WL#342: Replication HeartbeatAffects: Server-5.5 — Status: Complete — Priority: LowRATIONALE --------- 1. To solve BUG#20435 - extra relay log rotation; 2. to detect failures more easily and precisely; 3. to have better second_behind_master value on slave (BUG#29309). DESIGN ------ 1. New extension to CHANGE MASTER: CHANGE MASTER SET master_heartbeat_period= val; to be stored in master info file and mi object to provide the value sent to master. The value 'val' needs to be within some reasonable interval. As the cost of handling creation, sending and treating the event on slave side is supposed to be low, the val can be as small as 1 seconds even less. 2. Slave io thread prepares and sends the query 'SET SESSION @master_heartbeat_period= val' to master. From the query the master's dump thread finds out slave's preference about heartbeat sending period. 3. The heartbeat event is created by master's DUMP thread and sent each time @master_heartbeat_period elapses which designate nothing has added to binlog for the period time. 4. On the slave's side the heartbeat event is handled exclusively by IO thread avoiding its recording to relay log as well as engaging slave sql thread. Upon receiving the last sent event's coordinates are compared against the ones slave io thread maintains and updates per each received event, except some "phantoms" including the new Heartbeat. The heartbeat event does not update that local information. 5. The Heartbeat period and the number of received event should be monitorable via SHOW STATUS like 'slave_heartbeat period' and SHOW STATUS like 'slave_received_heartbeats' respectively. FEATURES -------- A newer heartbeat-aware slave will not have any error response from an unintelligent old master about the slave connecting time query set @master_heartbeat_period=val; and naturally the old master will not send heartbeats. In that case, slave will show its chosen heartbeat value in the status, but there will be no real actions. Using a the user variable @master_heartbeat_period instead of the system one avoids displaying the name within a list of available variables for a plain user session. Existence of the user variable master_heartbeat_period can be noticed only via the general query log. User observable behaviour -------------------------- 1. Requesting from the slave to send heartbeats from master with a period: CHANGE MASTER master_heartbeat_period= val where val is the period being of the decimal type with the value in the range [0.001, 4294967] seconds. Notice, that heartbeats are sent by the master only if there is no more unsent events in the actual binlog file for a period longer that master_heartbeat_period. Whenever the master's binlog is updated with an event, the waiting for heartbeat sending condition gets reset. If `val' is zero no hearbeats will be sending. Notice, heartbeat is active by default with the period slave_net_timeout/2. 2. SHOW STATUS like 'slave_heartbeat_period' Slave's side status variable which gets the value from either CHANGE MASTER, master.info or implicitly as `slave_net_timeout/2' (the default). The denominator 2 provides a reasonable default period to guarantee no reconnection will happen to an idling master upon elapsing slave_net_timeout. 3. SHOW STATUS like 'slave_received_heartbeats'; The counter that initializes at slave init time, increments by every received heartbeat and resets to zero with CHANGE MASTER; The memory size for the counter is the size for ulonglong i.e normally 8 bytes. Overflowing it even with the fastest heartbeat is possible only on a cosmic time scale. 4. RESET SLAVE; resets the current heartbeat's period to the default (see 2.). The check of the valid range remains after computing slave_net_timeout/2 with dropping the period's value to the max allowable if the ratio would be greater. 5. SET @@global.slave_net_timeout=`value less than the current hb period` is warned as that'd be an irrational intention. OLD INFO: --------- Suggestion from Jeremy Zawodny at OSCON, July 2001, as a discussion idea: Need active heartbeat detection mechanism for replication monitoring. Writing an error to log file is insufficient notice of failure. Instead need a way to trigger an alert message to DBA or a method for a watcher program to immediately detect when a slave has failed. Detailed design No Comments yet |
VotesWatches1 members are watching this worklog
You must be logged in to track this worklog.
Provide Feedback
You must be logged in to comment
|