MySQL Internals Replication
← Back to MySQL Internals overview page
[edit] Replication
This chapter describes MySQL replication principles and code, as it is in version 4.1.1.
MySQL replication works like this: Every time the master executes a statement that updates data (UPDATE, INSERT, DELETE, etc.), it packs this statement into an event, which consists of the statement plus a few bytes of information (timestamp, thread id of the thread which issued the statement, etc., defined later in this chapter). Then the master writes this event to a file (the binary log). When the slave is connected, the master re-reads its binary log and sends the events to the slaves. The slave I/O thread reads the event from the master and writes it to the slave's relay log. Later, the slave SQL thread reads the event from the relay log, unpacks the event, and executes the statement.
[edit] Main Source Code Files
These files are all in the sql directory:
-
log.cc: creating/writing/deleting a binlog. -
log_event.*: all event types and their methods. -
slave.*: all the slave threads' code. -
sql_repl.*: all SQL commands related to replication (START SLAVE,CHANGE MASTER TO). Also all the master's high-level code about replication (binlog sending, a.k.a.COM_BINLOG_DUMP). For example, binlog sending code is insql_repl.cc, but uses low-level commands (single event reading) which are inlog_event.cc. -
repl_failsafe.*: unfinished code about failsafe (master election if the primary master fails). This file will probably be heavily reworked. Presently it's almost unused.
[edit] The Binary Log
When started with --log-bin, mysqld creates a binary log (binlog) of all updates. Only updates that really change the data are written (a DELETE issued on an empty table won't be written to the binary log). Every statement is written in a packed form: an event. The binary log is a sequence of events. The mysqlbinlog utility can be used to print human-readable data from the binary log.
The possible types of events are listed in the Log_event_type enumeration:
enum Log_event_type
{
UNKNOWN_EVENT= 0, START_EVENT_V3= 1, QUERY_EVENT= 2, STOP_EVENT= 3,
ROTATE_EVENT= 4, INTVAR_EVENT= 5, LOAD_EVENT= 6, SLAVE_EVENT= 7,
CREATE_FILE_EVENT= 8, APPEND_BLOCK_EVENT= 9, EXEC_LOAD_EVENT= 10,
DELETE_FILE_EVENT= 11, NEW_LOAD_EVENT= 12, RAND_EVENT= 13,
USER_VAR_EVENT= 14, FORMAT_DESCRIPTION_EVENT= 15, XID_EVENT= 16,
BEGIN_LOAD_QUERY_EVENT= 17, EXECUTE_LOAD_QUERY_EVENT= 18,
TABLE_MAP_EVENT = 19, PRE_GA_WRITE_ROWS_EVENT = 20,
PRE_GA_UPDATE_ROWS_EVENT = 21, PRE_GA_DELETE_ROWS_EVENT = 22,
WRITE_ROWS_EVENT = 23, UPDATE_ROWS_EVENT = 24,
DELETE_ROWS_EVENT = 25, INCIDENT_EVENT= 26,
HEARTBEAT_LOG_EVENT= 27,
ENUM_END_EVENT /* end marker */
};
The INTVAR_EVENT type has "subtypes," listed in the Int_event_type enumeration:
enum Int_event_type
{
INVALID_INT_EVENT = LAST_INSERT_ID_EVENT = INSERT_ID_EVENT = 2
};
There is a C++ class for every type of event (class Start_log_event, class Query_log_event, and so forth). Prototypes are in log_event.h. Code for methods of these classes is in log_event.cc. Code to create, write, rotate, or delete a binary log is in log.cc.
Binary log event type meanings and formats are detailed at The Binary Log.
Here is a sample of what a binary log looks like as displayed by mysqlbinlog. (Minor reformatting has been done for readability.)
[guilhem@gbichot2 1]$ mysqlbinlog gbichot2-bin.005 # at 4 #030710 21:55:35 server id 1 log_pos 4 Start: binlog v 3, server v 4.0.14-debug-log created 030710 21:55:35 at startup # at 79 #030710 21:55:59 server id 1 log_pos 79 Query thread_id=2 exec_time=16 error_code=0 SET TIMESTAMP=1057866959; drop database test; # at 128 #030710 21:56:20 server id 1 log_pos 128 Query thread_id=2 exec_time=0 error_code=0 SET TIMESTAMP=1057866980; create database test; # at 179 #030710 21:57:00 server id 1 log_pos 179 Query thread_id=2 exec_time=1 error_code=0 use test; SET TIMESTAMP=1057867020; create table u(a int primary key, b int, key(b), foreign key (b) references u(a)); # at 295 #030710 21:57:19 server id 1 log_pos 295 Query thread_id=2 exec_time=0 error_code=0 SET TIMESTAMP=1057867039; drop table u; # at 342 #030710 21:57:24 server id 1 log_pos 342 Query thread_id=2 exec_time=0 error_code=0 SET TIMESTAMP=1057867044; create table u(a int primary key, b int, key(b), foreign key (b) references u(a)) type=innodb; # at 470 #030710 21:57:52 server id 1 log_pos 470 Query thread_id=2 exec_time=0 error_code=0 SET TIMESTAMP=1057867072; insert into u values(4,NULL); # at 533 #030710 21:57:59 server id 1 log_pos 533 Query thread_id=2 exec_time=0 error_code=0 SET TIMESTAMP=1057867079; insert into u values(3,4); # at 593 #030710 21:58:34 server id 1 log_pos 593 Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1057867114; delete from u; # at 641 #030710 21:58:57 server id 1 log_pos 641 Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1057867137; drop table u; # at 688 #030710 21:59:18 server id 1 log_pos 688 Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1057867158; create table v(c int primary key) type=innodb; # at 768 #030710 21:59:24 server id 1 log_pos 768 Query thread_id=4 exec_time=0 error_code=0 SET TIMESTAMP=1057867164; create table u(a int primary key, b int, key(b), foreign key (b) references v(c)) type=innodb; # at 896 #030710 21:59:47 server id 1 log_pos 896 Query thread_id=8 exec_time=0 error_code=0 SET TIMESTAMP=1057867187; DROP TABLE IF EXISTS u; # at 953 #030710 21:59:47 server id 1 log_pos 953 Query thread_id=8 exec_time=0 error_code=0 SET TIMESTAMP=1057867187; CREATE TABLE u ( a int(11) NOT NULL default '0', b int(11) default NULL, PRIMARY KEY (a), KEY b (b), CONSTRAINT `0_41` FOREIGN KEY (`b`) REFERENCES `v` (`c`) ) TYPE=InnoDB; # at 1170 #030710 21:59:47 server id 1 log_pos 1170 Query thread_id=8 exec_time=0 error_code=0 SET TIMESTAMP=1057867187; DROP TABLE IF EXISTS v; # at 1227 #030710 21:59:47 server id 1 log_pos 1227 Query thread_id=8 exec_time=0 error_code=0 SET TIMESTAMP=1057867187; CREATE TABLE v ( c int(11) NOT NULL default '0', PRIMARY KEY (c) ) TYPE=InnoDB; # at 1345 #030710 22:00:06 server id 1 log_pos 1345 Query thread_id=9 exec_time=0 error_code=0 SET TIMESTAMP=1057867206; drop table u,v; # at 1394 #030710 22:00:29 server id 1 log_pos 1394 Query thread_id=13 exec_time=0 error_code=0 SET TIMESTAMP=1057867229; create table v(c int primary key) type=innodb; # at 1474 #030710 22:00:32 server id 1 log_pos 1474 Query thread_id=13 exec_time=0 error_code=0 SET TIMESTAMP=1057867232; create table u(a int primary key, b int, key(b), foreign key (b) references v(c)) type=innodb; # at 1602 #030710 22:00:44 server id 1 log_pos 1602 Query thread_id=16 exec_time=0 error_code=0 SET TIMESTAMP=1057867244; drop table v,u; # at 1651 #030710 22:00:51 server id 1 log_pos 1651 Query thread_id=16 exec_time=0 error_code=0 SET TIMESTAMP=1057867251; CREATE TABLE v ( c int(11) NOT NULL default '0', PRIMARY KEY (c) ) TYPE=InnoDB; # at 1769 #030710 22:12:50 server id 1 log_pos 1769 Stop
[edit] Replication Threads
Every time replication is started on the slave mysqld, that is, when mysqld is started with some replication options (--master-host=this_hostname etc.) or some existing master.info and relay-log.info files, or when the user does START SLAVE on the slave, two threads are created on the slave, in slave.cc:
extern "C" pthread_handler_decl(handle_slave_io,arg)
{ ... }
extern "C" pthread_handler_decl(handle_slave_sql,arg)
{ ... }
[edit] The Slave I/O Thread
The I/O thread connects to the master using a user/password. After connecting, it asks the master for its binary logs:
thd->proc_info = "Requesting binlog dump";
if (request_dump(mysql, mi, &suppress_warnings))
Then it enters this loop:
while (!io_slave_killed(thd,mi))
{
...
thd->proc_info = "Reading master update";
ulong event_len = read_event(mysql, mi, &suppress_warnings);
...
thd->proc_info = "Queueing event from master";
if (queue_event(mi,(const char*)mysql->net.read_pos + 1,
event_len))
{
sql_print_error("Slave I/O thread could not queue event
from master");
goto err;
}
flush_master_info(mi);
if (mi->rli.log_space_limit && mi->rli.log_space_limit <
mi->rli.log_space_total &&
!mi->rli.ignore_log_space_limit)
if (wait_for_relay_log_space(&mi->rli))
{
sql_print_error("Slave I/O thread aborted while
waiting for relay log space");
goto err;
}
...
}
read_event() calls net_safe_read() to read what the master has sent over the network. queue_event() writes the read event to the relay log, and also updates the counter which keeps track of the space used by all existing relay logs. flush_master_info() writes to the master.info file the new position up to which the thread has read in the master's binlog. Finally, if relay logs take too much space, the I/O thread blocks until the SQL thread signals that it is okay to read and queue more events. The code handles network failures and reconnections.
When the relay log gets too large, it is rotated: The I/O thread stops writing to it, closes it, opens a new relay log file and writes to the new one. The old file is kept until the SQL thread (see below) has finished executing it, then it is deleted. The meaning of "too large" is determined from the value of one of two system variables as follows:
-
max_relay_log_size, ifmax_relay_log_size> 0 -
max_binlog_size, ifmax_relay_log_size= 0 or MySQL is older than 4.0.14 (before whichmax_relay_log_sizedoes not exist)
[edit] The Slave SQL Thread
while (!sql_slave_killed(thd,rli))
{
thd->proc_info = "Processing master log event";
DBUG_ASSERT(rli->sql_thd == thd);
THD_CHECK_SENTRY(thd);
if (exec_relay_log_event(thd,rli))
{
// do not scare the user if SQL thread was simply killed or stopped
if (!sql_slave_killed(thd,rli))
sql_print_error("Error running query, slave SQL thread
aborted. Fix the problem, and restart
the slave SQL thread with "SLAVE START".
We stopped at log '%s' position %s",
RPL_LOG_NAME, llstr(rli->master_log_pos, llbuff));
goto err;
}
}
exec_relay_log_event() reads an event from the relay log (by calling next_event()). next_event() will start reading the next relay log file if the current one is finished; it will also wait if there is no more relay log to read (because the I/O thread is stopped or the master has nothing more to send to the slave). Finally exec_relay_log_event() executes the read event (all ::exec_event() methods in log_event.cc) (mostly this execution goes through sql_parse.cc), thus updating the slave database and writing to relay-log.info the new position up to which it has executed in the relay log. The ::exec_event() methods in log_event.cc will take care of filter options like replicate-do-table and such.
When the SQL thread hits the end of the relay log, it checks whether a new one exists (that is, whether a rotation has occurred). If so, it deletes the already-read relay log and starts reading the new one. Otherwise, it just waits until there's more data in the relay log.
[edit] Why 2 Threads?
In MySQL 3.23, we had only one thread on the slave, which did the whole job: read one event from the connection to the master, executed it, read another event, executed it, etc.
In MySQL 4.0.2 we split the job into two threads, using a relay log file to exchange between them.
This makes code more complicated. We have to deal with the relay log being written at the end, read at another position, at the same time. Plus handling the detection of EOF on the relay log, switching to the new relay log. Also the SQL thread must do different reads, depending on how the relay log file it is reading is being used:
- If the file is being written to by the I/O thread, the relay log is partly in memory, not all on disk, and mutexes are needed to avoid confusion between threads.
- If the file has already been rotated (the I/O thread is not writing to it anymore), it is a normal file that no other threads touch.
The advantages of having 2 threads instead of one:
- Helps having a more up-to-date slave. Reading a statement is fast, executing it is slow. If the master dies (burns), there are good chances that the I/O thread has caught almost all updates issued on the master, and saved them in the relay log, for use by the SQL thread.
- Reduces the required master-slave connection time. If the slave has not been connected for a long time, it is very late compared to the master. It means the SQL thread will have a lot of executing to do. So with the single-thread read-execute-read-execute technique, the slave will have to be connected for a long time to be able to fetch all updates from the master. Which is stupid, as for a significant part of the time, the connection will be idle, because the single thread is busy executing the statement. Whereas with 2 threads, the I/O thread will fetch the binlogs from the master in a shorter time. Then the connection is not needed anymore, and the SQL thread can continue executing the relay log.
[edit] The Binlog Dump Thread
This thread is created by the master when it receives a COM_BINLOG_DUMP request.
void mysql_binlog_send(THD* thd, char* log_ident, my_off_t pos,
ushort flags)
{
...
if ((file=open_binlog(&log, log_file_name, &errmsg)) < 0)
{
my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
goto err;
}
if (pos < BIN_LOG_HEADER_SIZE || pos > my_b_filelength(&log))
{
errmsg= "Client requested master to start replication from
impossible position";
my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
goto err;
}
my_b_seek(&log, pos); // Seek will done on next read
...
// if we are at the start of the log
if (pos == BIN_LOG_HEADER_SIZE)
{
// tell the client log name with a fake rotate_event
if (fake_rotate_event(net, packet, log_file_name, &errmsg))
{
my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
goto err;
}
...
}
while (!net->error && net->vio != 0 && !thd->killed)
{
pthread_mutex_t *log_lock = mysql_bin_log.get_log_lock();
while (!(error = Log_event::read_log_event(&log, packet, log_lock)))
{
...
if (my_net_write(net, (char*)packet->ptr(), packet->length()) )
{
errmsg = "Failed on my_net_write()";
my_errno= ER_UNKNOWN_ERROR;
goto err;
}
...
If this thread starts reading from the beginning of a binlog, it is possible that the slave does not know the binlog's name (for example it could have just asked give me the FIRST binlog</span>). Using fake_rotate_event(), the master tells the slave the binlog's name (required for master.info and SHOW SLAVE STATUS) by building a Rotate_log_event and sending this event to the slave. In this event the slave will find the binlog's name. This event has zeros in the timestamp (shows up as written in year 1970 when reading the relay log with mysqlbinlog).
[edit] How Replication Deals With...
This section describes how replication handles various problematic issues.
[edit] Columns, LAST_INSERT_ID()
When a statement inserts into such a column, or uses LAST_INSERT_ID(), one or two Intvar_log_event instances are written to the binlog just before the Query_log_event.
[edit] User Variables (Since 4.1)
When a statement uses a user variable, a User_var_log_event is written to the binlog just before the Query_log_event.
[edit] System Variables
Example: SQL_MODE, FOREIGN_KEY_CHECKS. Not dealt with. Guilhem is working on it for version 5.0.
[edit] Some Functions
USER(), LOAD_FILE(). Not dealt with. Will be solved with row-level binlogging (presently we have statement-level binlogging, but in the future we plan to support row-level binlogging too).
[edit] Non-repeatable UDF Functions
Non repeatable means that they have a sort of randomness, for example they depend on the machine (to generate a unique ID for example). Not dealt with. Will be solved with row-level binlogging.
[edit] Prepared Statements
For the moment, a substituted normal statement is written to the master's binlog. Using prepared statements on the slave as well is on the TODO.
[edit] Temporary Tables
Temporary tables depend on the thread which created them, so any query event which uses such tables is marked with the LOG_EVENT_THREAD_SPECIFIC_F flag. All events have in their header the id of the thread which created them, so the slave knows which temporary table the statement refers to.
When the slave is stopped (STOP SLAVE or even mysqladmin shutdown), the in-use replicated temporary tables are not dropped (like clients' temporary tables are). This way, when the slave restarts they are still available.
When a connection using temporary tables terminates on the master, the master automatically writes some DROP TEMPORARY TABLE statements for them so that they are dropped on the slave as well.
When the master brutally dies, then restarts, it drops all temporary tables which remained in tmpdir, but without writing it to the binlog, so these temporary tables are still on the slave, and they will not be dropped before the next slave's mysqld restart. To avoid this, the slave drops all replicated temporary tables when it executes a Start_log_event read from the master. Indeed such an event means the master's mysqld has restarted so all preceding temporary tables have been dropped.
Presently we have a bug: if the slave mysqld is stopped while it was replicating a temporary table, then at restart it deletes this table (like a normal temporary table), which may cause a problem if subsequent statements on the master refer to this table.
[edit] LOAD DATA [LOCAL] INFILE (Since 4.0)
The master writes the loaded file to the binlog, but in small blocks rather than all at once. The slave creates a temporary file, the concatenation of each block. When the slave reads the final Execute_load_log_event, it loads all temporary files into the table and deletes the temporary files. If the final event was instead a Delete_file_log_event then these temporary files are deleted without loading.
[edit] How a Slave Asks Its Master to Send Its Binary Log
The slave server must open a normal connection to its master. The MySQL account used to connect to the master must have the REPLICATION SLAVE privilege on the master. Then the slave must send the COM_BINLOG_DUMP command, as in this example taken from function request_dump():
static int request_dump(MYSQL* mysql, MASTER_INFO* mi,
bool *suppress_warnings)
{
char buf[FN_REFLEN + 10];
int len;
int binlog_flags = 0; // for now
char* logname = mi->master_log_name;
DBUG_ENTER("request_dump");
// TODO if big log files: Change next to int8store()
int4store(buf, (longlong) mi->master_log_pos);
int2store(buf + 4, binlog_flags);
int4store(buf + 6, server_id);
len = (uint) strlen(logname);
memcpy(buf + 10, logname,len);
if (simple_command(mysql, COM_BINLOG_DUMP, buf, len + 10, 1))
{
// act on errors
}
}
Here variable buf contains the arguments for COM_BINLOG_DUMP. It's the concatenation of:
- 4 bytes: the position in the master's binlog from which we want to start (that is, please master, send me the binlog, starting from this position).
- 2 bytes: 0 for the moment.
- 4 bytes: this slave's server id. This is used by the master to delete old
Binlog Dumpthreads which were related to this slave (see functionkill_zombie_dump_threads()for details). - variable-sized part: the name of the binlog we want. The dump will start from this binlog, at the position indicated in the first four bytes.
Then send the command, and start reading the incoming packets from the master, like read_event() does (using net_safe_read() as explained below). One should also, to be safe, handle all possible cases of network problems, disconnections/reconnections, malformed events.
[edit] Where Binary Log Events are Read/Parsed in the Source Code
1. Replication master's binlog dump thread:
This is the thread handling the COM_BINLOG_DUMP command sent by a slave.
It reads events from binary log files and sends them to the replication slave over the SQL connection. Code: mysql_binlog_send() function in sql/sql_repl.cc
2. Replication slave's I/O thread:
Reads events sent by the master's dump thread and stores them in relay logs (local to the slave host). Code: handle_slave_io() function in sql/slave.cc
3. Replication slave's SQL thread:
Reads events from relay logs and executes them.
Code: handle_slave_sql() function in sql/slave.cc
4. mysqlbinlog program:
Reads events from binary log (or relay log) files or sent by a mysqld server and presents them in text format as SQL statements. Code: dump_{local,remote}_log_entries() functions in client/mysqlbinlog.cc
Notes:
1a. If the current binary log file ends, mysql_binlog_send() switches to the next file.
1b. mysql_binlog_send() sends a "fake" Rotate event before sending any other events from a binlog and after each single binlog file.
1c. In case an event offset was specified with a binlog dump request, mysql_binlog_send() finds the format description event at the beginning of the binary log file and sends it first before proceeding to the events at the given offset.
[edit] How Different Binary Log Formats are Handled in the Slave
The I/O thread reads the master server version from the mysql->server_version value (where mysql is an object representing an established connection to the master). It aborts replication if the master's version is x.y.z with x <= 2.
The I/O thread reads events using cli_safe_read() function, which simply reads protocol packets (?). Thus, this is binlog-version transparent.
Before writing events to the relay log, the I/O thread can process them:
- If it considers the current binary log to be in format v1 or v3, it translates events to v4 format
- Otherwise, it copies event packets as they are, except that it recognizes and parses FDE and Rotate events. It also recognizes Stop events, but doesn't write them to the relay log.
The I/O thread decides about binary log version based on the FDE it parses. Before first FDE arrives, the binary log version is based on the master server version. For masters with version x.y.z, where x > 4, the binary log version is assumed to be v4 (even though in fact it can be higher).
Parsing FDE and Rotate events is done by constructors of corresponding objects. These constructors don't do any checks of the binary log version. Note: the Rotate_log_event constructor reads header and post-header lengths from the current event description event (and thus is possibly binlog format sensitive).
Translation of events from old formats to the current format is done by parsing event data using Log_event::read_log_event() (that is, in the constructors of event objects) and then writing the event objects to the relay log.
Note: Looks like the event object constructors don't trust the binlog version stored in the FDE passed to them.
[edit] Network Packets in Detail
The communication protocol between the master and slave is the one that all other normal connections use, as described in MySQL Client/Server Protocol. So after the COM_BINLOG_DUMP command has been sent, the communication between the master and slave is a sequence of packets, each of which contains an event. In slave.cc, function read_event(), one has an example: net_safe_read() is called; it is able to detect wrong packets. After net_safe_read(), the event is ready to be interpreted; it starts at pointer (char*) mysql->net.read_pos + 1. That is, (char*) mysql->net.read_pos + 1 is the first byte of the event's timestamp, etc.
[edit] Replication Event Format in Detail
[Moved to: The Binary Log.]
[edit] Plans
We have already made extensive changes to the above in MySQL 5.0. For an upcoming version we plan a new format where data is replicated by row instead of by statement. This makes it possible to replicate data from one MySQL Cluster to another. We also plan new functionality to handle replication connections from multiple masters into one slave.