Category: MySQLDevelopment

MySQL Internals Replication

← Back to MySQL Internals overview page

Contents

[edit] Replication

This chapter describes MySQL replication principles and code, as it is in version 4.1.1.

MySQL replication works like this: Every time the master executes a statement that updates data (UPDATE, INSERT, DELETE, etc.), it packs this statement into an event, which consists of the statement plus a few bytes of information (timestamp, thread id of the thread which issued the statement, etc., defined later in this chapter). Then the master writes this event to a file (the binary log). When the slave is connected, the master re-reads its binary log and sends the events to the slaves. The slave I/O thread reads the event from the master and writes it to the slave's relay log. Later, the slave SQL thread reads the event from the relay log, unpacks the event, and executes the statement.

[edit] Main Source Code Files

These files are all in the sql directory:

[edit] The Binary Log

When started with --log-bin, mysqld creates a binary log (binlog) of all updates. Only updates that really change the data are written (a DELETE issued on an empty table won't be written to the binary log). Every statement is written in a packed form: an event. The binary log is a sequence of events. The mysqlbinlog utility can be used to print human-readable data from the binary log.

The possible types of events are listed in the Log_event_type enumeration:

enum Log_event_type
{
  UNKNOWN_EVENT= 0, START_EVENT_V3= 1, QUERY_EVENT= 2, STOP_EVENT= 3,
  ROTATE_EVENT= 4, INTVAR_EVENT= 5, LOAD_EVENT= 6, SLAVE_EVENT= 7,
  CREATE_FILE_EVENT= 8, APPEND_BLOCK_EVENT= 9, EXEC_LOAD_EVENT= 10,
  DELETE_FILE_EVENT= 11, NEW_LOAD_EVENT= 12, RAND_EVENT= 13,
  USER_VAR_EVENT= 14, FORMAT_DESCRIPTION_EVENT= 15, XID_EVENT= 16,
  BEGIN_LOAD_QUERY_EVENT= 17, EXECUTE_LOAD_QUERY_EVENT= 18,
  TABLE_MAP_EVENT = 19, PRE_GA_WRITE_ROWS_EVENT = 20,
  PRE_GA_UPDATE_ROWS_EVENT = 21, PRE_GA_DELETE_ROWS_EVENT = 22,
  WRITE_ROWS_EVENT = 23, UPDATE_ROWS_EVENT = 24,
  DELETE_ROWS_EVENT = 25, INCIDENT_EVENT= 26,
  HEARTBEAT_LOG_EVENT= 27,
  ENUM_END_EVENT /* end marker */
};

The INTVAR_EVENT type has "subtypes," listed in the Int_event_type enumeration:

enum Int_event_type
{
  INVALID_INT_EVENT = LAST_INSERT_ID_EVENT = INSERT_ID_EVENT = 2
};

There is a C++ class for every type of event (class Start_log_event, class Query_log_event, and so forth). Prototypes are in log_event.h. Code for methods of these classes is in log_event.cc. Code to create, write, rotate, or delete a binary log is in log.cc.

Binary log event type meanings and formats are detailed at The Binary Log.

Here is a sample of what a binary log looks like as displayed by mysqlbinlog. (Minor reformatting has been done for readability.)

[guilhem@gbichot2 1]$ mysqlbinlog gbichot2-bin.005
 # at 4
 #030710 21:55:35 server id 1  log_pos 4
  Start: binlog v 3, server v 4.0.14-debug-log created 030710 21:55:35 at startup
 # at 79
 #030710 21:55:59 server id 1  log_pos 79   Query   thread_id=2     exec_time=16    error_code=0
 SET TIMESTAMP=1057866959;
 drop database test;
 # at 128
 #030710 21:56:20 server id 1  log_pos 128  Query   thread_id=2     exec_time=0     error_code=0
 SET TIMESTAMP=1057866980;
 create database test;
 # at 179
 #030710 21:57:00 server id 1  log_pos 179  Query   thread_id=2     exec_time=1     error_code=0
 use test;
 SET TIMESTAMP=1057867020;
 create table u(a int primary key, b int, key(b), foreign key (b) references u(a));
 # at 295
 #030710 21:57:19 server id 1  log_pos 295  Query   thread_id=2     exec_time=0     error_code=0
 SET TIMESTAMP=1057867039;
 drop table u;
 # at 342
 #030710 21:57:24 server id 1  log_pos 342  Query   thread_id=2     exec_time=0     error_code=0
 SET TIMESTAMP=1057867044;
 create table u(a int primary key, b int, key(b), foreign key (b) references u(a)) type=innodb;
 # at 470
 #030710 21:57:52 server id 1  log_pos 470  Query   thread_id=2     exec_time=0     error_code=0
 SET TIMESTAMP=1057867072;
 insert into u values(4,NULL);
 # at 533
 #030710 21:57:59 server id 1  log_pos 533  Query   thread_id=2     exec_time=0     error_code=0
 SET TIMESTAMP=1057867079;
 insert into u values(3,4);
 # at 593
 #030710 21:58:34 server id 1  log_pos 593  Query   thread_id=4     exec_time=0     error_code=0
 SET TIMESTAMP=1057867114;
 delete from u;
 # at 641
 #030710 21:58:57 server id 1  log_pos 641  Query   thread_id=4     exec_time=0     error_code=0
 SET TIMESTAMP=1057867137;
 drop table u;
 # at 688
 #030710 21:59:18 server id 1  log_pos 688  Query   thread_id=4     exec_time=0     error_code=0
 SET TIMESTAMP=1057867158;
 create table v(c int primary key) type=innodb;
 # at 768
 #030710 21:59:24 server id 1  log_pos 768  Query   thread_id=4     exec_time=0     error_code=0
 SET TIMESTAMP=1057867164;
 create table u(a int primary key, b int, key(b), foreign key (b) references v(c)) type=innodb;
 # at 896
 #030710 21:59:47 server id 1  log_pos 896  Query   thread_id=8     exec_time=0     error_code=0
 SET TIMESTAMP=1057867187;
 DROP TABLE IF EXISTS u;
 # at 953
 #030710 21:59:47 server id 1  log_pos 953  Query   thread_id=8     exec_time=0     error_code=0
 SET TIMESTAMP=1057867187;
 CREATE TABLE u (
   a int(11) NOT NULL default '0',
   b int(11) default NULL,
   PRIMARY KEY  (a),
   KEY b (b),
   CONSTRAINT `0_41` FOREIGN KEY (`b`) REFERENCES `v` (`c`)
 ) TYPE=InnoDB;
 # at 1170
 #030710 21:59:47 server id 1  log_pos 1170  Query   thread_id=8     exec_time=0     error_code=0
 SET TIMESTAMP=1057867187;
 DROP TABLE IF EXISTS v;
 # at 1227
 #030710 21:59:47 server id 1  log_pos 1227  Query   thread_id=8     exec_time=0     error_code=0
 SET TIMESTAMP=1057867187;
 CREATE TABLE v (
   c int(11) NOT NULL default '0',
   PRIMARY KEY  (c)
 ) TYPE=InnoDB;
 # at 1345
 #030710 22:00:06 server id 1  log_pos 1345  Query   thread_id=9     exec_time=0     error_code=0
 SET TIMESTAMP=1057867206;
 drop table u,v;
 # at 1394
 #030710 22:00:29 server id 1  log_pos 1394  Query   thread_id=13    exec_time=0     error_code=0
 SET TIMESTAMP=1057867229;
 create table v(c int primary key) type=innodb;
 # at 1474
 #030710 22:00:32 server id 1  log_pos 1474  Query   thread_id=13    exec_time=0     error_code=0
 SET TIMESTAMP=1057867232;
 create table u(a int primary key, b int, key(b), foreign key (b) references v(c)) type=innodb;
 # at 1602
 #030710 22:00:44 server id 1  log_pos 1602  Query   thread_id=16    exec_time=0     error_code=0
 SET TIMESTAMP=1057867244;
 drop table v,u;
 # at 1651
 #030710 22:00:51 server id 1  log_pos 1651  Query   thread_id=16    exec_time=0     error_code=0
 SET TIMESTAMP=1057867251;
 CREATE TABLE v (
   c int(11) NOT NULL default '0',
   PRIMARY KEY  (c)
 ) TYPE=InnoDB;
 # at 1769
 #030710 22:12:50 server id 1  log_pos 1769  Stop
 

[edit] Replication Threads

Every time replication is started on the slave mysqld, that is, when mysqld is started with some replication options (--master-host=this_hostname etc.) or some existing master.info and relay-log.info files, or when the user does START SLAVE on the slave, two threads are created on the slave, in slave.cc:

extern "C" pthread_handler_decl(handle_slave_io,arg)
{ ... }

extern "C" pthread_handler_decl(handle_slave_sql,arg)
{ ... }

[edit] The Slave I/O Thread

The I/O thread connects to the master using a user/password. After connecting, it asks the master for its binary logs:

    thd->proc_info = "Requesting binlog dump";
    if (request_dump(mysql, mi, &suppress_warnings))

Then it enters this loop:

    while (!io_slave_killed(thd,mi))
    {
      ...
      thd->proc_info = "Reading master update";
      ulong event_len = read_event(mysql, mi, &suppress_warnings);
      ...
      thd->proc_info = "Queueing event from master";
      if (queue_event(mi,(const char*)mysql->net.read_pos + 1,
                      event_len))
      {
        sql_print_error("Slave I/O thread could not queue event
                         from master");
        goto err;
      }
      flush_master_info(mi);
      if (mi->rli.log_space_limit && mi->rli.log_space_limit <
          mi->rli.log_space_total &&
          !mi->rli.ignore_log_space_limit)
        if (wait_for_relay_log_space(&mi->rli))
        {
          sql_print_error("Slave I/O thread aborted while
                           waiting for relay log space");
          goto err;
        }
      ...
    }

read_event() calls net_safe_read() to read what the master has sent over the network. queue_event() writes the read event to the relay log, and also updates the counter which keeps track of the space used by all existing relay logs. flush_master_info() writes to the master.info file the new position up to which the thread has read in the master's binlog. Finally, if relay logs take too much space, the I/O thread blocks until the SQL thread signals that it is okay to read and queue more events. The code handles network failures and reconnections.

When the relay log gets too large, it is rotated: The I/O thread stops writing to it, closes it, opens a new relay log file and writes to the new one. The old file is kept until the SQL thread (see below) has finished executing it, then it is deleted. The meaning of "too large" is determined from the value of one of two system variables as follows:

[edit] The Slave SQL Thread

  while (!sql_slave_killed(thd,rli))
  {
    thd->proc_info = "Processing master log event";
    DBUG_ASSERT(rli->sql_thd == thd);
    THD_CHECK_SENTRY(thd);
    if (exec_relay_log_event(thd,rli))
    {
      // do not scare the user if SQL thread was simply killed or stopped
      if (!sql_slave_killed(thd,rli))
        sql_print_error("Error running query, slave SQL thread
                         aborted. Fix the problem, and restart
                         the slave SQL thread with "SLAVE START".
                         We stopped at log '%s' position %s",
                         RPL_LOG_NAME, llstr(rli->master_log_pos, llbuff));
      goto err;
    }
  }

exec_relay_log_event() reads an event from the relay log (by calling next_event()). next_event() will start reading the next relay log file if the current one is finished; it will also wait if there is no more relay log to read (because the I/O thread is stopped or the master has nothing more to send to the slave). Finally exec_relay_log_event() executes the read event (all ::exec_event() methods in log_event.cc) (mostly this execution goes through sql_parse.cc), thus updating the slave database and writing to relay-log.info the new position up to which it has executed in the relay log. The ::exec_event() methods in log_event.cc will take care of filter options like replicate-do-table and such.

When the SQL thread hits the end of the relay log, it checks whether a new one exists (that is, whether a rotation has occurred). If so, it deletes the already-read relay log and starts reading the new one. Otherwise, it just waits until there's more data in the relay log.

[edit] Why 2 Threads?

In MySQL 3.23, we had only one thread on the slave, which did the whole job: read one event from the connection to the master, executed it, read another event, executed it, etc.

In MySQL 4.0.2 we split the job into two threads, using a relay log file to exchange between them.

This makes code more complicated. We have to deal with the relay log being written at the end, read at another position, at the same time. Plus handling the detection of EOF on the relay log, switching to the new relay log. Also the SQL thread must do different reads, depending on how the relay log file it is reading is being used:

The advantages of having 2 threads instead of one:

[edit] The Binlog Dump Thread

This thread is created by the master when it receives a COM_BINLOG_DUMP request.

void mysql_binlog_send(THD* thd, char* log_ident, my_off_t pos,
                       ushort flags)
{
  ...
  if ((file=open_binlog(&log, log_file_name, &errmsg)) < 0)
  {
    my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
    goto err;
  }
  if (pos < BIN_LOG_HEADER_SIZE || pos > my_b_filelength(&log))
  {
    errmsg= "Client requested master to start replication from
             impossible position";
    my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
    goto err;
  }

  my_b_seek(&log, pos);                 // Seek will done on next read
  ...
  // if we are at the start of the log
  if (pos == BIN_LOG_HEADER_SIZE)
  {
    // tell the client log name with a fake rotate_event
    if (fake_rotate_event(net, packet, log_file_name, &errmsg))
    {
      my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
      goto err;
    }
    ...
  }

  while (!net->error && net->vio != 0 && !thd->killed)
  {
    pthread_mutex_t *log_lock = mysql_bin_log.get_log_lock();

    while (!(error = Log_event::read_log_event(&log, packet, log_lock)))
    {
      ...
      if (my_net_write(net, (char*)packet->ptr(), packet->length()) )
      {
        errmsg = "Failed on my_net_write()";
        my_errno= ER_UNKNOWN_ERROR;
        goto err;
      }
      ...

If this thread starts reading from the beginning of a binlog, it is possible that the slave does not know the binlog's name (for example it could have just asked give me the FIRST binlog</span>). Using fake_rotate_event(), the master tells the slave the binlog's name (required for master.info and SHOW SLAVE STATUS) by building a Rotate_log_event and sending this event to the slave. In this event the slave will find the binlog's name. This event has zeros in the timestamp (shows up as written in year 1970 when reading the relay log with mysqlbinlog).

[edit] How Replication Deals With...

This section describes how replication handles various problematic issues.

[edit] Columns, LAST_INSERT_ID()

When a statement inserts into such a column, or uses LAST_INSERT_ID(), one or two Intvar_log_event instances are written to the binlog just before the Query_log_event.

[edit] User Variables (Since 4.1)

When a statement uses a user variable, a User_var_log_event is written to the binlog just before the Query_log_event.

[edit] System Variables

Example: SQL_MODE, FOREIGN_KEY_CHECKS. Not dealt with. Guilhem is working on it for version 5.0.

[edit] Some Functions

USER(), LOAD_FILE(). Not dealt with. Will be solved with row-level binlogging (presently we have statement-level binlogging, but in the future we plan to support row-level binlogging too).

[edit] Non-repeatable UDF Functions

Non repeatable means that they have a sort of randomness, for example they depend on the machine (to generate a unique ID for example). Not dealt with. Will be solved with row-level binlogging.

[edit] Prepared Statements

For the moment, a substituted normal statement is written to the master's binlog. Using prepared statements on the slave as well is on the TODO.

[edit] Temporary Tables

Temporary tables depend on the thread which created them, so any query event which uses such tables is marked with the LOG_EVENT_THREAD_SPECIFIC_F flag. All events have in their header the id of the thread which created them, so the slave knows which temporary table the statement refers to.

When the slave is stopped (STOP SLAVE or even mysqladmin shutdown), the in-use replicated temporary tables are not dropped (like clients' temporary tables are). This way, when the slave restarts they are still available.

When a connection using temporary tables terminates on the master, the master automatically writes some DROP TEMPORARY TABLE statements for them so that they are dropped on the slave as well.

When the master brutally dies, then restarts, it drops all temporary tables which remained in tmpdir, but without writing it to the binlog, so these temporary tables are still on the slave, and they will not be dropped before the next slave's mysqld restart. To avoid this, the slave drops all replicated temporary tables when it executes a Start_log_event read from the master. Indeed such an event means the master's mysqld has restarted so all preceding temporary tables have been dropped.

Presently we have a bug: if the slave mysqld is stopped while it was replicating a temporary table, then at restart it deletes this table (like a normal temporary table), which may cause a problem if subsequent statements on the master refer to this table.

[edit] LOAD DATA [LOCAL] INFILE (Since 4.0)

The master writes the loaded file to the binlog, but in small blocks rather than all at once. The slave creates a temporary file, the concatenation of each block. When the slave reads the final Execute_load_log_event, it loads all temporary files into the table and deletes the temporary files. If the final event was instead a Delete_file_log_event then these temporary files are deleted without loading.

[edit] How a Slave Asks Its Master to Send Its Binary Log

The slave server must open a normal connection to its master. The MySQL account used to connect to the master must have the REPLICATION SLAVE privilege on the master. Then the slave must send the COM_BINLOG_DUMP command, as in this example taken from function request_dump():

static int request_dump(MYSQL* mysql, MASTER_INFO* mi,
                        bool *suppress_warnings)
{
  char buf[FN_REFLEN + 10];
  int len;
  int binlog_flags = 0; // for now
  char* logname = mi->master_log_name;
  DBUG_ENTER("request_dump");

  // TODO if big log files: Change next to int8store()
  int4store(buf, (longlong) mi->master_log_pos);
  int2store(buf + 4, binlog_flags);
  int4store(buf + 6, server_id);
  len = (uint) strlen(logname);
  memcpy(buf + 10, logname,len);
  if (simple_command(mysql, COM_BINLOG_DUMP, buf, len + 10, 1))
  {
     // act on errors
  }
}

Here variable buf contains the arguments for COM_BINLOG_DUMP. It's the concatenation of:

Then send the command, and start reading the incoming packets from the master, like read_event() does (using net_safe_read() as explained below). One should also, to be safe, handle all possible cases of network problems, disconnections/reconnections, malformed events.

[edit] Where Binary Log Events are Read/Parsed in the Source Code

1. Replication master's binlog dump thread:

This is the thread handling the COM_BINLOG_DUMP command sent by a slave.

It reads events from binary log files and sends them to the replication slave over the SQL connection. Code: mysql_binlog_send() function in sql/sql_repl.cc

2. Replication slave's I/O thread:

Reads events sent by the master's dump thread and stores them in relay logs (local to the slave host). Code: handle_slave_io() function in sql/slave.cc

3. Replication slave's SQL thread:

Reads events from relay logs and executes them.

Code: handle_slave_sql() function in sql/slave.cc

4. mysqlbinlog program:

Reads events from binary log (or relay log) files or sent by a mysqld server and presents them in text format as SQL statements. Code: dump_{local,remote}_log_entries() functions in client/mysqlbinlog.cc

Notes:

1a. If the current binary log file ends, mysql_binlog_send() switches to the next file.

1b. mysql_binlog_send() sends a "fake" Rotate event before sending any other events from a binlog and after each single binlog file.

1c. In case an event offset was specified with a binlog dump request, mysql_binlog_send() finds the format description event at the beginning of the binary log file and sends it first before proceeding to the events at the given offset.


[edit] How Different Binary Log Formats are Handled in the Slave

The I/O thread reads the master server version from the mysql->server_version value (where mysql is an object representing an established connection to the master). It aborts replication if the master's version is x.y.z with x <= 2.

The I/O thread reads events using cli_safe_read() function, which simply reads protocol packets (?). Thus, this is binlog-version transparent.

Before writing events to the relay log, the I/O thread can process them:

The I/O thread decides about binary log version based on the FDE it parses. Before first FDE arrives, the binary log version is based on the master server version. For masters with version x.y.z, where x > 4, the binary log version is assumed to be v4 (even though in fact it can be higher).

Parsing FDE and Rotate events is done by constructors of corresponding objects. These constructors don't do any checks of the binary log version. Note: the Rotate_log_event constructor reads header and post-header lengths from the current event description event (and thus is possibly binlog format sensitive).

Translation of events from old formats to the current format is done by parsing event data using Log_event::read_log_event() (that is, in the constructors of event objects) and then writing the event objects to the relay log.

Note: Looks like the event object constructors don't trust the binlog version stored in the FDE passed to them.

[edit] Network Packets in Detail

The communication protocol between the master and slave is the one that all other normal connections use, as described in MySQL Client/Server Protocol. So after the COM_BINLOG_DUMP command has been sent, the communication between the master and slave is a sequence of packets, each of which contains an event. In slave.cc, function read_event(), one has an example: net_safe_read() is called; it is able to detect wrong packets. After net_safe_read(), the event is ready to be interpreted; it starts at pointer (char*) mysql->net.read_pos + 1. That is, (char*) mysql->net.read_pos + 1 is the first byte of the event's timestamp, etc.

[edit] Replication Event Format in Detail

[Moved to: The Binary Log.]

[edit] Plans

We have already made extensive changes to the above in MySQL 5.0. For an upcoming version we plan a new format where data is replicated by row instead of by statement. This makes it possible to replicate data from one MySQL Cluster to another. We also plan new functionality to handle replication connections from multiple masters into one slave.

Retrieved from "http://forge.mysql.com/wiki/MySQL_Internals_Replication"

This page has been accessed 7,011 times. This page was last modified 02:28, 14 August 2008.

Find

Browse
MySQLForge
Main Page
Current events
Recent changes
Random page
Help
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Special pages
New pages
File list
Statistics
Bug reports
More...