MySQL Internals Replication
← Back to MySQL Internals overview page
Replication
[edit] Introduction
Status of this section: up to date 2009-10-21
This chapter describes MySQL replication principles, rules, and code, as it is supposed to work in version 5.1.
The MySQL replication feature allows a server - the master - to send all changes to another server - the slave - and the slave tries to apply all changes to keep up-to-date with the master. Replication works as follows:
- Whenever the master's database is modified, the change is written to a file, the so-called binary log, or binlog. This is done by the client thread that executed the query that modified the database.
- The master has a thread, called the dump thread, that continuously reads the master's binlog and sends it to the slave.
- The slave has a thread, called the IO thread, that receives the binlog that the master's dump thread sent, and writes it to a file: the relay log.
- The slave has another thread, called the SQL thread, that continuously reads the relay log and applies the changes to the slave server.
[edit] Organisation of this doocument
We distinguish between two levels of the architecture: principles, and rules:
- principle
- High-level goal that declares how the program shall work, from an external point of view. Principles do not discuss how the code works. An example of a principle is "no row events shall be written to the binlog if binlog_format=statement".
- rule
- Defines how to implement the principles. Rules can be formulated on a very high level, but they describe implementation (how the code works from the inside) rather than interface (how the program behaves when looking from the outside). An example of a rule is "at ROLLBACK when the transaction cache only contains updates to transactional tables, clear the transaction cache and do not write to the binlog".
In the next section, we list the source code files used by replication and describe what each file contains. The section after describes the replication principles and the following section describes the replication rules. The last section contains very old, obsolete documentation. It will be removed after we have verified that anything useful has been transferred to the main sections.
[edit] Source Code Files
Status of this section: up to date 2009-10-21
Files in the sql directory:
| File | Description |
|---|---|
| slave.h/.cc | Contains the slave IO and SQL threads. This is the high-level administrative logic for the slave threads - i.e., not the low-level functions for parsing the dump from the net, parsing the relay log, formatting the relay log, or executing events. Also contains the code to execute SHOW SLAVE STATUS. |
| log.h/.cc | The high-level binary logging mechanism for organizing events into a sequence so that it becomes a binary log. Routines for creating, writing, and deleting binary log files. Also the handler callbacks for the binlog. |
| log_event.h/.cc | The Log_event class and subclasses, for creating, writing, reading, printing, and applying events of every event type. Reading and writing here is at a low level, that is, serializing values into records. |
| old_log_event.h/.cc | Contains classes to read and execute old versions of the row log events. |
| rpl_rli.h/.cc | Implementation of the data structure Relay_log_info that holds the state of the SQL thread, and also some auxiliary methods used by the slave SQL thread. (The slave SQL thread is in slave.cc and it also uses other auxiliary functions.) |
| rpl_mi.h/.cc | Contains the data structure Master_info that holds some state of the IO thread (the IO thread is in slave.cc and it also uses other auxiliary functions). |
| sql_repl.cc | The dump thread, where the master sends its binary log to the slave. This is also where the code resides for RESET SLAVE, CHANGE MASTER, RESET MASTER, SHOW BINLOG EVENTS, SHOW MASTER STATUS, SHOW BINARY LOGS, PURGE BINARY LOGS, as well as some replication-related system variables. |
| sql_binlog.cc | Contains code to execute BINLOG statements (i.e., the base64-encoded things that mysqlbinlog prints when it sees row events). |
| rpl_record.h/.cc | Utilities for encoding and decoding table rows into and out of the row event format. |
| repl_failsafe.h/.cc | Utilities to initialize and register slaves on the master. Also unfinished and unused code dealing with "failsafe" (master election if the primary master fails). |
| replication.h | Observer class declarations, which together constitute the binary log interface. |
| rpl_constants.h | Enumeration of incidents (events that occur during replication). Also some constants that are local to the replication code. |
| rpl_filter.h/.cc | Implements the table and database filters used by the --{binlog,replicate}-{do,ignore}-db, --replicate[-wild]-{do,ignore}-table, and --replicate-rewrite-db flags. |
| rpl_handler.h/.cc | Coordination classes used by plugins to register to the binary log interface. |
| rpl_injector.h/.cc | The injector class that allows external insertions into the binary log. This is used for cluster replication binary logging. |
| rpl_reporting.h/.cc | Utilities for reporting replication conditions and reporting errors, warnings, and informational messages on the slave. |
| rpl_tblmap.h/.cc | Utilities to generate a mapping from numbers to tables. The mapping is used by the row logging system to identify tables. |
| rpl_utility.h/.cc | Auxiliary classes and functions used for Table_map_events, and also an auxiliary class for smart pointers. |
| sql_base.cc | Prior to Bug#39934, the function decide_logging_format() that determines if statements should be written row-based or statement-based to the binlog. After Bug#39934, there is nothing related to replication here. |
| sql_class.cc | The function binlog_query(), called from commands that need to log a query_log_event. After Bug#39934, also decide_logging_format(). |
| sql_lex.h/.cc | List of all types of unsafe statements, and functions for marking statements unsafe. |
Files in the plugin/semisync directory:
| File | Description |
|---|---|
| semisync.h/.cc | Auxiliary code, particularly for tracing, that is used by both the master semisync module and the slave semisync module. |
| semisync_master.h/.cc | The master semisync module. |
| semisync_master_plugin.cc | The callbacks invoked by the server to use the master semisync module, as well as code to register the master semisync module. |
| semisync_slave.h/.cc | The slave semisync module. |
| semisync_slave_plugin.cc | The callbacks invoked by the server to use the slave semisync module, as well as code to register the master semisync module. |
[edit] Principles
In this section, we describe the architectural principles of replication. These are high-level goals that replication shall achieve. The principles have been used as guidelines to construct the Rules of replication (next section).
[edit] Binlog formats
Status of this subsection: Complete but not reviewed 2009-10-21
The binlog is organized as a linear sequence of events. An SQL query that modifies the database will generate one or more events and append them to the binlog. There are also auxiliary event types that describe the structure of the binlog.
Queries can be logged in two ways:
- In statement format: the SQL query is written to the binlog in text.
- In row format: rows that changed are written to the binlog in a binary format. Each row may consist of a Before Image (BI) and/or an After Image (AI). The BI identifies the row to modify and the AI describes the row after the change. There are three types of log_events:
- Write_rows_log_event: adds a new row to a table. Has only AI.
- Update_rows_log_event: modifies an existing row in a table. Has both BI and AI.
- Delete_rows_log_event: removes an existing row from a table. Has only BI.
Which of the two formats to use is configured with the @@session.binlog_format variable, which takes the values STATEMENT, ROW, or MIXED. The following principles shall hold:
- (P-binlog_format-statement) @@session.binlog_format=STATEMENT:
- We do not guarantee correct logging.
- The client may not generate row events.
- If the server cannot determine that a statement is correctly logged, a warning or error shall be issued:
- If it is possible that the user (through application-specific logic) knows that the statement is correctly logged, then a warning shall be issued.
- If it is inherently impossible for the user to determine that the statement will be correctly logged, an error shall be issued and the statement shall not execute.
- (P-binlog_format-row) @@session.binlog_format=ROW:
- We guarantee correct logging. If a statement cannot be correctly logged, then an error shall be generated and the statement shall not execute.
- DML changes may only be logged in row format, not in statement format. If a DML change cannot be logged in row format, then an error shall be generated and the statement shall not execute.
- (P-binlog_format-mixed) @@session.binlog_format=MIXED:
- We guarantee correct logging. If a statement cannot be correctly logged, then an error shall be generated and the statement shall not execute.
- If correct logging can be guaranteed by logging in statement format, then statement format shall be used. Otherwise, row format shall be used.
- Clarification: If it cannot be determined in a practical manner that statement format leads to correct logging, then row format shall be used.
[edit] Differences between master and slave
Status of this subsection: In progress 2009-10-21
What does the term correct replication really mean? To clarify the notion, we make the following preliminiary definitions:
- (D-identical-environments) Two environments are identical if all the following are identical:
- The hardware representation of floating point numbers and the hardware implementation of floating point arithmetic
- The case sensitivity of the file systems
- The versions of all used components of MySQL
- Note: The following are examples of things not taken into account by this definition:
- The hardware's word size, as long as it is supported by MySQL
- The hardware's endianness
- (D-identical-server-states) Two server states are identical if all the following are identical:
- The sets of databases (a.k.a. schemas)
- The table definitions (including table names, table options, column definitions) of all tables outside the mysql and information_schema databases
- The table contents, modulo row order (in mathematical language: the unordered multisets of rows are equal), of all tables outside the mysql and information_schema databases
- The definitions of all functions, procedures, triggers, views, prepared statements, and events
- Note: The following are examples of things not taken into account in this definition:
- System variables and user variables
- The state of clients, including the replication slave. The state includes session variables and temporary tables.
- Binlog files, relay log files, binlog indexes, relay log indexes
- The internal state of the random number generator
- Which plugins are installed
- (D-rpl-correct) Replication is correct if both the following hold:
- Any change on the master eventually results in the same change on the slave.
- Any intermediate state of the slave is identical to some intermediate state of the master.
- Note: it is not required that each intermediate state of the master is identical to some intermediate state of the slave.
We now state the architectural principles that define when replication shall be correct. The following is the main rule:
- (P-rpl-correct) If a replication master and slave reside on identical environments (D-identical-environments) and the server states are identical (D-identical-server-states), and @@client.binlog_format!=STATEMENT, then replication shall be correct.
[edit] Exceptions: situations where we do not guarantee correct replication
There are some exceptions to (P-rpl-correct) where we do not guarantee correct replication.
- (P-exception-federated) If a table uses a federated table on the master, then TODO
- (P-exception-table-definition) If a CREATE TABLE uses a DATA DIRECTORY or INDEX DIRECTORY clause, then the table may not be correctly replicated.
- (P-exception-plugins) TODO: figure out principles for plugin replication
[edit] Additions: special situations where we do support correct replication
In addition to what we guarantee in (P-rpl-correct), we also guarantee correct replication in the following scenarios:
- (P-rpl-different-file-system-case-sensitivities) TODO: allowed differences in file system case sensitivity
- (P-rpl-different-versions) Replication shall be correct even if master has version a.b.c and slave has version A.B.C, where A.B.C >= a.b.c and A <= a+1.
- (P-rpl-different-table-definitions) Table options may differ in the following fields: TODO: (comments, data/index directories, various hints)
- (P-rpl-different-engines) TODO: allowed differences in storage engines
- (P-rpl-different-column-definitions) Replication shall be correct even if the table definitions differ in one or more of the following ways:
- (rpl-extra-slave-columns) binlog_format=ROW and the slave has extra columns after the columns of the master. Notes:
- Extra indexes still must follow the rules for indexes - see below.
- For purposes of defining "correct replication", the tables are considered equal if they are equal on the common columns.
- (rpl-extra-slave-columns) binlog_format=ROW and the slave has extra columns after the columns of the master. Notes:
- (rpl-missing-slave-columns) binlog_format=ROW and the master has extra columns after the columns of the slave, as long as the following rules apply:
- The slave cannot have both missing columns and extra columns (see above) at the same time.
- If the master uses --binlog_row_image=minimal or binlog_row_image=noblob, then the BI must contain at least one column that exists on the slave. Moreover, the set of columns that are logged in BI must not match two different rows on the slave (but it may match two or more identical rows). This can be ensured, e.g., by one of the following strategies:
- The master has a PK that only includes columns that the slave has.
- The master has a PK that includes all columns that the slave has, and possibly other columns too.
- Note: For purposes of defining "correct replication", the tables are considered equal if they are equal on the common columns.
- (rpl-missing-slave-columns) binlog_format=ROW and the master has extra columns after the columns of the slave, as long as the following rules apply:
- (rpl-type-promotion) The data type of a column differs as allowed in the manual.
- (rpl-different-keys) Keys, indexes, and NOT NULL attributes may differ freely between master and slave, as long as the follwing rule applies:
- If the slave has an enabled key, and the master does not have an enabled key of the same type over the exact same set of columns (e.g., because the key is missing/disabled or of a different type; or because the columns only exist on the slave), then the semantics of the slave's key must be ensured before the rows are inserted on the slave. Specifically:
- If the slave has a uniqueness constraint (PK or UK), then uniqueness must be guaranteed before a row is inserted on the slave. This can be done, for example, through the following strategies:
- Have a uniqueness constraint (UK or PK) on the master, over the same columns or over a subset of the columns.
- If a column only exists on the slave (or if application-specific logic ensures that only NULL values are inserted on the master), then the AUTOINCREMENT attribute can be used on the slave.
- Use application-specific logic on the master that ensures that rows inserted are unique in the key's columns.
- Use BEFORE INSERT and/or BEFORE UPDATE triggers on the slave that ensure (through application-specific logic) that the rows are unique.
- If the slave has a non-NULL constraint (PK or NOT NULL), then the absence of NULL values must be ensured before a row is inserted on the slave. This can be done analogously to how uniqueness constraints are satisfied above:
- Have non-NULL constraints (PK or NOT NULL) on the master covering all the columns.
- Use AUTOINCREMENT as above.
- Use application-specific logic on the master that ensures no NULL values are inserted.
- Use BEFORE INSERT and/or BEFORE UPDATE triggers on the slave that ensure (through application-specific logic) that no NULL values are inserted.
- If the slave has a uniqueness constraint (PK or UK), then uniqueness must be guaranteed before a row is inserted on the slave. This can be done, for example, through the following strategies:
- Note: There are no restrictions on extra keys on the master.
- (rpl-column-names) If binlog_format=ROW, then column names may differ: columns are identified only by their position.
- (P-rpl-different-rows) TODO: allowed extra rows on master or slave
- (P-rpl-different-default-values) TODO:
[edit] Crashes
Status of this subsection: Not started 2009-10-21
[edit] Binlog files and indexes
Status of this subsection: Not started 2009-10-21
[edit] Rules
[edit] Determining the logging format
For each statement, we must determine the logging format: row or statement. This is done as follows.
- At parse time, it is detected if the statement is unsafe to log in statement format (i.e., requires row format). If this is the case, the THD::Lex::set_stmt_unsafe() function is called. This must be done prior to the call to THD::decide_logging_format() (i.e., prior to lock_tables). As a special case, some types of unsafeness are detected inside THD::decide_logging_format(), before the logging format is decided. Note that statements shall be marked unsafe even if binlog_format!=mixed.
- THD::decide_logging_format() determines the logging format, based on the value of binlog_format and the unsafeness of the current statement.
- THD::decide_logging_format() also determines if the statement is impossible to log, in which case it generates an error and the statement is not executed. The statement may be impossible to log for the following reasons:
- both row-incapable engines and statement-incapable engines are involved (ER_BINLOG_ROW_ENGINE_AND_STMT_ENGINE)
- BINLOG_FORMAT = ROW and at least one table uses a storage engine limited to statement-logging (ER_BINLOG_ROW_MODE_AND_STMT_ENGINE)
- statement is unsafe, BINLOG_FORMAT = MIXED, and storage engine is limited to statement-logging and (ER_BINLOG_UNSAFE_AND_STMT_ENGINE)
- statement is a row injection (i.e., a row event executed by the slave SQL thread or a BINLOG statement) and at least one table uses a storage engine limited to statement-logging (ER_BINLOG_ROW_INJECTION_AND_STMT_ENGINE)
- BINLOG_FORMAT = STATEMENT and at least one table uses a storage engine limited to row-logging (ER_BINLOG_STMT_MODE_AND_ROW_ENGINE)
- statement is a row injection (i.e., a row event executed by the slave SQL thread or a BINLOG statement) and BINLOG_FORMAT = STATEMENT (ER_BINLOG_ROW_INJECTION_AND_STMT_MODE)
- more than one engine is involved and at least one engine is self-logging (ER_BINLOG_MULTIPLE_ENGINES_AND_SELF_LOGGING_ENGINE)
- See the comment above decide_logging_format for details.
- THD::decide_logging_format() also determines if a warning shall be issued. A warning is issued for unsafe statements if binlog_format=STATEMENT. Warnings are not issued immediately; instead, THD::binlog_stmt_unsafe_flags is set and the warning is issued in THD::binlog_query(). This prevents warnings in the case that the statement generates an error later so that it is not logged.
Sub-statements. Let T be a statement that invokes an unsafe sub-statement S (S may be a stored function, stored procedure, trigger, view, or prepared statement). Each sub-statement is cached as an sp_head object. The sp_head object stores the Lex that was generated when the statement defining the sub-statement was parsed (i.e., when CREATE FUNCTION/CREATE PROCEDURE/CREATE TRIGGER/CREATE VIEW/PREPARE was parsed). Hence, this cached Lex has the unsafe flag set. When T is parsed, it fetches S from the cache. At this point, it calls sp_head::propagate_attributes(), which marks the current Lex object as unsafe if the cached Lex object was unsafe.
[edit] Unsafe statements
A statement may be flagged as unsafe. An unsafe statement will be logged in row format if binlog_format=MIXED and will generate a warning if binlog_format=STATEMENT.
The following types of sub-statements are currently marked unsafe:
- System functions that may return a different value on slave, including: FOUND_ROWS, GET_LOCK, IS_FREE_LOCK, IS_USED_LOCK, LOAD_FILE, MASTER_POS_WAIT, RELEASE_LOCK, ROW_COUNT, SESSION_USER, SLEEP, SYSDATE, SYSTEM_USER, USER, UUID, UUID_SHORT.
- Note: the following non-deterministic functions are not marked unsafe:
- CONNECTION_ID (Query_log_events contain the connection number)
- CURDATE, CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP, CURTIME, LOCALTIME, LOCALTIMESTAMP, NOW, UNIX_TIMESTAMP, UTC_DATE, UTC_TIME, UTC_TIMESTAMP (Query_log_event contain timezone and the time when the statement was executed)
- LAST_INSERT_ID (this is replicated in an Intvar_log_event)
- RAND (the seed is replicated in a Rand_log_event)
- Also note that most floating-point math functions will return a hardware-dependent result. We do not mark such function unsafe, because we only support replication between platforms that use identical floating point math.
- System variables, with some exceptions listed at http://dev.mysql.com/doc/refman/5.1/en/binary-log-mixed.html
- UDFs: since we have no control over what the UDF does, it may be doing something unsafe.
- Update of a table that has an autoincrement column. This is unsafe because the order that rows are updated may differ on master and slave. Cf. Bug#45677
- INSERT DELAYED, since the rows inserted may interleave with concurrently executing statements.
- Updates using LIMIT, since the order in which rows are retreived is not specified.
- Statements referencing system log tables, since the contents of those tables may differ between master and slave.
- Non-transactional reads or writes executing after transactional reads or writes in a transaction. (cf. WL#2687 and next section)
- Reads or writes to self-logging tables or statements executing after such reads or writes in the same transaction.
The following has not yet been implemented:
- Statements using fulltext parser plugins (cf. Bug#48183)
[edit] Logging transactions
Status of this subsection: complete but not reviewed 2009-10-21
There are several types of statements that require attention because of their special behavior in transactions:
- Non-transactional updates that take place inside a transaction present problems for logging because (1) they are visible to other clients before the transaction is committed, and (2) they are not rolled back even if the transaction is rolled back. It is not always possible to log correctly in statement format when both transactional and non-transactional tables are used in the same transaction.
- Statements that do an implicit commit (i.e., most but not all DDL, and some utility commands) are logged specially due to unspecified requirements by NDB.
- Statements that update temporary tables need special treatment since they are not logged in row format.
[edit] Definitions
To reason about logging different table types, we make some preliminary definitions.
- (D-T-table) A table that has a transactional engine is called a T-table.
- (D-N-table) A table that has a non-transactional engine is called an N-table.
- (D-N-write) A statement makes an N-write if it makes any type of change to the server state that will not be changed by a ROLLBACK.
- Note: N-writes include updates to N-tables, but also CREATE and DROP statements.
- (D-log-target) Events are either appended to the Transaction Cache (TC) or to the Statement Cache (SC) or written directly to the binlog.
[edit] Preliminary Rules
The following preliminary rules are actually consequences of the principle that statements shall be correctly logged when binlog_format=MIXED or ROW. They also apply when binlog_format=STATEMENT: this makes statement format work in many practical cases.
- (PR-causality) If statement A is executed before statement B, and B is logged in statement format, and B reads tables that A may modifies, then B shall be logged after A.
- (PR-durability) Events shall be written to the binary log at the moment they take effect. In particular, changes to N-tables shall be written to the binary log when they have been executed, and changes to T-tables shall be written to the binary log on commit. If --sync-binlog has been specified, then it suffices that events are be written to the binary log at the next synchronization point.
- (PR-causality-precedence) If P-causality and P-durability cannot both be fulfilled, then P-causality is considered more important.
[edit] Rules for non-committing statements, except CREATE TEMPORARY TABLE...SELECT
The preliminary rules above, together with the principles for logging format, have been used to construct the following rules.
CALL statements are unrolled (see ???TODO: add section about unrolling???), so that each statement executed by the stored procedure is logged separately. (If a stored procedure A invokes a stored procedure B, then B is unrolled recursively). In the following, we assume that unrolling has already been done, and the word "statement" refers to a non-CALL top-level statement or a non-CALL sub-statement.
Let S be a logged statement that does not have an implicit commit, except CREATE TEMPORARY TABLE...SELECT (This includes all "pure DML": INSERT, UPDATE, DELETE, REPLACE, TRUNCATE, SELECT, DO, CALL, EXECUTE, LOAD DATA INFILE, and BINLOG. It also includes CREATE TEMPORARY TABLE without SELECT, and DROP TEMPORARY TABLE. CREATE TEMPORARY TABLE...SELECT is handled in the next subsection).
- Before executing S, determine unsafeness:
- (R-unsafe-nontransactional) If S either makes N-writes or reads from an N-table, and either S or a previous statement in the same transaction reads or writes to a T-table then S is marked unsafe.
- (R-unsafe-self-logging) If either S or a previous statement in the same transaction reads or writes to a self-logging table, then S is marked unsafe.
- When logging S, determine where to log it by applying the following rules in order:
- (R-log-statement-format) If S is to be logged in statement format (i.e., if one of the following holds: (1) @@session.binlog_format=STATEMENT; (2) @@session.binlog_format=MIXED and S is safe; (3) S is of DDL type, i.e., CREATE TEMPORARY TABLE):
- If S produces an error and does not do any N-write, do not log.
- Otherwise, if either S or any previous statement in the same transaction reads or writes in any T-tables, log to TC.
- Otherwise, log to SC.
- (R-log-row-format) If S is to be logged in row format (i.e., if S is DML and one of the following holds: (1) @@session.binlog_format=ROW; (2) @@session.binlog_format=MIXED and S is unsafe):
- Do not log row events that write to temporary tables.
- Log row events that write to non-temporary N-tables to SC.
- Log row events that write to non-temporary T-tables to TC, except rows that are rolled back due to an error. (Note: if there is an error, rows written to a T-table are kept if there are subsequent rows written to an N-table.)
- (R-flush-SC) At the end of S, write BEGIN + SC + COMMIT to the binlog and clear the SC.
- At end of transaction:
- (R-log-commit) At COMMIT or implicit commit, where all XA tables in the transaction succeed in the "prepare" phase:
- If the TC is non-empty, write BEGIN + TC + COMMIT to the binlog.
- If the TC is empty, do nothing.
- (R-log-rollback) At ROLLBACK; or at COMMIT or implicit commit where some XA table fails in the "prepare" phase:
- If the TC contains any N-write, write BEGIN + TC + ROLLBACK to the binlog.
- If the TC does not contain any N-write, do nothing.
- (R-log-rollback-to-savepoint) At ROLLBACK TO SAVEPOINT:
- If the TC contains any N-write after the savepoint, write ROLLBACK TO SAVEPOINT to the TC.
- Otherwise, clear the part of the TC that starts at the savepoint and extends to the end of the TC. (Bug#47327 breaks this rule)
- (R-clear-TC) Clear the TC at the end of the transaction.
[edit] Rules for CREATE [TEMPORARY] TABLE...SELECT
First, unsafeness is determined as above (R-unsafe-transaction). Then the logging format is decided. Then the following rules apply.
- (R-log-create-select-statement-format) If logging in statement format (i.e., one of the following holds: (1) @@session.binlog_format=STATEMENT; (2) @@session.binlog_format=MIXED and statement is safe):
- If there is an error, do not write anything.
- If there is no error and the TEMPORARY keyword is used, write the entire CREATE...SELECT statement to the TC.
- If there is no error and the TEMPORARY keyword is not used, write the entire CREATE...SELECT directly to the binlog.
- (R-log-create-select-row-format) If logging in row format (i.e., one of the following holds: (1) @@session.binlog_format=ROW; (2) @@session.binlog_format=MIXED and statement is unsafe):
- If the TEMPORARY keyword is used, do not write anything.
- If the TEMPORARY keyword is not used, write CREATE TABLE (without select) + BEGIN + row events + COMMIT to the TC. If there is an error, clear the TC; otherwise flush the TC to the binlog at the end of the statement and then clear the TC. (Note: currently Bug#47899 breaks this rule)
- Note: this breaks D-rpl-correct rule, because the slave will have an intermediate state that never existed on the master (namely, a state where the new table exists and is empty).
[edit] Rules for committing statements, except CREATE [TEMPORARY] TABLE...SELECT
- (R-log-commit-statement) All other statements that have a pre-commit are written directly to the binlog. (Note: this is semantically equivalent to writing it to the SC and flushing the SC. However, due to requirements by NDB (which have not been clarified), we write directly to the binlog.)
[edit] Logging updates to auto_increment columns
Status of this subsection: not started 2009-10-21
[edit] Logging access to variables and functions
Status of this subsection: not started 2009-10-21
User variables: User variables (@variable) are logged as
User-defined functions
Server variables
Built-in functions
[edit] Other unsafe statements
Status of this subsection: not started 2009-10-21
INSERT DELAYED
LIMIT
System tables
[edit] binlog_row_image
Status of this subsection: finished, not reviewed, not fully implemented 2009-10-21
The sets of columns recorded in the BI and AI are determined by the value of binlog_row_image. To specify the sets of columns, we define the PKE (for Primary Key Equivalent), as follows:
- If a PK exists, the PKE is equal to the PK.
- Otherwise, if there exists a UK where all columns have the NOT NULL attribute, then that is the PKE (if there are more than one such UKs, then one is chosen arbitrarily).
- Otherwise, the PKE is equal to the set of all columns.
The set of columns included in the BI and AI are defined as in the following tables:
- write event
binlog_row_image Before image After image minimal - All columns where a value was specified, and the autoincrement column if there is one noblob - All columns where a value was specified, and the autoincrement column if there is one, and all non-blob columns full - All columns
- update event
binlog_row_image Before image After image minimal PKE All columns where a value was specified noblob PKE + all non-blob columns All columns where a value was specified, and all non-blob columns full All columns All columns
- delete event
binlog_row_image Before image After image minimal PKE - noblob PKE + all non-blob columns - full All columns -
Cf. WL#5092.
[edit] Replication locks
Status of this subsection: not started 2009-10-21
[edit] Text below this line is obsolete
[edit] Replication Threads
Every time replication is started on the slave mysqld, that is, when mysqld is started with some replication options (--master-host=this_hostname etc.) or some existing master.info and relay-log.info files, or when the user does START SLAVE on the slave, two threads are created on the slave, in slave.cc:
extern "C" pthread_handler_decl(handle_slave_io,arg)
{ ... }
extern "C" pthread_handler_decl(handle_slave_sql,arg)
{ ... }
[edit] The Slave I/O Thread
The I/O thread connects to the master using a user/password. After connecting, it asks the master for its binary logs:
thd->proc_info = "Requesting binlog dump";
if (request_dump(mysql, mi, &suppress_warnings))
Then it enters this loop:
while (!io_slave_killed(thd,mi))
{
...
thd->proc_info = "Reading master update";
ulong event_len = read_event(mysql, mi, &suppress_warnings);
...
thd->proc_info = "Queueing event from master";
if (queue_event(mi,(const char*)mysql->net.read_pos + 1,
event_len))
{
sql_print_error("Slave I/O thread could not queue event
from master");
goto err;
}
flush_master_info(mi);
if (mi->rli.log_space_limit && mi->rli.log_space_limit <
mi->rli.log_space_total &&
!mi->rli.ignore_log_space_limit)
if (wait_for_relay_log_space(&mi->rli))
{
sql_print_error("Slave I/O thread aborted while
waiting for relay log space");
goto err;
}
...
}
read_event() calls net_safe_read() to read what the master has sent over the network. queue_event() writes the read event to the relay log, and also updates the counter which keeps track of the space used by all existing relay logs. flush_master_info() writes to the master.info file the new position up to which the thread has read in the master's binlog. Finally, if relay logs take too much space, the I/O thread blocks until the SQL thread signals that it is okay to read and queue more events. The code handles network failures and reconnections.
When the relay log gets too large, it is rotated: The I/O thread stops writing to it, closes it, opens a new relay log file and writes to the new one. The old file is kept until the SQL thread (see below) has finished executing it, then it is deleted. The meaning of "too large" is determined from the value of one of two system variables as follows:
-
max_relay_log_size, ifmax_relay_log_size> 0 -
max_binlog_size, ifmax_relay_log_size= 0 or MySQL is older than 4.0.14 (before whichmax_relay_log_sizedoes not exist)
[edit] The Slave SQL Thread
while (!sql_slave_killed(thd,rli))
{
thd->proc_info = "Processing master log event";
DBUG_ASSERT(rli->sql_thd == thd);
THD_CHECK_SENTRY(thd);
if (exec_relay_log_event(thd,rli))
{
// do not scare the user if SQL thread was simply killed or stopped
if (!sql_slave_killed(thd,rli))
sql_print_error("Error running query, slave SQL thread
aborted. Fix the problem, and restart
the slave SQL thread with "SLAVE START".
We stopped at log '%s' position %s",
RPL_LOG_NAME, llstr(rli->master_log_pos, llbuff));
goto err;
}
}
exec_relay_log_event() reads an event from the relay log (by calling next_event()). next_event() will start reading the next relay log file if the current one is finished; it will also wait if there is no more relay log to read (because the I/O thread is stopped or the master has nothing more to send to the slave). Finally exec_relay_log_event() executes the read event (all ::exec_event() methods in log_event.cc) (mostly this execution goes through sql_parse.cc), thus updating the slave database and writing to relay-log.info the new position up to which it has executed in the relay log. The ::exec_event() methods in log_event.cc will take care of filter options like replicate-do-table and such.
When the SQL thread hits the end of the relay log, it checks whether a new one exists (that is, whether a rotation has occurred). If so, it deletes the already-read relay log and starts reading the new one. Otherwise, it just waits until there's more data in the relay log.
[edit] Why 2 Threads?
In MySQL 3.23, we had only one thread on the slave, which did the whole job: read one event from the connection to the master, executed it, read another event, executed it, etc.
In MySQL 4.0.2 we split the job into two threads, using a relay log file to exchange between them.
This makes code more complicated. We have to deal with the relay log being written at the end, read at another position, at the same time. Plus handling the detection of EOF on the relay log, switching to the new relay log. Also the SQL thread must do different reads, depending on how the relay log file it is reading is being used:
- If the file is being written to by the I/O thread, the relay log is partly in memory, not all on disk, and mutexes are needed to avoid confusion between threads.
- If the file has already been rotated (the I/O thread is not writing to it anymore), it is a normal file that no other threads touch.
The advantages of having 2 threads instead of one:
- Helps having a more up-to-date slave. Reading a statement is fast, executing it is slow. If the master dies (burns), there are good chances that the I/O thread has caught almost all updates issued on the master, and saved them in the relay log, for use by the SQL thread.
- Reduces the required master-slave connection time. If the slave has not been connected for a long time, it is very late compared to the master. It means the SQL thread will have a lot of executing to do. So with the single-thread read-execute-read-execute technique, the slave will have to be connected for a long time to be able to fetch all updates from the master. Which is stupid, as for a significant part of the time, the connection will be idle, because the single thread is busy executing the statement. Whereas with 2 threads, the I/O thread will fetch the binlogs from the master in a shorter time. Then the connection is not needed anymore, and the SQL thread can continue executing the relay log.
[edit] The Binlog Dump Thread
This thread is created by the master when it receives a COM_BINLOG_DUMP request.
void mysql_binlog_send(THD* thd, char* log_ident, my_off_t pos,
ushort flags)
{
...
if ((file=open_binlog(&log, log_file_name, &errmsg)) < 0)
{
my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
goto err;
}
if (pos < BIN_LOG_HEADER_SIZE || pos > my_b_filelength(&log))
{
errmsg= "Client requested master to start replication from
impossible position";
my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
goto err;
}
my_b_seek(&log, pos); // Seek will done on next read
...
// if we are at the start of the log
if (pos == BIN_LOG_HEADER_SIZE)
{
// tell the client log name with a fake rotate_event
if (fake_rotate_event(net, packet, log_file_name, &errmsg))
{
my_errno= ER_MASTER_FATAL_ERROR_READING_BINLOG;
goto err;
}
...
}
while (!net->error && net->vio != 0 && !thd->killed)
{
pthread_mutex_t *log_lock = mysql_bin_log.get_log_lock();
while (!(error = Log_event::read_log_event(&log, packet, log_lock)))
{
...
if (my_net_write(net, (char*)packet->ptr(), packet->length()) )
{
errmsg = "Failed on my_net_write()";
my_errno= ER_UNKNOWN_ERROR;
goto err;
}
...
If this thread starts reading from the beginning of a binlog, it is possible that the slave does not know the binlog's name (for example it could have just asked give me the FIRST binlog</span>). Using fake_rotate_event(), the master tells the slave the binlog's name (required for master.info and SHOW SLAVE STATUS) by building a Rotate_log_event and sending this event to the slave. In this event the slave will find the binlog's name. This event has zeros in the timestamp (shows up as written in year 1970 when reading the relay log with mysqlbinlog).
[edit] How Replication Deals With...
This section describes how replication handles various problematic issues.
[edit] Columns, LAST_INSERT_ID()
When a statement inserts into such a column, or uses LAST_INSERT_ID(), one or two Intvar_log_event instances are written to the binlog just before the Query_log_event.
[edit] User Variables (Since 4.1)
When a statement uses a user variable, a User_var_log_event is written to the binlog just before the Query_log_event.
[edit] System Variables
Example: SQL_MODE, FOREIGN_KEY_CHECKS. Not dealt with. Guilhem is working on it for version 5.0.
[edit] Some Functions
USER(), LOAD_FILE(). Not dealt with. Will be solved with row-level binlogging (presently we have statement-level binlogging, but in the future we plan to support row-level binlogging too).
[edit] Non-repeatable UDF Functions
Non repeatable means that they have a sort of randomness, for example they depend on the machine (to generate a unique ID for example). Not dealt with. Will be solved with row-level binlogging.
[edit] Prepared Statements
For the moment, a substituted normal statement is written to the master's binlog. Using prepared statements on the slave as well is on the TODO.
[edit] Temporary Tables
Temporary tables depend on the thread which created them, so any query event which uses such tables is marked with the LOG_EVENT_THREAD_SPECIFIC_F flag. All events have in their header the id of the thread which created them, so the slave knows which temporary table the statement refers to.
When the slave is stopped (STOP SLAVE or even mysqladmin shutdown), the in-use replicated temporary tables are not dropped (like clients' temporary tables are). This way, when the slave restarts they are still available.
When a connection using temporary tables terminates on the master, the master automatically writes some DROP TEMPORARY TABLE statements for them so that they are dropped on the slave as well.
When the master brutally dies, then restarts, it drops all temporary tables which remained in tmpdir, but without writing it to the binlog, so these temporary tables are still on the slave, and they will not be dropped before the next slave's mysqld restart. To avoid this, the slave drops all replicated temporary tables when it executes a Start_log_event read from the master. Indeed such an event means the master's mysqld has restarted so all preceding temporary tables have been dropped.
Presently we have a bug: if the slave mysqld is stopped while it was replicating a temporary table, then at restart it deletes this table (like a normal temporary table), which may cause a problem if subsequent statements on the master refer to this table.
[edit] LOAD DATA [LOCAL] INFILE (Since 4.0)
The master writes the loaded file to the binlog, but in small blocks rather than all at once. The slave creates a temporary file, the concatenation of each block. When the slave reads the final Execute_load_log_event, it loads all temporary files into the table and deletes the temporary files. If the final event was instead a Delete_file_log_event then these temporary files are deleted without loading.
[edit] How a Slave Asks Its Master to Send Its Binary Log
The slave server must open a normal connection to its master. The MySQL account used to connect to the master must have the REPLICATION SLAVE privilege on the master. Then the slave must send the COM_BINLOG_DUMP command, as in this example taken from function request_dump():
static int request_dump(MYSQL* mysql, MASTER_INFO* mi,
bool *suppress_warnings)
{
char buf[FN_REFLEN + 10];
int len;
int binlog_flags = 0; // for now
char* logname = mi->master_log_name;
DBUG_ENTER("request_dump");
// TODO if big log files: Change next to int8store()
int4store(buf, (longlong) mi->master_log_pos);
int2store(buf + 4, binlog_flags);
int4store(buf + 6, server_id);
len = (uint) strlen(logname);
memcpy(buf + 10, logname,len);
if (simple_command(mysql, COM_BINLOG_DUMP, buf, len + 10, 1))
{
// act on errors
}
}
Here variable buf contains the arguments for COM_BINLOG_DUMP. It's the concatenation of:
- 4 bytes: the position in the master's binlog from which we want to start (that is, please master, send me the binlog, starting from this position).
- 2 bytes: 0 for the moment.
- 4 bytes: this slave's server id. This is used by the master to delete old
Binlog Dumpthreads which were related to this slave (see functionkill_zombie_dump_threads()for details). - variable-sized part: the name of the binlog we want. The dump will start from this binlog, at the position indicated in the first four bytes.
Then send the command, and start reading the incoming packets from the master, like read_event() does (using net_safe_read() as explained below). One should also, to be safe, handle all possible cases of network problems, disconnections/reconnections, malformed events.
[edit] Where Binary Log Events are Read/Parsed in the Source Code
1. Replication master's binlog dump thread:
This is the thread handling the COM_BINLOG_DUMP command sent by a slave.
It reads events from binary log files and sends them to the replication slave over the SQL connection. Code: mysql_binlog_send() function in sql/sql_repl.cc
2. Replication slave's I/O thread:
Reads events sent by the master's dump thread and stores them in relay logs (local to the slave host). Code: handle_slave_io() function in sql/slave.cc
3. Replication slave's SQL thread:
Reads events from relay logs and executes them.
Code: handle_slave_sql() function in sql/slave.cc
4. mysqlbinlog program:
Reads events from binary log (or relay log) files or sent by a mysqld server and presents them in text format as SQL statements. Code: dump_{local,remote}_log_entries() functions in client/mysqlbinlog.cc
Notes:
1a. If the current binary log file ends, mysql_binlog_send() switches to the next file.
1b. mysql_binlog_send() sends a "fake" Rotate event before sending any other events from a binlog and after each single binlog file.
1c. In case an event offset was specified with a binlog dump request, mysql_binlog_send() finds the format description event at the beginning of the binary log file and sends it first before proceeding to the events at the given offset.
[edit] How Different Binary Log Formats are Handled in the Slave
The I/O thread reads the master server version from the mysql->server_version value (where mysql is an object representing an established connection to the master). It aborts replication if the master's version is x.y.z with x <= 2.
The I/O thread reads events using cli_safe_read() function, which simply reads protocol packets (?). Thus, this is binlog-version transparent.
Before writing events to the relay log, the I/O thread can process them:
- If it considers the current binary log to be in format v1 or v3, it translates events to v4 format
- Otherwise, it copies event packets as they are, except that it recognizes and parses FDE and Rotate events. It also recognizes Stop events, but doesn't write them to the relay log.
The I/O thread decides about binary log version based on the FDE it parses. Before first FDE arrives, the binary log version is based on the master server version. For masters with version x.y.z, where x > 4, the binary log version is assumed to be v4 (even though in fact it can be higher).
Parsing FDE and Rotate events is done by constructors of corresponding objects. These constructors don't do any checks of the binary log version. Note: the Rotate_log_event constructor reads header and post-header lengths from the current event description event (and thus is possibly binlog format sensitive).
Translation of events from old formats to the current format is done by parsing event data using Log_event::read_log_event() (that is, in the constructors of event objects) and then writing the event objects to the relay log.
Note: Looks like the event object constructors don't trust the binlog version stored in the FDE passed to them.
[edit] Network Packets in Detail
The communication protocol between the master and slave is the one that all other normal connections use, as described in MySQL Client/Server Protocol. So after the COM_BINLOG_DUMP command has been sent, the communication between the master and slave is a sequence of packets, each of which contains an event. In slave.cc, function read_event(), one has an example: net_safe_read() is called; it is able to detect wrong packets. After net_safe_read(), the event is ready to be interpreted; it starts at pointer (char*) mysql->net.read_pos + 1. That is, (char*) mysql->net.read_pos + 1 is the first byte of the event's timestamp, etc.
[edit] Replication Event Format in Detail
[Moved to: The Binary Log.]
[edit] Plans
We have already made extensive changes to the above in MySQL 5.0. For an upcoming version we plan a new format where data is replicated by row instead of by statement. This makes it possible to replicate data from one MySQL Cluster to another. We also plan new functionality to handle replication connections from multiple masters into one slave.