16.3.2 Handling an Unexpected Halt of a Replica
In order for replication to be resilient to unexpected halts of the server (sometimes described as crash-safe) it must be possible for the replica to recover its state before halting. This section describes the impact of an unexpected halt of a replica during replication and how to configure a replica for the best chance of recovery to continue replication.
After an unexpected halt of a replica, upon restart the replication SQL thread must recover which transactions have been executed already. The information required for recovery is stored in the replica metadata repository. In older MySQL Server versions, this log could only be created as a file in the data directory that was updated after the transaction had been applied. This held the risk of losing synchrony with the source depending at which stage of processing a transaction the replica halted at, or even corruption of the file itself. In MySQL 5.7 you can instead use an InnoDB
table to store the relay log info log. By using this transactional storage engine the information is always recoverable upon restart. As a table, updates to the relay log info log are committed together with the transactions, meaning that the replica's progress information recorded in that log is always consistent with what has been applied to the database, even in the event of an unexpected server halt.
To configure MySQL 5.7 to store the relay log info log as an InnoDB
table, set the system variable relay_log_info_repository
to TABLE
. The server then stores information required for the recovery of the replication SQL thread in the mysql.slave_relay_log_info
table. For further information on the replication metadata repositories, see Section 16.2.4, “Relay Log and Replication Applier Metadata Repositories”.
Exactly how a replica recovers from an unexpected halt is influenced by the chosen method of replication, whether the replica is single-threaded or multithreaded, the setting of variables such as relay_log_recovery
, and whether features such as MASTER_AUTO_POSITION
are being used.
The following table shows the impact of these different factors on how a single-threaded replica recovers from an unexpected halt.
Table 16.3 Factors Influencing Single-threaded Replica Recovery
GTID |
MASTER_AUTO_POSITION |
Crash type |
Recovery guaranteed |
Relay log impact |
||
---|---|---|---|---|---|---|
OFF |
OFF |
1 |
TABLE |
Server |
Yes |
Lost |
OFF |
OFF |
1 |
Any |
OS |
No |
Lost |
OFF |
OFF |
0 |
TABLE |
Server |
Yes |
Remains |
OFF |
OFF |
0 |
TABLE |
OS |
No |
Remains |
ON |
ON |
1 |
TABLE |
Any |
Yes |
Lost |
ON |
OFF |
0 |
TABLE |
Server |
Yes |
Remains |
ON |
OFF |
0 |
Any |
OS |
No |
Remains |
As the table shows, when using a single-threaded replica the following configurations are most resilient to unexpected halts:
When using GTIDs and
MASTER_AUTO_POSITION
, setrelay_log_recovery=1
. With this configuration the setting ofrelay_log_info_repository
and other variables does not impact on recovery. Note that to guarantee recovery,sync_binlog=1
(which is the default) must also be set on the replica, so that the replica's binary log is synchronized to disk at each write. Otherwise, committed transactions might not be present in the replica's binary log.When using file position based replication, set
relay_log_recovery=1
andrelay_log_info_repository=TABLE
.NoteDuring recovery the relay log is lost.
The following table shows the impact of these different factors on how a multithreaded replica recovers from an unexpected halt.
Table 16.4 Factors Influencing Multithreaded Replica Recovery
GTID |
|
Crash type |
Recovery guaranteed |
Relay log impact |
|||
---|---|---|---|---|---|---|---|
OFF |
1 |
OFF |
1 |
TABLE |
Any |
Yes |
Lost |
OFF |
>1 |
OFF |
1 |
TABLE |
Server |
Yes |
Lost |
OFF |
>1 |
OFF |
1 |
Any |
OS |
No |
Lost |
OFF |
1 |
OFF |
0 |
TABLE |
Server |
Yes |
Remains |
OFF |
1 |
OFF |
0 |
TABLE |
OS |
No |
Remains |
ON |
Any | ON |
1 |
TABLE |
Any |
Yes |
Lost |
ON |
1 |
OFF |
0 |
TABLE |
Server |
Yes |
Remains |
ON |
1 |
OFF |
0 |
Any |
OS |
No |
Remains |
As the table shows, when using a multithreaded replica the following configurations are most resilient to unexpected halts:
When using GTIDs and
MASTER_AUTO_POSITION=ON
, setrelay_log_recovery=1
. With this configuration the setting ofrelay_log_info_repository
and other variables does not impact on recovery. From MySQL 5.7.28 a multithreaded replica automatically skips relay log recovery whenMASTER_AUTO_POSITION
is set toON
, so the setting forrelay_log_recovery
makes no difference.When using file position based replication, set
relay_log_recovery=1
,sync_relay_log=1
, andrelay_log_info_repository=TABLE
.NoteDuring recovery the relay log is lost.
It is important to note the impact of sync_relay_log=1
, which requires a write of to the relay log per transaction. Although this setting is the most resilient to an unexpected halt, with at most one unwritten transaction being lost, it also has the potential to greatly increase the load on storage. Without sync_relay_log=1
, the effect of an unexpected halt depends on how the relay log is handled by the operating system. Also note that when relay_log_recovery=0
, the next time the replica is started after an unexpected halt the relay log is processed as part of recovery. After this process completes, the relay log is deleted.
An unexpected halt of a multithreaded replica using the recommended file position based replication configuration above may result in a relay log with transaction inconsistencies (gaps in the sequence of transactions) caused by the unexpected halt. See Section 16.4.1.32, “Replication and Transaction Inconsistencies”. In MySQL 5.7.13 and later, if the relay log recovery process encounters such transaction inconsistencies they are filled and the recovery process continues automatically. In MySQL versions prior to MySQL 5.7.13, this process was not automatic and required starting the server with relay_log_recovery=0
, starting the replica with START SLAVE UNTIL SQL_AFTER_MTS_GAPS
to fix any transaction inconsistencies, and then restarting the replica with relay_log_recovery=1
.
When you are using multi-source replication and relay_log_recovery=1
, after restarting due to an unexpected halt all replication channels go through the relay log recovery process. Any inconsistencies found in the relay log due to an unexpected halt of a multithreaded replica are filled.