In order to be able to monitor replication effectively, you need to
know and understand the various states that the master and slave can
occupy. Server states can be displayed by using the SHOW PROCESSLIST statement on the
master and the slave. At least one line of the results will be related to
the replication activities for the user account associated with
replication. Following the examples of this chapter, the account is
replicant on the master and system
user on the slave. In the Command column, on
the master the value will be Binlog Dump, meaning a
binary log thread; on the slave the value will be
Connect. The results will also contain a field called
State, in which the state of the thread will be given.
Here is an example from a slave:
SHOW PROCESSLIST \G *************************** 1. row *************************** Id: 16 User: system user Host: db: NULL Command: Connect Time: 119255 State: Waiting for master to send event Info: NULL
These results show only one thread, the I/O thread waiting on the master. If the server were processing entries from the master’s binary log, there would probably be another row shown in the results for the SQL thread. What follows is a list of all of the possible server states that you may see on master and slave servers, along with descriptions of each. In addition to understanding these traits, you may want to develop a script to check that replication is running on the slave and not stalled and to notify you if it’s not running. Replication on MySQL is very stable, but if it does stop, it’s very quiet about it. Fortunately, it’s very good about rapidly catching up once you restart it.
Here is an overview of master server replication states that
can be reported for binary log threads (Binlog
Dump):
This is the most common status message you should see for a slave connection on the master. In this state, the master is basically doing nothing regarding replication. It has sent the slave all entries requested and is now waiting for another event to occur that will cause its binary log to be updated. Notice that it says it is waiting for the binary log to be updated. It doesn’t say it’s waiting for the databases to be updated. That’s handled by a different component of MySQL. The thread lives only to provide information about the binary log to the slave.
After the binary log has been updated, the master informs
the slave that one or more new entries have been made. If the
slave requests the entries, the master enters this state,
indicating that it is in the process of sending a slave entries or
information on pertinent database events. There are obviously
other states in between, but they are so fast and short-lived that
they are not registered and therefore will not show up in the
results of SHOW PROCESSLIST.
If a slave has been offline for a while, the master may have flushed its logs in the interim. Whenever the master does this, it will start a new log file, saving the previous ones. When a slave requests log entries that span more than one log file as the master switches from one file to the next, it enters this state.
Once the master has completed the process of updating a slave, the master shows this status as it’s closing the binary log file and winding down the communication with the slave. When it is finished, the master will return to the first thread state (Has sent all binlog to slave; waiting for binlog to be updated) in which it is waiting for more changes to the binary log.
Here is a list of replication states that can be found on the slave server for I/O threads:
This state indicates that the slave I/O thread is attempting to connect to the master. If it can’t connect, it may stay in this state for a while as it retries.
After the slave connects to the master, it compares versions of MySQL with the master to ensure compatibility. This is very quick.
After the slave connects to the master, it registers itself
with the master as a replication slave server. During this
process, it will be in this state. On the master side of the
connection, the Binlog Dump state will be
Has sent all binlog to slave; waiting for binlog to be
updated, as described previously.
When the slave has been informed of changes to the master binary log, it enters this state to request the new entries. Also, when it first connects to a server—either for the first time or after having been disconnected for a while—it enters this state briefly to request all entries since the last master binary log position that it gives the master. If no changes have occurred, none are returned. If there are new entries, the entries starting from the position given until the end of the master’s binary log will be transmitted to the slave. On the master side, you will see the state Sending binlog event to slave as a result of the request.
If the request for new entries mentioned in the previous
state fails to be received from the master, the slave enters this
state as it waits to be able to connect to the master
periodically. This timeout period is configured using the
--master-connect-retry and defaults to 60
seconds. The number of retries it will make can be found in the
master.info file shown earlier in this
chapter. Each time the slave attempts to reconnect, it will enter
the next state.
If the slave failed to stay connected to the master while
trying to retrieve entries to the master’s binary log (as
mentioned in the previous state description), this state indicates
that the slave is trying to reconnect. If it fails again, it will
go back to the previous state and wait to retry. By default, it
will try 60 times before stopping. You can change the number of
retries with the --master-connect-retry
option.
This state is the most common that you will see, unless your
server is very busy. The SQL thread is currently connected to the
master and is waiting for the master to send it binary log
updates. If there is no activity after a while, the connection
will time out. The number of seconds that will elapse before
timeout is reached can be found in the variable
slave_net_timeout (previously
slave_read_timeout). A timeout is the same as a
lost connection for the slave. Therefore, it will become active
and attempt to reconnect to the master, then inquire about any
changes to the master’s binary log, before entering this state
again.
This state occurs when the slave I/O thread has received changes to the master’s binary log from the master and is writing the SQL statements and the related information to the slave’s relay log. Once it’s done, the slave’s SQL thread will read the relay log and execute the new SQL statements written to the log. On the SQL thread, this is the Reading event from the relay log state described in the next section.
If the connection to the slave failed while reading an event
(represented by an entry in the master’s binary log), the slave
will wait in this state for a certain amount of time before
attempting to reconnect to the master. The number of seconds that
the slave will wait before retrying is found in the
master-connect-retry variable on the slave.
When the slave attempts to reconnect, it enters the next
state.
This state occurs after the slave I/O thread loses its connection to the master while receiving an entry from the master binary log.
If the SQL thread isn’t processing the entries in the relay
log fast enough, and the backlog has caused the relay log files to
become too large, the I/O thread will enter this state. In this
state, it’s waiting for the SQL thread to process enough of the
entries in the relay log so that the I/O thread can delete some of
the older content of the log. The maximum amount of space
allocated for the relay log files is found in the
relay_log_space_limit variable. The slave SQL
thread automatically deletes relay log files. The FLUSH
LOGS statement, though, causes the slave to rotate log
files and to consider deleting old files.
When the I/O thread has been terminated, it enters this state as it closes. The term mutex stands for mutual exclusion. The SQL thread gets the mutex to prevent any other slave replication activities so that replication can be shut down without loss of data or file corruption.
Here is a list of replication states that can be found on the slave server for SQL threads:
Because replication is so fast, you will usually see the slave’s SQL thread in this state unless you have a very busy database system with data constantly being updated. This state indicates that the slave’s SQL thread has read all of the entries in its relay log and has executed all of the SQL statements that it contains. It has no further updates to make to its databases and is waiting for the slave’s I/O thread to add more entries to the relay log file. As mentioned in the similar state for the master, each thread acts somewhat independently and focuses only on the activities of its purview. Messages related to each thread’s state reflect this.
When an entry has been made to the relay log by the slave’s I/O thread, the slave’s SQL thread enters this state. In this state it is reading the current relay log file and is executing the new SQL statements that it contains. Basically, the SQL thread is busy updating the slave’s databases.
When the SQL thread has finished updating the slave’s
databases,it enters this state while it’s closing the relay log
file and terminating communications with the slave server. The SQL
thread gets the mutex to prevent any other slave replication
activities so that replication can be shut down without loss of
data or file corruption. This is a very minimal state. However, if
there is a problem with closing the relay log file or ending the
activities of the slave server, this state is displayed so that
you know the thread is locked. This could be caused by a table or
log file being corrupted. If you see this state, you may want to
run myisamchk or a similar utility, or the
REPAIR TABLE statement on the tables that
accessed at the time of the lockup. You’ll have to look in the
relay log file and the error log file on the slave to determine
which tables might need checking.