In the last month I’ve found two bugs that can cause MySQL to lie about Seconds_Behind_Master (at least on 4.1.22).
Maximum Packet Size on Slaves
If max_allowed_packet is large on the master but smaller on the slave the slave won’t be able to read the data from the master.
What happens is that the slave will hit an error, return the value of NULL for Seconds_Behind_Master, try to execute the SQL again, and then return ZERO seconds behind master.
This process continues in an infinite loop until manual intervention.
Ethernet Segment Saturation
We had a problem where a switch freaked out and set port speed to 10Mbit.
Other boxes were writing to the master at full speed but downstream slaves could only read data at 10Mbit. This saturated the segment and reads were timing out on the slave.
The slave was then timing out but during this process Seconds_Behind_Master stayed at zero.
A user on my blog commented about using a heartbeat table to solve this problem.
Clients insert a timestamp representing a heartbeat into the master and continually assert that there are correct values are on the slave. If the timestamp on the slave is too far behind we KNOW that replication is broken even if the slave is confused.
The thing I like about this is that it’s extremely pragmatic and uses the exact same mechanism as our data.
I think I’m going to update lbpool to use this mechanism to detect slave lag. Is there’s any other information I should include about the clients?