Remember the other day when I blogged about MySQL being broken with binary replication?
I was wrong. It might actually be functional (I still haven’t tested) but the problem was more difficult to diagnose than I originally thought.
Here’s what was happening.
The default slave_net_timeout value is 3600 seconds. The network was being congested due to network activity and other variables. This would cause the MySQL slave to block on reads to the master. As far as it was concerned it was zero seconds behind.
A temporary fix is to set slave_net_timeout to a more realistic value (5 seconds).
Which yields a few more bugs in MySQL that should be fixed.
The default value of slave_net_timeout should NOT be 3600 seconds. This is insane. Let’s select a more realistic value please.
Seconds_Behind_Master should include the last master read time. If it’s timing out then Seconds_Behind_Master should include this value.
Tip of the hat to Barry @ WordPress for connecting the dots for me.












September 26, 2007 at 9:46 pm
This is why our replication check scripts over at FastMail.FM don’t use those values for anything at all, instead:
$ echo “desc ping.PingData” | sql
Field Type Null Key Default Extra
ServerId int(11) NO PRI
ExternalTime int(11) YES NULL
InternalTime timestamp NO CURRENT_TIMESTAMP
Each server runs a job every ‘N’ seconds and INSERT OR UPDATEs their row with the latest system clock time (we use Perl, but anything that can get a unix timestamp is fine) as well as the CURRENT_TIMESTAMP internally.
It then checks that the other one is up-to-date enough as well. We get emailed if it ever gets over 30 seconds, and paged if it gets over 5 minutes.
This is master-master replication (but with one end getting all the updates in the general case) with an offsite replica also watching these (not updating the ping table obviously) and shutting down its local copy of postfix (backup MX) if it gets more than 20 minutes behind by the ping table values.