Wow, it's been a busy week. I was totally swamped for several days dealing with the remember.yahoo.com MySQL servers and related stuff. And then I used a day or two to recover (sleep, shower, etc).
Anyway, I made some interesting discoveries along the way. The most surprising one had to do with thread caching on Linux when you have a busy MySQL server--busy in a particular way, mind you.
You see, we had a single master server which all the web servers could connect to (using PHP) whenever someone made a change. That includes creating a tile (there were several hundred thousand tiles created), approves a tile, marks one as "cool", and so on. All told, the master was quite busy.
Because there were between 20 and 45 front-end web servers during that time, and each could have had up to 70 apache processes that might have needed to connect, we faced a problem. That meant that the master needed to handle up to 3,150 connections in the worst case (that's 45 x 70). Most of the PHP code used mysql_pconnect() to hold persistent connections.
Rather than worry about how to do that, I made sure that the wait_timeout was set to a very low value: 15 seconds. That means MySQL would close any connection that was idle for more than 15 seconds. But I didn't realize the extent of the problem until I started getting reports from the web servers that the master was refusing connections. Why? Because I had set the maximum number of connections to a reasonable value in the master's my.cnf file:
set-variable = max_connections=180 set-variable = max_user_connections=140
And at that time, the wait_timeout was set to 600 seconds (10 minutes). Clearly that was a problem. There were a lot of idle clients holding connections and blocking out new clients from connecting and getting real work done.
What to do?
We could have stopped using mysql_pconnect(), but as you'll see, that wouldn't have solved the underlying problem.
I needed to adjust the settings. But I wasn't sure what values to use. And I really didn't want to keep stopping and starting the master. That would just suck. Then I remembered that we were running MySQL 4.0.4. I'd has a new feature that allows you to change most of the server settings on on the fly without restarting! Read about here, it in the on-line manual.
All I needed to do was execute a few variations on this command:
SET GLOBAL wait_timeout=60;
(with different values in the place of "60") to try and strike a balance between letting new clients in and kicking out already connected users too quickly.
Ultimately, I settled on a timeout of 15 seconds.
But that had an interesting and unanticipated side-effect. It meant that the Linux server was having to create new threads (MySQL is a multi-threaded server) at a very high rate. That sucks up a measurable amount of CPU time.
How much CPU time? By the time I got around to looking at the output of SHOW STATUS and seeing this:
| Threads_cached | 0 | | Threads_created | 270194 | | Threads_connected | 46 | | Threads_running | 28 |
Things were pretty bad. The machine had very little idle CPU time--probably 5-10% at the most. But it really wasn't doing that much work--maybe 40 queries per second. I was a bit puzzled. But that Threads_created number jumped out at me. It was high and increasing rapidly.
Luckily I remembered the thread_cache setting. So I decided to investigate (using the new syntax for examining server variables):
mysql> SELECT @@global.thread_cache_size; +---------------------+ | @@thread_cache_size | +---------------------+ | 0 | +---------------------+ 1 row in set (0.00 sec)
Uh oh. I never set the thread cache in my.cnf, so it has assumed the default. That's bad. It's like removing the pre-forking capabilities of Apache 1.3 and letting it get pounded on a busy web site. The "fork a new process for each new request" gets pretty expensive pretty quickly.
Luckily the thread cache is also tunable on the fly now. So all I had to do was this:
SET GLOBAL thread_cache_size=40;
I took a guess and figured that by caching 40 threads, we'd be saving a lot of work. And boy was I right!
In the other window, where I was running vmstat 1 I noticed a dramatic change. The idle CPU on the machine immediately went from 5-10% to 35-40%
If only I had thought of that sooner!
So the moral of the story is this: If you have a busy server that's getting a lot of quick connections, set your thread cache high enough that the Threads_created value in SHOW STATUS stops increasing. Your CPU will thank you.
I don't feel bad though. We were all going nuts to try and tune/optimize the code and servers while it was running and had very little sleep. Thread caching really wasn't the worst of our problems. But it became the worst after we had fixed all the bigger ones.
It was quite a learning experience.