MemcacheDB: Evolutionary Step For Code, Revolutionary Step For Performance
You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that.
The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers.
With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge.
What is MemcacheDB? It's a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported.
Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren't multiple indexes and there's no SQL. That's why it can be so fast.
Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it's a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached's normal partitioning scheme on top.
I asked Joe why he didn't turn to one of the in-memory data grid solutions? Some of the reasons were:
So it's an evolutionary step for code and a revolutionary step for performance. Digg is looking at using MemcacheDB across the board.