Mustali Kachwala's Blog

Optimizing Memcached Efficiency - Engineering at Quora - Quora

ENGINEERING AT QUORA

Optimizing Memcached Efficiency

As Quora has continued to grow, we've continued to invest in the scalability of our infrastructure. Caching is a key component of our architecture, and we use Memcached as our primary store to cache results from slower, persistent databases.

While data stores like MySQL, HBase, and Redis offer a variety of tools to monitor performance and efficiency, Memcached doesn't support the same level of detailed introspection out of the box. For example, we regularly analyze Redis backup files to find the relative size of each key, but because Memcached doesn't have any persistence, we can't take the same approach. Rather, Memcached provides only general server status and slab information through its client API, and it doesn't expose more detailed information about memory usage and expiration/eviction semantics. As a result, we were using Memcached to cache the results of a wide variety of functions on the same cluster, but we were blind to how many resources each function used, and we had little direction to optimize our caching.

To better understand what was happening with our Memcached processes under the hood, we built a tool called MCInspector, which we're excited to release as open sourcetoday. MCInspector analyzes the memory used by Memcached processes in order to provide an accurate, aggregated summary of all items in a server's memory, and it can also be used to create a filtered dump of all keys in use.

With MCInspector, we were able to identify optimizations that significantly improved both our memory usage and database performance. In this post, we'll describe how we implemented MCInspector as well as our results from using it in production.

Implementation

Memcached stores objects compactly, using chunks of fixed-size blocks called slabs. Most of the address space in a Memcached server is part of a slab, which is used to store key/value pairs and associated metadata. To analyze slab usage, MCInspector first directly copies the memory of a Memcached process, then scans that memory to find item headers, and finally collects data about them. By directly scanning the memory of a Memcached process, we're able to examine any part of Memcached's inner workings, rather than only the components exposed through the client API or console interface. In order to implement this behavior, we needed to first dive into the Memcached source code, where the structure of data stored in memory is defined.

Let's go through each of these three steps—copying, scanning, and analyzing—in more detail.

First, MCInspector copies the memory contents of a running Memcached process into MCInspector's local address space using the /proc virtual filesystem, which contains information about running processes. We find the allocated memory regions in the process by reading /proc/$pid/maps, then use the process_vm_readv system call (added in kernel version 3.2) to copy the contents of each region into local address space. Because most of the memory in Memcached servers tends to be in use, MCInspector performs copies in 64MB chunks, in order to avoid allocating too much memory and having Memcached be killed by the OS OOM-killer. Our initial implementation of MCInspector used ptrace and read from /proc/$pid/mem, but we found that the newer process_vm_readv API had significantly less overhead. Moreover, the copied data doesn't need to be consistent, so we don't have to explicitly interrupt the Memcached process while reading its memory.

After copying memory, MCInspector parses that data to find item headers. From the Memcached source code, we know that a space character is always found at a specific position in each item, which we can use as a heuristic to find each header structure. In order to check that each match is indeed an item header, we validate that each value in the header falls within a reasonable range. The below figure illustrates this process in more detail:

Finally, we need to aggregate the data. At Quora, we follow the convention that the category name of each key is a prefix of the key string, ending with the colon character. So, after we find an item, we can simply increment counters for the corresponding category, using metadata like key size, value size, and relevant timestamps.

Results

On an EC2 r3.xlarge instance, the required time to perform a complete analysis on a Memcached process with 130 million objects is about 85 seconds, including 7 seconds to copy memory chunks between processes. MCInspector uses 100% of one CPU core, but since we run Memcached on multi-core virtualized hardware, the performance impact to both the running Memcached process and the client is negligible; neither end-to-end query latency nor throughput noticeably regresses while running the analysis. Though memory is being changed by the Memcached server while this analysis is running—which means the memory snapshot seen by MCInspector is inconsistent—by comparing MCInspector's results with the curr_items field in the Memcached STATS output, we found that more than 99.9% of all records are detected (even with high concurrent read/write traffic on the running Memcached process).

With MCInspector deployed to our production servers, we were able to make a number of performance optimizations that we wouldn't have been aware of otherwise.

MCInspector reports give us an easy way to identify hotspots and inefficient storage. Using MCInspector, we audited the categories that had the largest number of items as well as those with the most memory used. Then, for those categories, we shortened keys and improved serialization methods in order to reduce total memory usage. Along the same line, we also identified several functions that had a large number of result objects but a low hit rate, and so we refactored their caching mechanisms.
We found that a large number of objects remained in Memcached even though their expiry time had already passed. Because Memcached uses lazy expiration, expired items may not be evicted even if memory is actually needed, which meant that we were wasting a lot of memory space. To reclaim that space, we created a recurring task that attempts to get expired items from Memcached, which causes them to be purged from memory immediately. (We've included this binary in the Github repository as well.)
At first, space freed by our purge task wasn't being used efficiently. By default, once Memcached assigns a memory page to a slab, it can't be moved to another slab, which causes issues in cases where categories have very low hit rates or much shorter lifetimes than other categories. While a quick solution to rebalance slabs is to restart the Memcached process so memory pages are assigned to slabs according to the traffic pattern at that time, newer versions of Memcached support slab automove. After enabling this feature, Memcached automatically balances memory allocation among slabs after the server's start time, which improves storage efficiency.

With these optimizations, the cache ages of some of our critical keys have tripled, while our the total pool size has remained unchanged. Furthermore, load on our MySQL cluster has decreased by over 30%, which provides significantly higher query capacity without any additional cost, making Quora more resilient to traffic spikes. Finally, MCInspector has become an an essential debugging tool for application-level caching issues as well as a key input when scaling up our tier.

Future Work

You can download MCInspector from our Github repository. MCInspector is designed with a plugin architecture in mind, so it should be straightforward to add your own new features that use cache objects' metadata. We plan to keep adding our own new plugins as well!

If you're interested in learning more about our Infrastructure team, which works on scalability problems like this one, check out our careers page!

Mustali Kachwala's Blog

Search This Blog

Tuesday, May 9, 2017

Optimizing Memcached Efficiency - Engineering at Quora - Quora