Tuesday, May 9, 2017

Optimizing Memcached Efficiency - Engineering at Quora - Quora

Optimizing Memcached Efficiency - Engineering at Quora - Quora

ENGINEERING AT QUORA

Optimizing Memcached Efficiency

As Quora has continued to grow, we've continued to invest in the scalability of our infrastructure. Caching is a key component of our architecture, and we use Memcached as our primary store to cache results from slower, persistent databases.

While data stores like MySQL, HBase, and Redis offer a variety of tools to monitor performance and efficiency, Memcached doesn't support the same level of detailed introspection out of the box. For example, we regularly analyze Redis backup files to find the relative size of each key, but because Memcached doesn't have any persistence, we can't take the same approach. Rather, Memcached provides only general server status and slab information through its client API, and it doesn't expose more detailed information about memory usage and expiration/eviction semantics. As a result, we were using Memcached to cache the results of a wide variety of functions on the same cluster, but we were blind to how many resources each function used, and we had little direction to optimize our caching.

To better understand what was happening with our Memcached processes under the hood, we built a tool called MCInspector, which we're excited to release as open sourcetoday. MCInspector analyzes the memory used by Memcached processes in order to provide an accurate, aggregated summary of all items in a server's memory, and it can also be used to create a filtered dump of all keys in use.

With MCInspector, we were able to identify optimizations that significantly improved both our memory usage and database performance. In this post, we'll describe how we implemented MCInspector as well as our results from using it in production.

Implementation

Memcached stores objects compactly, using chunks of fixed-size blocks called slabs. Most of the address space in a Memcached server is part of a slab, which is used to store key/value pairs and associated metadata. To analyze slab usage, MCInspector first directly copies the memory of a Memcached process, then scans that memory to find item headers, and finally collects data about them. By directly scanning the memory of a Memcached process, we're able to examine any part of Memcached's inner workings, rather than only the components exposed through the client API or console interface. In order to implement this behavior, we needed to first dive into the Memcached source code, where the structure of data stored in memory is defined.

Let's go through each of these three steps—copying, scanning, and analyzing—in more detail.

First, MCInspector copies the memory contents of a running Memcached process into MCInspector's local address space using the /proc virtual filesystem, which contains information about running processes. We find the allocated memory regions in the process by reading /proc/$pid/maps, then use the process_vm_readv system call (added in kernel version 3.2) to copy the contents of each region into local address space. Because most of the memory in Memcached servers tends to be in use, MCInspector performs copies in 64MB chunks, in order to avoid allocating too much memory and having Memcached be killed by the OS OOM-killer. Our initial implementation of MCInspector used ptrace and read from /proc/$pid/mem, but we found that the newer process_vm_readv API had significantly less overhead. Moreover, the copied data doesn't need to be consistent, so we don't have to explicitly interrupt the Memcached process while reading its memory.

After copying memory, MCInspector parses that data to find item headers. From the Memcached source code, we know that a space character is always found at a specific position in each item, which we can use as a heuristic to find each header structure. In order to check that each match is indeed an item header, we validate that each value in the header falls within a reasonable range. The below figure illustrates this process in more detail:

Finally, we need to aggregate the data. At Quora, we follow the convention that the category name of each key is a prefix of the key string, ending with the colon character. So, after we find an item, we can simply increment counters for the corresponding category, using metadata like key size, value size, and relevant timestamps.

Results

On an EC2 r3.xlarge instance, the required time to perform a complete analysis on a Memcached process with 130 million objects is about 85 seconds, including 7 seconds to copy memory chunks between processes. MCInspector uses 100% of one CPU core, but since we run Memcached on multi-core virtualized hardware, the performance impact to both the running Memcached process and the client is negligible; neither end-to-end query latency nor throughput noticeably regresses while running the analysis. Though memory is being changed by the Memcached server while this analysis is running—which means the memory snapshot seen by MCInspector is inconsistent—by comparing MCInspector's results with the curr_items field in the Memcached STATS output, we found that more than 99.9% of all records are detected (even with high concurrent read/write traffic on the running Memcached process).

With MCInspector deployed to our production servers, we were able to make a number of performance optimizations that we wouldn't have been aware of otherwise.

MCInspector reports give us an easy way to identify hotspots and inefficient storage. Using MCInspector, we audited the categories that had the largest number of items as well as those with the most memory used. Then, for those categories, we shortened keys and improved serialization methods in order to reduce total memory usage. Along the same line, we also identified several functions that had a large number of result objects but a low hit rate, and so we refactored their caching mechanisms.
We found that a large number of objects remained in Memcached even though their expiry time had already passed. Because Memcached uses lazy expiration, expired items may not be evicted even if memory is actually needed, which meant that we were wasting a lot of memory space. To reclaim that space, we created a recurring task that attempts to get expired items from Memcached, which causes them to be purged from memory immediately. (We've included this binary in the Github repository as well.)
At first, space freed by our purge task wasn't being used efficiently. By default, once Memcached assigns a memory page to a slab, it can't be moved to another slab, which causes issues in cases where categories have very low hit rates or much shorter lifetimes than other categories. While a quick solution to rebalance slabs is to restart the Memcached process so memory pages are assigned to slabs according to the traffic pattern at that time, newer versions of Memcached support slab automove. After enabling this feature, Memcached automatically balances memory allocation among slabs after the server's start time, which improves storage efficiency.

With these optimizations, the cache ages of some of our critical keys have tripled, while our the total pool size has remained unchanged. Furthermore, load on our MySQL cluster has decreased by over 30%, which provides significantly higher query capacity without any additional cost, making Quora more resilient to traffic spikes. Finally, MCInspector has become an an essential debugging tool for application-level caching issues as well as a key input when scaling up our tier.

Future Work

You can download MCInspector from our Github repository. MCInspector is designed with a plugin architecture in mind, so it should be straightforward to add your own new features that use cache objects' metadata. We plan to keep adding our own new plugins as well!

If you're interested in learning more about our Infrastructure team, which works on scalability problems like this one, check out our careers page!

Text classification using natural language processing through python NLTK and Redis - TECHEPI

Text classification using natural language processing through python NLTK and Redis

BY AMIT PRASHANT AGRAHARI · MAY 8, 2017

What is natural language processing ?

Natural language processing is approach to make a computer program to identify speech like human speech processing. natural language processing is based on artificial intelligent (AI) which is analyze, understand and then generate the text/speech. In other way you can say NLP enable machines to understand human language and extract meaning from them.

NLP can learn automatically all types of rules to analyze a set of text/speech.

“One of the most compelling ways NLP offers valuable intelligence is by tracking sentiment — the tone of a written message (tweet, Facebook update, etc.) — and tag that text as positive, negative or neutral,” Rehling said

Other than facebook, google, twitter, IBM there are many startup which one is providing business solutions using NLP :

Recorded Future (cyber security)
Quid (strategy)
Narrative Science (journalism)
Wit.ai (intent classification, acquired by FB)
x.ai (scheduling)
Kensho (finance)
Predata (open intelligence)
Lattice (sales and marketing)
AlchemyAPI (NLP APIs, acquired by IBM)
Basis (NLP APIs)

NLP business applications today include following things :

Machine translation
Text classification
Text summarization
Chat Bot
Sentence segmentation
Customer service
Reputation monitoring
Ad placement
Market intelligence
Regulatory compliance

Stanford NLP(Java) and NLTK (Python) are two major open source library to implement natural language processing, but here I am Exp-laing NLTK .

Install NLTK library and dependencies using PIP

pip install -U nltk 
pip install -U numpy

Install redis

pip install redis

Solution to identify tweet text is positive or negative using Text classification through NLP

There is some Positive tweets for training:

I love this car.
This view is amazing.
I feel great this morning.
I am so excited about the concert.
He is my best friend.

There is some Negative tweets for training:

I do not like this car.
This view is horrible.
I feel tired this morning.
I am not looking forward to the concert.
He is my enemy.

I wanna to test some below tweets which one is positive or negative :

I like this amazing car. as positive
My house is not great. as negative.

How Internally works native bayes classifications ?

There are two formula of native bayes classifications.

where a =1 , P(xk|+ or -) Probability of every word, nj is total number of + or – words, nk is number of times word k occurs in + or – case.

Vnb is value of native bays.

P(Vj) is Probability of total positive or negative tweets.

P(Vj) = Total number of positive or negative tweets / Total tweets

Let’s understand how it’s work.

Create list of unique positive and negative tweets from above Ist 2 positive and 1 negative tweets.
<“i love this car view is amazing do not like”>
Convert all tweet into feature set.

tweet	i	love	this	car	view	is	amazing	do	not	like	class
1	1	1	1	1							+
2			1	1	1	1	1				+
3	1		1	1				1	1	1	–

calculate P(+) = 2/3 = .666666667
calculate P(-) = 1/3 = .333333333

I wanna to test “I like this amazing car” is positive or negative.

Probability is greater for positive. So tweet is positive.

Steps to identify text using NLTK and redis

Step 1- Read tweets from file and convert into format of list

Storing all positive and negative tweets from both files into list using read_file function, after that categorized in positive and negative tweets.

def read_file(file_list):
 a_list = []
 for a_file in file_list:
 f = open(a_file, 'r')
 a_list.append(f.read())
 f.close()
 return a_list

for x in read_file(['positive_tweets']):
 positive = [ content for content in x.splitlines()]
 for x in read_file(['negative_tweets']):
 negative = [ content for content in x.splitlines()]

all_contents = [(content, 'positive') for content in positive]
all_contents += [(content, 'negative') for content in negative]
random.shuffle(all_contents)

Step 2 – Feature extractor

Feature extractor is use to extract the sentences into words with positive or negative. Defining feature set for each list of word which is indicating whether the document contains that word or not.

def word_extractor(sentence):
 lemmatizer = WordNetLemmatizer()
 return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(sentence)]

def feature_extractor(text, setting):
 if setting=='bow':
 return {word: count for word, count in Counter(word_extractor(text)).items()}
 else:
 return {word: True for word in word_extractor(text)}

Step 3 – Storing all feature extractor data into redis

Why Store training data into Redis?
For small set of training data you will take to do Ist step in very less time, But whenever you will be increase your training data size then that will be take more time. So Problem will be if you want to identify tweet is positive or negative with 1,00,000 positive and negative tweets in real time then that will be not possible. because whole 1,00,000 tweet will be take approx 10-30 min for training .

Solution is to store all training data into a file or in Redis.

All contents are processing for feature extractor and then storing into redis in chunk of 10,000 as tuple within list.

content_count =1
all_features = []
key_count = 1

r = redis.StrictRedis(host='localhost', port=6379, db=0)

for (content, label) in all_contents:
 content_count +=1
 all_features.append((get_features(content, ''), label))
 if content_count == 10000:
 r.set('train_tweets_'+str(key_count), all_features)
 print(str(key_count)+" created successfully!")
 all_features = []
 content_count = 0
 key_count +=1

r.set('train_tweets_'+str(key_count), all_features)

Step 4 – Read data from redis and train

Sample output of stored redis data.

"[({u'i': True, u'feel': True, u'morning': True, u'this': True, u'tired': True}, 'negative'), ({u'do': True, u'like': True, u'i': True, u'car': True, u'this': True, u'not': True}, 'negative'), ({u'this': True, u'is': True, u'horrible': True, u'view': True}, 'negative'), ({u'this': True, u'is': True, u'amazing': True, u'view': True}, 'positive'), ({u'enemy': True, u'is': True, u'my': True, u'he': True}, 'negative'), ({u'concert': True, u'i': True, u'am': True, u'forward': True, u'looking': True, u'to': True, u'not': True, u'the': True}, 'negative'), ({u'is': True, u'my': True, u'friend': True, u'best': True, u'he': True}, 'positive'), ({u'i': True, u'this': True, u'love': True, u'car': True}, 'positive'), ({u'i': True, u'feel': True, u'great': True, u'this': True, u'morning': True}, 'positive'), ({u'about': True, u'concert': True, u'i': True, u'am': True, u'so': True, u'the': True, u'excited': True}, 'positive')]"

Reading all training sets of data from Redis and set classifier through training using Naive Bayes Classifier.

def train(train_set):
 classifier = NaiveBayesClassifier.train(train_set)
 return classifier
all_features = []
get_keys = [1,2,3,4,5]
for key in get_keys:
 all_features += eval(r.get('train_tweets_'+str(key)))

classifier = train(all_features)

Step 5 – Classify tweet

Now test tweet is positive or negative using evaluate function, which will be return accuracy as probability if accuracy is greater than .5 then below tweet is positive other wise tweet is negative.

def evaluate(train_set, test_set, classifier):
 return classify.accuracy(classifier, test_set)

test_tweet = 'I feel happy this morning';
test_contents = [(test_tweet, 'positive')]
test_set = [(feature_extractor(content, 'bow'), label) for (content, label) in test_contents]
accuracy = evaluate(all_features, test_set, classifier)
print(accuracy)

accuracy for tweet is greater than .5, so above test is marked as positive.

Wrapping Up

Natural language processing is easy to implement using NLTK library, NLTK provides lots of functionalities to implement NLP, with in this library using scikit-learn you can also implement more machine learning algorithm for better accuracy.

Thursday, May 4, 2017

RDS vs Redshift vs DynamoDB vs SimpleDB: Ultimate Comparison

AWS Database Services Complete Overview: RDS vs Redshift vs DynamoDB vs SimpleDB

rds-vs-redshift-vs-dynamo-db-vs-simple-db-cover

Cloud database platforms are considered to be a worthy representative of cloud services. No patching, hardware troubles or other maintenance efforts, easy integration and scaling up to the natural growth and customers' demands, high availability and security – all these make the cloud database market growing. Today we are going to have a look at the database platform, which Amazon Web Services (AWS) offers, and find out how to apply it in business and production.

Amazon RDS

Amazon Relational Database Service (RDS) is a good solution for those, who want to run a common database engine with no need for dealing with administration and maintenance. AWS presupposes RDS to be a fully functional alternative to common hardware databases. It is fast, scalable and can be replicated among Availability Zones for greater accessibility.

The following database engines are available:

Amazon Aurora.MySQL v.5.1, 5.5, 5.6 and 5.7 (Community Edition) with InnoDB as the default database storage engine
MariaDB v.10.0
Oracle Database v.11gR2, 12c
SQL Server 2008 R2, SQL Server 2012 (SP2), SQL Server 2014
PostgreSQL 9.3, 9.4

You should have noticed that only InnoDB subsystem is available for MySQL, and it’s highly recommended to be used by Amazon. Nevertheless, it has a 2 TB limit per table. Since MySQL proved itself to be good with big data arrays, use of InnoDB may be challengeable.

So, what about other storage engines? There are no strict restrictions relating to engines use. MyISAM, for instance, breaks Point-In-Time-Restore and Snapshot Restore features of RDS. When you run other engines, you have to stop, lock and flush your tables manually before taking a snapshot or the active content can be damaged.

There are also some inner software limits for each database instance you may want to know:

Database	Limit
Amazon Aurora	No
MySQL	No
MariaDB	No
Oracle Database	One database per instance; no limit on number of schemas per database
SQL Server	30 databases per instance
PostgreSQL	No

Computing Resources

Another key question to analyse is computer resources required to run a database. AWS provides for automatic horizontal scalability by deploying additional instances or manual vertical scaling by instance type changing. Nevertheless, the starting point is instance capacities. Amazon RDS has the following offers:

Standard – Latest Generation instance family, which includes virtual machines (VM) equipped with:

2 to 40 vCPU (virtual Central Processing Units).
8 to 180 GB RAM.
Provisioned Input/output Operations Per Second (PIOPS) support.
Network performance from Moderate to 10 Gigabit.

Standard – Previous Generation family has VMs equipped with:

1 to 8 vCPU.
3.75 to 30 GB RAM.
PIOPS support by top-tier db.m3.xlarge and db.m3.2xlarge.
Network performance from Moderate to High, top-tier.

Memory Optimized – Current Generation family provides VMs equipped with:

2 to 32 vCPU.
15 to 244 GB RAM.
PIOPS does not supported by the weakest db.r3.large and the strongest db.r3.8xlarge.
Network performance from Moderate to 10 Gigabit.

Micro instances family provides inexpensive VMs equipped with:

1-2 vCPU.
1 to 8 GB RAM.
No PIOPS support.
Network performance from Low to Moderate.

Intrigued how fast Low, Moderate, and High bandwidths are? There is no official benchmark, though there are some officious estimates. "Low" is anywhere from 50 Mbit to 300 Mbit, "Moderate" is 300-900 Mbit, "High" is 0.9-2.2Gbit. The exact figure strongly depends on the selected region and routing between Amazon data center and the end user.

The instance type to choose also depends on the database engine you want to use. For example, Aurora DB can be deployed only on Memory Optimized VMs, while SQL Server Enterprise Edition isn’t available on Micro instances. So, it should be checked whether your desired hardware meets software requirements.

Note: Amazon RDS limits the number of simultaneously used instances up to 40.

Storage Facilities

Amazon RDS provides three types of attached storage for databases and logs, based on various storage technologies, which differ in performance characteristics and price. All storage types are powered by Amazon Elastic Block Store (EBS) technology, which stripes across multiple Amazon EBS volumes to enhance IOPS performance.

Magnetic (or Standard) storage is based on HDD and suitable for the use of a database with low input/output requirements and burst possibilities (for example, latency-tolerable workloads, large data blocks processing and data warehousing). Size limits fall between 5 GB and 3 TB and are determined by the database engine. Their performance is around 100-200 IOPS, and the ceiling is 500.

Note: Magnetic storage can’t be reserved for a single instance, so the final capacity also depends on other users.

General Purpose (SSD) storage is designed for basic workloads and databases, which shall be quick but not too big. SSD storage has minimal latencies and its performance is around three IOPS per gigabyte, which can be boosted up to 3,000 IOPS for a long time and 10,000 IOPS as the upper limit. There are also the following size restrictions:

MySQL, MariaDB, PostgreSQL, Oracle DB support volumes from 5 GB to 6TB.
SQL Server supports 20 GB to 4 TB volumes.

Here is another handy note for you: while the I/O of General Purpose volumes is 16KB, the I/O of a Magnetic disk is 1 MiB. It definitely makes a distinction between performance and big data processing creating a necessity to use multiple volumes or databases for complex needs.

Provisioned IOPS (PIOPS) storage is based on virtualized volumes, which can provide a stable capacity of 10,000-20,000 IOPS. This is the best choice for intensive database workloads and interactive applications attached to database engines. PIOPS has the following limits:

MySQL, MariaDB, PostgreSQL, Oracle DB can vary in size between 100 GB and 6TB.
SQL Server Express and Web Editions vary between 100 GB and 4 TB.
SQL Server Standard and Enterprise Edition varies between 200 GB and 4 TB.

The most appealing feature of PIOPS is that the number of IOPS is dedicated and configured while creating a volume. This capacity is guaranteed by Amazon with ±10% fluctuation 99.9% of time yearly, which helps to rely on a cloud database in case of a big workflow.

The size of a storage block provided for by IOPS storage is 32KiB, and it slightly exceeds the size of a General Purpose Volume.

Note: the maximum capacity of all storages is 100TB. To process bigger data you should use another AWS database platform.

Accessibility and Manageability

Among the first RDS promoted features is Multi-AZ (Availability Zones) deployment. The feature presupposes database replication with all its settings to an idling VM instance in a different Availability Zone. The main instance and multi-AZ instances are not connected by hardware or a network and belong to different infrastructure objects.

Failures and disasters can hardly affect two data centers at the same time, so Multi-AZ deployment makes them highly durable. In case of a trouble, AWS performs an automatic failover, and a reserved VM starts with the same network settings and endpoint allowing applications and users work with the database as if nothing has happened.

Multi-AZ instances cost more than Single-AZ ones, but they have a number of extra advantages:

It is fail-resistant, so your database is always available for users.
Amazon RDS SLA (Service Level Agreement) Terms cover only Multi-AZ instances. If a Single-AZ database is down, there is no credit or compensation.
Maintenance Window downtime cannot influence Multi-AZ instances.

A Maintenance Window is an obligatory downtime period for service tasks. When it’s used for scaling a database instance or software patching, the virtual machine will be offline while maintenance works are in progress. It is automatically scheduled for requested changes within an instance, security and durability patches, and lasts 30 minutes by default. Such actions are commonly required every few months.

Usage

Here are the most typical applications of Amazon RDS:

You already have a database with a familiar engine, which needs to be offsite.
The platform for an application that requires the database to be fast, durable, scalable or all of these.
There is unrationed workflow, which requires a highly scalable database in order to avoid expenses
The Data shall be processed quickly without storing too much onsite.

Amazon Redshift

Amazon RedShift is a tool designed to work with data of up to dozens of petabytes. Powered by PostgreSQL, it is mostly applied to any kind of SQL applications with minimum changes. The target feature of the service is creating a data warehouse, where a user may focus on data management without keeping an effortful and complex infrastructure. From the technical point of view, Redshift is a cluster database without such consistency features as a foreign key and the uniqueness of field values.

The cluster includes a number of nodes with virtual databases powered by Amazon Elastic Compute Cloud 2 (EC2) instances. Those nodes are basic database units that you can use for your tasks.

Computing and Storage

The cluster architecture of Redshift is based on two main roles – a leading node and a computing one:

A leading node is connected to the outer network; it gets a user request, compiles an executable code for a computing node, makes a query and forwards tasks to computing nodes.
Сomputing nodes perform user requests and send responses, which are again queried by leading nodes and sent back to the user.
If there is just one node in a cluster, it plays leading and computing roles, however, the minimum number of nodes in big clusters is two.
Moreover, each computing node is subdivided into slices, conventional computing units that get tasks from a leading node and take part in queries.

So, the first thing to choose with Redshift is node instances. They are subdivided into the following two tiers:

Dense Storage (DS) nodes are designed for large data workflow, equipped with an HDD for higher capacity at a lower price and available in two variations.

Node name	vCPU	RAM	Storage	Number of slices	Maximum number of nodes per cluster
ds1.xlarge node	2	15GB	2TB	2	1-32
ds2.xlarge	4	31GB	2TB	2	1-32
ds1.8xlarge	16	120GB	16TB	16	2-128
ds2.8xlarge	36	244GB	16TB	16	2-128

Dense Compute (DC) nodes are used for tasks with intensive performance and extremely low latency. They use an SSD as basic storage. Also, these nodes are much faster than DS nodes, that’s why they are considered to be the best choice for the role of a leading node. DC nodes are available in two variations:

Node name	vCPU	RAM	Storage	Number of slices	Maximum number of nodes per cluster
dc1.xlarge	2	15GB	160GB	2	32
dc1.8xlarge	32	244GB	2.56TB	32	2-128

How to complete a cluster with nodes? The first criterion to consider is the data volume and its growth rate. If you have 32 TB of data, and this amount remains almost unchanged, 2 ds1.8xlarge nodes will perfectly fit your demands. If the amount of data increases by small portions, it will be better to choose 16 ds1.xlarge nodes with a possibility of horizontal scaling by 2 TB increments. Like with RDS, you also get storage for backups, which size is the same as the size of the main storage, thus facilitating maintenance.

The second criterion is the required performance. It can be easily increased by scaling the database horizontally, namely, adding DC nodes to the cluster. With Redshift technology, computing nodes mirror their disks to another one making data processing persistent. You can create a data warehouse of any capacity and complexity combining different cluster builds and node types.

Accessibility and manageability

While Redshift’s special appeal is its large scale, there are also some limits:

Number of active nodes: 200.
Parameter, Security, Subnet groups: 20.
Subnets within a Subnet group: 20.
Tables (including temporary ones) per cluster: 9.900.
Databases per cluster: 60.
Concurrent user connection to a cluster: 500.

These are not all technical and structural limits of Redshift, but still they are the most important ones. You can learn more on AWS Limits page and Redshift Limits page.

Maintenance

Like in RDS, the entire infrastructure is maintained and patched by AWS, and a user doesn’t have a root access. While the data warehouse architecture is really complicated, and it’s really effortful and expensive to replicate Redshift using EC2 instances or any other cloud platform, there is one consequent pitfall – the Maintenance Window.

It is exactly the same as in RDS: scheduled manually or automatically, takes place once per week, the exact time can be adjusted. Unlike RDS, in Redshift you have to manage the database downtime manually.

At last, Redshift supports all auto balancing, autoscaling, monitoring and networking AWS features, SQL commands and API, so it will be easy to deploy and control it.

Usage

The most common use cases of Amazon Redshift are as follows:

Data warehousing – the name speaks for itself
Big corporate or scientific data processing, with loads related to big amounts of data and large computing loads
Analytical databases for businesses required to store, analyze and transfer big data within a short time
Customer activity monitoring for analysis and statistics

DynamoDB

DynamoDB is a NoSQL database service by AWS designed for fast processing of small data, which dynamically grows and changes. The main non-relative feature of DynamoDB is the unstrict structure of a table – it consists of items (as compared to rows in a traditional table) and attributes (an analogue to columns). Carrying over of relational engines, it resembles a table with an individual number of columns in each row. Database mutability and fast I/O rate is powered by an SSD used as the basic (and the only) storage hardware.

Features and Limits

With DynamoDB there are no hardware instances on which capacities and billing depend. The main value is the read/write throughput used by the database. There is no limit on storage resources – they grow as the database grows with no replication of instances or any other typical cloud scaling. The multi-AZ feature, which requires an additional fee with RDS, comes from the box here: your data is automatically replicated among 3 Availability Zones (AZ) within the selected region. Total absence of administering activities, data replication, and final-performance scaling models make DynamoDB extremely durable.

Meanwhile, DynamoDB doesn’t support such complex functions as advanced querying and transactions. Since data is partitioned for durability, it takes some time to re-write it in each replica after a successful write operation in the main one. The balance between read and write capacities is called Read Consistency, and it can be adjusted in the following way:

Eventually Consistent Reads option gives a priority to a read operation, which forwards data even if it is already modified but hasn’t been yet replicated to a local AZ. This option bursts the reading performance, but read requests shall be performed again to get up-to-date data.
Strongly Consistent Reads option is targeted at getting the latest data. It takes more time but it returns the result, which reflects all successful writes made before read initialization.

Read consistency is not the only one unique peculiarity of DynamoDB. We stated some of its main features below:

Maximum R&W throughput – 10.000 R&W units per table, 20.000 R&W units per account.
Note: the maximum R&W throughput for the US East region is 40.000 and 80.000 R&W units respectively.
Maximum item size (item key + all attributes) – 400 KB.
Maximum table size – unlimited.
Tables per account: 256.
Supported data: Number, String, Binary, Boolean, collection data (Number Set, String Set, Binary Set) heterogeneous List and heterogeneous Map (NULL values).
String data encoding: UTF-8.

More limits can be found on Amazon DynamoDB Limits Page and in its FAQ section. There are also additional features like:

Streams – time-ordered sequences of item changes.
Triggers – Integration with AWS Lambda to execute a custom function if certain item changes are detected.
Integration – an effortless interaction between DynamoDB and Redshift, Data Pipeline, Elastic MapReduce, Hadoop, etc.
Compatibility – supports all AWS networking, monitoring and management services.

Usage

In the upshot, the best practices with DynamoDB are as follows:

Data blocks systematization and processing.
Advertising services: collection of customer data, making trend charts, etc.
Messaging and blogging: building message selections, the list of blog entries by author, etc.
Gaming: high-scores, world changes, player status and statistics, etc.
Any other case when you have to process data rather than store and the data shall be highly available rather transactable.

SimpleDB

Amazon SimpleDB is another NoSQL database platform, which resembles DynamoDB technically. It has a similar non-relational item-attributable structure, replicates among a few regions for durability, and provides read consistency options to adjust an appropriate access mode. Nevertheless, SimpleDB should be treated as the database core, which supports only the basic non-relational index, the query and storage function. The main distinctive features of the database platform are as follows:

The basic structural unit is a domain, which is referred to as a table in a relational database. Domains are multiplied in order to increase performance.
Domain size limit is 10 GB, and it’s scaled up by means of deployment of additional domains, which mirror its disks to create a database medium.
Maximum query execution time is 5 seconds.

SimpleDB differs from DynamoDB in capacity too. Let’s compare them to clear up all things:

	DynamoDB	SimpleDB
Write Capacity (per table)	10.000-40.000 units	25 writes/sec
Performance Scaling Method	Presettable Throughput	Horizontal (no bursts available)
Attributes per table	Unlimited	1 billion/td>
Attributes per item	Unlimited	256
Items per table (with maximal size)	Unlimited	3.906.250
Tables per account	256	250
Maximum size of item	400KB	1KB
Data types supported	Number, String, Binary, Boolean, NULL values, collection data	String
Encoding of string data	UTF-8	UTF-8

Thus, the main billing metrics are hours of service, running and data storage capacity. As a NoSQL database, SimpleDB doesn’t support complex transactions, but still it can run conditional PUT/DELETE operations. Domains are easily accessed via web interfaces, managed via API or Management Console, and can be integrated with any AWS product.

Usage

Lightweight and easily managed, SimpleDB doesn’t stand out against other database platforms by performance, computing capacity or storage facilities. Nevertheless, it’s beneficial to use it as an auxiliary service for other AWS products or as a simple database for non-complex needs. SimpleDB common usage scenarios are as follows:

Logging facility
Gaming database for scores, player items, client settings, etc
Indexing object metadata like rating, format or geolocation

Conclusion

The choice of a database platform always depends on computing resources and flexibility – an external index, a data warehouse and a business activity tracker require different storage capacities, database engines and performance rates. The depth of administering is also significant. If you want to adjust everything within the database, it would be better to deploy one of the preconfigured database images for EC2 having all software installed with a possibility to access root features. To facilitate your decision and brush up the features of every platform, we created a little chart below:

	Amazon RDS	Amazon Redshift	Amazon DynamoDB	Amazon SimpleDB
Database engine	Amazon Aurora, MySQL, MariaDB, Oracle Database, SQL Server, PostgreSQL	Redshift (adapted PostgreSQL)	NoSQL	NoSQL (with limited capacity)
Computing resources	Instances with 32 vCPU and 244 GB RAM	Nodes with vCPU and 244 GB RAM	Not specified, software as a service	Not specified, software as a service
Data storage facilities (max)	6 TB per instance, 20.000 IOPS	16 TB per instance	Unlimited storage size, 40.000 Read/Write per table	10 GB per domain, 25 Writes/Sec
Maintenance Windows	30 minutes per week	30 minutes per week	No effect	No effect
Multi-AZ replication	As an additional service	Manual	Built-in	Built-in
Tables (per basic structural unit)	Defined by the database engine	9.900	256	250
Main usage feature	Conventional database	Data warehouse	Database for dynamically modified data	Simple database for small records or auxiliary roles

Mustali Kachwala's Blog

Search This Blog

Tuesday, May 9, 2017

Optimizing Memcached Efficiency - Engineering at Quora - Quora

Text classification using natural language processing through python NLTK and Redis - TECHEPI

Text classification using natural language processing through python NLTK and Redis

What is natural language processing ?

Install NLTK library and dependencies using PIP

Install redis

Solution to identify tweet text is positive or negative using Text classification through NLP

How Internally works native bayes classifications ?

Steps to identify text using NLTK and redis

Step 1- Read tweets from file and convert into format of list

Step 2 – Feature extractor

Step 3 – Storing all feature extractor data into redis

Step 4 – Read data from redis and train

Step 5 – Classify tweet

Wrapping Up

Thursday, May 4, 2017

RDS vs Redshift vs DynamoDB vs SimpleDB: Ultimate Comparison

AWS Database Services Complete Overview: RDS vs Redshift vs DynamoDB vs SimpleDB

Amazon RDS

Computing Resources

Storage Facilities

Accessibility and Manageability

Usage

Amazon Redshift

Computing and Storage

Accessibility and manageability

Maintenance

Usage

DynamoDB

Features and Limits

Usage

SimpleDB

Usage

Conclusion