Text classification using natural language processing through python NLTK and Redis
What is natural language processing ?
Natural language processing is approach to make a computer program to identify speech like human speech processing. natural language processing is based on artificial intelligent (AI) which is analyze, understand and then generate the text/speech. In other way you can say NLP enable machines to understand human language and extract meaning from them.
NLP can learn automatically all types of rules to analyze a set of text/speech.
“One of the most compelling ways NLP offers valuable intelligence is by tracking sentiment — the tone of a written message (tweet, Facebook update, etc.) — and tag that text as positive, negative or neutral,” Rehling said
Other than facebook, google, twitter, IBM there are many startup which one is providing business solutions using NLP :
- Recorded Future (cyber security)
- Quid (strategy)
- Narrative Science (journalism)
- Wit.ai (intent classification, acquired by FB)
- x.ai (scheduling)
- Kensho (finance)
- Predata (open intelligence)
- Lattice (sales and marketing)
- AlchemyAPI (NLP APIs, acquired by IBM)
- Basis (NLP APIs)
NLP business applications today include following things :
- Machine translation
- Text classification
- Text summarization
- Chat Bot
- Sentence segmentation
- Customer service
- Reputation monitoring
- Ad placement
- Market intelligence
- Regulatory compliance
Stanford NLP(Java) and NLTK (Python) are two major open source library to implement natural language processing, but here I am Exp-laing NLTK .
Install NLTK library and dependencies using PIP
Install redis
Solution to identify tweet text is positive or negative using Text classification through NLP
There is some Positive tweets for training:
I love this car.
This view is amazing.
I feel great this morning.
I am so excited about the concert.
He is my best friend.
This view is amazing.
I feel great this morning.
I am so excited about the concert.
He is my best friend.
There is some Negative tweets for training:
I do not like this car.
This view is horrible.
I feel tired this morning.
I am not looking forward to the concert.
He is my enemy.
This view is horrible.
I feel tired this morning.
I am not looking forward to the concert.
He is my enemy.
I wanna to test some below tweets which one is positive or negative :
I like this amazing car. as positive
My house is not great. as negative.
My house is not great. as negative.
How Internally works native bayes classifications ?
There are two formula of native bayes classifications.
where a =1 , P(xk|+ or -) Probability of every word, nj is total number of + or – words, nk is number of times word k occurs in + or – case.
Vnb is value of native bays.
P(Vj) is Probability of total positive or negative tweets.
P(Vj) = Total number of positive or negative tweets / Total tweets
Let’s understand how it’s work.
Create list of unique positive and negative tweets from above Ist 2 positive and 1 negative tweets.
<“i love this car view is amazing do not like”>
Convert all tweet into feature set.
<“i love this car view is amazing do not like”>
Convert all tweet into feature set.
tweet | i | love | this | car | view | is | amazing | do | not | like | class | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | + | ||||||||
2 | 1 | 1 | 1 | 1 | 1 | + | |||||||
3 | 1 | 1 | 1 | 1 | 1 | 1 | – |
calculate P(+) = 2/3 = .666666667
calculate P(-) = 1/3 = .333333333
calculate P(-) = 1/3 = .333333333
P(i|+) = (1+1)/(8+10) = .111111111
P(love|+) = (1+1)/(8+10) = .111111111
P(this|+) = (1+1)/(8+10) = .111111111
P(car|+) = (1+1)/(8+10) = .111111111
P(view|+) = (1+1)/(8+10) = .111111111
P(is|+) = (1+1)/(8+10) = .111111111
P(amazing|+) = (1+1)/(8+10) = .111111111
P(do|+) = (0+1)/(8+10) = .055555556
P(not|+) = (0+1)/(8+10) = .055555556
P(like|+) = (0+1)/(8+10) = .055555556
P(love|+) = (1+1)/(8+10) = .111111111
P(this|+) = (1+1)/(8+10) = .111111111
P(car|+) = (1+1)/(8+10) = .111111111
P(view|+) = (1+1)/(8+10) = .111111111
P(is|+) = (1+1)/(8+10) = .111111111
P(amazing|+) = (1+1)/(8+10) = .111111111
P(do|+) = (0+1)/(8+10) = .055555556
P(not|+) = (0+1)/(8+10) = .055555556
P(like|+) = (0+1)/(8+10) = .055555556
P(i|-) = (1+1)/(6+10) = .125
P(love|-) = (0+1)/(6+10) = .0625
P(this|-) = (1+1)/(6+10) = .125
P(car|-) = (1+1)/(6+10) = .125
P(view|-) = (0+1)/(6+10) = .0625
P(is|-) = (0+1)/(6+10) = .0625
P(amazing|-) = (0+1)/(6+10) = .0625
P(do|-) = (1+1)/(6+10) = .125
P(not|-) = (1+1)/(6+10) = .125
P(like|-) = (1+1)/(6+10) = .125
P(love|-) = (0+1)/(6+10) = .0625
P(this|-) = (1+1)/(6+10) = .125
P(car|-) = (1+1)/(6+10) = .125
P(view|-) = (0+1)/(6+10) = .0625
P(is|-) = (0+1)/(6+10) = .0625
P(amazing|-) = (0+1)/(6+10) = .0625
P(do|-) = (1+1)/(6+10) = .125
P(not|-) = (1+1)/(6+10) = .125
P(like|-) = (1+1)/(6+10) = .125
I wanna to test “I like this amazing car” is positive or negative.
Vj for +ive = P(+) * P(i|+) * P(like|+) * P(this|+) * P(amazing|+) * P(car|+)
= .666666667 * .111111111 * .055555556 * .111111111 * .111111111 * .111111111
= 0.000005645
= .666666667 * .111111111 * .055555556 * .111111111 * .111111111 * .111111111
= 0.000005645
Vj for -ive = P(-) * P(i|-) * P(like|-) * P(this|-) * P(amazing|-) * P(car|-)
= .333333333 * .125 * .125 * .125 * .0625 * .125
= 0.000005086
= .333333333 * .125 * .125 * .125 * .0625 * .125
= 0.000005086
Probability is greater for positive. So tweet is positive.
Steps to identify text using NLTK and redis
Step 1- Read tweets from file and convert into format of list
Storing all positive and negative tweets from both files into list using read_file function, after that categorized in positive and negative tweets.
Step 2 – Feature extractor
Feature extractor is use to extract the sentences into words with positive or negative. Defining feature set for each list of word which is indicating whether the document contains that word or not.
Step 3 – Storing all feature extractor data into redis
Why Store training data into Redis?
For small set of training data you will take to do Ist step in very less time, But whenever you will be increase your training data size then that will be take more time. So Problem will be if you want to identify tweet is positive or negative with 1,00,000 positive and negative tweets in real time then that will be not possible. because whole 1,00,000 tweet will be take approx 10-30 min for training .
For small set of training data you will take to do Ist step in very less time, But whenever you will be increase your training data size then that will be take more time. So Problem will be if you want to identify tweet is positive or negative with 1,00,000 positive and negative tweets in real time then that will be not possible. because whole 1,00,000 tweet will be take approx 10-30 min for training .
Solution is to store all training data into a file or in Redis.
All contents are processing for feature extractor and then storing into redis in chunk of 10,000 as tuple within list.
Step 4 – Read data from redis and train
Sample output of stored redis data.
Reading all training sets of data from Redis and set classifier through training using Naive Bayes Classifier.
Step 5 – Classify tweet
Now test tweet is positive or negative using evaluate function, which will be return accuracy as probability if accuracy is greater than .5 then below tweet is positive other wise tweet is negative.
accuracy for tweet is greater than .5, so above test is marked as positive.
Wrapping Up
Natural language processing is easy to implement using NLTK library, NLTK provides lots of functionalities to implement NLP, with in this library using scikit-learn you can also implement more machine learning algorithm for better accuracy.
No comments:
Post a Comment