Unlocking the Power of Sentiment Analysis with Python
Introduction
Sentiment analysis is a branch of natural language processing (NLP) that aims to extract and quantify the emotional content of a text. Sentiment analysis can be used for various applications, such as analyzing customer reviews, social media posts, product feedback, and more.
In this blog post, I will introduce the basics of sentiment analysis with Python, using some popular libraries and tools. We will also show how to perform sentiment analysis on a sample dataset of movie reviews and visualize the results.
What is sentiment analysis?
Sentiment analysis, also known as opinion mining, is the process of identifying and extracting the subjective information from a text, such as the polarity (positive, negative, or neutral), the emotion (joy, anger, sadness, etc.), the intensity (strong or weak), and the aspect (the specific part of the text that expresses the sentiment).
Sentiment analysis can be done at different levels of granularity, such as:
- Document-level: The overall sentiment of a whole document or text.
- Sentence-level: The sentiment of each individual sentence within a text.
- Aspect-level: The sentiment of a specific aspect or feature within a text.
Sentiment analysis can also be done with different approaches, such as:
Rule-based: The sentiment is determined by a set of predefined rules or heuristics, such as the presence of certain keywords or phrases, the word order, the punctuation, etc.
Machine learning-based: The sentiment is learned from a labelled dataset of texts and their corresponding sentiments, using various machine learning algorithms and techniques, such as classification, regression, deep learning, etc.
Hybrid: The sentiment is determined by a combination of rule-based and machine-learning-based methods.
Sentiment Analysis Implementation of Movie Reviews
Let's import the required libraries and download movie_reviews data:
import nltk
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('movie_reviews')
# create a word feature, remove stopwords and return diction words
def create_word_features(words):
useful_words = [word for word in words if word not in stopwords.words("english")]
my_dict = dict([(word, True) for word in useful_words])
return my_dict
let's create an empty list called neg_reviews to loop over all the files in the neg folder, this will get all the words in that file. Then we use the function create_word_features in the format nltk expects.
neg_reviews = []
for fileid in movie_reviews.fileids('neg'):
words = movie_reviews.words(fileid)
neg_reviews.append((create_word_features(words), "negative"))
pos_reviews = []
for fileid in movie_reviews.fileids('pos'):
words = movie_reviews.words(fileid)
pos_reviews.append((create_word_features(words), "positive"))
print(len(pos_reviews))
output:
1000
train test split
train_set = neg_reviews[:750] + pos_reviews[:750]
test_set = neg_reviews[750:] + pos_reviews[750:]
print(len(train_set), len(test_set))
output:
1500 500
train model
classifier = NaiveBayesClassifier.train(train_set)
accuracy = nltk.classify.util.accuracy(classifier, test_set)
print(accuracy * 100)
output:
72.3999
evaluation
reviews_list=["It's a nice movie","the Movie is unpleasant",
"In the movie, the hero of movies do stupid things","OK, the movie is well","OK nice","it's a disaster"
]
for review in reviews_list:
words = word_tokenize(review)
words = create_word_features(words)
print(review,' ('+classifier.classify(words)+')')
output:
It's a nice movie (positive)
the Movie is unpleasant (negative)
In the movie, the hero of movies do stupid things (negative)
OK, the movie is well (positive)
OK nice (positive)
it's a disaster (negative)
Conclusion
Sentiment Analysis is useful for finding the mood of the public about things like movies, politicians, stocks, or even current events. In this blog post have analysed the sentiment of the movie reviews corpus using NaiveBayesClassifier.
Still, we can improve the accuracy of the models by preprocessing data and by using lexicon models like Textblob.