Sentiment analysis has become crucial in today’s digital age, enabling businesses to glean insights from vast amounts of textual data, including customer reviews, social media comments, and news articles. By utilizing natural language processing (NLP) techniques, sentiment analysis using NLP categorizes opinions as positive, negative, or neutral, providing valuable feedback on products, services, or brands. This analysis is powered by various algorithms such as Naive Bayes, Support Vector Machines (SVM), and Recurrent Neural Networks (RNN), which help in understanding the overall sentiment and emotional tone conveyed in the text, making it an indispensable tool for business intelligence and decision-making.
This article was published as a part of the Data Science Blogathon.
sentiment analysis using NLP is a method that identifies the emotional state or sentiment behind a situation, often using NLP to analyze text data. Language serves as a mediator for human communication, and each statement carries a sentiment, which can be positive, negative, or neutral.
Suppose there is a fast-food chain company selling a variety of food items like burgers, pizza, sandwiches, and milkshakes. They have created a website where customers can order food and provide reviews.
By analyzing these reviews, the company can conclude that they need to focus on promoting their sandwiches and improving their burger quality to increase overall sales.
But, now a problem arises, that there will be hundreds and thousands of user reviews for their products and after a point of time it will become nearly impossible to scan through each user review and come to a conclusion.
A Sentiment Analysis Model is crucial for identifying patterns in user reviews, as initial customer preferences may lead to a skewed perception of positive feedback. By processing a large corpus of user reviews, the model provides substantial evidence, allowing for more accurate conclusions than assumptions from a small sample of data.
We will explore the workings of a basic Sentiment Analysis model using NLP later in this article. Furthermore, principal sentiments like “positive” and “negative” can be broken down into more nuanced sub-sentiments such as “Happy,” “Love,” “Surprise,” “Sad,” “Fear,” and “Angry,” depending on specific business requirements.
Sentiment analysis, also known as opinion mining, is a subfield of Natural Language Processing (NLP) that includes deciding and concentrating on the emotional data in an info text. This can be an assessment, an evaluation, or an inclination about a specific point or item. Here are the fundamental sorts of feeling examination:
Sentiment analysis using NLP is a mind boggling task because of the innate vagueness of human language. Mockery, for example, is especially difficult to identify. Subsequently, the precision of opinion investigation generally relies upon the intricacy of the errand and the framework’s capacity to gain from a lot of information.
NLP for sentiment analysis is important for several reasons:
Keep in mind, the objective of sentiment analysis using NLP isn’t simply to grasp opinion however to utilize that comprehension to accomplish explicit targets. It’s a useful asset, yet like any device, its worth comes from how it’s utilized.
Sentiment analysis, while powerful, comes with its own set of challenges:
These challenges highlight the complexity of human language and communication. Overcoming them requires advanced NLP techniques, deep learning models, and a large amount of diverse and well-labelled training data. Despite these challenges, sentiment analysis continues to be a rapidly evolving field with vast potential.
Sentiment Analysis has a wide range of applications across various domains. Here are some key applications:
Remember, these are just a few examples. The potential applications of sentiment analysis are vast and continue to grow with advancements in AI and machine learning technologies.
First, let’s import all the python libraries that we will use throughout the program.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from wordcloud import WordCloud import re
import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,classification_report from scikitplot.metrics import plot_confusion_matrix
We will use the dataset which is available on Kaggle for sentiment analysis using NLP, which consists of a sentence and its respective sentiment as a target variable. This dataset contains 3 separate files named train.txt, test.txt and val.txt.
Now, we will read the training data and validation data. As the data is in text format, separated by semicolons and without column names, we will create the data frame with read_csv() and parameters as “delimiter” and “names”.
df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label']) df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])
Now, we will concatenate these two data frames, as we will be using cross-validation and we have a separate test dataset, so we don’t need a separate validation set of data. And, then we will reset the index to avoid duplicate indexes.
df = pd.concat([df_train,df_val]) df.reset_index(inplace=True,drop=True)
We can view a sample of the contents of the dataset using the “sample” method of pandas, and check the no. of records and features using the “shape” method.
import pandas as pd df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label']) df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label']) df = pd.concat([df_train,df_val]) df.reset_index(inplace=True,drop=True) print("Shape of the DataFrame:",df.shape) print(df.sample(5))
Now, we will check for the various target labels in our dataset using seaborn.
As we can see that, we have 6 labels or targets in the dataset. We can make a multi-class classifier for Sentiment Analysis using NLP. But, for the sake of simplicity, we will merge these labels into two classes, i.e. Positive and Negative sentiment.
Now, we will create a custom encoder to convert categorical target labels to numerical form, i.e. (0 and 1)
def custom_encoder(df): df.replace(to_replace ="surprise", value =1, inplace=True) df.replace(to_replace ="love", value =1, inplace=True) df.replace(to_replace ="joy", value =1, inplace=True) df.replace(to_replace ="fear", value =0, inplace=True) df.replace(to_replace ="anger", value =0, inplace=True) df.replace(to_replace ="sadness", value =0, inplace=True)
custom_encoder(df['label'])
Now, we can see that our target has changed to 0 and 1,i.e. 0 for Negative and 1 for Positive, and the data is more or less in a balanced state.
Now, we will perform some pre-processing on the data before converting it into vectors and passing it to the machine learning model.
We will create a function for pre-processing of data.
A lemma is a base form of a word. For example, “run”, “running” and “runs” are all forms of the same lexeme, where the “run” is the lemma. Hence, we are converting all occurrences of the same lexeme to their respective lemma. And, then return a corpus of processed data.
But first, we will create an object of WordNetLemmatizer and then we will perform the transformation.
#object of WordNetLemmatizer lm = WordNetLemmatizer()
def text_transformation(df_col): corpus = [] for item in df_col: new_item = re.sub('[^a-zA-Z]',' ',str(item)) new_item = new_item.lower() new_item = new_item.split() new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))] corpus.append(' '.join(str(x) for x in new_item)) return corpus
corpus = text_transformation(df['text'])
Now, we will create a Word Cloud. It is a data visualization technique used to depict text in such a way that, the more frequent words appear enlarged as compared to less frequent words. This gives us a little insight into, how the data looks after being processed through all the steps until now.
rcParams['figure.figsize'] = 20,8 word_cloud = "" for row in corpus: for word in row: word_cloud+=" ".join(word) wordcloud = WordCloud(width = 1000, height = 500,background_color ='white',min_font_size = 10).generate(word_cloud) plt.imshow(wordcloud)
Output:
Now, we will use the Bag of Words Model(BOW), which is used to represent the text in the form of a bag of words ,i.e. the grammar and the order of words in a sentence are not given any importance, instead, multiplicity, i.e. (the number of times a word occurs in a document) is the main point of concern.
Basically, it describes the total occurrence of words within a document.
Scikit-Learn provides a neat way of performing the bag of words technique using CountVectorizer.
Now, we will convert the text data into vectors, by fitting and transforming the corpus that we have created.
cv = CountVectorizer(ngram_range=(1,2)) traindata = cv.fit_transform(corpus) X = traindata y = df.label
We will take ngram_range as (1,2) which signifies a bigram.
Ngram is a sequence of ‘n’ of words in a row or sentence. ‘ngram_range’ is a parameter, which we use to give importance to the combination of words, such as, “social media” has a different meaning than “social” and “media” separately.
We can experiment with the value of the ngram_range parameter and select the option which gives better results.
Now comes the machine learning model creation part and in this project, I’m going to use Random Forest Classifier, and we will tune the hyperparameters using GridSearchCV.
First, We will create a dictionary, “parameters” which will contain the values of different hyperparameters.
We will pass this as a parameter to GridSearchCV to train our random forest classifier model using all possible combinations of these parameters to find the best model.
parameters =
Now, we will fit the data into the grid search and view the best parameter using the “best_params_” attribute of GridSearchCV.
grid_search = GridSearchCV(RandomForestClassifier(),parameters,cv=5,return_train_score=True,n_jobs=-1) grid_search.fit(X,y) grid_search.best_params_
Output:
And then, we can view all the models and their respective parameters, mean test score and rank as GridSearchCV stores all the results in the cv_results_ attribute.
for i in range(432): print('Parameters: ',grid_search.cv_results_['params'][i]) print('Mean Test Score: ',grid_search.cv_results_['mean_test_score'][i]) print('Rank: ',grid_search.cv_results_['rank_test_score'][i])
Output: (a sample of the output)
Now, we will choose the best parameters obtained from GridSearchCV and create a final random forest classifier model and then train our new model.
rfc = RandomForestClassifier(max_features=grid_search.best_params_['max_features'], max_depth=grid_search.best_params_['max_depth'], n_estimators=grid_search.best_params_['n_estimators'], min_samples_split=grid_search.best_params_['min_samples_split'], min_samples_leaf=grid_search.best_params_['min_samples_leaf'], bootstrap=grid_search.best_params_['bootstrap']) rfc.fit(X,y)
Now, we will read the test data and perform the same transformations we did on training data and finally evaluate the model on its predictions.
test_df = pd.read_csv('test.txt',delimiter=';',names=['text','label'])
X_test,y_test = test_df.text,test_df.label #encode the labels into two classes , 0 and 1 test_df = custom_encoder(y_test) #pre-processing of text test_corpus = text_transformation(X_test) #convert text data into vectors testdata = cv.transform(test_corpus) #predict the target predictions = rfc.predict(testdata)
We will evaluate our model using various metrics such as Accuracy Score, Precision Score, Recall Score, Confusion Matrix and create a roc curve to visualize how our model performed.
rcParams['figure.figsize'] = 10,5 plot_confusion_matrix(y_test,predictions) acc_score = accuracy_score(y_test,predictions) pre_score = precision_score(y_test,predictions) rec_score = recall_score(y_test,predictions) print('Accuracy_score: ',acc_score) print('Precision_score: ',pre_score) print('Recall_score: ',rec_score) print("-"*50) cr = classification_report(y_test,predictions) print(cr)
Output:
Confusion Matrix:
We will find the probability of the class using the predict_proba() method of Random Forest Classifier and then we will plot the roc curve.
predictions_probability = rfc.predict_proba(testdata) fpr,tpr,thresholds = roc_curve(y_test,predictions_probability[:,1]) plt.plot(fpr,tpr) plt.plot([0,1]) plt.title('ROC Curve') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.show()
As we can see that our model performed very well in classifying the sentiments, with an Accuracy score, Precision and Recall of approx 96%. And the roc curve and confusion matrix are great as well which means that our model is able to classify the labels accurately, with fewer chances of error.
Now, we will check for custom input as well and let our model identify the sentiment of the input statement.
Predict for Custom Input:
def expression_check(prediction_input): if prediction_input == 0: print("Input statement has Negative Sentiment.") elif prediction_input == 1: print("Input statement has Positive Sentiment.") else: print("Invalid Statement.")
# function to take the input statement and perform the same transformations we did earlier def sentiment_predictor(input): input = text_transformation(input) transformed_input = cv.transform(input) prediction = rfc.predict(transformed_input) expression_check(prediction)
input1 = ["Sometimes I just want to punch someone in the face."] input2 = ["I bought a new phone and it's so good."]
sentiment_predictor(input1) sentiment_predictor(input2)
Hurray, As we can see that our model accurately classified the sentiments behind the two sentences.
Sentiment analysis using NLP stands as a powerful tool in deciphering the complex landscape of human emotions embedded within textual data. By leveraging various techniques and methodologies such as text analysis and lexicon-based approaches, analysts can extract valuable insights, ranging from consumer preferences to political sentiment, thereby informing decision-making processes across diverse domains. The polarity of sentiments identified helps in evaluating brand reputation and other significant use cases. As we conclude this journey through sentiment analysis, it becomes evident that its significance transcends industries, offering a lens through which we can better comprehend and navigate the digital realm.
A. Sentiment analysis is a technique used to determine whether a piece of text (like a review or a tweet) expresses a positive, negative, or neutral sentiment. It helps in understanding people’s opinions and feelings from written language.
Q2. What Are the Three Types of Sentiment Analysis?A. Fine-grained Sentiment Analysis: This involves classifying sentiments into categories like very positive, positive, neutral, negative, and very negative.
Aspect-based Sentiment Analysis: This focuses on identifying sentiments about specific aspects or features of a product or service, like the taste of food or the speed of service in a restaurant.
Emotion Detection: This type categorizes text into different emotions such as happiness, anger, sadness, etc.
A. The objective of sentiment analysis is to automatically identify and extract subjective information from text. It helps businesses and organizations understand public opinion, monitor brand reputation, improve customer service, and gain insights into market trends.
Q4. What Is Sentiment Analysis in Python?A. Sentiment analysis in Python involves using libraries and tools to analyze text data and determine its sentiment. Commonly used libraries include:
1. NLTK (Natural Language Toolkit): For text processing and classification.
2. TextBlob: For simple sentiment analysis and text processing.
3. VADER (Valence Aware Dictionary and sEntiment Reasoner): For analyzing social media texts.
4. Transformers (Hugging Face): For using pre-trained models to perform sentiment analysis.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Nikhil 30 Jul, 2024Data Scientist with 6 years of experience in analysing large datasets and delivering valuable insights via advanced data-driven methods. Proficient in Time Series Forecasting, Natural Language Processing and with a demonstrated history of working in the Telecom, Healthcare and Retail Supply Chain industries.