In our previous post, we feed Twitter data to the Text Analytics API which was able to detect the language of each tweet. We will expand upon our previous work and continue to use the API and our Twitter data to determine the sentiment of each tweet. By analyzing the sentiment of each tweet, we’re essentially going to determine the overall attitude of each tweet. Does the text have a positive, negative, or neutral connotation?
Determining overall sentiment of what people say about your company can be crucial to understanding how customers feel about your company. If the overall sentiment of your customers is low, then some investigation is needed to determine why your customers aren’t satisfied with your products or services. Next steps would be to resolve any issues discovered before they have a significant impact on the business.
The notebook for this post is on GitHub. And speaking of code, let’s setup our imports. Our Text Analytics API key will continue to be stored in an external config file so we will make sure that also gets loaded.
import json
import pandas as pd
import requests
config = json.load(open("config.json"))
text_analytics_sentiment_url = "https://westcentralus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment"
You may have noticed that the Text Analytics sentiment URL is similar to the language URL, which ended in “language” while the sentiment URL ends with “sentiment”.
Reviewing our Data
Let’s review the data that we obtained from the previous post. We got a collection of tweets using the tweepy
package. We gathered our data and saved it to CSV files. Now we can load those file so we can continue to work with them.
df = pd.read_csv("./tweets.csv")
languages_df = pd.read_csv("./detected_languages.csv")
df.head()
And thanks to the language detection portion of the Text Analytics API, we determined the language of each tweet.
languages_df.head()
Which was actually useful, since one of our tweets was actually in French.
languages_df.language.unique()
To use the sentiment analysis portion of the Text Analytics API, let’s merge the tweets and language data frames into one data frame. As usual, pandas
comes to the rescue with an easy way accomplish this. Since there are no columns in either of the data frames, we can use the index of each data frame to merge them.
df = pd.merge(df, languages_df, left_index=True, right_index=True)
df.head()
Sentiment Analysis API
Similar to when we used the language detection portion of the API, the sentiment analysis documentation shows us the structure of the data that should be in the request. That, too, is similar to the language detection portion. We can do an iterrows
on the data frame to build our document structure. This structure is almost identical to the language detection document structure, with the difference being we are adding a language column to each row.
Passing in the correct language is critical. If the incorrect language is used, the results could be much lower sentiment scores. Remember we discovered there was one tweet that was in French? If that tweet is processed as English instead of French, the sentiment score will go from 50% down to around 20%.
documents = {"documents": []}
for idx, row in df.iterrows():
documents["documents"].append({
"id": str(idx + 1),
"language": row["language"],
"text": row["text"]
})
documents
Now we can structure our API request almost exactly the same as we did in the previous post.
headers = {"Ocp-Apim-Subscription-Key": config["subscriptionKey"]}
response = requests.post(text_analytics_sentiment_url, headers=headers, json=documents)
sentiments = response.json()
sentiments
In our response, we receive a sentiment score for each tweet. The scores range from 0 to 1. A score close to 0 indicates a negative sentiment and closer to 1 indicates a positive tweet. When the score is close to 0.5, that indicates a more neutral tweet. We also got back no errors, which is always good.
Analysis on Tweet Sentiment
Now comes the fun part. We have our sentiment data for each tweet, which means we can now analyze that data. But first, let’s put this data into a pandas
data frame to make it easier to do the analysis process.
Since the data was in a nested dictionary, we can use list comprehensions to extract the “score” and “id”, then we set the column to “sentiment_score”.
sentiment_df = pd.DataFrame([d["score"] for d in sentiments["documents"]], index=[d["id"] for d in sentiments["documents"]],
columns=["sentiment_score"])
sentiment_df.head()
We can calculate the sentiment percentage rounded to the nearest two decimal points and add that as a column to the data frame.
sentiment_df["sentiment_percentage"] = round(sentiment_df.sentiment_score * 100, 2)
sentiment_df.head()
Next we can call the describe
method to get some descriptive statistics.
sentiment_df.sentiment_percentage.describe()
We can get some interesting analysis at this point. The mean (average) sentiment is 70% which indicates that we have a positive sentiment in our tweets. The minimum sentiment is 50% which indicates the minimum sentiment of a tweet is neutral and the maximum sentiment is 99%. We have some positive tweets about Wintellect!
We can see what the tweet text is that has the maximum sentiment.
sentiment_df[sentiment_df.sentiment_percentage == sentiment_df.sentiment_percentage.max()]
[doc for doc in documents["documents"] if doc["id"] == "20"]
Looks like our most positive tweet is welcoming Jeff Prosise. Now let’s see what the tweet text is for our minimum sentiment.
min_sentiment = sentiment_df[sentiment_df.sentiment_percentage == sentiment_df.sentiment_percentage.min()]
min_sentiment
We can look at all of these at once with a similar list comprehension as before.
[doc for doc in documents["documents"] if doc["id"] in min_sentiment.index.values]
We can even go a bit further and do a plot of sentiment frequencies.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
g = sns.countplot(sentiment_df.sentiment_percentage)
loc, labels = plt.xticks()
g.set_xticklabels(labels, rotation=90)
g.set_ylabel("Count")
g.set_xlabel("Sentiment %")
There are many more tweets with a 50% sentiment than all of the others. Also, as we saw from the describe
method, 50% is the lowest and 99% is the highest sentiment.
Conclusion
You may have noticed that we didn’t have to do any pre-processing of our text. We submitted the raw text to the API, and the API was able to give us results. This approach saved us a lot of time in regards to cleaning our text data. That is part of the power and benefits of using the Microsoft Cognitive Services. The APIs not only give you intelligence within your applications, but you don’t have to do any cleaning of the data to receive that intelligence. Microsoft does all the heavy lifting for you.
The Text Analytics API also allows you to extract key phrases. You may notice what we didn’t have to do is any pre-processing on our text. We sent in the raw text to the API and it was still able to give us results, which saves us a lot of time of not having to clean our text data. I’ll leave that as an exercise for you so you can play around with the API.
In this post, we covered getting sentiment analysis from our Twitter data and then doing some quick analysis of the sentiment scores. Without having to do the pre-processing of our data, we were able to quickly get our sentiment analysis and start analyzing the results to gain insights.