This post elaborates on a simple opinion mining POC I was recently asked to develop as part of an internal company event. The general idea is to combine supervised machine learning with sentiment analysis to develop an opinion mining tool. The developed tool is based on python and builds on the nltk library, tweepy, and spark. For readability, the post is broken into short subsections.
The basic idea
At the core of this application lies a Naïve Bayes prediction model. The intention is to train the model to identify the polarity of a Brexit related tweet—leave or stay—and then use the model to classify person A's tweets as well as the tweets of the people that person A is following on twitter, in order to effectively classify person A.
What is a Naïve Bayes classifier?
In brief, it is a classifier that applies Bayes’ Theorem for identifying a tweet’s class participation (e.g. positive/negative), based on the number of previously classified tweets of ‘similar type’. Simply put:
which is interpreted as:
P(label/tweet): The probability of a label given a tweet (the result of which we want to compute).
P(tweet/label): The probability of a tweet given a label (which is based on previously gathered information).
P(label): The probability of the label (which is independent of all other probabilities, e.g. 50% in the case of two labels).
P(tweet): The probability of the tweet (also independent from all other probabilities).
Why Naïve Bayes?
The “Naïve” Bayes' Classifier is characterised as such because it assumes, in our case, that a tweet’s tokens are independent random events, which greatly simplifies calculations, speeding up the prediction process. Some argue that this may lead to a reduced accuracy, but this is very rarely the case (this assumption is context dependent, it could be that a model requires that tokens are perceived as dependent random events in order to achieve accurate predictions), while, in addition, the model is extremely fast.
The first step of this process is training the model. To do so I had to first:
1. accumulate a corpus of Brexit related tweets, and;
2. label them as either “Leave” or “Stay”.
To that end a BBC article that listed close to 440 MPs and where they stand proved to be very helpful.
Getting a list of @MP-handles using the twitter API
Using the twitter API was an essential part of this project. Everyone who intends to do so must first create a new Twitter Application by following the steps here.
To my good fortune, a twitter-handles list with all the UK MPs with active twitter accounts subscribed to it, was available on twitter and I was able to get my hands on it. I used python for the whole project. Below you can see the full code I used to download the list.
import tweepy import csv # -- Twitter API credentials -- consumer_key = "<Your Consumer Key>" consumer_secret = "<Your Consumer Secret>" access_key = "<Your Access Key>" access_secret = "<Your Access Secret>j" # authorize twitter app auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_key, access_secret) api = tweepy.API(auth) # -- SUB-FUNCTIONS -- def get_list_members(owner_screen_name, slug): twitter_list = api.list_members(owner_screen_name=owner_screen_name, slug=slug, count=700) # transform the tweepy tweets into a 2D array that will populate the csv users = [[user.name.encode("utf-8") .replace(" MP", "") .replace(" MEP", "") .replace(" MdEP", "") .replace(" ASE/MEP", ""), user.screen_name.encode("utf-8")] for user in twitter_list] # write the csv with open('%s_list.csv' % slug, 'wb') as f: writer = csv.writer(f) writer.writerow(["Name", "Screen_Name"]) writer.writerows(users) return twitter_list # -- MAIN CODE -- if __name__ == '__main__': # UK MPs owner_screen_name:"tweetminster" slug:"ukmps" print("Instantiating UK-MPs's List...") ukMPs = get_list_members("tweetminster", "ukmps") print("List created successfully... ")
Once I had both lists in my possession (BBC list and twitter handles), it was a simple matter of matching MP names with their respective handles. In order to effectively do so, I had to strip all the prefixes from the registered MP names as they appear in the twitter list I downloaded, hence the use of:
replace(" prefix", "")
Note that for running the .get_list_members() method you will need both the owner screen name as well as the list slug. The latter is this:
Accumulating the corpus
This part was generally easy. Once more, using the tweepy library, I was able to iterate through the list of labelled MPs, get their handles, access their timelines and download as much as 3240 (maximum number based on twitter restrictions) of their latest tweets. For most accounts, this number was more than enough to collect their full twitter history. A couple of things are worth noting however.
The main method used here is:
new_tweets = api.user_timeline(screen_name=screen_name, count=200)
The count parameter takes a maximum value of 200 which means that one can only collect as many as 200 tweets in one go. If you want more you need to iterate this method for as many times as you are allowed (that is, while not exceeding the maximum number of 3240 tweets) over and over again, collecting tweets from where you left things from the last collection. This means that you will have to identify the last tweet you collected in the previous call and use it as a parameter in your next call. This is done with:
oldest = new_tweets [-1].id – 1
Hence, every next iterative call should be of the form:
Needless to say, you need to store every batch of 200 tweets in a separate list prior to iterating.
In addition to this, you need to encapsulate all your calls within try:catch statements to handle exceptions. There are generally 3 exceptions you need to worry about:
1. Not authorised: Some users won’t allow you to collect their tweets.
2. Page does not exist: It may simply be the case that you have misspelled a twitter handle.
3. Rate limit exception: Twitter generally allows uninterrupted collection of tweets for roughly 10 minutes. You should be able to download more than 50k tweets within that time. After that, you will get a “Rate limit exceeded” exception and you will be blocked from accessing any more twitter timelines for as long as 15 minutes (the temporary ban is consumer key based so there is a way to bypass it—I’ll get to this later in the article).
Furthermore, in case you are only concerned with collecting tweets up to a given point in the past, e.g. not before 2015, a simple way is to break your loop if the year of the last tweet collected is smaller than that of your choice. To get the year of creation of the last tweet collected you will need to run the code bellow:
year = int(str(new_tweets[-1].created_at)[0:4])
where [0:4] simply refers to the first 4 characters of the .created_at value which refer to the year.
Finally, for every batch of 200 tweets collected, I had to filter out the non-brexit-related ones, encode every tweet based on “utf-8” and label it in accordance to the label associated with the MP whose account I accessed. I initially used relatively uncommon terms like:
list_words = ['European union', 'European Union', 'european union', 'EUROPEAN UNION', 'Brexit', 'brexit', 'BREXIT', 'euref', 'EUREF', 'euRef', 'eu_ref', 'EUref', 'leaveeu', 'leave_eu', 'leaveEU', 'leaveEu', 'borisvsdave', 'BorisVsDave', 'StrongerI', 'strongerI', 'strongeri', 'strongerI', 'votestay', 'vote_stay', 'voteStay', 'votein', 'voteout', 'voteIn', 'voteOut', 'vote_In', 'vote_Out', 'referendum', 'Referendum', 'REFERENDUM']
to make sure that the tweets would definitely be related to the referendum (identification of word participation in the tweet was done using the .find(word) function). However, the corpus I was able to accumulate was no bigger than 19k tweets. This climbed to 56k once I added: ' eu ' and ' EU '. It is true that this resulted in including some tweets about the European Union that were unrelated to the referendum but after a good inspection of the data the proportion of those tweets was insignificant and should not materially affect the accuracy of the model. Finally note that accumulating this many tweets can take as long as a whole day so you will need to plan ahead in case you need to meet some deadline.
The full code can be found here.
Dividing the Corpus: Training/Cross-Validation/Test sets
The corpus (labelled tweets) was saved in a *.csv file which I loaded using pandas dataframes. So, I used two dataframes, one for the tweets and another one for the corresponding tweet-labels.
# Load Corpus using Pandas dataFrame = pandas.read_csv('/Users/username/PycharmProjects/twitter_analytics/corpus.csv',header=None, names=['name', 'screen_name', 'id', 'created_at', 'text', 'label']) # Load Columns as Arrays (Notice: first element = column name) tweets = dataFrame['text'] del tweets labels = dataFrame['label'] del labels
The del dataframe is simply used to remove the labels at the top of the dataframe.
The model was developed using the Apache Spark MLlib. The first step of the training process was to break the corpus into a training set (60%), cross-validation set (20%) and test set (20%). Unfortunately, an additional restriction required that the original set of 56k tweets had to be cut down to 18202 * 2 = 36404 tweets, where 18202 is the number of the “Leave” labels (the smaller of the two sets).
This had to be done in order to balance the training set between the “Leave” and the “Stay” tweets. You may be wondering why this was necessary, i.e. why sacrifice close to 20k tweets for it?
The answer here is not as simple as one would think so allow me to elaborate a bit on this matter. It is reasonable to expect that words (tokens) with significant semantic interpretation like “voteIn” or “voteLeave” will appear mostly (even exclusively in some cases) in stay-, leave-labelled tweets respectively. However, other, less semantically significant, words like “say” or “support” are very likely to appear in both sets in equal proportions. If one, dismisses this as not important then it is very likely that such words will be quantified in favour of one result over the other when they shouldn’t and this will surely cause a form of bias in the prediction model.
So, first things first, I began the process by creating a Spark context objet:
sc = SparkContext('local', 'EU_Tweet_Sentiment_Analyser')
and started with producing an RDD composed of the tweet labels, on which I applied simple transformations and actions to count respectively the “leave” and “stay” labels.
# Instantiate Tweet RDDS labels_RDD = sc.parallelize(labels, 4) total_tweets = labels_RDD.count() pos_tweets = labels_RDD.filter(lambda x: x == "Stay").count() neg_tweets = pos_tweets = labels_RDD.filter(lambda x: x == "Leave").count()
This was followed by splitting the tweets into two sets: positives and negatives (leave and stay respectively).
# Break tweets between positive and negative pos_tweets =  neg_tweets =  for (tweet, label) in itertools.izip(tweets, labels): if label == "Stay": pos_tweets.append(tweet) else: neg_tweets.append(tweet)
This process was followed by calculating the number of tweets to include in every one of the three sets and calling the populate_with() function which chose random tweets from the negatives’ and the positives’ sets to populate each one of them (example below concerned with just the training set).
# Divide respectively to 60%-20%-20% training_no = int(min(len(pos_tweets), len(neg_tweets)) * 60 / 100) cross_validation_no = int(min(len(pos_tweets), len(neg_tweets)) * 20 / 100) test_no = min(len(pos_tweets), len(neg_tweets)) - training_no - cross_validation_no # Training Set training_set =  training_labels =  (training_set, training_labels) = populate_with(training_no, pos_tweets, "STAY", training_set, training_labels) (training_set, training_labels) = populate_with(training_no, neg_tweets, "LEAVE", training_set, training_labels)
Next comes tokenizing each tweet in the training set. To do so using Spark transformations, I converted the training list of tweets into an RDD:
training_RDD = sc.parallelize(training_set, 4)
For the tokenizing part I used the tokenize method from the nltk library:
from nltk.tokenize import RegexpTokenizer