1. Introduction
1.1 What is Twitter?
Figure 1. Twitter
Twitter is a popular social media platform where users can interact with each other by publishing tweets. Tweets are constrained by a character limit (280 characters) and can include media such as photos and videos. The constraints shift the focus of Twitter to sharing brief thoughts and opinions, usually about trending news. Another key feature of Twitter is the use of hashtags. These are words and phrases prefaced with the ‘#’ character which links the individual tweet the hashtag is used with certain topics. With 330 million monthly active users reported in the first quarter of 2019, Twitter offers a constant stream of unfiltered thoughts and opinions [1]. In addition to registered users, Twitter reports that 500 million access the platform every month without logging into an account suggesting just how popular the site is as a source of information and entertainment regarding current news[1]. But is all this content coming from real people? Following the events of the 2016 US election, Twitter reportedly closed 70 million fake accounts after questions arose about the effects of misleading content being circulated by fake accounts on the platform [1]. As evidence of state-backed twitter bots came to light, many were left asking how this false information affects the population and its voting patterns as well as its political opinions. Our group decided to further explore this issue by modeling the activity of Twitter bots to uncover and detect potentially harmful state-backed Twitter accounts that exist solely to circulate highly politicized propaganda all targeted at the United States of America.
1.2 Twitter Bots
Twitter bots automate common actions performed on Twitter such as tweeting, retweeting, and liking. The purpose and complexity of Twitter bots varies, with some existing merely to follow one specific account (i.e. to boost the account’s follower count) while others post content and interact with real Twitter users to promote a viewpoint. In the context of this project, we focus on state-backed Twitter bots that exist to promote a political agenda and spread false information to disrupt and warp the public’s knowledge. After evidence was revealed that Russia’s Internet Research Agency attempted to interfere with the 2016 U.S. presidential election using fake Twitter accounts, government officials and users alike have raised concerns about differentiation between authentic accounts and “trolling” fake ones [2]. In this case, fake accounts created by the IRA were operated through the use of bots that propagated propaganda into the feeds of real users during critical election times. Such false information could alter the perspective of voters and change the outcome of national elections.
Figure 2. Bots
1.3 Motivation and Background
The motivation of our project is to use machine learning algorithms to model the behavior of state-backed Twitter bots in an effort to detect and differentiate a state backed Twitter bot from an authentic account. State backed trolls are a very real problem in today’s digital world as more and more nation-states jockey to control and manage thousands of bots on Twitter. These nation-states can benefit from the propaganda that these bots promote. The bots can work to sway public opinion, as suggested by the controversy surrounding Russian involvement in the 2016 United States election. Twitter has taken notice and worked to remove flagged bot accounts but the rate of fake accounts being created will only continue to grow and Twitter’s best efforts to mitigate their use will likely lag behind the damage done by the bots.
Fortunately, Twitter releases datasets of tweets they suspect to be backed by nation-states every month. This information is free to use as Twitter hopes others can join in on experimenting and analyzing the data in an effort to fight back against the use of biased bots. We decided to utilize Twitter’s datasets as they proved to contain a thorough collection of compromised accounts and tweets. As another US election draws nearer in the fall, our team wanted to join the countless data scientists also working to create models to mitigate the effects bots could have in swaying public opinion with uninformed and inaccurate information.
Figure 3. Credit goes to Pew Research Center
1.4 Approach
In approaching this problem, we decided to analyze the tweets in two ways, one approach analyzed tweet text in predicting whether or not the tweet was a bot while the other approach used a set of features we defined to draw conclusions about a tweet’s authenticity; both using binary classification. We used two labels for this, one for a real tweet, and one for a troll tweet. Below, we outline our approach to collecting and preprocessing the data as well as creating our models.
2. Methodology
2.1 Collecting the Data
As mentioned above, Twitter releases datasets of tweets that they think are from government-backed efforts in multiple countries every month. The datasets are publicly available and consist of tweets originating from a variety of countries. The data sets we decided to use in this study include a set of Russian-linked tweets (January 2019), Iranian-linked troll tweets (June 2019), and a set of China-linked tweets (July 2019).
In addition to obtaining our troll tweets, we needed to obtain a collection of tweets that come from verified accounts. We used a mixture of verified and unverified accounts on Twitter for this portion of our dataset. We used Tweepy, a python wrapper of the Twitter API, to scrape tweets using a custom script we created. We ran our script overnight and accumulated approximately 60 thousand tweets from 90 sampled accounts, all politcally related since that was also the nature of the bot tweets. This resulted in a thorough mixture of authentic and fake tweets that we could train our models with.
Since we used one portion of our compiled dataset for exploration and the remaining portion for the construction of the models, we needed to be aware of what data was used in which process so as to not draw incorrect conclusions from duplicate use of specific data. We made it a priority to keep track of how our dataset was organized and in what way portions of it were being used in each step of our methodology so as to not compromise the accuracy of our trained models.
At this point, we had collected data from a variety of sources using several collection methods. This meant that our data, while comprehensive, was a bit unorganized as far conforming to a standard format that we could start working with.
Our data was compiled into a .csv format where each column was its own characteristic (i.e. tweet time, tweet text, and whether or not the tweet was retweeted). From this, we could import the data into our notebook workspace and further manipulate it using the Pandas framework.
2.2 Preprocessing
The preprocessing step was rather tedious as while we had an abundance of data, it lacked a standardized format. To effectively preprocess what we did have, we finalized the key characteristics we wanted to select as features we would analyze in the next step. In the datasets provided by twitter, the tweet time was listed in a format different from the format seen in the data collected by our Tweepy script. It also had the most variance in its distribution as seen by the histogram in Figure 4 when plotted for time of day using 24 hours. We thought it was interesting that while the real data had clear spikes of high and low activity, the troll data remained oddly consistent throughout the day, with little spikes here and there. The frequency of the real tweets seemed to spike as it gets closer to night, which makes sense, as people come home from work/school in the evening and generally have more time to tweet. Clearly, the consistent nature of the fake tweet's frequencies are linked to their automated nature.
Figure 4. Tweet Time Histogram
In addition to tweet time, several other of the features we selected were represented differently amongst our compiled datasets. To standardize this, we imported the raw data into Pandas dataframes and worked through each dataset, relabeling columns and removing unwanted characteristics. One characteristic of the data we wanted to select as a feature was how many hashtags For some of the features like one that specified the number of hashtags listed in a tweet, we had to manipulate the characteristic in each data frame to show the number of hashtags in a tweet since not all of the data we collected showed this (some showed the actual hashtags while others showed the hashtags count). We also had to manipulate characteristics like whether or not a tweet was a quote by converting boolean values to a numerical 1 or 0.
We focused on nine features whether or not the tweet was a retweet, the time of the tweet, the tweet’s like count, the amount of times the tweet was retweeted, the number of hashtags the tweet contains, the amount of links the tweet contained, the amount of user mentions, and the follower and following count of the account the tweet came from. For the text classifiers, the sole feature was the tweet text, which was obviously a textual feature. Most of our features were continuous with the exception of whether or not the tweet was a retweet and the time of the tweet. Whether or not the tweet was a retweet was a binary feature and the tweet time was a temporal feature. We used the pairplot function in seaborn to plot the relationship between the eight features for our normal classifiers
Figure 5. Tweet Data Pairplot
As seen above in Figure 5, the diagonals represent the variance in the data itself, while the rest of the cells show the relationship between the two features that make up its respective rows and columns. This re-affirmed our stance that the tweet time had the most variance, as well as gave way to some interesting insights like the relationship between tweet time and like/retweet counts. Both had higher values when tweet time was later in the day, which supported the conjecture that more popular accounts posted later in the day. We also observed a positive correlation between like count and retweet count which again makes sense in relation to Twitter. Another interesting insight was that when the number of urls and hashtags increased, the amount of likes trended downwards, alluding to the fact that bot account were more likely to tweet excess numbers of them. The label column and row simply serve to show the frequency of real and bot in the other features, zero being real and one being a bot.
Part of why we felt our approach was valid was due to splitting up the tweet text from the rest of the features and analyzing it separately. Building classifiers for the tweet text data proved challenging in itself as we had to use text vectorization to manipulate the text into data that could be then passed into the various machine learning algorithms needed to create the models. Upon doing it though, we had successfully created a series of models that focused solely on the text data. This was interesting in itself as we could clearly see how the models predicted the likelihood of a tweet coming from a bot based on the text alone. For the rest of the features, they required far less manipulation and most of the work in processing them involved standardizing their values. The two different approaches of text classification and normal classificanmt allowed us to create and achieve better insight for this binary classification problem.
We performed some data exploration on the textual data as well, in the form of the word clouds in Figures 6 and 7.
Figure 6. Wordcloud of Bot tweets         Figure 7. Worcloud of Real tweets
We thought it was interesting how both word clouds had some phrases/terms that were the same, but there were also many distinct differences in some of the lesser frequent terms.
In addition to why we chose this approach, we would like to discuss what we felt was new about the way we approached the problem. From the research we did, we did not see anybody use anything other than verified tweets in their real tweet dataset. While this made sense considering the ease in which it was to obtain tweets from verified accounts, we felt it did not reflect the majority of the authentic, unverified accounts that more often would come into contact with bots on Twitter. This is the precise reason as to why we adjusted our data collection process by manually sifting through accounts on Twitter and finding real, unverified accounts to use tweets from. The tweets associated with verified accounts tend to be distinctly different than a troll’s which can lead to highly biased models and results that would not scale well to larger tweet samples. Since we incorporated many authentic, unverified accounts into our sample, the models should be more accurate considering the difference between unverified, authentic accounts are much more subtle in terms of follower count.
3. Text Classification
3.1 Feature Selection
For text classification, we focused on analyzing each tweet’s text. Using pandas, we constructed a dataframe to store the text of each tweet. In the next section, we discuss how we engineered the text feature for Natural Language Processing (NLP).
Bot Is it fair that Indians buy Gasoline and Oil much more expensive than before by this ridicules sanctions on Iran oil????? Response me if you have any answer....
Real Exactly, it's just a perfect example of what trumps talking about
3.2 Feature Engineering
First, we needed to transform the textual data into something that could be passed into the machine learning models. This step was rather challenging as each tweet proved to have many characteristics that needed to be addressed in an effort to standardize the textual representation for each tweet. We started by using Regex to strip away any links from the tweet’s text. We next cleaned up each tweet’s text by removing punctuation and symbols, making the text lowercase, and trimming unnecessary white space. We also removed stop words like “the”, “an”, and “in” that can be ignored for they have no effect on the model since they hold little meaning. Last, we used stemming to reduce each word down to its root word where possible. This ensured words like “voted” and “votes” were reduced down to their fundamental meaning- “vote”. With nothing but the tweet’s cleaned textual content remaining, we prepared the tweets for the tokenization process. In the context of NLP, we treated each tweet like its own document and the collection of tweets in the dataset became the corpus. Each document, or tweet, was tokenized using a bag of words representation before using TF-IDF to assign frequencies to each word in the corpus as deminstrated in Figure 8.
Figure 8. NLP Techniques
TF-IDF stand for 'Term Frequency-Inverse Document Frequency', the words in the corpus are scored by not only how how frequent they are but also by how distinct or unique they are across all documents in the corpus.
3.3 Model Selection
Using what we learned in class as well as our own research, we decided on three different classifier models. Naive Bayes, Logistic Regression, and Random Forest all work well for binary classification.
3.4 Dimension Reduction
The tf-idf vectorizer creates term frequency matrices. In text classification, the term frequency matrices generated were sparse matrices. Since these matrices were sparse, performing PCA on them did not work so we had to change our approach by using truncated SVD to reduce the number of words in the matrix from over 400,000 to simply a 100. We would lose some accuracy over this, but this makes it much more scalable, as well as more efficient.
3.5 Results
3.5.1 Accuracy
For our text-classifiers, we used logistic regression, random forest, and naive bayes. Each of these models was run and tested on data partitioned using sklearn’s 75 to 25 data split method. For each of these with the exception of naive bayes, we also ran the models with dimension reduction. Naive bayes assumes that all features are independent and dimension reduction functions under the assumption that the features are dependent. When considering our text-classifiers without dimension reduction, the naive bayes model was the most accurate at approximately 87.66% followed by random forest at 86.81% , and logistic regression at 86.69%. With dimension reduction, random forest performed the most accurately with approximately 80% making it an accurate and fast option for classifying datasets of growing sizes.
Figure 9. Text Classification Accuracies
Model | Percent Accuracy (without dimensionality reduction) | Percent Accuracy (with dimensionality reduction) |
---|---|---|
Logistic Regression | 87% | 75% |
Random Forest | 87% | 80% |
Naive Bayes | 88% | n/a |
3.5.2 Confusion Matrices
Figure 10. Logistic Regression w/ Dim. Red.         Figure 11. Random Forest w/ Dim. Red.
A confusion matrix is a sound way to visualize the accuracies of classification models, the top left and bottom right squares represent true positive and true negative repectively. '1' means a bot and '0' means a real tweet. It can be seen how much more accurate the random forest model with dimension reduction is than the logistic regression one, and it is especially good at predicting real tweets accurately. The Logistic Regression model struggled mightily to predict bot tweets accurately, thought its high score when predicting real tweets balanced it out somewhat.
4. Classification
4.1 Feature Selection
For the normal classification, we compared different features of the datasets to identify and select features ideal for analysis. As mentioned previously, the compiled dataset was a mixture of tweets taken from official Twitter datasets and tweets we scraped so the characteristics weren’t fully aligned. The two sources did have about eight features that aligned with each other and those served the basis of the features we chose to select.
To scale our data, we used several of sklearn’s built-in data scalers. Scaling effectively standardizes the values in the dataframe for more accurate comparison. We made use of several scaling methods including sklearn’s normalizer (scales based on a normal distribution), standard-scaler (scales the data by centering it after removing removing the mean value of the feature and dividing features by standard deviation), and the min-max scaler (this is the classic min-max scaler).
Figure 12. Our Selected Features
4.2 Model Selection
After preprocessing our data, it was time to start building our models. We used a multitude of models and ML algorithms to analyze our data including logistic regression, SVM, random forest, naive bayes, PCA, k-means, gradient boosted decision trees, and Gaussian mixture model (GMM).
Out of curiosity, we chose to use K-means and GMM to see how data was interpreted with unsupervised learning.
4.3 Dimension Reduction
When building our models, we wanted to focus on the scalability of the models we were creating. To us, this was an exercise in creating models that could potentially be used every time someone tweets, stopping bots from the start. Since we were working with a multitude of features, we used PCA to reduce it down to 2 components as we felt this would yield the most scalable results.
Figure 13. PCA Component Graph
As seen above, the two components have a positive correlation, and they are categorized by their true labels.
4.4 Results
4.4.1 Supervised Learning Models
4.4.1(a) Accuracy
We created five models that fall under the supervised learning umbrella. These include logistic regression, SVM, naive bayes, random forest, and gradient boosted decision trees. Like the text classification models, the decision tree models were the most accurate with the random forest achieving accuracies of 99% (without dimension reduction) and 95% (with dimension reduction). The gradient boosted decision tree had an accuracy of 97% without dimension reduction and 86% accuracy with dimension reduction. Logistic regression recorded an accuracy of 82% without dimension reduction and 75% with it. Following closely behind logistic regression was SVM which had 83% accuracy without dimension reduction and 73% with it. Naive bayes had an accuracy of 74% without dimension reduction.
Figure 14. Supervised Classification Accuracies
Model | Percent Accuracy (without dimensionality reduction) | Percent Accuracy (with dimensionality reduction) |
---|---|---|
Random Forest | 99% | 95% |
Gradient Boosted Decision Tree | 97% | 86% |
Logistic Regression | 82% | 75% |
SVM | 83% | 73% |
Naive Bayes | 74% | n/a |
4.4.1(b) Random Forest Analysis
As stated above, the superior model far and away was Random Forest, and we were able to achieve this by tuning its two parameters, maximum depth and number of estimators, while keeping the other constant.
Figure 15. Maximum Depth
The max depth had a large influence on the performance of the model, as it had exponential growth until it started to plataeu at around 25 nodes with 95% accuracy.
Figure 16. N-Estimators
The number of estimators proved to not be as influential as the maximum depth though it still showed signs of improving the accuracy. It had a 2% accuracy growth from 93% to 95%. This difference while not miniscule, is not exactly noteworthy
4.4.1(c) Confusion Matrices
Figure 17. Logistic Regression w/ Dim. Red.         Figure 18. Random Forest w/ Dim. Red.
The Random Forest model had high accuracy for predicting both bots and real tweets, while the logistic regression model had more success with predicting bots.
4.4.2 Unsupervised Models
4.4.2(a) Accuracy
To see how our unsupervised learning interpreted our data, we decided to use two types of clustering models, GMM and K Means. For the GMM model, the achieved accuracy was approximately 49% regardless of whether or not we used dimension reduction. Similar to the GMM model, the K Means model achieved an accuracy of approximately 50%, regardless of the use of dimension reduction.
Figure 19. Unsupervised Classification Accuracies
Model | Percent Accuracy (without dimensionality reduction) | Percent Accuracy (with dimensionality reduction) |
---|---|---|
Kmeans | 51% | 51% |
GMM | 43% | 49% |
4.4.2(b) Confusion Matrices
Figure 20. KMeans w/ Dim. Red.         Figure 21. GMM w/ Dim. Red.
It can be seen that the KMeans model performed quite horribly; we can infer from this that it struggled to separate the observations into different clusters, as it predicted nearly every tweet to be real. The GMM model performed admirably for an unspervised algorithm, at almost 50% accuracy for both bots and real tweets. We observed that soft clustering might be more conducive to this brand of classification than hard clustering.
5. Conclusion
We tried to ensure that we were following best practices to the best of our ability throughout the entire process to ensure we produced results with a valid approach. That being said, we evaluated our approach based on both our model’s accuracy and the feasibility of these accuracies. Low accuracies would have implied that we were not able to generate a valid prediction from the features we selected and engineered. Very high accuracies would have implied some error in our methodology where our models became too biased and gave us misleadingly accurate results (i.e. 100% predictor accuracy). For example, even though Random Forest performed very well for us in this project, there is no gaurantee it would have this kind of success in a real world scenario, with millions of tweets. There is always some potential bias and we recognize that we probably won't have this accuracy, however we are confident that Random Forest would work the best out of the models we have tested; perhaps with different model tuning and different features, some other model would work better. It is also worth noting that we used a series of scoring methods and visualizations to additionally evaluate our models. The confusion matrices we used gave an indication of the percentage of true and false positive predictions our models were making.
Overall, this project was very interesting and we were satisfied with our results. With some of our models like the random forest classifier demonstrating a high accuracy in identifying bots, we felt we contributed to the latest efforts in building a safer internet with other data scientists.