Press "Enter" to skip to content

Tutorial: Analysing Tweets using Python

This tutorial is based on a term paper I wrote for the course "Data Analysis & Evaluation" as part of my studies “Information Science” at Humboldt University Berlin. Even though there are already countless resources available online, I finally decided to publish this one, since equivalent examples and explanations have often helped me out in the past.

Table of Contents

  1. Introduction
  2. Preparation
  3. Preprocessing
  4. Text Processing
  5. Data Analysis
    1. Descriptive Analysis
    2. Relationships between variables
    3. Additional Analysis
  6. Conclusion


In our today's world, politics, science and society are aware of the dangers surrounding disinformation and "fake news" and work on developing solutions - a topic that plays an important role especially for the information sciences. The present tutorial should focus on this current debate, motivated by the following tweet by Donald Trump from December 11, 2014:

The fact that Trump later repeatedly referred to the (primarily nonconservative) news media as the enemy of the American population is highlighted in the article "Defining the Enemy: How Donald Trump Frames the News Media" [Meeks 2019]. In general, news and media are a frequently addressed topic in Trumps tweets [Meeks 2019, p. 17] [Wang et al. 2016]. Therefore, an analysis might produce relevant and novel results. Our media system is currently undergoing a transformation from controlled top-down processes with journalistic gatekeepers to a more decentralized “hybrid media system” (according to A. Chadwick, "The Hybrid Media System: Politics and Power") in which digital and printed media coexist. Social media facilitate addressing target groups and disseminating information to a potentially unlimited audience so that they also support the "development of communities of populist, ethnonationalist, and anti-establishment sentiment" [ Wells et al. 2020].

Trump also uses so-called “framing” as a persuasion strategy: “the presence or absence of certain keywords, stock phrases, stereotyped images, sources of information, and sentences that provide thematically reinforcing clusters of facts or judgments” [Entman 1993, p. 52, quoted from Meeks (2019), p. 3]. Twitter's retweet function additionally supports framing:

“Twitter gives users an easy way to repeat Trump’s frames verbatim via the retweet function. […] Twitter’s ‘shareability’ enables Trump’s frames and influence to spread outward across peer networks, adding momentum to his framing.”

Meeks 2019, p. 7.

Therefore, the information producer is no longer responsible for his or her content alone, but responsibility also lies with the platform enabling these diverse functionalities for networking and information diffusion between actors.


This tutorial uses the programming language Python . The code snippets are extracted from a Jupyter Notebookwhich can be accessed via Github .

To analyze Twitter data I used a whole range of different Python libraries, for instance for formatting and processing of data as well as statistics and visualizations. In a first step, these libraries need to be installed with the help of a package manager such as pip or Conda, before they can be imported as follows:

#import all necessary modules

# general
import json
import pandas as pd

#numbers and time
import numpy as np
from collections import Counter
from datetime import datetime

# text
import re #regular expressions
import preprocessor as p #from tweet-preprocessor library, see
from textblob import TextBlob #text processing library, here used to extract sentiment, see also
from nltk import ngrams

from scipy import stats
import researchpy as rp

# visualizations
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import as px
from plotly.colors import n_colors
import seaborn as sn

The dataset used here was retrieved from the Trump Twitter Archive . This collection consists of more than 50.000 tweets of the personal Twitter account by Donald Trump @realDonaldTrump and I would recommend reusing these data (for that specific use case). Contrary to the Twitter API (see Tweet Availability), the Trump Twitter Archive gives access to the majority of deleted tweets (a recommended blog post on that subject would also be Tweets and Deletes). The dataset was generated on 4th August 2020 by searching for the following keywords (exact word search):

  • news
  • media
  • press
  • fact
  • facts
  • information
  • cnn
  • nbc
  • abc
  • cbs
  • nytimes
  • newyorktimes
  • ny times
  • fox


The next step is data preparation and data cleaning. It also means adding new columns by extracting existing information from our dataset. First things first - we need to load our dataset (which is a JSON-file right now) and convert it into a Pandas Dataframe . This will make later analysis much easier.

#read json file

with open("dataset_media.json", 'r',encoding="utf8") as f:
        datastore = json.load(f)

# convert json to dataframe

df = pd.json_normalize(datastore)
source text created_at retweet_count favorite_count is_retweet id_str
0 Twitter for iPhone RT @WhiteHouse: LIVE: President @realDonaldTru… Mon Aug 03 21:36:46 +0000 2020 4971 0 True 1290401249251270663
1 Twitter for iPhone My visits last week to Texas and Florida had m… Mon Aug 03 15:27:41 +0000 2020 24241 111167 False 1290308363872538624
2 Twitter for iPhone RT @realDonaldTrump: FAKE NEWS IS THE ENEMY OF… Mon Aug 03 13:53:10 +0000 2020 78121 0 True 1290284578578419712
3 Twitter for iPhone Wow! Really bad TV Ratings for Morning Joe (@J… Mon Aug 03 12:57:35 +0000 2020 15131 72174 False 1290270589945430016
4 Twitter for iPhone My visits last week to Texas and Frorida had m… Mon Aug 03 12:46:02 +0000 2020 12976 59426 False 1290267685117460481
4238 Twitter Web Client Donald Trump appearing today on CNN Internatio… Wed Feb 10 15:17:56 +0000 2010 7 1 False 8905123688
4239 Twitter Web Client Celebrity Apprentice returns to NBC, Sunday, 3… Tue Jan 12 18:05:08 +0000 2010 20 3 False 7677152231
4240 Twitter Web Client Reminder: The Miss Universe competition will b… Sun Aug 23 21:12:37 +0000 2009 1 4 False 3498743628
4241 Twitter Web Client Watch the Miss Universe competition LIVE from … Fri Aug 21 14:32:45 +0000 2009 1 3 False 3450626731
4242 Twitter Web Client Read a great interview with Donald Trump that … Wed May 20 22:29:47 +0000 2009 4 3 False 1864367186

Additionally, I filtered out retweetsbecause they are not relevant for this analysis (Trumps own statements should be in focus).

A final pre-processing step concerns the time format. The datetime values were transformed into UTC format and individual years were extracted in a similar manner.

# transform col "created_at" into recognizable UTC time values as dates/date type
# see also

# define function, using datetime module
def change_dates(dataframe, index):
    for tweet in dataframe:
        dataframe.iloc[index] = datetime.strftime(datetime.strptime(dataframe.iloc[index],'%a %b %d %H:%M:%S +0000 %Y'), '%Y-%m-%d %H:%M:%S')
        index += 1

# call function for col "created_at"
change_dates(df_media["created_at"], 0)

Text Processing

By using regular expressions, hashtags and mentions could be identifiedwhich were saved in a seperate column for analysis and evaluation purposes.

# search for hashtags and mentions, put them in own columns

hashtag_re = re.compile("#(\w+)")
df_media['hashtags'] = np.where(df_media.text.str.contains(hashtag_re), df_media.text.str.findall(hashtag_re), "")

mention_re = re.compile("@(\w+)")
df_media['mentions'] = np.where(df_media.text.str.contains(mention_re), df_media.text.str.findall(mention_re), "")

Below, a simplified sentiment analysis should be performed on all tweets. Therefore, the tweets needs to be cleaned from any noise, such as hashtags, mentions, urls, ... using the preprocessormodule so that they can't influence the results in any way.

df_media['cleaned_text'] = df_media["text"] #copy col text to cleaned_text

p.set_options(p.OPT.HASHTAG, p.OPT.URL, p.OPT.EMOJI, p.OPT.MENTION) #set options for cleaning: clean text from hashtags, mentions, urls and emojis

# define function clean_text() which will append the predefined options to each row
def clean_text(dataframe, index):
    for row in dataframe:
        dataframe.iloc[index] = p.clean(dataframe.iloc[index])
        index += 1

# call clean_text() for col "cleaned_text"
clean_text(df_media["cleaned_text"], 0)

The following example uses the library Textblob for a sentiment analysis.

#create df containing polarity values for each tweet in the dataset

tweet_text = [] # create an empty list

for index, col in df_media.iterrows(): 
    tweet_text.append(df_media.cleaned_text[index]) # append each tweet to it

sentiment_objects = [TextBlob(tweet) for tweet in tweet_text] # apply textblob for each tweet in tweet_text

sentiment_values = [[tweet.sentiment.polarity, str(tweet)] for tweet in sentiment_objects] # create a new list of polarity values and tweet text

sentiment_df = pd.DataFrame(sentiment_values, columns=["polarity", "tweet"]) # transform list into dataframe and sort the values accordingly
sentiment_df.sort_values("polarity", ascending=True)

We receive a list of all tweets, sorted by their polarity values:

polarity tweet
1756 -1.0 find the leakers within the FBI itself. Classi…
2119 -1.0 FMR PRES of Mexico, Vicente Fox horribly used …
2620 -1.0 If you look at the horrible picture on the fro…
2908 -1.0 The media is pathetic. Our embassies are savag…
1316 -1.0 Some people HATE the fact that I got along wel…
472 1.0 Triggered, a great book by my son, Don. Now nu…
972 1.0 Great news!
969 1.0 Finally great news at the Border!
1501 1.0 Great news, as a result of our TAX CUTS & …
696 1.0 Jesse & Emma, Great News. Congratulations!…

For being able to compare/analyse a tweet's polarity with other attributes from our dataframe, an additional column must be added.

df_media = df_media.reset_index(drop=True)
df_media['sentiment_values'] = pd.Series(sentiment_df['polarity'])

Alternatively, we can simply convert all polarity values into descriptions (but we still need to find fitting thresholds, in this code example I used: negative<= -0.33 < neutral <= 0.33 < positive).

df_media["sentiment"] = "" # new col

# define function to devide sentiment_values into sentiments ("negative","neutral","positive")
def get_sentiment(dataframe, index):
    for tweet in dataframe:
        if df_media.sentiment_values.iloc[index] <= -0.33:
            df_media.sentiment.iloc[index] = "negative"
        elif (df_media.sentiment_values.iloc[index] > -0.33) & (df_media.sentiment_values.iloc[index] <=0.33):
            df_media.sentiment.iloc[index] = "neutral"
            df_media.sentiment.iloc[index] = "positive"
        index += 1

# apply get_sentiment()
get_sentiment(df_media["sentiment_values"], 0)

Data Analysis

Descriptive Analysis

By using the pandas.Dataframe.describe() function, we can generate first key figures for minimum, maximum, average, standard deviation and quantiles. To visualize the distributions, we can implement a histogram or box plot.

The histograms shown above already indicate a right skewed distribution for retweets and likes in this dataset. For the polarity values calculated afterwards, we can observe an approximate normal distribution (although a majority of the values are close to 0 which might be due to the dictionary based approach). The following example shows how such a histogram can be created using matplotlib.pyplot.hist() .

# plot data
fig, ax = plt.subplots()
ax.hist(df_media.retweet_count, color = "blue", alpha=0.5, bins=20)
ax.hist(df_media.favorite_count, color = "aquamarine", alpha=0.5, bins=60)

# add labels, title, legend
ax.set_xlabel('Anzahl der Retweets/Likes')
ax.set_title(r'Verteilungen retweet_count/favorite_count')

#show plot

Box plots are another method to visualize the distribution of numerical values. The figure shows examples for the same three variables, created with plotly.graph_objects.Box. plotly.graph_objects.Box.

# boxplot favorite_count
data = df_media.favorite_count

# plot
fig = go.Figure()
fig.add_trace(go.Box(y=data, name="favorite_count", marker_color="lightseagreen", boxpoints="all", jitter=0.6,
    pointpos=-2, marker=dict(opacity=0.5), marker_size=5, width=40))

# update layout
fig.update_layout(width=400, height=800, font=dict(color="black"), plot_bgcolor="white", xaxis = dict(gridcolor = "#ededed", gridwidth=3), yaxis=dict(gridcolor = "#ededed", gridwidth=3))

#show plot

The visualizations presented so far are applicable for metric variables. When dealing with nominal or ordinal data one may need to use alternative, simplier methods.

This sunburst chart contains all hashtags from our dataset, which were grouped into more general, heuristically built categories ("media", "media criticism", "political", "show", "other", "election campaign"). In addition to the expected hashtags for media ("media", "media criticism", "show"), many hashtags are included that were assigned to the category "election campaign" (e.g. #maga, #trump2016).

# plot
fig = px.sunburst(hashtags_df, path=["category", "hashtag"], values='frequency', color="frequency", color_continuous_scale=[[0,"white"],[1.0, "blue"]])

# update layout
fig.update_layout(width=900, height=900, uniformtext=dict(minsize=13, mode='hide'),coloraxis_showscale=False) # labels of hashtags with frequency=1 are not shown (by "minsize=13, mode='hide'")

# show plot

The bar chart represents the 10 most frequently mentioned accounts (mentions). It mainly consists of accounts of those news channels that were already used as terms for the generation of the dataset in the first place (@cnn, @nytimes, @foxnews, @nbc, @foxandfriends, @abc, @nbcnews and also @washingtonpost).

First, all of the mentions are transferred into a list, then the individual frequencies of these mentions are counted and afterwards the results are saved in a new dataframe and sorted by frequency (this step is similar for the hashtag example above).

mentions = [] # create empty list

# iterate through all mentions and append them to list
for index, col in df_media.iterrows():

# change nested list to flat list
# see

flat_list = []
for sublist in mentions:
    for item in sublist:
        item = item.lower()

# create dict for mentions by counting occurences of each item, then transform mentions and frequencies into a dataframe (mentions_df)
mentions_dict = Counter(flat_list)
mentions_df = pd.DataFrame(list(mentions_dict.items()),columns = ["mention", "frequency"]) 

# sort mentions in descending order 
mentions_df = mentions_df.sort_values(by="frequency", ascending=False).reset_index(drop=True)

Subsequently, the first ten mentions can be visualized accordingly:

data = mentions_df.iloc[:11]

# plot
fig =, x="mention", y="frequency")
fig.update_traces(marker_color='lightseagreen', opacity=.6, texttemplate='%{y}', textposition='outside')

# update layout
fig.update_layout(width=800, height=500, font=dict(size=12, color="black"), plot_bgcolor="white", xaxis = dict(gridcolor = "#ededed", gridwidth=1), yaxis=dict(gridcolor = "#ededed", gridwidth=1))

# show plot

Relationships between variables

Scatter plots can be employed to show relationships between variables. For these visualizations, I used the library Plotly once again. Calculations of statistical values were done with the help of the Python library researchpy . Researchpy returns different statistical summaries and is based on statistical functions of scipy.stats .

The figure above shows the attributes source (device/application) and created_at (publication date/time of the tweet) in a scatter plot. The colours represent the polarity values of each tweet. We can assume, that the three most frequently used channels (Twitter for Android, Twitter Web Client and Twitter for iPhone) differ mainly in their temporal usage. This temporal subdivision according to the "tweet sources" used for sending a tweet corresponds to the observations of [Clarke/Grieve 2019, p. 5f.] and their extensive data.

df = df_media

# plot
fig = px.scatter(df, x="created_at", y="source",color="sentiment_values", color_continuous_scale=["orangered","greenyellow","blue"], opacity=.5)

# update layout
fig.update_layout(font=dict(size=12, color="black"), plot_bgcolor="white", xaxis = dict(gridcolor = "#ededed", gridwidth=2), yaxis=dict(gridcolor = "#ededed", gridwidth=2), width=1200)

# show plot

Since source is a nominal variable, the Chi-squared test is applied here. It should be ensured that the expected frequencies are above 5 for each domain, but in any case "the proportion of expected frequencies that are less than 5 should not exceed 20%" [Bortz 2005, p. 177, freely translated]. In this example, the dataset was reduced by any rarely occurring values (all "sources" except Twitter Web Client, Twitter for Android and Twitter for iPhone; the year 2009) to fulfill this requirement.

dataset Pearson Chi-squared p-value Cramers V
original (df=143.0) 4984.11 0.00 0.38
adjusted (df=20.0) 3236.34 0.00 0.73

For the adjusted data (df=20) a strong effect (V = 0.73) can be observed. This makes it possible to assume that at least between the most frequently used devices and the time intervals in years a certain correlation can be found in the dataset. The Chi-squared test was performed using the crosstab() function.

crosstab, res, expected = rp.crosstab(df_media.source, df_media.year,prop="cell",test="chi-square",correction=True, cramer_correction=True, expected_freqs=True)

These scatter plots show the distributions of retweets and likes over time. While at first glance they both appear to be very similar, the values for favorite_count reach a higher scale. Most striking is the strong increase of both retweets and likes for tweets starting from 2016, at the time of Trump's presidential candidacy. The following table summarizes the correlation values, calculated with parametric (Pearson) and non-parametric (Spearman, Kendall) methods for the variables retweet_count and year, favorite_count and year as well as favorite_count and retweet_count .

variables method r-value p-value
retweet_count x year Pearson 0.69 0.00
Spearman 0.75 0.00
Kendall 0.59 0.00
favorite_count x year Pearson 0.71 0.00
Spearman 0.78 0.00
Kendall 0.64 0.00
favorite_count x retweet_count Pearson 0.96 0.00
Spearman 0.98 0.00
Kendall 0.88 0.00

The correlation coefficients can be calculated with researchpy.corr_pair() :

rp.corr_pair(df_media[["retweet_count","favorite_count"]], method="kendall")

The values for the correlation of retweet_count x year and favorite_count x year recorded in the table imply a strong and positive linear correlation. However, the variables retweet_count and favorite_count with their extremely high correlation values could be an indication of a spurious relationship :

"A relationship between two variables that is caused by a statistical artifact or a factor, not included in the model, that is related to both variables."

Downey 2014, S. 143

This means the correlation might not imply causation as well. Probably a confounding variable ("confounder"), such as the number of followers increasing over time, is responsible for a similarly high number of retweets and likes [see Wang et al. 2016, p. 721].

Additional Analysis

The selection of examined relations between variables presented here allows the assumption that there is a certain correlation between different variables and the temporal component (e.g. by increasing numbers of followers or strategic adjustments over time). The temporal dependency of linguistic change in Trump's tweets has already been discussed in detail. [Clarke/Grieve 2019] Further analysis could focus on determining how the content Trump publishes on Twitter varies and how the Twitter community reacts to it. One research question could therefore be:

What are the most frequent statements (measured in n-grams) in the dataset and how are they distributed over time?

To address this question, all tweets need to be segmented into so-called n-gramsfirst, here 5-grams ("pentagrams"). For this step I used the module ngrams of the NLTK (Natural Language Toolkit).

# copy df for this step, it gets messy
# df_media should stay as it is (deep=True)
df_ngrams = df_media.copy(deep=True)

#create a new column "ngrams"
df_ngrams["ngrams"] = ""

#def function to iterate through df and get all 5-grams within "cleaned_text"
def get_ngrams(dataframe, index):
    for tweet in dataframe:
        list_ngrams = []
        n = 5
        fivegrams = list(ngrams(df_ngrams.cleaned_text.iloc[index].split(), n))
        dataframe.iloc[index] = fivegrams
        index += 1

# call get_ngrams()
get_ngrams(df_ngrams["ngrams"], 0)

In order to actually evaluate the n-grams, the nested lists need to be dissolved. However, splitting them with pandas.DataFrame.explode()does not solve the problem, because it splits all contained lists until nothing but seperate words are left over. Deviding a list only on its top level can be achieved by implementing the tidy_split() function (see on Github):

# call function tidy_split() and separate all ngrams per tweet from each other
df_grams = tidy_split(df_ngrams,"ngrams",sep="), (")

The following table presents the 10 most frequent pentagrams .

Pentagramm Anzahl
'the', 'failing', 'new', 'york', 'times' 30
'the', 'enemy', 'of', 'the', 'people!' 25
'the', 'fake', 'news', 'media', 'is' 19
'be', 'on', 'fox', '&', 'friends' 15
'will', 'be', 'on', 'fox', '&' 15
'i', 'will', 'be', 'having', 'a' 13
'be', 'doing', 'fox', '&', 'friends' 12
'is', 'the', 'enemy', 'of', 'the' 12
'the', 'history', 'of', 'our', 'country.' 11

Reflecting upon these most frequent text fragments within the context of their date of publication, one aspect becomes even clearer now: at least after Trump's inauguration, the language and content of his tweets seem to change as well. The results of Clarke/Grieve (2019) coincide with this statement:

"All four dimensions showed clear temporal patterns and most major shifts in style align to a small number of indisputably important points in the Trump timeline, especially the 2011 Birther controversy, the 2012 election, his 2015 declaration, his 2016 Republican nomination, the 2016 election, and his 2017 inauguration, as well as the seasons of his television series The Apprentice."

Clarke/Grieve 2019, S. 19

It should be noted that the figure consists of a very small fraction of all n-grams (a total of 75756 N-grams were extracted from the dataset). Nevertheless, it can be assumed that earlier tweets are more likely to deal with media in the sense of entertainment television (here: Fox & Friends) while the number of distancing and incriminating statements towards media (keywords would be "enemy", "fake news media", "failing") is obviously increasing, especially from 2017 onwards. For example, Trump antagonizes "the fake news" or "fake news media" (he often calls them "the enemy of the people"):

Of course, these observations alone are not proof enough and further analysis would have to be carried out in order to consolidate corresponding assumptions.


In conclusion, I would like to summarize some key findings.

1. Trump pursues a clear communication strategy on Twitter. Indicators can be the temporal dependency of various aspects - for instance, the devices/applications used to publish tweets as well as the diverging content of the tweets. In general, a distinct increase in tweets can be observed from 2016 onwards. Important events are likely to have an impact on the content and style of the tweets and could serve as the subject of later studies. [Clarke/Grieve 2019]

2. Trump's media criticism on Twitter is of a strong political nature, it is used for election campaign purposes. Hashtags and terms extracted from the tweets contain critical and/or attacking statements against the media and news as well as slogans such as #makeamericagreatagain / #maga and #trump2016, later #kag ("Keep America Great"). By repeating certain phrases such as "'fake news', 'failing' and 'enemy of the American People'" [ Meeks 2019, p. 5], he frames distinct media channels and even the entire media system. Thereby, Trump is receiving a lot of feedback on social media as well as in more traditional media.

3. By exploiting the hybrid media system, Trump aims for a broad media coverage. The sharing of information in social media such as Twitter ensures that users can give their approval to a specific information or opinion. Information can be passed on in ideologically oriented communities, but also beyond them - which can then result in the attention of traditional news media. [Wells et al. 2020,p. 664f.]

"Furthermore, Trump tweeted more at times when he had recently garnered less of a relative advantage in news attention, suggesting he strategically used Twitter to trigger coverage."
Wells et al. 2020, S. 559

Accordingly, Trump does not only attack political opponents with his tweets, but he also attacks the media [Clarke/Grieve 2019, p. 20] [Wang et al. 2016, p. 719][Wells et al. 2020, p. 661] by using frames to reinforce an already existing mistrust of the population [Meeks 2019, p. 5]. Ultimately, this media image created by Trump and disseminated via Twitter could also have an impact on the public perception of media worldwide [Meeks 2019, p. 19].


Bortz, Jürgen (2005): Statistik für Human- und Sozialwissenschaftler, 6. Aufl., Berlin/Heidelberg: Springer.

Clarke, Isobelle & Grieve, Jack (2019): Stylistic variation on the Donald Trump Twitter account: A linguistic analysis of tweets posted between 2009 and 2018, in: PLoS One 14(9), S. 1-27.

Downey, Allen B. (2014): Think Stats. Exploratory Data Analysis, 2. Aufl., O'Reilly Media, Inc.

Entman, Robert M. (1993): Framing: Toward clarification of a fractured paradigm, in Journal of Communication 43(4), S. 51-58.

Meeks, Lindsey (2019): Defining the Enemy: How Donald Trump Frames the News Media, in Journalism & Mass Communication Quarterly 97(1), S. 211-234.

Wang, Yu; Luo, Jiebo; Niemi, Richard; Li, Yuncheng & Hu, Tianran (2016): Catching Fire via "Likes": Inferring Topic Preferences of Trump Followers on Twitter, in Proceedings of the Tenth International AAAI Conference on Web and Social Media (ICWSM 2016), S. 719-722.

Wells, Chris; Shav, Dhavan; Lukito, Josephine; Pelled, Ayellet; Pevehouse, Jon CW & Yang, Jung Hwan (2020): Trump, Twitter, and news media responsiveness: A media system approach, in new media & society 22(4), S. 659-682.

Leave a Reply

Your email address will not be published. Required fields are marked *

German English