Election 2019 - Exploratory Analysis of Party Manifestos using R and TidyText
- 11th December 2019
- Big Data & Advanced Analytics
- Paul Clough
This report describes exploratory text analysis on the manifestos produced for the 2019 Election (12th December 2019) from the 5 main political parties: Labour, Conservative, Liberal Democrats, SNP and the Green Party. The documents are publicly available for download in PDF from the party websites. In this exploratory analysis we profile and summarise the manifestos using text analysis and mining.
The analysis described uses R/Rstudio together with the TidyText package. This study demonstrates the versatility and capabilities of this widely used open source library for text mining. The library is also accompanied by a publicly-accessible O’Reilly textbook by Julia Silge and David Robinson called Text Mining in R, which provides further examples of analysing text.
This report is produced using R Markdown - a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Setting up After installing and loading the necessary packages into RStudio, next we import the data - PDF versions of the manifestos (stored in the “Data” directory). We use the functions provided by pdftools to read in the PDFs and extract words etc. The manifestos are loaded into a data frame:
After installing and loading the necessary packages into RStudio, next we import the data - PDF versions of the manifestos (stored in the “Data” directory). We use the functions provided by pdftools to read in the PDFs and extract words etc. The manifestos are loaded into a data frame:
Finally, before starting the analysis we use Tidytext functions to extract words from from the text and store this as a tibble (i.e. new fancy data frame format) with the line numbers of the words (note I found a spurious ‘l’ inn the Lib Dems text so we remove this with the filter() function):
We start by listing the words across all documents and order by the frequency of the word. This highlights there are common words (e.g. ‘the’) - referred to as stopwords - which must be ignored during the analysis (in the following example we use the command anti_join(stop_words) from Tidytext):
Let’s look at the number of lines in each of the documents:
Now let’s count the total number of words and unique words in the documents:
Analysis of words
We can learn a lot about documents but inspecting the words and phrases they contain. We already looked at frequently occurring words, but let’s now look at the words used in each document. We assign weights to the words based on a measure called tf_idf. This basically assigns an importance score to each word within a document - giving a higher score to words that occur frequently within a document but not common across documents.
This effect of this is to isolate words that reflect the content of each manifesto which we can visualise as follows (just the top 15 words):
We can already start to see some of the key themes for each document emerging.
Analysing the co-occurence of words
A common step in analysing word occurrences in text is to consider words that commonly co-occur together, such as phrases. The following focuses on pairs of words and then using the tf_idf weighing again to try and identify commonly occurring phrases within each manifesto:
Doing this we can see that the top ranked pairs of words included phases such as “liberal democrats” and “real change” which we would expect as terms that are commonly used together. We can visualise this again by the manifesto:
Let’s use some further visualisation methods to demonstrate word co-occurrences. A really interesting approach is to use the ggraph package that automatically creates a network plot. We will use this to show the words that commonly co-occur (we have to do a bit more coding this time):
This is interesting as it shows us groups of words that commonly occur together, e.g. “Boris Johnson”. This is a directed graph and therefore the direction of the arrow shows the ordering of the words. The colour of the line shows the strength of the connection based on how often the word patterns occur. We also see chains of word occurrences, for example “public health services”.
A final thought would be to consider trigrams - these are groups of 3 consecutive words that occur together:
A very common text analysis operation to carry out is to consider the sentiment of text (e.g. the feeling or emotion) expressed. This is a challenging task to carry out computationally due to the beauty of natural language (e.g. irony and sarcasm), but let’s just use a simple dictionary-lookup approach using a dictionary of words that are labelled as positive and negative and count how often these occur in our manifestos:
We’d have to do a lot more work here to make the analysis more robust. For example, in our dictionary the word ‘conservative’ is considered negative, but that is not how the word is being used in the Conservative manifesto. Also, this simple approach does not deal with negation, e.g. “not poverty”.
Another common approach to analysing text is to consider topics - these are more abstract groupings of words that commonly occur together and might form some kind of coherent theme (or topics). Topic modelling is hard as it is what we can an unsupervised method and is similar to clustering. The challenge is that we don’t know how many topics we are likely to expect so we have to guess a number. But here goes . . . we will use a technique called LDA to identify 10 topics in our manifestos and then visualise the words that are commonly associated with these topics. Read the relevant chapter in the Text Mining with R book if you want to know more:
So in this case there is quite high probability that people is assigned to topic 1. We can now use a visualisation to show the most probable words for each of the topics:
So you get the general idea that there are words that generally co-occur to form topics and some of the topics probably reflect themes such as energy, jobs, etc. As well as deciding on the number of topics we should choose, actually trying to interpret the topics is another really challenging task with topic modelling. We can also look to see which topics occur in which manifestos like this:
A good test for an unsupervised approach like topic modelling would be if we select the number of clusters to match the number of documents (5 in our case) and if we assume the documents are likely to be independent and about different topics, then when we run the topic modelling using LDA we would expect to get 5 topics, each one representing a manifesto. Let’s see if this is the case:
The output is as we expect - one predominant topic per manifesto. We can now look at the words in each topic which basically reflect Which manifesto that align with what we could see from analysing the words and bigrams before:
Clustering the manifestos
One final piece of analysis carried out involves considering how ‘similar’ the documents are. This can be performed by computing pairwise similarity between pairs of documents where the more similar words used the more similar the documents are considered. We can use cluster analysis to perform this and the following computes a hierarchical clustering of the documents based on measuring the distance between document word vectors:
This shows us that the Lib Dems manifesto is more similar to the Conservative manifesto than the Green Party, for example. It also shows us that the SNP manifesto is the most different from the rest. We can observe something similar if we were to use k-means clustering. In this case I set k=3 to identify 3 clusters:
Again, we see that the Green Party and SNP documents are quite different from the documents produced by the three main parties.
To find out how you could utilise text mining and text analysis to benefit your business contact us.