Using R to create a Word Cloud from a PDF Document

## Introduction

Word clouds are aesthetic representations of the most important key words in a document. They are not only eye-pleasing, but can also serve to give a rapid overview of the major topics discussed in a document. Such word clouds are usually generated by computer codes, since the manual classification of words according to their relevance is a tedious task that can be automatized with suitable algorithms. In a typical word cloud, the most important words are written with larger characters than those which are less important, and often the words are arranged in a radial way, with the most important ones in the center. Color codes can further help highlighting the relative importance of various key words.

Generating a word cloud is a common and relatively simple task in a domain that has become known as text mining. In analogy to data mining, which usually refers to the extraction of information and the statistical evaluation of numerical data, the term text mining is typically employed to describe numerical methods to obtain information from words and text in general, ranging from the counting of words to the semantic interpretation of a text. Text mining can thus be considered as a subcategory of data mining that has evolved into a highly specialized branch which treats text-related tasks. In many cases these tasks are much more complex than generating a word cloud.

In this blog post I will discuss how a word cloud can be generated using the R programming language on the basis of a given PDF document. As an example, I have chosen the PDF file of a recent article that I have co-authored on the properties of artificial spin ice systems. The code can be easily adapted to treat any PDF document. In this case R mainly provides the fundamental framework, while the lion’s share of the work is done by the text-mining package tm for the analysis of the text and by the wordcloud package, which will generate an image as a result of this analysis. In addition, we will use some fancy colors in our word cloud by employing the RColorBrewer package.

The code discussed below uses mainly two packages, tm for text mining and wordcloud for the graphics. With the following commands we can load these libraries. If any of these packages or dependencies is not yet installed on the computer, the code will download and install the missing packages. This is achieved here by using the function tryCatch().

setwd("~/Desktop/Website/blog/")
needed_libs <- c("tm", "wordcloud")
install_missing <- function(lib){install.packages(lib,repos="https://cran.r-project.org/", dependencies = TRUE); library(lib, character.only = TRUE)}
for (lib in needed_libs) tryCatch(library(lib, character.only=TRUE), error = function(e) install_missing(lib))
## Loading required package: NLP
## Loading required package: RColorBrewer

Although we will also use features provided by the RColorBrewer package, we don’t need to explicitly load this library because it is loaded (and if necessary installed) automatically with the wordcloud package. The same holds for the NLP package, which is required by the tm package.
The first line of the code defines the working directory. Unless specified otherwise, this is where the code will begin to search files if necessary. In this particular case, this is the folder (directory) where we will store the PDF file to be analyzed. If you want to reproduce this code, the entry in the argument of setwd() should be modified depending on your system and file structure. After choosing one folder in your system as your working directory, you could copy the full path to that folder into the argument of setwd("<path-to-my-working-directory>") (don’t forget to put the path in quotation marks).

The arrangement of the words in the cloud will follow a certain order, as described in the introduction, but there will still be a random component concerning details of the position and the orientation of the words. To obtain a reproducible result we use the function set.seed() which ensures that, for a given arbitrary number that is passed as an argument, the resulting word cloud is always the same:

set.seed(8)

With this we have accomplished the prerequisites of the code as far as R is concerned, and we can move on to the analysis of our PDF file.

## Reading the text in the PDF file

One powerful aspect of R is its cross-platform compatibility; a feature that is maintained in most packages, which usually operate identically on all common operating systems. When we are converting the content of a PDF file into text, this seamless portability of the code from one system to another may not hold, because the extraction of the text relies on external programs that may or may not be installed on your system.

Often there is no problem at all, and the PDF file can be converted into text by simply using the readPDF() function provided by the tm package. However, the same command may lead to an error in other operating systems. Since the possible error message might be quite obscure, it could be useful to discuss how such a situation can be handled.

First we should place a copy of the PDF file that we want to examine in the working directory defined above. In our example the name of the file is “PhysRevB.92.060413.pdf”. If you want to reproduce the example, you can download the file following this link, but you need to have access to the journal site. This access is often granted if you are a member of a University or of a Research Facility which has a physics department. In any case, it is not important which PDF file you use, provided that it contains a sufficient amount of meaningful text. The first step in the analysis of the text consists, quite obviously, in extracting the text from the PDF file. This can be done with:

my_pdf <- readPDF(control=list(text="-layout"))(elem=list(uri="PhysRevB.92.060413.pdf"), language="en")

If you obtain an error message, it probably means that the system could not find the default PDF extraction engine. You could learn more about this by reading the help page of this command with ?tm::readPDF and find ways to install the required engine. In the case of a Widows system you might see a frightening window popping up stating that jpeg8.dll is missing on your computer and that you should try to re-install it. The recommendations in this error message can safely be ignored; the important information is that there is an error message, which means that the default PDF extraction engine could not be found. If you run into such an error, one possibility to resolve the problem consists in downloading the precompiled binaries at http://www.foolabs.com/xpdf/download.html that fit to your operating system. You may either carefully install these programs as described on that page, or you could simply copy the necessary executable files into the working directory; as a quick fix. The required programs are pdftotext and pdfinfo.

On my system Ubuntu 14.04.3 LTS, the programs pdfinfo version 0.24.5 and pdftotext version 0.24.5 are installed by default, and the code does not throw any error message.

## Creating the text corpus

Once the command readPDF() has been successfully executed, we have a lot of information stored in the variable my_pdf. We are only interested in the content of the file, and we can exclude certain parts of the text that we consider irrelevant:

text_raw <- my_pdf\$content
text_raw <- text_raw[-c(1:5)] #remove journal header
text_raw <- text_raw[-c(2:17)] #remove author names, affiliations
text_raw <- text_raw[-11] # remove bibliographic reference details
text_raw <- text_raw[1:211] #remove list of references

The line numbers that have been used in the code above to remove certain parts of the text were chosen manually, by repeatedly viewing and changing the content of text_raw. In my opinion there is no point in attempting to find a programmatic approach for this, since the parts of the text that you may want to remove will depend on the specific document and on your preferences.

At this point, we have filtered from the PDF document the raw text that we consider to be relevant. Now it is time for the text mining package tm to further shape the text:

text_corpus <- Corpus(VectorSource(text_raw))
corpus_clean <- tm_map(text_corpus, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, content_transformer(tolower))

The tm package operates using a variable of the class “Corpus”, and it uses the command tm_map() to modify the content of this corpus. A detailed description of these aspects can be found in the manual of the tm package. In the first line of the code displayed above, we have created the corpus based on the raw text we had extracted before. The subsequent commands are rather self-explanatory: First we remove all the white spaces, then we remove any number, after that we transform all the text to lower case.

In the next step we remove the so-called stop words. Those are frequently occurring words that usually don’t provide essential information. The tm package has a predefined list of stop words, which in the English language are the following:

print(stopwords("en"))
##   [1] "i"          "me"         "my"         "myself"     "we"
##   [6] "our"        "ours"       "ourselves"  "you"        "your"
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"
##  [16] "his"        "himself"    "she"        "her"        "hers"
##  [21] "herself"    "it"         "its"        "itself"     "they"
##  [26] "them"       "their"      "theirs"     "themselves" "what"
##  [31] "which"      "who"        "whom"       "this"       "that"
##  [36] "these"      "those"      "am"         "is"         "are"
##  [41] "was"        "were"       "be"         "been"       "being"
##  [46] "have"       "has"        "had"        "having"     "do"
##  [51] "does"       "did"        "doing"      "would"      "should"
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"
## [101] "who's"      "what's"     "here's"     "there's"    "when's"
## [106] "where's"    "why's"      "how's"      "a"          "an"
## [111] "the"        "and"        "but"        "if"         "or"
## [116] "because"    "as"         "until"      "while"      "of"
## [121] "at"         "by"         "for"        "with"       "about"
## [126] "against"    "between"    "into"       "through"    "during"
## [131] "before"     "after"      "above"      "below"      "to"
## [136] "from"       "up"         "down"       "in"         "out"
## [141] "on"         "off"        "over"       "under"      "again"
## [146] "further"    "then"       "once"       "here"       "there"
## [151] "when"       "where"      "why"        "how"        "all"
## [156] "any"        "both"       "each"       "few"        "more"
## [161] "most"       "other"      "some"       "such"       "no"
## [166] "nor"        "not"        "only"       "own"        "same"
## [171] "so"         "than"       "too"        "very"

We can easily remove these words from the corpus, and we can furthermore remove a few words that are specific to our file:

corpus_clean <- tm_map(corpus_clean, removeWords, stopwords("english"))
my_stopwords <- c("e-ii","can","due","will", # additional user-defined stop words
"fig","figs","figure","online", # stop words related to figure captions
"rapid", "physical", "review", "communications") # stop words related to the journal
corpus_clean <- tm_map(corpus_clean, removeWords, my_stopwords)

As a last modification, we can also remove the punctuation, such as, e.g., commas, periods or quotation marks.

corpus_clean <- tm_map(corpus_clean, removePunctuation)

## Generating the word cloud

Now we are basically done. All that is left to do is to generate the word cloud. This can be achieved with the following command:

wordcloud(corpus_clean, max.words=Inf, random.order=FALSE, scale= c(3, 0.1), colors=brewer.pal(8,"Dark2"))

To briefly explain the parameters used in the wordcloud() function, the values of scale (which in this case are 3 and 0.1) define the size of the largest and the smallest letters used to display the words, random.order=FALSE ensures that the most important words are in the center of the cloud, and we have chosen to use all the eight colors of the palette named Dark2 from the RColorBrewer package. The various color palettes provided by RColorBrewer can be visualized with the command display.brewer.all(). Finally, we have set no limit to the number of words in the cloud by specifying max.words=Inf. In the case of particularly long texts it may be helpful to reduce the total number of displayed words to improve the visibility.

## Stemming

A further improvement of the word cloud might be obtained by stemming the words in the corpus. The tm package provides a function stemDocument() for this purpose. In our example we can see that the cloud contains the words “magnetic” and “magnetization” which share the same word stem. Another instance is given by the words “symmetric” and “symmetry”, and there are a few more such cases. However, stemming is not a trivial task, and simply applying the stemming function stemDocument() does not always lead to the desired result. The need for stemming does not seem to be sufficiently pronounced in this specific case to justify strong efforts in this sense, and we can therefore decide to leave the cloud as it is.

This document has been produced with RStudio version 0.99.467, using the tm package version 0.6-2, the wordcloud package version 2.5, the RColorBrewer package version 1.1-2, and the NLP package version 0.1-8. The rmarkdown package version 0.8 was used for the typesetting, and the output, including the word cloud, was generated dynamically within the document using R version 3.2.2 (2015-08-14) and the knitr package version 1.11.