Summarizing a Corpus

Text data is messy and to make sense of it you often have to clean it a bit first. For example, do you want “Tuesday” and “Tuesdays” to count as separate words or the same word? Most of the time we would want to count this as the same word. Similarly with “run” and “running”. Furthermore, we often are not interested in including punctuation in the analysis - we just want to treat the text as a “bag of words”. There are libraries in R that will help you do this.

You can download the code and data for this module by using this dropbox link: https://www.dropbox.com/sh/q3w2bvju4lowy4d/AAC_Noadt_lYpSULcnCtcAaJa?dl=0. If you are using Rstudio you can do this from within Rstudio:

usethis::use_course("https://www.dropbox.com/sh/q3w2bvju4lowy4d/AAC_Noadt_lYpSULcnCtcAaJa?dl=1")

In the following let’s look at Yelp reviews of Las Vegas hotels:

library(tidyverse)
reviews <- read_cvs('data/vegas_hotel_reviews.csv')
business <- read_csv('data/vegas_hotel_info.csv')

This data contains customer reviews of 18 hotels in Las Vegas. In addition to text, each review also contain a star rating from 1 to 5. Let’s see how the 18 hotels fare on average in terms of star ratings:

library(scales)
library(lubridate)

reviews %>%
  left_join(select(business,business_id,name),
             by='business_id') %>%
  group_by(name) %>%
  summarize(n = n(),
            mean.star = mean(as.numeric(stars))) %>%
  arrange(desc(mean.star)) %>%
  ggplot() + 
  geom_point(aes(x=reorder(name,mean.star),y=mean.star,size=n))+
  coord_flip() +
  ylab('Mean Star Rating (1-5)') + 
  xlab('Hotel')

So The Venetian, Bellagio and The Cosmopolitan are clearly the highest rated hotels, while Luxor and LVH are the lowest rated. Ok, but what is behind these ratings? What are customers actually saying about these hotels? This is what we can hope to find through a text analysis.

Basic Text Analysis in R

Constructing a Document Term Matrix

In the following we will make use of two libraries for text mining:

## install packages
install.packages(c("wordcloud","tidytext"))  ## only run once
library(tidytext)
library(wordcloud)

Simple text summaries is often based on the document term matrix. This is an array where each row corresponds to a document and each column corresponds to a word. The entries of the array are simply counts of how many times a certain word occurs in a certain document. Let’s look at a simple example:

example <- data.frame(doc_id = c(1:4),
                      text=c("I have a brown dog. My dog loves walks.",
                       "My dog likes food.",
                       "I like food.",
                       "Some dogs are black."),
                      stringsAsFactors = F)

This data contains four documents. The first document contains 8 unique words (or “terms”). The word “brown” occurs once while “dog” occurs twice. We can create document-term counts using the tidytext’s library unnest_tokens and then a simple count:

example %>%
  unnest_tokens(word,text) %>%
  count(doc_id,word)

   doc_id  word n
1       1     a 1
2       1 brown 1
3       1   dog 2
4       1  have 1
5       1     i 1
6       1 loves 1
7       1    my 1
8       1 walks 1
9       2   dog 1
10      2  food 1
11      2 likes 1
12      2    my 1
13      3  food 1
14      3     i 1
15      3  like 1
16      4   are 1
17      4 black 1
18      4  dogs 1
19      4  some 1

This data contains four documents. The first document contains 8 unique words (or “terms”). The word “brown” occurs once while “dog” occurs twice. Note that punctuation is removed and lower-casing is done automatically.

We are usually performing text summaries with the goal of understanding what a document or set of documents are about. To this end many of the often used words in the English language are useless - words such as the, and, it and so on. These words are called “stop words” and we can remove them automatically by using an anti_join :

example %>%
  unnest_tokens(word,text) %>%
  count(doc_id,word) %>%
  anti_join(stop_words)

   doc_id  word n
1       1 brown 1
2       1   dog 2
3       1 loves 1
4       1 walks 1
5       2   dog 1
6       2  food 1
7       2 likes 1
8       3  food 1
9       4 black 1
10      4  dogs 1

We now have a more distilled and interesting representation of each document. Let’s try this out on some real data - the hotel reviews above. Suppose we are interersted in summarizing the reviews for the Aria hotel:

aria.id <-  filter(business, 
        name=='Aria Hotel & Casino')$business_id
aria.reviews <- filter(reviews, 
                       business_id==aria.id)


AriaTidy <- aria.reviews %>%
  select(review_id,text,stars) %>%
  unnest_tokens(word,text)

AriaFreqWords <- AriaTidy %>%
  count(word)

Let’s plot the top words:

AriaFreqWords %>%
  top_n(25) %>%
  ggplot(aes(x=fct_reorder(word,n),y=n)) + geom_bar(stat='identity') + 
  coord_flip() + theme_bw()+
  labs(title='Top 25 Words in Aria Reviews',
       x='Count',
       y= 'Word')

ok - so here we see the problem with stop words. Let’s remove them:

AriaFreqWords %>%
  anti_join(stop_words) %>%
  top_n(25) %>%
  ggplot(aes(x=fct_reorder(word,n),y=n)) + geom_bar(stat='identity') + 
  coord_flip() + theme_bw()+
  labs(title='Top 25 Words in Aria Reviews',
       subtitle = 'Stop Words Removed',
       x='Count',
       y= 'Word')

Much better! But how do these top words vary with the rating of the underlying reviews? This is an easy extension:

AriaFreqWordsByRating <- AriaTidy %>%
  count(stars,word)

AriaFreqWordsByRating %>%
  anti_join(stop_words) %>%
  group_by(stars) %>%
  top_n(20) %>%
  ggplot(aes(x=reorder_within(word,n,stars),
             y=n,
             fill=stars)) + 
  geom_bar(stat='identity') + 
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~stars,scales = 'free',nrow=1) + 
  theme_bw() + 
  theme(legend.position = "none")+
  labs(title = 'Top Words by Review Rating',
       subtitle = 'Stop words removed',
       x = 'Word',
       y = 'Count')

Here we can begin to see what is behind low rated experiences and high rated experiences.

A typical visualization of document term arrays are word clouds. These can easily be done in R:

topWords <- AriaFreqWords %>%
  anti_join(stop_words) %>%
  top_n(100) 
  
wordcloud(topWords$word,
           topWords$n,
           scale=c(5,0.5), 
           colors=brewer.pal(8,"Dark2"))

Above we looked at terms in isolation. Sometimes we may want to summarize terms two at a time. These are called “bi-grams”:

aria.reviews %>%
  select(review_id,text) %>%
  unnest_tokens(bigram,text,token="ngrams",n=2) %>%
  count(bigram) %>%
  top_n(40) %>%
  ggplot(aes(x=fct_reorder(bigram,n),y=n)) + geom_bar(stat='identity') + 
  coord_flip() +
  labs(title="Top Bi-grams",
       x = "Counts",
       y = "Terms")

All of the above analyses looked at overall term frequencies. Sometimes we may have a different objective: What is one given document about? The default metric for this question is a document’s Term-Frequency-Inverse Document-Frequency (TF-IDF). A word’s TF-IDF is a measure of how important a word is to a document. It is calculated as \[ \text{TF}(\text{word}) = \frac{\text{number of times word appears in document}}{\text{total number of words in document}} \] \[ \text{IDF}(\text{word}) = \log_e \frac{ \text{total number of documents}}{\text{Number of documents with word in it}} \] \[ \text{TF-IDF}(\text{word}) = \text{TF}(\text{word}) \times \text{IDF}(\text{word}) \]

Words with high TF-IDF in a document will tend to be words that are rare (when compared to other documents) but not too rare. In this way they are informative about what the document is about! We can easily calculate TF-IDF using bind_tf_idf:

tidyReviews <- aria.reviews %>%
  select(review_id,text) %>%
  unnest_tokens(word, text) %>%
  count(review_id,word)

minLength <- 200  # focus on long reviews 
tidyReviewsLong <- tidyReviews %>%
  group_by(review_id) %>%
  summarize(length = sum(n)) %>%
  filter(length >= minLength)

tidyReviewsTFIDF <- tidyReviews %>%
  filter(review_id %in% tidyReviewsLong$review_id) %>%
  bind_tf_idf(word,review_id,n) %>%
  group_by(review_id) %>%
  arrange(desc(tf_idf)) %>%
  slice(1:15) %>% # get top 15 words in terms of tf-idf
  ungroup() %>%
  mutate(xOrder=n():1) %>%  # for plotting
  inner_join(select(aria.reviews,review_id,stars),by='review_id')  # get star ratings

Here is a plot of the top TF-IDF words for 12 reviews:

nReviewPlot <- 12
plot.df <- tidyReviewsTFIDF %>%
  filter(review_id %in% tidyReviewsLong$review_id[1:nReviewPlot])

plot.df %>%
  mutate(review_id_n = as.integer(factor(review_id))) %>%
  ggplot(aes(x=xOrder,y=tf_idf,fill=factor(review_id_n))) + 
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ review_id_n,scales='free') +
  scale_x_continuous(breaks = plot.df$xOrder,
                     labels = plot.df$word,
                     expand = c(0,0)) + 
  coord_flip()+
  labs(x='Word',
       y='TF-IDF',
       title = 'Top TF-IDF Words in Reviews of Aria',
       subtitle = paste0('Based on first ', 
                         nReviewPlot,
                         ' reviews'))+
  theme(legend.position = "none")

One can think of these as “key-words” for each review.

Aria v2.0

If you want to try out a much bigger database of Aria reviews, you can use the data in the file AriaReviewsTrip.rds. This contains 10,905 reviews of the Aria hotel scraped from Tripadvisor.

aria <- read_csv('data/AriaReviewsTrip.csv') %>%
  rename(text = reviewText)

meta.data <- aria %>%
  select(reviewID,reviewRating,date,year.month.group)

With this sample size you can do more in-depth analyses. For example, how do word frequencies change over time? In the following we focus on the terms “buffet”,“pool” and “staff”. We calculate the relative frequency of these three terms for each month (relative to the total number of terms used that month).

ariaTidy <- aria %>%
  select(reviewID,text) %>%
  unnest_tokens(word,text) %>%
  count(reviewID,word) %>%
  inner_join(meta.data,by="reviewID")

total.terms.time <- ariaTidy %>%
  group_by(year.month.group) %>%
  summarize(n.total=sum(n))

## for the legend 
a <- 1:nrow(total.terms.time)
b <- a[seq(1, length(a), 3)]

ariaTidy %>%
  filter(word %in% c("pool","staff","buffet")) %>%
  group_by(word,year.month.group) %>%
  summarize(n = sum(n)) %>%
  left_join(total.terms.time, by='year.month.group') %>%
  ggplot(aes(x=year.month.group,y=n/n.total,color=word,group=word)) + 
  geom_line() + 
  facet_wrap(~word)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_x_discrete(breaks=as.character(total.terms.time$year.month.group[b]))+
  scale_y_continuous(labels=percent)+xlab('Year/Month')+
  ylab('Word Frequency relative to Month Total')+
  ggtitle('Dynamics of Word Frequency for Aria Hotel')

We see three different patterns for the relative frequencies: “buffet” is used in a fairly stable manner over this time period, while “pool” displays clear seasonality, rising in popularity in the summer months. Finally, we see an upward trend in the use of “staff”.

We can try a similar analysis where consider word frequency dynamics for different satisfaction segments:

aria.tidy2 <- ariaTidy %>%
  mutate(year = year(date),
         satisfaction = fct_recode(factor(reviewRating),
                                   "Not Satisfied"="1",
                                   "Not Satisfied"="2",
                                   "Neutral"="3",
                                   "Neutral"="4",
                                   "Satisfied"="5"))

total.terms.rating.year <- aria.tidy2 %>%
  group_by(satisfaction,year) %>%
  summarize(n.total = sum(n)) 


aria.tidy2 %>%
  filter(word %in% c("pool","staff","buffet","food","wait","casino","line","check","clean")) %>%
  group_by(satisfaction,year,word) %>%
  summarize(n = sum(n)) %>%
  left_join(total.terms.rating.year, by=c('year','satisfaction')) %>%
  ggplot(aes(x=year,y=n/n.total,color=satisfaction,group=satisfaction)) + 
  geom_line(size=1,alpha=0.25) + geom_point() + 
  facet_wrap(~word,scales='free')+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_y_continuous(labels=percent)+xlab('Year')+
  ylab('Word Frequency relative to Month Total')+
  labs(title='Dynamics of Word Frequency for Aria Hotel',
       subtitle='Three Satisfaction Segments')

Basic Text Analysis in Python

The most popular library for basic text analysis in Python is the NLTK library (Natural Language Tool Kit). We start by loading the required libraries and functions. We also need to download the dictionaries we will be using (you only need to do this once):

import nltk
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import string
from nltk.corpus import stopwords
from nltk.util import ngrams
import pandas as pd
import re  # for working with regular expressions
import matplotlib.pyplot as plt
import seaborn as sns
#nltk.download('punkt')       # only need to run this once
#nltk.download('stopwords')   # only need to run this once
#nltk.download('wordnet')     # only need to run this once
#nltk.download('omw-1.4')
lang="english"

We will use the same data as above and focus on the Aria hotel reviews:

reviews = pd.read_csv('../data/vegas_hotel_reviews.csv')
business = pd.read_csv('../data/vegas_hotel_info.csv')

ariaID = "JpHE7yhMS5ehA9e8WG_ETg"
ariaReviews = reviews[reviews.business_id==ariaID]

Ok - now we are good to go. Let’s start by lower casing all words and remove punctuation and other non-word characters. We concatenate all the reviews into one long list - then we can do word frequencies on that:

ariaReviewsList = [re.sub("[^\w ]",  " " , x.lower()) for x in ariaReviews['text']]
ariaReviewsStr = ' '.join(ariaReviewsList)                                         
print(ariaReviewsStr[:100])

they likely have some issues to work out  but they are a new establishment  the trend here  as evide

Next let’s create tokens - here we set up our tokenizer to keep only tokens with words that are at least two letters long:

tokeniser = RegexpTokenizer(r'[A-Za-z]{2,}')
ariaReviewsTokenized = tokeniser.tokenize(ariaReviewsStr)
print(ariaReviewsTokenized[:100])

['they', 'likely', 'have', 'some', 'issues', 'to', 'work', 'out', 'but', 'they', 'are', 'new', 'establishment', 'the', 'trend', 'here', 'as', 'evidenced', 'by', 'the', 'wynn', 'is', 'non', 'theming', 'one', 'concern', 'that', 'some', 'might', 'have', 'is', 'that', 'the', 'aria', 'and', 'city', 'center', 'look', 'nothing', 'like', 'vegas', 'stayed', 'one', 'night', 'at', 'aria', 'on', 'opening', 'night', 'and', 'checked', 'out', 'most', 'of', 'the', 'bars', 'and', 'common', 'areas', 'in', 'the', 'place', 'the', 'room', 'was', 'quite', 'nice', 'and', 'very', 'high', 'tech', 'like', 'the', 'rest', 'of', 'the', 'hotel', 'decor', 'is', 'muted', 'and', 'brownish', 'walking', 'in', 'the', 'curtains', 'open', 'to', 'do', 'reveal', 'of', 'the', 'view', 'over', 'the', 'strip', 'bellagio', 'is', 'few', 'buildings']

We can now count words (or - we should say - tokens). Here are the most 10 frequent tokens:

fdist = FreqDist(ariaReviewsTokenized)
fdist.most_common(10)

[('the', 25948), ('and', 12337), ('to', 11406), ('was', 6888), ('in', 6374), ('of', 6042), ('it', 5885), ('is', 5155), ('for', 4269), ('we', 4123)]

Alright - here we have the same problem as in the R version so let’s remove the stop words:

stop_words=set(stopwords.words("english"))
ariaReviewsTokenizedStop = [word for word in ariaReviewsTokenized if not word in stop_words]
fdistStop = FreqDist(ariaReviewsTokenizedStop)
fdistStop.most_common(10)

[('room', 4000), ('hotel', 2750), ('aria', 2379), ('like', 1520), ('nice', 1443), ('vegas', 1337), ('get', 1318), ('one', 1282), ('stay', 1264), ('great', 1207)]

Better! As always in Python it is better to put frequently used operations into functions. Let’s do that and then find the top 30 most frequent non-stopword tokens and then plot them:

def process_raw_text(text):    
    # Tokenize words
    tokeniser = RegexpTokenizer(r'[A-Za-z]{2,}')
    tokens = tokeniser.tokenize(text)
    
    # Lowercase tokens
    tokens_lower = [token.lower() for token in tokens]
    
    # Remove stopwords
    clean = [token for token in tokens_lower if token not in stop_words]
    return clean

def frequent_tokens(corpus, n):    
    # Preprocess each document
    documents = [process_raw_text(document) for document in corpus]
   
    # get all tokens and put into one long list 
    allTokens = [document for document in documents]       
    allTokens_flat = [item for sublist in allTokens for item in sublist]
    
    # Find frequency of ngrams
    freq_dist = FreqDist(allTokens_flat)
    top_freq = freq_dist.most_common(n)
    return pd.DataFrame(top_freq, columns=["token", "count"])

data=frequent_tokens(ariaReviews['text'],30)
print(data)

         token  count
0         room   4000
1        hotel   2750
2         aria   2379
3         like   1520
4         nice   1443
5        vegas   1337
6          get   1318
7          one   1282
8         stay   1264
9        great   1207
10     service   1201
11      casino   1198
12       rooms   1175
13       would   1147
14        time   1111
15       check   1075
16      really   1008
17        good    928
18       place    904
19          us    903
20        also    854
21        back    794
22        even    780
23       strip    758
24      stayed    731
25  everything    731
26         bed    727
27       night    701
28          go    698
29         got    677

plt.figure(figsize=(12,10))
sns.barplot(x="count", y="token", data=data)
plt.title("Most common tokens")
plt.show()

Looks good! Now let’s advance this a little. First of all, when we count words do we really want to treat “room” and “rooms” as different words? How about “meeting” and “meet”? For most purposes we don’t want to treat these differently. The process of finding the root word of a word is called lemmatization and the root words are called lemmas. Below we add lemmatization to the text cleaning function. Second, we might want to extract with single tokens (unigrams) or two tokens (bigrams) or more. We can use the ngrams function for that and we add that to our frequency function:

def process_raw_text(text):    
    # Tokenize words
    tokeniser = RegexpTokenizer(r'[A-Za-z]{2,}')
    tokens = tokeniser.tokenize(text)
    
    # Lowercase, lemmatize and remove stopwords
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(token.lower(), pos='v') for token in tokens]
    lemmas = [token.lower() for token in lemmas]    
    
    clean = [lemma for lemma in lemmas if lemma not in stop_words]
    return clean

def frequent_ngram(corpus, ngram, n):    
    # Preprocess each document
    documents = [process_raw_text(document) for document in corpus]
    
    # Find ngrams per document and put into one long list
    n_grams = [list(ngrams(document, ngram)) for document in documents]
    n_grams_flat = [item for sublist in n_grams for item in sublist]
    
    # get frequencies of ngrams
    freq_dist = FreqDist(n_grams_flat)
    top_freq = freq_dist.most_common(n)
    return pd.DataFrame(top_freq, columns=["ngram", "count"])

Ok - now let’s try this out for unigrams and bigrams and plot the results:

data_unigrams=frequent_ngram(ariaReviews['text'],1,30)
data_bigrams=frequent_ngram(ariaReviews['text'],2,30)

plt.figure(figsize=(12,10))
sns.barplot(x="count", y="ngram", data=data_unigrams)
plt.title("Most common unigrams")
plt.show()

plt.figure(figsize=(12,10))
sns.barplot(x="count", y="ngram", data=data_bigrams)
plt.title("Most common bigrams")
plt.show()