::use_course("https://www.dropbox.com/sh/q3w2bvju4lowy4d/AAC_Noadt_lYpSULcnCtcAaJa?dl=1") usethis
Summarizing a Corpus
Text data is messy and to make sense of it you often have to clean it a bit first. For example, do you want “Tuesday” and “Tuesdays” to count as separate words or the same word? Most of the time we would want to count this as the same word. Similarly with “run” and “running”. Furthermore, we often are not interested in including punctuation in the analysis - we just want to treat the text as a “bag of words”. There are libraries in R that will help you do this.
You can download the code and data for this module by using this dropbox link: https://www.dropbox.com/sh/q3w2bvju4lowy4d/AAC_Noadt_lYpSULcnCtcAaJa?dl=0. If you are using Rstudio you can do this from within Rstudio:
In the following let’s look at Yelp reviews of Las Vegas hotels:
library(tidyverse)
<- read_cvs('data/vegas_hotel_reviews.csv')
reviews <- read_csv('data/vegas_hotel_info.csv') business
This data contains customer reviews of 18 hotels in Las Vegas. In addition to text, each review also contain a star rating from 1 to 5. Let’s see how the 18 hotels fare on average in terms of star ratings:
library(scales)
library(lubridate)
%>%
reviews left_join(select(business,business_id,name),
by='business_id') %>%
group_by(name) %>%
summarize(n = n(),
mean.star = mean(as.numeric(stars))) %>%
arrange(desc(mean.star)) %>%
ggplot() +
geom_point(aes(x=reorder(name,mean.star),y=mean.star,size=n))+
coord_flip() +
ylab('Mean Star Rating (1-5)') +
xlab('Hotel')
So The Venetian, Bellagio and The Cosmopolitan are clearly the highest rated hotels, while Luxor and LVH are the lowest rated. Ok, but what is behind these ratings? What are customers actually saying about these hotels? This is what we can hope to find through a text analysis.
Basic Text Analysis in R
Constructing a Document Term Matrix
In the following we will make use of two libraries for text mining:
## install packages
install.packages(c("wordcloud","tidytext")) ## only run once
library(tidytext)
library(wordcloud)
Simple text summaries is often based on the document term matrix. This is an array where each row corresponds to a document and each column corresponds to a word. The entries of the array are simply counts of how many times a certain word occurs in a certain document. Let’s look at a simple example:
<- data.frame(doc_id = c(1:4),
example text=c("I have a brown dog. My dog loves walks.",
"My dog likes food.",
"I like food.",
"Some dogs are black."),
stringsAsFactors = F)
This data contains four documents. The first document contains 8 unique words (or “terms”). The word “brown” occurs once while “dog” occurs twice. We can create document-term counts using the tidytext’s library unnest_tokens and then a simple count:
%>%
example unnest_tokens(word,text) %>%
count(doc_id,word)
doc_id word n
1 1 a 1
2 1 brown 1
3 1 dog 2
4 1 have 1
5 1 i 1
6 1 loves 1
7 1 my 1
8 1 walks 1
9 2 dog 1
10 2 food 1
11 2 likes 1
12 2 my 1
13 3 food 1
14 3 i 1
15 3 like 1
16 4 are 1
17 4 black 1
18 4 dogs 1
19 4 some 1
This data contains four documents. The first document contains 8 unique words (or “terms”). The word “brown” occurs once while “dog” occurs twice. Note that punctuation is removed and lower-casing is done automatically.
We are usually performing text summaries with the goal of understanding what a document or set of documents are about. To this end many of the often used words in the English language are useless - words such as the, and, it and so on. These words are called “stop words” and we can remove them automatically by using an anti_join :
%>%
example unnest_tokens(word,text) %>%
count(doc_id,word) %>%
anti_join(stop_words)
doc_id word n
1 1 brown 1
2 1 dog 2
3 1 loves 1
4 1 walks 1
5 2 dog 1
6 2 food 1
7 2 likes 1
8 3 food 1
9 4 black 1
10 4 dogs 1
We now have a more distilled and interesting representation of each document. Let’s try this out on some real data - the hotel reviews above. Suppose we are interersted in summarizing the reviews for the Aria hotel:
<- filter(business,
aria.id =='Aria Hotel & Casino')$business_id
name<- filter(reviews,
aria.reviews ==aria.id)
business_id
<- aria.reviews %>%
AriaTidy select(review_id,text,stars) %>%
unnest_tokens(word,text)
<- AriaTidy %>%
AriaFreqWords count(word)
Let’s plot the top words:
%>%
AriaFreqWords top_n(25) %>%
ggplot(aes(x=fct_reorder(word,n),y=n)) + geom_bar(stat='identity') +
coord_flip() + theme_bw()+
labs(title='Top 25 Words in Aria Reviews',
x='Count',
y= 'Word')
ok - so here we see the problem with stop words. Let’s remove them:
%>%
AriaFreqWords anti_join(stop_words) %>%
top_n(25) %>%
ggplot(aes(x=fct_reorder(word,n),y=n)) + geom_bar(stat='identity') +
coord_flip() + theme_bw()+
labs(title='Top 25 Words in Aria Reviews',
subtitle = 'Stop Words Removed',
x='Count',
y= 'Word')
Much better! But how do these top words vary with the rating of the underlying reviews? This is an easy extension:
<- AriaTidy %>%
AriaFreqWordsByRating count(stars,word)
%>%
AriaFreqWordsByRating anti_join(stop_words) %>%
group_by(stars) %>%
top_n(20) %>%
ggplot(aes(x=reorder_within(word,n,stars),
y=n,
fill=stars)) +
geom_bar(stat='identity') +
coord_flip() +
scale_x_reordered() +
facet_wrap(~stars,scales = 'free',nrow=1) +
theme_bw() +
theme(legend.position = "none")+
labs(title = 'Top Words by Review Rating',
subtitle = 'Stop words removed',
x = 'Word',
y = 'Count')
Here we can begin to see what is behind low rated experiences and high rated experiences.
A typical visualization of document term arrays are word clouds. These can easily be done in R:
<- AriaFreqWords %>%
topWords anti_join(stop_words) %>%
top_n(100)
wordcloud(topWords$word,
$n,
topWordsscale=c(5,0.5),
colors=brewer.pal(8,"Dark2"))
Above we looked at terms in isolation. Sometimes we may want to summarize terms two at a time. These are called “bi-grams”:
%>%
aria.reviews select(review_id,text) %>%
unnest_tokens(bigram,text,token="ngrams",n=2) %>%
count(bigram) %>%
top_n(40) %>%
ggplot(aes(x=fct_reorder(bigram,n),y=n)) + geom_bar(stat='identity') +
coord_flip() +
labs(title="Top Bi-grams",
x = "Counts",
y = "Terms")
All of the above analyses looked at overall term frequencies. Sometimes we may have a different objective: What is one given document about? The default metric for this question is a document’s Term-Frequency-Inverse Document-Frequency (TF-IDF). A word’s TF-IDF is a measure of how important a word is to a document. It is calculated as \[ \text{TF}(\text{word}) = \frac{\text{number of times word appears in document}}{\text{total number of words in document}} \] \[ \text{IDF}(\text{word}) = \log_e \frac{ \text{total number of documents}}{\text{Number of documents with word in it}} \] \[ \text{TF-IDF}(\text{word}) = \text{TF}(\text{word}) \times \text{IDF}(\text{word}) \]
Words with high TF-IDF in a document will tend to be words that are rare (when compared to other documents) but not too rare. In this way they are informative about what the document is about! We can easily calculate TF-IDF using bind_tf_idf:
<- aria.reviews %>%
tidyReviews select(review_id,text) %>%
unnest_tokens(word, text) %>%
count(review_id,word)
<- 200 # focus on long reviews
minLength <- tidyReviews %>%
tidyReviewsLong group_by(review_id) %>%
summarize(length = sum(n)) %>%
filter(length >= minLength)
<- tidyReviews %>%
tidyReviewsTFIDF filter(review_id %in% tidyReviewsLong$review_id) %>%
bind_tf_idf(word,review_id,n) %>%
group_by(review_id) %>%
arrange(desc(tf_idf)) %>%
slice(1:15) %>% # get top 15 words in terms of tf-idf
ungroup() %>%
mutate(xOrder=n():1) %>% # for plotting
inner_join(select(aria.reviews,review_id,stars),by='review_id') # get star ratings
Here is a plot of the top TF-IDF words for 12 reviews:
<- 12
nReviewPlot <- tidyReviewsTFIDF %>%
plot.df filter(review_id %in% tidyReviewsLong$review_id[1:nReviewPlot])
%>%
plot.df mutate(review_id_n = as.integer(factor(review_id))) %>%
ggplot(aes(x=xOrder,y=tf_idf,fill=factor(review_id_n))) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~ review_id_n,scales='free') +
scale_x_continuous(breaks = plot.df$xOrder,
labels = plot.df$word,
expand = c(0,0)) +
coord_flip()+
labs(x='Word',
y='TF-IDF',
title = 'Top TF-IDF Words in Reviews of Aria',
subtitle = paste0('Based on first ',
nReviewPlot,' reviews'))+
theme(legend.position = "none")
One can think of these as “key-words” for each review.
Aria v2.0
If you want to try out a much bigger database of Aria reviews, you can use the data in the file AriaReviewsTrip.rds. This contains 10,905 reviews of the Aria hotel scraped from Tripadvisor.
<- read_csv('data/AriaReviewsTrip.csv') %>%
aria rename(text = reviewText)
<- aria %>%
meta.data select(reviewID,reviewRating,date,year.month.group)
With this sample size you can do more in-depth analyses. For example, how do word frequencies change over time? In the following we focus on the terms “buffet”,“pool” and “staff”. We calculate the relative frequency of these three terms for each month (relative to the total number of terms used that month).
<- aria %>%
ariaTidy select(reviewID,text) %>%
unnest_tokens(word,text) %>%
count(reviewID,word) %>%
inner_join(meta.data,by="reviewID")
<- ariaTidy %>%
total.terms.time group_by(year.month.group) %>%
summarize(n.total=sum(n))
## for the legend
<- 1:nrow(total.terms.time)
a <- a[seq(1, length(a), 3)]
b
%>%
ariaTidy filter(word %in% c("pool","staff","buffet")) %>%
group_by(word,year.month.group) %>%
summarize(n = sum(n)) %>%
left_join(total.terms.time, by='year.month.group') %>%
ggplot(aes(x=year.month.group,y=n/n.total,color=word,group=word)) +
geom_line() +
facet_wrap(~word)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
scale_x_discrete(breaks=as.character(total.terms.time$year.month.group[b]))+
scale_y_continuous(labels=percent)+xlab('Year/Month')+
ylab('Word Frequency relative to Month Total')+
ggtitle('Dynamics of Word Frequency for Aria Hotel')
We see three different patterns for the relative frequencies: “buffet” is used in a fairly stable manner over this time period, while “pool” displays clear seasonality, rising in popularity in the summer months. Finally, we see an upward trend in the use of “staff”.
We can try a similar analysis where consider word frequency dynamics for different satisfaction segments:
<- ariaTidy %>%
aria.tidy2 mutate(year = year(date),
satisfaction = fct_recode(factor(reviewRating),
"Not Satisfied"="1",
"Not Satisfied"="2",
"Neutral"="3",
"Neutral"="4",
"Satisfied"="5"))
<- aria.tidy2 %>%
total.terms.rating.year group_by(satisfaction,year) %>%
summarize(n.total = sum(n))
%>%
aria.tidy2 filter(word %in% c("pool","staff","buffet","food","wait","casino","line","check","clean")) %>%
group_by(satisfaction,year,word) %>%
summarize(n = sum(n)) %>%
left_join(total.terms.rating.year, by=c('year','satisfaction')) %>%
ggplot(aes(x=year,y=n/n.total,color=satisfaction,group=satisfaction)) +
geom_line(size=1,alpha=0.25) + geom_point() +
facet_wrap(~word,scales='free')+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
scale_y_continuous(labels=percent)+xlab('Year')+
ylab('Word Frequency relative to Month Total')+
labs(title='Dynamics of Word Frequency for Aria Hotel',
subtitle='Three Satisfaction Segments')
Basic Text Analysis in Python
The most popular library for basic text analysis in Python is the NLTK
library (Natural Language Tool Kit). We start by loading the required libraries and functions. We also need to download the dictionaries we will be using (you only need to do this once):
import nltk
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import string
from nltk.corpus import stopwords
from nltk.util import ngrams
import pandas as pd
import re # for working with regular expressions
import matplotlib.pyplot as plt
import seaborn as sns
#nltk.download('punkt') # only need to run this once
#nltk.download('stopwords') # only need to run this once
#nltk.download('wordnet') # only need to run this once
#nltk.download('omw-1.4')
="english" lang
We will use the same data as above and focus on the Aria hotel reviews:
= pd.read_csv('../data/vegas_hotel_reviews.csv')
reviews = pd.read_csv('../data/vegas_hotel_info.csv') business
= "JpHE7yhMS5ehA9e8WG_ETg"
ariaID = reviews[reviews.business_id==ariaID] ariaReviews
Ok - now we are good to go. Let’s start by lower casing all words and remove punctuation and other non-word characters. We concatenate all the reviews into one long list - then we can do word frequencies on that:
= [re.sub("[^\w ]", " " , x.lower()) for x in ariaReviews['text']]
ariaReviewsList = ' '.join(ariaReviewsList)
ariaReviewsStr print(ariaReviewsStr[:100])
they likely have some issues to work out but they are a new establishment the trend here as evide
Next let’s create tokens - here we set up our tokenizer to keep only tokens with words that are at least two letters long:
= RegexpTokenizer(r'[A-Za-z]{2,}')
tokeniser = tokeniser.tokenize(ariaReviewsStr)
ariaReviewsTokenized print(ariaReviewsTokenized[:100])
['they', 'likely', 'have', 'some', 'issues', 'to', 'work', 'out', 'but', 'they', 'are', 'new', 'establishment', 'the', 'trend', 'here', 'as', 'evidenced', 'by', 'the', 'wynn', 'is', 'non', 'theming', 'one', 'concern', 'that', 'some', 'might', 'have', 'is', 'that', 'the', 'aria', 'and', 'city', 'center', 'look', 'nothing', 'like', 'vegas', 'stayed', 'one', 'night', 'at', 'aria', 'on', 'opening', 'night', 'and', 'checked', 'out', 'most', 'of', 'the', 'bars', 'and', 'common', 'areas', 'in', 'the', 'place', 'the', 'room', 'was', 'quite', 'nice', 'and', 'very', 'high', 'tech', 'like', 'the', 'rest', 'of', 'the', 'hotel', 'decor', 'is', 'muted', 'and', 'brownish', 'walking', 'in', 'the', 'curtains', 'open', 'to', 'do', 'reveal', 'of', 'the', 'view', 'over', 'the', 'strip', 'bellagio', 'is', 'few', 'buildings']
We can now count words (or - we should say - tokens). Here are the most 10 frequent tokens:
= FreqDist(ariaReviewsTokenized)
fdist 10) fdist.most_common(
[('the', 25948), ('and', 12337), ('to', 11406), ('was', 6888), ('in', 6374), ('of', 6042), ('it', 5885), ('is', 5155), ('for', 4269), ('we', 4123)]
Alright - here we have the same problem as in the R version so let’s remove the stop words:
=set(stopwords.words("english"))
stop_words= [word for word in ariaReviewsTokenized if not word in stop_words]
ariaReviewsTokenizedStop = FreqDist(ariaReviewsTokenizedStop)
fdistStop 10) fdistStop.most_common(
[('room', 4000), ('hotel', 2750), ('aria', 2379), ('like', 1520), ('nice', 1443), ('vegas', 1337), ('get', 1318), ('one', 1282), ('stay', 1264), ('great', 1207)]
Better! As always in Python it is better to put frequently used operations into functions. Let’s do that and then find the top 30 most frequent non-stopword tokens and then plot them:
def process_raw_text(text):
# Tokenize words
= RegexpTokenizer(r'[A-Za-z]{2,}')
tokeniser = tokeniser.tokenize(text)
tokens
# Lowercase tokens
= [token.lower() for token in tokens]
tokens_lower
# Remove stopwords
= [token for token in tokens_lower if token not in stop_words]
clean return clean
def frequent_tokens(corpus, n):
# Preprocess each document
= [process_raw_text(document) for document in corpus]
documents
# get all tokens and put into one long list
= [document for document in documents]
allTokens = [item for sublist in allTokens for item in sublist]
allTokens_flat
# Find frequency of ngrams
= FreqDist(allTokens_flat)
freq_dist = freq_dist.most_common(n)
top_freq return pd.DataFrame(top_freq, columns=["token", "count"])
=frequent_tokens(ariaReviews['text'],30)
dataprint(data)
token count
0 room 4000
1 hotel 2750
2 aria 2379
3 like 1520
4 nice 1443
5 vegas 1337
6 get 1318
7 one 1282
8 stay 1264
9 great 1207
10 service 1201
11 casino 1198
12 rooms 1175
13 would 1147
14 time 1111
15 check 1075
16 really 1008
17 good 928
18 place 904
19 us 903
20 also 854
21 back 794
22 even 780
23 strip 758
24 stayed 731
25 everything 731
26 bed 727
27 night 701
28 go 698
29 got 677
=(12,10))
plt.figure(figsize="count", y="token", data=data)
sns.barplot(x"Most common tokens")
plt.title( plt.show()
Looks good! Now let’s advance this a little. First of all, when we count words do we really want to treat “room” and “rooms” as different words? How about “meeting” and “meet”? For most purposes we don’t want to treat these differently. The process of finding the root word of a word is called lemmatization and the root words are called lemmas. Below we add lemmatization to the text cleaning function. Second, we might want to extract with single tokens (unigrams) or two tokens (bigrams) or more. We can use the ngrams
function for that and we add that to our frequency function:
def process_raw_text(text):
# Tokenize words
= RegexpTokenizer(r'[A-Za-z]{2,}')
tokeniser = tokeniser.tokenize(text)
tokens
# Lowercase, lemmatize and remove stopwords
= WordNetLemmatizer()
lemmatizer = [lemmatizer.lemmatize(token.lower(), pos='v') for token in tokens]
lemmas = [token.lower() for token in lemmas]
lemmas
= [lemma for lemma in lemmas if lemma not in stop_words]
clean return clean
def frequent_ngram(corpus, ngram, n):
# Preprocess each document
= [process_raw_text(document) for document in corpus]
documents
# Find ngrams per document and put into one long list
= [list(ngrams(document, ngram)) for document in documents]
n_grams = [item for sublist in n_grams for item in sublist]
n_grams_flat
# get frequencies of ngrams
= FreqDist(n_grams_flat)
freq_dist = freq_dist.most_common(n)
top_freq return pd.DataFrame(top_freq, columns=["ngram", "count"])
Ok - now let’s try this out for unigrams and bigrams and plot the results:
=frequent_ngram(ariaReviews['text'],1,30)
data_unigrams=frequent_ngram(ariaReviews['text'],2,30) data_bigrams
=(12,10))
plt.figure(figsize="count", y="ngram", data=data_unigrams)
sns.barplot(x"Most common unigrams")
plt.title( plt.show()
=(12,10))
plt.figure(figsize="count", y="ngram", data=data_bigrams)
sns.barplot(x"Most common bigrams")
plt.title( plt.show()