import polars as pl
= pl.read_csv('data/transactions.csv')
transactions = pl.read_csv('data/products.csv')
products = pl.read_csv('data/stores.csv') stores
Group Summaries
Imagine that you are three years old and someone asks you to take a pile of Lego bricks and count the number of bricks of each color. How would you do this?
If you are a smart three year old you would intuitively use the following algorithm: First divide all bricks into separate piles of identical color, i.e., a pile of red bricks, another of green bricks and so on, and then count the number bricks in each pile. Easy.
Now imagine that you are a 25 year old analyst and your manager asks you to take the company customer database and compute average transaction size per customer for the most recent quarter. This is essentially the Lego problem and you should solve it the same way: First divide transactions into “customer piles” and then compute the average of each pile. This idea of taking a collection of objects and first sorting them and then performing some operation is also the foundation of many modern large scale computing frameworks such as MapReduce.
Grouped Data Summaries in Business Analytics
Whenever you are faced with an objective like the ones above, you should think: Python and the pandas library or R and the tidyverse library. These are packages for performing group operations (and many other things of course). Once you understand how to do this, it will change your life as a data scientist and - quite possibly - permanently make you are a happier person. Seriously.
The basic structure of a group statement is
1.Take some data 2.Define the group assignments for the data 3.Perform some operation on each group
Let us try this out on some real data. As always, you can code and data here: https://www.dropbox.com/scl/fo/pu6pe4i7mncytqbm58q1b/h?rlkey=khuypes7iohteya2d8mfiq0ge&dl=0
Case Study: Retail Transactions
Let’s start out by looking at the same transaction data we used in the relation data module here. We will see how a variety of interesting data summaries easily can be carried out using group-by statements.
Let’s start with a simple objective: What is the total sales (in dollars) for each product (UPC) in the data? Recall that this data is transactions of different products for different stores in different weeks. The current objective requires us to group the data (i.e., rows of the data file) into buckets defined by UPC. For each of the UPC-buckets, we then simply sum the sales (SPEND) to get at the required metric.
We start by importing the polars library and reading in the data:
We can then use the groupby
function in pandas to group rows by UPC and then sum sales within all rows for the same UPC:
= transactions.groupby('UPC').agg(pl.sum('SPEND')) salesUPC
We group by UPC
and sum over SPEND
. You can read this - left to right - like you read a recipe for cooking: First take the transactions dataframe, then group it by UPC
and then sum the variable SPEND
within each group. The reset_index()
function at the end of the chain is to make sure that the result is a new pandas dataframe - without that we would get a pandas series. This is fine but dataframes are easier to use for later queries, e.g., merges. The result is a dataframe with one row per UPC:
print(salesUPC.head())
shape: (5, 2)
┌─────────────┬───────────┐
│ UPC ┆ SPEND │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════════════╪═══════════╡
│ 1111038078 ┆ 113465.39 │
│ 7110410455 ┆ 73248.74 │
│ 31254742735 ┆ 296693.48 │
│ 88491201426 ┆ 954418.23 │
│ 1111009507 ┆ 346753.8 │
└─────────────┴───────────┘
Note that the result is sorted by UPC
by default. Suppose you wanted the result returned arranged by total SPEND
instead. That’s easy - we can just add to the “recipe”:
= transactions.groupby('UPC').agg(pl.sum('SPEND')).sort('SPEND', descending=True) salesUPC
Ok - now let’s find the 5 weeks with the highest sales (across all stores and products). In this case we just change the grouping variable to WEEK_END_DATE
and then grab the top 5 rows in the sorted dataframe:
= transactions.groupby('WEEK_END_DATE').agg(pl.sum('SPEND'))
salesWeek = salesWeek.sort('SPEND', descending=True).head(5)
top5salesWeek print(top5salesWeek)
shape: (5, 2)
┌───────────────┬───────────┐
│ WEEK_END_DATE ┆ SPEND │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════════════╪═══════════╡
│ 13-Jan-10 ┆ 261133.84 │
│ 6-Jan-10 ┆ 254203.56 │
│ 9-Feb-11 ┆ 245929.22 │
│ 3-Mar-10 ┆ 235027.24 │
│ 9-Dec-09 ┆ 233886.03 │
└───────────────┴───────────┘
Rather than look at UPCs or weeks, now let’s assume that we wanted to find the total sales for each of the four product categories (SNACKS, CEREAL, PIZZA and ORAL HYGIENE). Since each UPCs product information is only found in the product
data file, we first merge that with the transactions
file and then group by CATEGORY
:
= transactions.join(products, on='UPC')
transactionsProducts = transactionsProducts.groupby('CATEGORY').agg(pl.sum('SPEND'))
salesCategory print(salesCategory)
shape: (4, 2)
┌───────────────────────┬──────────┐
│ CATEGORY ┆ SPEND │
│ --- ┆ --- │
│ str ┆ f64 │
╞═══════════════════════╪══════════╡
│ ORAL HYGIENE PRODUCTS ┆ 1.7278e6 │
│ FROZEN PIZZA ┆ 6.4596e6 │
│ BAG SNACKS ┆ 4.7320e6 │
│ COLD CEREAL ┆ 1.5008e7 │
└───────────────────────┴──────────┘
We can easily generalize the group-by approach to multiple groups. For example, what is total sales by category for each store? To meet this objective, we must first merge in the product
file (to get category information), then group by both STORE_NUM
and CATEGORY
:
= transactionsProducts.groupby(['CATEGORY','STORE_NUM']).agg(pl.sum('SPEND'))
salesCategoryStore print(salesCategoryStore.head())
shape: (5, 3)
┌───────────────────────┬───────────┬──────────┐
│ CATEGORY ┆ STORE_NUM ┆ SPEND │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═══════════════════════╪═══════════╪══════════╡
│ BAG SNACKS ┆ 23055 ┆ 7730.19 │
│ COLD CEREAL ┆ 2541 ┆ 82259.67 │
│ ORAL HYGIENE PRODUCTS ┆ 17615 ┆ 23234.34 │
│ BAG SNACKS ┆ 23061 ┆ 88277.23 │
│ BAG SNACKS ┆ 29159 ┆ 46111.67 │
└───────────────────────┴───────────┴──────────┘
Finally, remember that we can keep adding steps in the chain of operations. For example, we might want the result above in wide format. In this case, we can just add a final pivot_table
step at the end:
= salesCategoryStore.pivot(values = 'SPEND', index = 'STORE_NUM',columns = 'CATEGORY') salesCategoryStoreWide
<string>:1: DeprecationWarning: In a future version of polars, the default `aggregate_function` will change from `'first'` to `None`. Please pass `'first'` to keep the current behaviour, or `None` to accept the new one.
print(salesCategoryStoreWide.head())
shape: (5, 5)
┌───────────┬────────────┬─────────────┬───────────────────────┬──────────────┐
│ STORE_NUM ┆ BAG SNACKS ┆ COLD CEREAL ┆ ORAL HYGIENE PRODUCTS ┆ FROZEN PIZZA │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════╪════════════╪═════════════╪═══════════════════════╪══════════════╡
│ 23055 ┆ 7730.19 ┆ 84935.97 ┆ 7730.11 ┆ 42152.96 │
│ 2541 ┆ 16084.14 ┆ 82259.67 ┆ 11965.54 ┆ 34656.53 │
│ 17615 ┆ 28017.43 ┆ 139788.43 ┆ 23234.34 ┆ 65096.92 │
│ 23061 ┆ 88277.23 ┆ 208878.41 ┆ 19951.5 ┆ 98472.77 │
│ 29159 ┆ 46111.67 ┆ 167652.55 ┆ 19101.84 ┆ 60915.59 │
└───────────┴────────────┴─────────────┴───────────────────────┴──────────────┘
Let’s start with a simple objective: What is the total sales (in dollars) for each product (UPC) in the data? Recall that this data is transactions of different products for different stores in different weeks. The current objective requires us to group the data (i.e., rows of the data file) into buckets defined by UPC. For each of the UPC-buckets, we then simply sum the sales (SPEND) to get at the required metric.
We start by importing the pandas library and reading in the data:
import pandas as pd
= pd.read_csv('data/transactions.csv')
transactions = pd.read_csv('data/products.csv')
products = pd.read_csv('data/stores.csv') stores
We can then use the groupby
function in pandas to group rows by UPC and then sum sales within all rows for the same UPC:
= transactions.groupby(['UPC'])['SPEND'].sum().reset_index() salesUPC
We group by UPC
and sum over SPEND
. You can read this - left to right - like you read a recipe for cooking: First take the transactions dataframe, then group it by UPC
and then sum the variable SPEND
within each group. The reset_index()
function at the end of the chain is to make sure that the result is a new pandas dataframe - without that we would get a pandas series. This is fine but dataframes are easier to use for later queries, e.g., merges. The result is a dataframe with one row per UPC:
print(salesUPC.head())
UPC SPEND
0 1111009477 912633.74
1 1111009497 715124.34
2 1111009507 346753.80
3 1111035398 95513.37
4 1111038078 113465.39
Note that the result is sorted by UPC
by default. Suppose you wanted the result returned arranged by total SPEND
instead. That’s easy - we can just add to the “recipe”:
= transactions.groupby(['UPC'])['SPEND'].sum().reset_index().sort_values(by=['SPEND'], ascending = False) salesUPC
Ok - now let’s find the 5 weeks with the highest sales (across all stores and products). In this case we just change the grouping variable to WEEK_END_DATE
and then grab the top 5 rows in the sorted dataframe:
= transactions.groupby(['WEEK_END_DATE'])['SPEND'].sum().reset_index()
salesWeek print(salesWeek.sort_values(by=['SPEND'], ascending = False)[0:5])
WEEK_END_DATE SPEND
20 13-Jan-10 261133.84
136 6-Jan-10 254203.56
151 9-Feb-11 245929.22
116 3-Mar-10 235027.24
150 9-Dec-09 233886.03
Rather than look at UPCs or weeks, now let’s assume that we wanted to find the total sales for each of the four product categories (SNACKS, CEREAL, PIZZA and ORAL HYGIENE). Since each UPCs product information is only found in the product
data file, we first merge that with the transactions
file and then group by CATEGORY
:
= transactions.merge(products, on = 'UPC')
transactionsProducts = transactionsProducts.groupby(['CATEGORY'])['SPEND'].sum()
salesCategory print(salesCategory)
CATEGORY
BAG SNACKS 4731982.53
COLD CEREAL 15008351.27
FROZEN PIZZA 6459557.59
ORAL HYGIENE PRODUCTS 1727831.19
Name: SPEND, dtype: float64
We can easily generalize the group-by approach to multiple groups. For example, what is total sales by category for each store? To meet this objective, we must first merge in the product
file (to get category information), then group by both STORE_NUM
and CATEGORY
:
= transactionsProducts.groupby(['CATEGORY','STORE_NUM'])['SPEND'].sum().reset_index()
salesCategoryStore print(salesCategoryStore.head())
CATEGORY STORE_NUM SPEND
0 BAG SNACKS 367 20988.36
1 BAG SNACKS 387 100710.62
2 BAG SNACKS 389 137481.40
3 BAG SNACKS 613 61620.43
4 BAG SNACKS 623 32654.77
Finally, remember that we can keep adding steps in the chain of operations. For example, we might want the result above in wide format. In this case, we can just add a final pivot_table
step at the end:
= pd.pivot_table(salesCategoryStore,index='STORE_NUM',columns='CATEGORY',values='SPEND').reset_index()
salesCategoryStoreWide print(salesCategoryStoreWide.head())
CATEGORY STORE_NUM BAG SNACKS ... FROZEN PIZZA ORAL HYGIENE PRODUCTS
0 367 20988.36 ... 52475.44 6871.20
1 387 100710.62 ... 102733.14 34575.06
2 389 137481.40 ... 116476.88 29707.91
3 613 61620.43 ... 89158.54 23485.61
4 623 32654.77 ... 87677.56 24244.83
[5 rows x 5 columns]
Suppose that you want to calculate multiple summaries in one groupby statement. We can solve this by creating a function that calculates any summary on each groupby subset of the data and then apply that function to each of those subsets. Here’s an example:
# Find total spend, average spend, minimum spend, maximum spend across all weeks for each
# store and category
# this function contains all the summaries we want for each of the groupby combinations
def f(x):
= {}
d 'totalSpend'] = x['SPEND'].sum()
d['avgSpend'] = x['SPEND'].sum()
d['minSpend'] = x['SPEND'].min()
d['maxSpend'] = x['SPEND'].max()
d[return pd.Series(d)
= transactionsProducts.groupby(['CATEGORY','STORE_NUM']).apply(f)
salesCategoryStoreStats
print(salesCategoryStoreStats.head())
totalSpend avgSpend minSpend maxSpend
CATEGORY STORE_NUM
BAG SNACKS 367 20988.36 20988.36 0.00 106.64
387 100710.62 100710.62 2.50 268.95
389 137481.40 137481.40 2.00 305.14
613 61620.43 61620.43 2.39 150.80
623 32654.77 32654.77 1.12 107.06
Let’s start with a simple objective: What is the total sales (in dollars) for each product (UPC) in the data? Recall that this data is transactions of different products for different stores in different weeks. The current objective requires us to group the data (i.e., rows of the data file) into buckets defined by UPC. For each of the UPC-buckets, we then simply sum the sales (SPEND) to get at the required metric.
We start by loading the required packaged and the data:
library(tidyverse)
<- read_csv('/data/transactions.csv')
transactions <- read_csv('data/products.csv')
products <- read_csv('data/stores.csv') stores
Warning: package 'tidyverse' was built under R version 4.1.3
Warning: package 'ggplot2' was built under R version 4.1.3
Warning: package 'tibble' was built under R version 4.1.3
Warning: package 'stringr' was built under R version 4.1.3
Warning: package 'forcats' was built under R version 4.1.3
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.1.9000 v readr 2.0.2
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.2 v tibble 3.2.1
v lubridate 1.8.0 v tidyr 1.1.4
v purrr 0.3.4
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 524950 Columns: 12
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (1): WEEK_END_DATE
dbl (11): STORE_NUM, UPC, UNITS, VISITS, HHS, SPEND, PRICE, BASE_PRICE, FEAT...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 58 Columns: 6
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (5): DESCRIPTION, MANUFACTURER, CATEGORY, SUB_CATEGORY, PRODUCT_SIZE
dbl (1): UPC
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 77 Columns: 9
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (4): STORE_NAME, ADDRESS_CITY_NAME, ADDRESS_STATE_PROV_CODE, SEG_VALUE_NAME
dbl (5): STORE_ID, MSA_CODE, PARKING_SPACE_QTY, SALES_AREA_SIZE_NUM, AVG_WEE...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
What is the total sales (in dollars) for each product (UPC) in the data? Recall that this data is transactions of different products for different stores in different weeks. The current objective requires us to group the data (i.e., rows of the data file) into buckets defined by UPC. For each of the UPC-buckets, we then simply sum the sales (SPEND) to get at the required metric:
<- transactions %>%
salesUPC group_by(UPC) %>%
summarize(total = sum(SPEND))
We group by UPC
and sum over SPEND
. You can read this like you read a recipe for cooking. Think of the %>
sign as an “and”: First take the transactions dataframe AND then group it by UPC
AND sum the variable SPEND
within each group. The result is a dataframe with one row per UPC:
head(salesUPC)
# A tibble: 6 x 2
UPC total
<dbl> <dbl>
1 1111009477 912634.
2 1111009497 715124.
3 1111009507 346754.
4 1111035398 95513.
5 1111038078 113465.
6 1111038080 88450.
Note that we can easily add more steps to the “recipe” above - for example - add information on the UPCs and sort in descinding order of spending:
<- salesUPC %>%
salesUPCsort inner_join(products, by = 'UPC') %>%
arrange(desc(total))
head(salesUPCsort)
# A tibble: 6 x 7
UPC total DESCRIPTION MANUFACTURER CATEGORY SUB_CATEGORY PRODUCT_SIZE
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1600027527 2.05e6 GM HONEY N~ GENERAL MI COLD CE~ ALL FAMILY ~ 12.25 OZ
2 3800031838 1.48e6 KELL FROST~ KELLOGG COLD CE~ KIDS CEREAL 15 OZ
3 7192100339 1.44e6 DIGRN PEPP~ TOMBSTONE FROZEN ~ PIZZA/PREMI~ 28.3 OZ
4 1600027528 1.43e6 GM CHEERIOS GENERAL MI COLD CE~ ALL FAMILY ~ 18 OZ
5 1600027564 1.38e6 GM CHEERIOS GENERAL MI COLD CE~ ALL FAMILY ~ 12 OZ
6 3800031829 1.22e6 KELL BITE ~ KELLOGG COLD CE~ ALL FAMILY ~ 18 OZ
Suppose we wanted to identify the 5 weeks with the highest total weekly spending across all stores and products. Easy - we just group by the week variable WEEK_END_DATA
and pick out the top 5:
<- transactions %>%
salesWeek group_by(WEEK_END_DATE ) %>%
summarize(total = sum(SPEND)) %>%
top_n(5,wt = total)
salesWeek
# A tibble: 5 x 2
WEEK_END_DATE total
<chr> <dbl>
1 13-Jan-10 261134.
2 3-Mar-10 235027.
3 6-Jan-10 254204.
4 9-Dec-09 233886.
5 9-Feb-11 245929.
Rather than look at UPCs or weeks, now let’s assume that we wanted to find the total sales for each of the four product categories (SNACKS, CEREAL, PIZZA and ORAL HYGIENE). Since each UPCs product information is only found in the product
data file, we first join that with the transactions
file and then group by CATEGORY
:
<- transactions %>%
transactionsProduct inner_join(products, by = 'UPC')
<- transactionsProduct %>%
salesCategory group_by(CATEGORY) %>%
summarize(total = sum(SPEND))
salesCategory
# A tibble: 4 x 2
CATEGORY total
<chr> <dbl>
1 BAG SNACKS 4731983.
2 COLD CEREAL 15008351.
3 FROZEN PIZZA 6459558.
4 ORAL HYGIENE PRODUCTS 1727831.
We can easily generalize the group-by approach to multiple groups. For example, what is total sales by category for each store? To meet this objective, we must first join the product
file (to get category information), then group by both STORE_NUM
and CATEGORY
:
<- transactionsProduct %>%
salesCategoryStore group_by(STORE_NUM,CATEGORY) %>%
summarize(total = sum(SPEND))
`summarise()` has grouped output by 'STORE_NUM'. You can override using the
`.groups` argument.
head(salesCategoryStore)
# A tibble: 6 x 3
# Groups: STORE_NUM [2]
STORE_NUM CATEGORY total
<dbl> <chr> <dbl>
1 367 BAG SNACKS 20988.
2 367 COLD CEREAL 127585.
3 367 FROZEN PIZZA 52475.
4 367 ORAL HYGIENE PRODUCTS 6871.
5 387 BAG SNACKS 100711.
6 387 COLD CEREAL 241718.
Finally, remember that we can keep adding steps in the chain (by adding more lines with %>%). For example, we might want to wide format of result above. In this case, we can just add a final pivot_wider
step at the end:
<- transactionsProduct %>%
salesCategoryStore group_by(STORE_NUM,CATEGORY) %>%
summarize(total = sum(SPEND)) %>%
pivot_wider(id_cols = 'STORE_NUM',names_from = 'CATEGORY',values_from = 'total')
`summarise()` has grouped output by 'STORE_NUM'. You can override using the
`.groups` argument.
head(salesCategoryStore)
# A tibble: 6 x 5
# Groups: STORE_NUM [6]
STORE_NUM `BAG SNACKS` `COLD CEREAL` `FROZEN PIZZA` `ORAL HYGIENE PRODUCTS`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 367 20988. 127585. 52475. 6871.
2 387 100711. 241718. 102733. 34575.
3 389 137481. 327986. 116477. 29708.
4 613 61620. 207302. 89159. 23486.
5 623 32655. 165673. 87678. 24245.
6 2277 235623. 480789. 190066. 66659.
Suppose that you want to calculate multiple summaries in one groupby statement. Easy - you just keep adding more statements in the summarize
function:
##
## Find total spend, average spend, minimum spend, maximum spend across all weeks for each
## store and category
##
<- transactions %>%
salesCategoryStoreStats inner_join(products, by = 'UPC') %>%
group_by(STORE_NUM,CATEGORY) %>%
summarize(totalSpend = sum(SPEND),
avgSpend = mean(SPEND),
minSpend = min(SPEND),
maxSpend = max(SPEND))
`summarise()` has grouped output by 'STORE_NUM'. You can override using the
`.groups` argument.
head(salesCategoryStoreStats)
# A tibble: 6 x 6
# Groups: STORE_NUM [2]
STORE_NUM CATEGORY totalSpend avgSpend minSpend maxSpend
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 367 BAG SNACKS 20988. 15.2 0 107.
2 367 COLD CEREAL 127585. 58.2 1.65 492.
3 367 FROZEN PIZZA 52475. 39.2 3.89 250.
4 367 ORAL HYGIENE PRODUCTS 6871. 6.63 1 39.9
5 387 BAG SNACKS 100711. 54.5 2.5 269.
6 387 COLD CEREAL 241718. 130. 1.99 796.
Case Study: Pandemic Library Checkouts
The city of Seattle’s open data portal contains a database that records the monthly checkout count of every item (mostly books) in the Seattle Public Library system. You can find the full data here.
Let’s do a quick analysis of how the global Covid-19 pandemic impacted usage of the public library system in Seattle. Here we use only data for the years 2019-2021. Did users check out more or fewer books during the pandemic compared to before? Which books were more popular during the lockdown than before? This and many other interesting questions can be answered using this database.
Since seaborn does not directly support Polars Data Frame we will stick to Pandas
We start by loading the required libraries and data and print the column names:
import pandas as pd
= pd.read_csv('data/Checkouts_by_Title.csv')
checkouts for col in checkouts.columns:
print(col)
<string>:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
UsageClass
CheckoutType
MaterialType
CheckoutYear
CheckoutMonth
Checkouts
Title
ISBN
Creator
Subjects
Publisher
PublicationYear
year
Each row in the data is a (item,year,month) combination and the column Checkouts
tells us how many times the corresponding item was checked out in year CheckoutYear
and month CheckoutMonth
.
Let’s start by looking at total checkouts by year and month. We use a group-by to get monthly totals and the plot them to inspect the trends:
import seaborn as sns
= checkouts.groupby(['CheckoutYear','CheckoutMonth'])['Checkouts'].sum().reset_index()
totalYearMonth 'CheckoutYear'] = totalYearMonth.CheckoutYear.astype('category')
totalYearMonth[
="CheckoutMonth", y="Checkouts",
sns.lineplot(x="CheckoutYear",
hue=totalYearMonth) data
We see a massive drop in checkouts starting in month 3 and 4 of 2020. This is consistent with the national Covid-19 emergency that was declared in mid-March 2020. Around this time many states in the US went into lockdown with businesses closing and most people working from home. In fact the Seattle Public libaries themselves were closed for a large part of 2020. Therefore it was impossible to checkout physical books from the library during this time. This explains the large drop in checkouts in early 2020.
While the libraries were closed, people could still check out digital items online. To see what these patterns look like we can focus attention on ebooks:
= checkouts[checkouts['MaterialType'] == 'EBOOK']
checkoutsE
= checkoutsE.groupby(['CheckoutYear','CheckoutMonth'])['Checkouts'].sum().reset_index()
totalYearMonth 'CheckoutYear'] = totalYearMonth.CheckoutYear.astype('category')
totalYearMonth[
="CheckoutMonth", y="Checkouts",
sns.lineplot(x="CheckoutYear",
hue=totalYearMonth) data
Ok - this is interesting: We now see a large INCREASE in checkouts starting around the beginning of the lockdown. One possible explanation is simply people having more time on their hands (while working from home).
Did all types of ebook checkouts increase by the same amount or did some increase and others decrease? To answer this we can look at each book’s subject (Subjects). There are a LOT of different book subjects in the data - here we focus attention on the top 25 most frequent subjects in the data:
= 25
n = checkoutsE['Subjects'].value_counts()[:n].index.tolist()
topSubjects print(topSubjects)
['Fiction, Literature', 'Fiction, Mystery', 'Juvenile Fiction, Juvenile Literature', 'Fiction, Romance', 'Cooking & Food, Nonfiction', 'Fiction, Mystery, Suspense, Thriller', 'Fantasy, Fiction', 'Fiction, Literature, Romance', 'Fiction, Historical Fiction, Romance', 'Fantasy, Fiction, Romance', 'Fiction, Science Fiction', 'Juvenile Fiction, Picture Book Fiction', 'Fiction, Suspense, Thriller', 'Fiction, Romance, Historical Fiction', 'Fiction, Mystery, Thriller', 'Fiction, Thriller', 'Fantasy, Juvenile Fiction, Juvenile Literature', 'Business, Nonfiction', 'Health & Fitness, Nonfiction', 'History, Nonfiction', 'Fiction, Literature, Historical Fiction', 'Fiction, Literature, Mystery', 'Juvenile Fiction, Juvenile Literature, Picture Book Fiction', 'Fiction, Literature, Suspense, Thriller', 'Biography & Autobiography, Nonfiction']
Since the overall level of checkouts of ebooks is increasing, we can look at the relative share of each subject and then study which subjects have increasing versus decreasing shares during the lockdown period. So we will filter out the relevant subjects, then do a three-level group-by to get total checkout counts at the (Year,Month,Subject) level and then divide by the total to get shares:
= checkoutsE[checkoutsE.Subjects.isin(topSubjects)]
checkoutsEsub
= totalYearMonth.rename(columns={'Checkouts':'Total'})
totalYearMonth = checkoutsEsub.groupby(['CheckoutYear','CheckoutMonth','Subjects'])['Checkouts'].sum().reset_index()
totalSubjectYearMonth = totalSubjectYearMonth.merge(totalYearMonth, on = ['CheckoutYear','CheckoutMonth'])
totalSubjectYearMonth 'Share'] = totalSubjectYearMonth['Checkouts']/totalSubjectYearMonth['Total'] totalSubjectYearMonth[
We drop the 2021 data, convert the year to a category to get correct colors on the visualization and the plot shares for each subject:
= totalSubjectYearMonth[totalSubjectYearMonth['CheckoutYear'] != 2021]
totalSubjectYearMonth19_20 "CheckoutYear"] = totalSubjectYearMonth19_20.loc[:,'CheckoutYear'].astype('category')
totalSubjectYearMonth19_20.loc[:,
= sns.FacetGrid(totalSubjectYearMonth19_20, hue="CheckoutYear", col="Subjects",sharey=False,col_wrap=3,height=4.5, aspect=1)
g = g.map(sns.lineplot, "CheckoutMonth", "Share")
g g.add_legend()
These shares are quite informative. In relative terms it is clear that checkouts of ebooks for children is the main driver of the increase in overall ebook checkouts.
We start by loading the required libraries and data and then taking a peek at the data:
library(tidyverse)
<- read_csv('data/Checkouts_by_Title.csv')
checkouts head(checkouts)
Rows: 9750667 Columns: 13
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (9): UsageClass, CheckoutType, MaterialType, Title, ISBN, Creator, Subje...
dbl (4): CheckoutYear, CheckoutMonth, Checkouts, year
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 x 13
UsageClass CheckoutType MaterialType CheckoutYear CheckoutMonth Checkouts
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Digital OverDrive EBOOK 2019 1 1
2 Digital OverDrive EBOOK 2019 2 1
3 Digital OverDrive EBOOK 2019 2 2
4 Digital OverDrive EBOOK 2019 3 2
5 Physical Horizon VIDEODISC 2019 3 2
6 Digital OverDrive EBOOK 2019 3 2
# i 7 more variables: Title <chr>, ISBN <chr>, Creator <chr>, Subjects <chr>,
# Publisher <chr>, PublicationYear <chr>, year <dbl>
Each row in the data is a (item,year,month) combination and the column Checkouts tells us how many times the corresponding item was checked out in year CheckoutYear
and month CheckoutMonth
.
Let’s start by looking at total checkouts by year and month. We use a group-by to get monthly totals and the plot them to inspect the trends:
<- checkouts %>%
totalYearMonth group_by(CheckoutYear,CheckoutMonth) %>%
summarize(total = sum(Checkouts))
`summarise()` has grouped output by 'CheckoutYear'. You can override using the
`.groups` argument.
%>%
totalYearMonth mutate(CheckoutMonth = factor(CheckoutMonth),
CheckoutYear = factor(CheckoutYear)) %>%
ggplot(aes(x = CheckoutMonth, y = total, group = CheckoutYear, color = CheckoutYear)) + geom_line() +
labs(title = 'Checkouts by Year and Month')
We see a massive drop in checkouts starting in month 3 and 4 of 2020. This is consistent with the national Covid-19 emergency that was declared in mid-March 2020. Around this time many states in the US went into lockdown with businesses closing and most people working from home. In fact the Seattle Public libaries themselves were closed for a large part of 2020. Therefore it was impossible to checkout physical books from the library during this time. This explains the large drop in checkouts in early 2020.
While the libraries were closed, people could still check out digital items online. To see what these patterns look like we can focus attention on ebooks:
<- checkouts %>%
checkoutsE filter(MaterialType == 'EBOOK')
<- checkoutsE %>%
totalYearMonth group_by(CheckoutYear,CheckoutMonth) %>%
summarize(total = sum(Checkouts))
`summarise()` has grouped output by 'CheckoutYear'. You can override using the
`.groups` argument.
%>%
totalYearMonth mutate(CheckoutMonth = factor(CheckoutMonth),
CheckoutYear = factor(CheckoutYear)) %>%
ggplot(aes(x = CheckoutMonth, y = total, group = CheckoutYear, color = CheckoutYear)) + geom_line() +
labs(title = 'Checkouts by Year and Month',
subtitle = 'Ebooks Only')
Ok - this is interesting: We now see a large INCREASE in checkouts starting around the beginning of the lockdown. One possible explanation is simply people having more time on their hands (while working from home).
Did all types of ebook checkouts increase by the same amount or did some increase and others decrease? To answer this we can look at each book’s subject (Subjects). There are a LOT of different book subjects
in the data - here we focus attention on the top 25 most frequent subjects in the data:
<- checkoutsE %>%
topSubjects count(Subjects, sort = T) %>%
slice(1:25)
Since the overall level of checkouts of ebooks is increasing, we can look at the relative share of each subject and then study which subjects have increasing versus decreasing shares during the lockdown period. So we will filter out the relevant subjects, then do a three-level group-by to get total checkout counts at the (Year,Month,Subject) level and then divide by the total to get shares:
<- checkoutsE %>%
totalSubjectYearMonth filter(Subjects %in% topSubjects$Subjects) %>%
group_by(CheckoutYear,CheckoutMonth,Subjects) %>%
summarize(totalSubject = sum(Checkouts)) %>%
left_join(totalYearMonth, by = c('CheckoutYear','CheckoutMonth')) %>%
mutate(share = totalSubject/total)
`summarise()` has grouped output by 'CheckoutYear', 'CheckoutMonth'. You can
override using the `.groups` argument.
The resulting dataframe has quite a bit of data, so it seems to visualize it in order to inspect all patterns:
%>%
totalSubjectYearMonth filter(!CheckoutYear==2021) %>%
mutate(CheckoutMonth = factor(CheckoutMonth),
CheckoutYear = factor(CheckoutYear)) %>%
ggplot(aes(x = CheckoutMonth, y = share, group = CheckoutYear, color = CheckoutYear)) + geom_line() +
scale_y_continuous(labels = scales::percent) +
labs(title = 'Ebook Subject Checkout Share by Year and Month') +
facet_wrap(~Subjects,scales='free',ncol = 3)
These shares are quite informative. In relative terms it is clear that checkouts of ebooks for children is the main driver of the increase in overall ebook checkouts.