::use_course('https://www.dropbox.com/sh/f6ty4rvkgxzx9io/AAA9QcwEByOanbDvCo4yhWiSa?dl=1') usethis
Working with Relational Data
In many situations your data will be stored in separate files with dimensions defined by the nature of the objects that the data represents. For example, a large transactional database might contain four individual files (or “tables”): (1) a transaction file (2) a product file (3) a customer file and (4) a store file. Each of these will contain different information that can be linked to each other. This is also called a relational database since the objects stored in it can be linked or related to each other.
You can download the code and data for this module by using this dropbox link: https://www.dropbox.com/sh/f6ty4rvkgxzx9io/AAA9QcwEByOanbDvCo4yhWiSa?dl=0. If you are using Rstudio you can do this from within Rstudio:
Here is a simple example of a relational database: Suppose you have data on the results of two running races, race 1 and race 2. Each set of race results form a table. A third table contains a list of the runners, while a fourth contains the list of races. These four tables define a relational database.
Let’s look at a mini-version with data:
import polars as pl
= pl.read_csv('data/race1_results.csv')
Race1Results = pl.read_csv('data/race2_results.csv')
Race2Results = pl.read_csv('data/races.csv')
Races = pl.read_csv('data/runners.csv') Runners
shape: (4, 3)
┌────────┬──────────┬──────────┐
│ RaceId ┆ RunnerId ┆ RaceTime │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╡
│ 1 ┆ 1 ┆ 100 │
│ 1 ┆ 2 ┆ 120 │
│ 1 ┆ 3 ┆ 90 │
│ 1 ┆ 4 ┆ 140 │
└────────┴──────────┴──────────┘
shape: (4, 3)
┌────────┬──────────┬──────────┐
│ RaceId ┆ RunnerId ┆ RaceTime │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╡
│ 2 ┆ 2 ┆ 115 │
│ 2 ┆ 4 ┆ 145 │
│ 2 ┆ 5 ┆ 105 │
│ 2 ┆ 6 ┆ 95 │
└────────┴──────────┴──────────┘
shape: (2, 2)
┌────────┬─────────────────────────────┐
│ RaceId ┆ RaceName │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════╪═════════════════════════════╡
│ 1 ┆ La Jolla Half Marathon │
│ 2 ┆ Silver Strand Half Marathon │
└────────┴─────────────────────────────┘
shape: (6, 4)
┌──────────┬─────────┬────────┬───────┐
│ RunnerId ┆ Name ┆ Gender ┆ Age │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞══════════╪═════════╪════════╪═══════╡
│ 1 ┆ Peter ┆ M ┆ 35-39 │
│ 2 ┆ Jenny ┆ F ┆ 30-34 │
│ 3 ┆ Heather ┆ F ┆ 25-29 │
│ 4 ┆ John ┆ M ┆ 30-34 │
│ 5 ┆ Max ┆ M ┆ 35-39 │
│ 6 ┆ Julie ┆ F ┆ 40-44 │
└──────────┴─────────┴────────┴───────┘
There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools. In Python we join dataframes by using the join
function in polars.
There are different join
operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race 1 and race 2 table. You can think of this as the intersection between the two tables. This is done by performaing an inner merge:
## intersection - returning both race results using polars
= Race1Results.join(Race2Results, on = "RunnerId", how = 'inner')
inner_merged print(inner_merged)
shape: (2, 5)
┌────────┬──────────┬──────────┬──────────────┬────────────────┐
│ RaceId ┆ RunnerId ┆ RaceTime ┆ RaceId_right ┆ RaceTime_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╪══════════════╪════════════════╡
│ 1 ┆ 2 ┆ 120 ┆ 2 ┆ 115 │
│ 1 ┆ 4 ┆ 140 ┆ 2 ┆ 145 │
└────────┴──────────┴──────────┴──────────────┴────────────────┘
Notice that when using polars, python renamed the race specific variables by adding an “_right”. This is because the race.id and race.time variable has the same name in race1 and race2 and it helps in distinghuisning by telling which table’s column it is representing (Here it tells that these came from Race2Results as it was on the right side in pur join)
Suppose instead that you were interested in a master file that had information on all races for all runners. You can think of this as the union of the race 1 and race 2 table. You can do this operation by using outer merge:
= Race1Results.join(Race2Results, on = "RunnerId", how = "outer")
union_merged print(union_merged)
shape: (6, 5)
┌────────┬──────────┬──────────┬──────────────┬────────────────┐
│ RaceId ┆ RunnerId ┆ RaceTime ┆ RaceId_right ┆ RaceTime_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╪══════════════╪════════════════╡
│ 1 ┆ 2 ┆ 120 ┆ 2 ┆ 115 │
│ 1 ┆ 4 ┆ 140 ┆ 2 ┆ 145 │
│ null ┆ 5 ┆ null ┆ 2 ┆ 105 │
│ null ┆ 6 ┆ null ┆ 2 ┆ 95 │
│ 1 ┆ 3 ┆ 90 ┆ null ┆ null │
│ 1 ┆ 1 ┆ 100 ┆ null ┆ null │
└────────┴──────────┴──────────┴──────────────┴────────────────┘
Notice that we now have a bunch of nulls in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.
Suppose instead you wanted to extract the race data for all races that runners in race 1 has ever run. This can be accomplished by a left merge:
= Race1Results.join(Race2Results, on = "RunnerId", how = "left")
left_merged print(left_merged)
shape: (4, 5)
┌────────┬──────────┬──────────┬──────────────┬────────────────┐
│ RaceId ┆ RunnerId ┆ RaceTime ┆ RaceId_right ┆ RaceTime_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪══════════╪══════════╪══════════════╪════════════════╡
│ 1 ┆ 1 ┆ 100 ┆ null ┆ null │
│ 1 ┆ 2 ┆ 120 ┆ 2 ┆ 115 │
│ 1 ┆ 3 ┆ 90 ┆ null ┆ null │
│ 1 ┆ 4 ┆ 140 ┆ 2 ┆ 145 │
└────────┴──────────┴──────────┴──────────────┴────────────────┘
import pandas as pd
= pd.read_csv('data/race1_results.csv')
Race1Results = pd.read_csv('data/race2_results.csv')
Race2Results = pd.read_csv('data/races.csv')
Races = pd.read_csv('data/runners.csv') Runners
RaceId RunnerId RaceTime
0 1 1 100
1 1 2 120
2 1 3 90
3 1 4 140
RaceId RunnerId RaceTime
0 2 2 115
1 2 4 145
2 2 5 105
3 2 6 95
RaceId RaceName
0 1 La Jolla Half Marathon
1 2 Silver Strand Half Marathon
RunnerId Name Gender Age
0 1 Peter M 35-39
1 2 Jenny F 30-34
2 3 Heather F 25-29
3 4 John M 30-34
4 5 Max M 35-39
5 6 Julie F 40-44
There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools. In Python we join dataframes by using the merge
function in pandas.
There are different merge
operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race 1 and race 2 table. You can think of this as the intersection between the two tables. This is done by performaing an inner merge:
## intersection - returning both race results
= pd.merge(Race1Results, Race2Results, on = "RunnerId", how = 'inner')
inner_merged print(inner_merged)
RaceId_x RunnerId RaceTime_x RaceId_y RaceTime_y
0 1 2 120 2 115
1 1 4 140 2 145
Notice that Python renamed the race specific variables by adding an “x” and “y”. This is because the race.id and race.time variable has the same name in race1 and race2 and we obviously need to distinguish between them. The “by” option informs R on which variable the join operation is to be performed.
Suppose instead that you were interested in a master file that had information on all races for all runners. You can think of this as the union of the race 1 and race 2 table. You can do this operation by using outer merge:
= pd.merge(Race1Results, Race2Results, how = "outer", on = "RunnerId")
union_merged print(union_merged)
RaceId_x RunnerId RaceTime_x RaceId_y RaceTime_y
0 1.0 1 100.0 NaN NaN
1 1.0 2 120.0 2.0 115.0
2 1.0 3 90.0 NaN NaN
3 1.0 4 140.0 2.0 145.0
4 NaN 5 NaN 2.0 105.0
5 NaN 6 NaN 2.0 95.0
Notice that we now have a bunch of NAs in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.
Suppose instead you wanted to extract the race data for all races that runners in race 1 has ever run. This can be accomplished by a left merge:
= pd.merge(Race1Results, Race2Results, how = "left", on = "RunnerId")
left_merged print(left_merged)
RaceId_x RunnerId RaceTime_x RaceId_y RaceTime_y
0 1 1 100 NaN NaN
1 1 2 120 2.0 115.0
2 1 3 90 NaN NaN
3 1 4 140 2.0 145.0
library(tidyverse)
<- read_csv('data/race1_results.csv')
Race1Results <- read_csv('data/race2_results.csv')
Race2Results <- read_csv('data/races.csv')
Races <- read_csv('data/runners.csv')
Runners
Race1Results
Race2Results
Races Runners
# A tibble: 4 x 3
RaceId RunnerId RaceTime
<dbl> <dbl> <dbl>
1 1 1 100
2 1 2 120
3 1 3 90
4 1 4 140
# A tibble: 4 x 3
RaceId RunnerId RaceTime
<dbl> <dbl> <dbl>
1 2 2 115
2 2 4 145
3 2 5 105
4 2 6 95
# A tibble: 2 x 2
RaceId RaceName
<dbl> <chr>
1 1 La Jolla Half Marathon
2 2 Silver Strand Half Marathon
# A tibble: 6 x 4
RunnerId Name Gender Age
<dbl> <chr> <chr> <chr>
1 1 Peter M 35-39
2 2 Jenny F 30-34
3 3 Heather F 25-29
4 4 John M 30-34
5 5 Max M 35-39
6 6 Julie F 40-44
There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools.
There are different join operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race 1 and race 2 table. You can think of this as the intersection between the two tables. This is done by using the inner_join function:
<- inner_join(Race1Results,Race2Results,by='RunnerId')
runners.race12 runners.race12
# A tibble: 2 x 5
RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 120 2 115
2 1 4 140 2 145
Notice that R renamed the race specific variables by adding an “x” and “y”. This is because the race.id and race.time variable has the same name in race1 and race2 and we obviously need to distinguish between them. The “by” option informs R on which variable the join operation is to be performed.
Suppose instead that you were interested in a master file that had information on all races for all runners. You can think of this as the union of the race 1 and race 2 table. You can do this operation by using full_join:
<- full_join(Race1Results,Race2Results,by='RunnerId')
runners.race12.all runners.race12.all
# A tibble: 6 x 5
RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 100 NA NA
2 1 2 120 2 115
3 1 3 90 NA NA
4 1 4 140 2 145
5 NA 5 NA 2 105
6 NA 6 NA 2 95
Notice that we now have a bunch of NAs in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.
Suppose instead you wanted to extract the race data for all races that runners in race 1 has ever run. This can be accomplished by a left_join operation:
<- left_join(Race1Results,Race2Results,by='RunnerId')
runners.race1.all runners.race1.all
# A tibble: 4 x 5
RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 100 NA NA
2 1 2 120 2 115
3 1 3 90 NA NA
4 1 4 140 2 145
Suppose now you wanted to compare race results by age group for race 1. How should we do that? Well, now we need to join the race1 and runners table:
<- inner_join(Race1Results,Runners,by='RunnerId')
race1.groups race1.groups
# A tibble: 4 x 6
RaceId RunnerId RaceTime Name Gender Age
<dbl> <dbl> <dbl> <chr> <chr> <chr>
1 1 1 100 Peter M 35-39
2 1 2 120 Jenny F 30-34
3 1 3 90 Heather F 25-29
4 1 4 140 John M 30-34
Suppose instead you wanted results for runners participating in both races and the associated race names. We can do this as
<- inner_join(
runners.race12 inner_join(Race1Results,Races,by='RaceId'),
inner_join(Race2Results,Races,by='RaceId'),
by='RunnerId'
)
runners.race12
# A tibble: 2 x 7
RaceId.x RunnerId RaceTime.x RaceName.x RaceId.y RaceTime.y RaceName.y
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 1 2 120 La Jolla Half Mar~ 2 115 Silver St~
2 1 4 140 La Jolla Half Mar~ 2 145 Silver St~
Finally, you may sometimes be interested in extracting the records that do not overlap between two tables rather than the ones that do. For example, suppose you were interested in the runners who ran in race 1 but not in race 2. You can think of this as the complement of the runners.race12 table when compared to the original race 1 table. This can be generated by using an anti_join operation:
<- anti_join(Race1Results,Race2Results,by='RunnerId')
race1.only race1.only
# A tibble: 2 x 3
RaceId RunnerId RaceTime
<dbl> <dbl> <dbl>
1 1 1 100
2 1 3 90
Before considering any join operation, first think about what you are trying to accomplish - what is the end result you are looking for? Then find the appropriate join operation consistent with you goal. When you are working with large databases, the wrong type of join method can be disastrous. For example, wrongly using a full_join when you should be using an inner_join or left_join can lead to enormous data frames being generated which will often crash you computer.
The Structure of a Transactional Database
The classic example of a relational database is a transactional database. This database will contain one table with the actual transactions, i.e., one row for each transaction, where a transaction is an interaction between a certain customer in a certain store buying a certain product at a certain date. An “order” is then a series of transactions made by the same customer on a certain date. The product file will contain one row for each product available with a set of variables capturing various product characteristics. The customer file will contain one row for each customer (with associated customer information), while the store file will contain one row per store (with store information).
We can illustrate the structure of the relations between the files as:
Transaction.File | Product.File | Customer.File | Store.File | |
---|---|---|---|---|
Order ID | X | |||
Order Date | X | |||
Product ID | X | X | ||
Customer ID | X | X | ||
Store ID | X | X | ||
Volume | X | |||
Total Price | X | |||
Product Info | X | |||
Customer Info | X | |||
Store Info | X |
The specific joins to perform in this application will depend on your objectives. For example, if you wanted to investigate how sales of a certain product group varied with the regional location of stores you would need to join both the product file (to get information about which product IDs belong to the group of interest) and the store file (to get the regional location of each store) with the transaction file. On the other hand, if you wanted to know the characteristics of your top customers, you would need to join the transaction file with the customer file.
Example: Retail Sales Data
Here will look at some retail data for consumer packaged goods in four categories: mouthwash, frozen pizza, pretzels and cereal. The source of the data is the research company dunnhumby. This is sales data for 156 weeks for each of 77 stores. There are a total of 58 products (UPCs) covering the four categories. This type of data is very typical for the consumer packaged goods industry and is used by both retailers and manufacturers.
Suppose we have the following simple objective: What are the total dollar sales of pretzels in stores located in Texas? To meet this objective we have to sum the sales of UPCs that are pretzel-UPCs for stores that are located in Texas. Note that the raw transaction file doesn’t tell us which UPCs are pretzels-UPCs - this information is found in the product data. Furthermore, the transaction file doesn’t tell us which stores are located in Texas - this information is in the store data. This means we have to join both the store data AND the product data with the transaction data.
We start by reading the required data, then do two inner joins (left joins will give the same result here since all products are in both transaction file and product file and all stores are in both transaction file and store file) and then filter out stores in Texas and pretzel-UPCs. Note that the key for store has different names in the transaction and store file - we need to explicitly state this to get the right join. Finally we just some up to get total sales.
import polars as pl
## read data
= pl.read_csv('data/products.csv')
products = pl.read_csv('data/stores.csv')
stores = pl.read_csv('data/transactions.csv')
transactions
## join
= transactions.join(stores,left_on='STORE_NUM', right_on='STORE_ID').join(products,on='UPC')
allDF = allDF.filter((pl.col('ADDRESS_STATE_PROV_CODE') == 'TX') & (pl.col('SUB_CATEGORY') == 'PRETZELS'))
pretzelTX
= pretzelTX['SPEND'].sum()
SpendPretzelTX print(SpendPretzelTX)
1446886.7700000007
import pandas as pd
## read data
= pd.read_csv('data/products.csv')
products = pd.read_csv('data/stores.csv')
stores = pd.read_csv('data/transactions.csv')
transactions
## join
= transactions.merge(stores,left_on='STORE_NUM', right_on='STORE_ID').merge(products,on='UPC')
allDF = allDF[(allDF.ADDRESS_STATE_PROV_CODE=='TX') & (allDF.SUB_CATEGORY=='PRETZELS')]
pretzelTX
= sum(pretzelTX.SPEND)
SpendPretzelTX print(SpendPretzelTX)
1446886.7699998382
library(tidyverse)
<- read_csv('data/products.csv')
products <- read_csv('data/stores.csv')
stores <- read_csv('data/transactions.csv')
transactions
## Find total spending on pretzel UPCS in TX stores
<- transactions %>%
pretzelTX inner_join(stores, by = c('STORE_NUM'='STORE_ID')) %>%
inner_join(products, by = 'UPC') %>%
filter(ADDRESS_STATE_PROV_CODE =='TX',
== 'PRETZELS')
SUB_CATEGORY
<- sum(pretzelTX$SPEND)
SpendPretzelTX SpendPretzelTX
[1] 1446887
Example: DonorsChoose
DonorsChoose is a non-profit organization that raises money for school projects. Potential donors can go to the website and pick among active projects and donate to any project of their choosing. When a project is funded the organization channels the funds to the school in question.
Here we will focus on projects and donations in 2015 and 2016 (if you are interested, you can go to the DonorsChoose website and download the full database). The data is available as two zipped csv files so we start by importing them:
import polars as pl
import zipfile
## read data
with zipfile.ZipFile('projects.zip', 'r') as zip_ref:
'extracted_files1')
zip_ref.extractall(
= pl.read_csv('extracted_files1/projects.csv')
projects
with zipfile.ZipFile('donations.zip', 'r') as zip_ref:
'extracted_files2')
zip_ref.extractall(
= pl.read_csv('extracted_files2/donations.csv') donations
Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the value_counts
command:
= projects['school_state'].value_counts().sort(by="counts", descending=True)
frq print(frq)
shape: (51, 2)
┌──────────────┬────────┐
│ school_state ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════╪════════╡
│ CA ┆ 45901 │
│ NY ┆ 22531 │
│ TX ┆ 21470 │
│ FL ┆ 16998 │
│ … ┆ … │
│ MT ┆ 729 │
│ ND ┆ 404 │
│ VT ┆ 335 │
│ WY ┆ 184 │
└──────────────┴────────┘
California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US.
Consider now the following question: How local is giving? In other words, do donors mainly care about their own state or do the give to projects in many states? Alternatively, we can ask from where donations to projects located in New York originated. We will later return to this question in more detail, but for now let’s focus on donors in New York state. Where do these donors give? We can start by only considering the donation data for donors who reside in New York:
= donations.filter(pl.col('donor_state') == "NY") ny_donors
Now we need to find out to where these donations were made. To do this we need information on the location on each project. This is in the projects
file. Therefore we need to merge the ny_donors
file with the projects
file:
= ny_donors.join(projects, on="_projectid", how="inner") ny_donors_projects
This file will have the same number of rows as the ny_donors
file, but will now have added columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:
= ny_donors_projects['school_state'].value_counts().sort(by="counts", descending=True)
frq print(frq)
shape: (51, 2)
┌──────────────┬────────┐
│ school_state ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════╪════════╡
│ NY ┆ 47950 │
│ CA ┆ 7368 │
│ PA ┆ 4168 │
│ FL ┆ 3453 │
│ … ┆ … │
│ MT ┆ 143 │
│ NH ┆ 113 │
│ ND ┆ 44 │
│ WY ┆ 28 │
└──────────────┴────────┘
What would you say this says about how local giving is?
Here we will focus on projects and donations in 2015 and 2016 (if you are interested, you can go to the DonorsChoose website and download the full database). The data is available as two zipped csv files so we start by importing them:
import pandas as pd
## read data
= pd.read_csv('data/projects.zip')
projects = pd.read_csv('data/donations.zip') donations
Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the value_counts
command:
= projects['school_state'].value_counts(normalize=True)
frq print(frq)
school_state
CA 0.144154
NY 0.070760
TX 0.067428
FL 0.053383
NC 0.045123
IL 0.041531
SC 0.038990
GA 0.037084
PA 0.027216
AZ 0.026958
IN 0.026952
MI 0.026723
WA 0.025624
OK 0.023237
MO 0.022006
MA 0.021663
OH 0.020555
NJ 0.020043
VA 0.018749
TN 0.017326
LA 0.016284
UT 0.014754
MD 0.014547
AL 0.014242
CT 0.014120
OR 0.012345
NV 0.011341
CO 0.011174
WI 0.010760
AR 0.010543
MS 0.010401
KY 0.009742
MN 0.009390
HI 0.006331
IA 0.006209
ID 0.005411
KS 0.005264
WV 0.005160
NM 0.004899
DC 0.004758
ME 0.004259
DE 0.003357
NH 0.003181
NE 0.003062
AK 0.002805
RI 0.002575
SD 0.002393
MT 0.002289
ND 0.001269
VT 0.001052
WY 0.000578
Name: proportion, dtype: float64
California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US.
Consider now the following question: How local is giving? In other words, do donors mainly care about their own state or do the give to projects in many states? Alternatively, we can ask from where donations to projects located in New York originated. We will later return to this question in more detail, but for now let’s focus on donors in New York state. Where do these donors give? We can start by only considering the donation data for donors who reside in New York:
= donations[donations['donor_state'] == "NY"] ny_donors
Now we need to find out to where these donations were made. To do this we need information on the location on each project. This is in the projects
file. Therefore we need to merge the ny_donors
file with the projects
file:
= pd.merge(ny_donors, projects, on = "_projectid", how = 'inner') ny_donors_projects
This file will have the same number of rows as the ny_donors
file, but will now have added columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:
= ny_donors_projects['school_state'].value_counts(normalize=True)
frq print(frq)
school_state
NY 0.458983
CA 0.070527
PA 0.039897
FL 0.033053
NC 0.031004
TX 0.030478
IL 0.023576
SC 0.021681
GA 0.020503
MI 0.017833
MA 0.016320
NJ 0.015708
MO 0.014664
CT 0.013688
VA 0.013133
MD 0.012970
AZ 0.012425
OH 0.011697
IN 0.011171
TN 0.010931
AL 0.009419
OK 0.009247
CO 0.007840
LA 0.007648
WA 0.007093
OR 0.006509
WI 0.006452
NV 0.005810
MN 0.005140
MS 0.004987
UT 0.004346
AR 0.004145
DC 0.003752
KY 0.003599
ID 0.003379
WV 0.003207
ME 0.003121
RI 0.003082
NM 0.002575
DE 0.002498
IA 0.002230
HI 0.002106
KS 0.001934
AK 0.001704
VT 0.001646
SD 0.001637
NE 0.001512
MT 0.001369
NH 0.001082
ND 0.000421
WY 0.000268
Name: proportion, dtype: float64
What would you say this says about how local giving is?
Here we will focus on projects and donations in 2015 and 2016 (if you are interested, you can go to the DonorsChoose website and download the full database). The data is available as two zipped csv files so we start by importing them:
<- read_csv('data/donations.zip')
donations <- read_csv('data/projects.zip') projects
The data has 1643289 made to 318416. There is a lot of information about both donors and projects (you can see a sample of the data if you write glimpse(projects)) in R. Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the table command:
table(projects$school_state)
AK AL AR AZ CA CO CT DC DE FL GA HI IA
893 4535 3357 8584 45901 3558 4496 1515 1069 16998 11808 2016 1977
ID IL IN KS KY LA MA MD ME MI MN MO MS
1723 13224 8582 1676 3102 5185 6898 4632 1356 8509 2990 7007 3312
MT NC ND NE NH NJ NM NV NY OH OK OR PA
729 14368 404 975 1013 6382 1560 3611 22531 6545 7399 3931 8666
RI SC SD TN TX UT VA VT WA WI WV WY
820 12415 762 5517 21470 4698 5970 335 8159 3426 1643 184
California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US. You can easily convert the counts into proportions:
prop.table(table(projects$school_state))
AK AL AR AZ CA CO
0.0028045073 0.0142423748 0.0105428119 0.0269584443 0.1441541882 0.0111740616
CT DC DE FL GA HI
0.0141198935 0.0047579267 0.0033572434 0.0533829958 0.0370835636 0.0063313401
IA ID IL IN KS KY
0.0062088589 0.0054111602 0.0415305764 0.0269521632 0.0052635546 0.0097419728
LA MA MD ME MI MN
0.0162837295 0.0216634842 0.0145470077 0.0042585800 0.0267229034 0.0093902316
MO MS MT NC ND NE
0.0220058037 0.0104014874 0.0022894578 0.0451233606 0.0012687805 0.0030620321
NH NJ NM NV NY OH
0.0031813728 0.0200429627 0.0048992513 0.0113405105 0.0707596352 0.0205548716
OK OR PA RI SC SD
0.0232368976 0.0123454852 0.0272159690 0.0025752475 0.0389898749 0.0023930958
TN TX UT VA VT WA
0.0173263906 0.0674275162 0.0147542837 0.0187490578 0.0010520828 0.0256237124
WI WV WY
0.0107595096 0.0051599166 0.0005778604
Consider now the following question: How local is giving? In other words, do donors mainly care about their own state or do the give to projects in many states? Alternatively, we can ask from where donations to projects located in New York originated. We will later return to this question in more detail, but for now let’s focus on donors in New York state. Where do these donors give? We can start by only considering the donation data for donors who reside in New York:
<- donations %>%
ny.donations filter(donor_state=='NY')
This leaves us with a donation file with 104470 donations. Now we need to calculate to where these donations were made. To do this we need information on the location on each project. This is in the projects file. Therefore we need to join the ny.donations file with the projects file:
<- ny.donations %>%
ny.donations.projects inner_join(projects,by='_projectid')
This file will have the same number of rows as the ny.donations file, but will now have added columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:
table(ny.donations.projects$school_state)
AK AL AR AZ CA CO CT DC DE FL GA HI IA
178 984 433 1298 7368 819 1430 392 261 3453 2142 220 233
ID IL IN KS KY LA MA MD ME MI MN MO MS
353 2463 1167 202 376 799 1705 1355 326 1863 537 1532 521
MT NC ND NE NH NJ NM NV NY OH OK OR PA
143 3239 44 158 113 1641 269 607 47950 1222 966 680 4168
RI SC SD TN TX UT VA VT WA WI WV WY
322 2265 171 1142 3184 454 1372 172 741 674 335 28
prop.table(table(ny.donations.projects$school_state))
AK AL AR AZ CA CO
0.0017038384 0.0094189720 0.0041447305 0.0124246195 0.0705274241 0.0078395712
CT DC DE FL GA HI
0.0136881401 0.0037522734 0.0024983249 0.0330525510 0.0205034938 0.0021058677
IA ID IL IN KS KY
0.0022303054 0.0033789605 0.0235761463 0.0111706710 0.0019335694 0.0035991194
LA MA MD ME MI MN
0.0076481286 0.0163204748 0.0129702307 0.0031205131 0.0178328707 0.0051402316
MO MS MT NC ND NE
0.0146644970 0.0049870776 0.0013688140 0.0310041160 0.0004211735 0.0015123959
NH NJ NM NV NY OH
0.0010816502 0.0157078587 0.0025749019 0.0058102805 0.4589834402 0.0116971379
OK OR PA RI SC SD
0.0092466737 0.0065090457 0.0398966210 0.0030822246 0.0216808653 0.0016368335
TN TX UT VA VT WA
0.0109313679 0.0304776491 0.0043457452 0.0131329568 0.0016464057 0.0070929453
WI WV WY
0.0064516129 0.0032066622 0.0002680195
So 46 percent of the donations originating in New York are made to school projects in New York. What would you say this says about how local giving is?
Reshaping Data
The standard data frame in R and Python is organized in “short and wide” format, i.e., rows are records (e.g., users, transactions etc.) and columns are records/fields (i.e., different measures of users, transactions etc.). However, there are many situations where this is not the optimal structure for your data. In particular, it is often recommend to organize your data as “tall and thin” instead. This is somtimes referred to as “tidy” data, see here for a lengthy introduction to the idea of tidy data.
Here we will consider the most common problem in tidy data organization. This is the problem where your data looks like this:
Customer | V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|---|
ID1 | 5 | 5 | 4 | 4 | 8 |
ID2 | 9 | 4 | 3 | 1 | 9 |
ID3 | 1 | 8 | 7 | 2 | 6 |
…but you want it to look like this:
Customer | Variable | Value |
---|---|---|
ID1 | V1 | 5 |
ID1 | V2 | 4 |
ID1 | V3 | 8 |
. | . | . |
. | . | . |
The first version of the data is “short and wide”, while the second is “tall and thin”. For example, when you import data from an Excel sheet it will often be organized with one row per customer/user and one column for each measurement. However, this is often not the ideal way to organize your data. Suppose, for example, that you are analyzing a customer satisfaction survey. Each row is a customer and you may have measured 20 to 30 different dimensions of satisfaction (e.g., ordering process, product, packaging, etc.). Now you want to summarize mean satisfaction for each dimension and different customer segments. The most elegant and efficient way of doing this is first to make the data long and thin and then do a group-by statement.
Let’s look at a simple examples:
We can do similar wide-to-thin and thin-to-wide actions in Python using the melt and pivot_table functions:
import polars as pl
= pl.read_csv('data/states.csv')
states
= states.melt(id_vars='state',variable_name='variable',value_name='value')
statesThin
print(statesThin)
shape: (18, 3)
┌──────────────┬────────────┬─────────┐
│ state ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════════════╪════════════╪═════════╡
│ California ┆ Population ┆ 21198.0 │
│ Illinois ┆ Population ┆ 11197.0 │
│ Rhode Island ┆ Population ┆ 931.0 │
│ California ┆ Income ┆ 5114.0 │
│ … ┆ … ┆ … │
│ Rhode Island ┆ HS.Grad ┆ 46.4 │
│ California ┆ Frost ┆ 20.0 │
│ Illinois ┆ Frost ┆ 127.0 │
│ Rhode Island ┆ Frost ┆ 127.0 │
└──────────────┴────────────┴─────────┘
melt takes wide format data and makes it thin. The option “id_vars” specify which variable we do NOT want melted. “var_name” and “value_name” are the user-specified names of the two other columns in the melted thin data frame. We can return to wide by using pivot_table:
= statesThin.pivot(index='state',columns='variable',values='value') statesWide
<string>:1: DeprecationWarning: In a future version of polars, the default `aggregate_function` will change from `'first'` to `None`. Please pass `'first'` to keep the current behaviour, or `None` to accept the new one.
print(statesWide)
shape: (3, 7)
┌──────────────┬────────────┬────────┬──────────┬────────┬─────────┬───────┐
│ state ┆ Population ┆ Income ┆ Life.Exp ┆ Murder ┆ HS.Grad ┆ Frost │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════════╪════════════╪════════╪══════════╪════════╪═════════╪═══════╡
│ California ┆ 21198.0 ┆ 5114.0 ┆ 71.71 ┆ 10.3 ┆ 62.6 ┆ 20.0 │
│ Illinois ┆ 11197.0 ┆ 5107.0 ┆ 70.14 ┆ 10.3 ┆ 52.6 ┆ 127.0 │
│ Rhode Island ┆ 931.0 ┆ 4558.0 ┆ 71.9 ┆ 2.4 ┆ 46.4 ┆ 127.0 │
└──────────────┴────────────┴────────┴──────────┴────────┴─────────┴───────┘
Here “columns” is the variable name in the thin data that becomes column names in the wide data. Finally, “values” is the column name in the thin data that becomes the values of the columns in the wide data.
We can do similar wide-to-thin and thin-to-wide actions in Python using the melt and pivot_table functions in pandas:
import pandas as pd
= pd.read_csv('data/states.csv')
states
= pd.melt(states,id_vars='state',var_name='variable',value_name='value')
statesThin
print(statesThin)
state variable value
0 California Population 21198.00
1 Illinois Population 11197.00
2 Rhode Island Population 931.00
3 California Income 5114.00
4 Illinois Income 5107.00
5 Rhode Island Income 4558.00
6 California Life.Exp 71.71
7 Illinois Life.Exp 70.14
8 Rhode Island Life.Exp 71.90
9 California Murder 10.30
10 Illinois Murder 10.30
11 Rhode Island Murder 2.40
12 California HS.Grad 62.60
13 Illinois HS.Grad 52.60
14 Rhode Island HS.Grad 46.40
15 California Frost 20.00
16 Illinois Frost 127.00
17 Rhode Island Frost 127.00
melt takes wide format data and makes it thin. The option “id_vars” specify which variable we do NOT want melted. “var_name” and “value_name” are the user-specified names of the two other columns in the melted thin data frame. We can return to wide by using pivot_table:
= pd.pivot_table(statesThin,index='state',columns='variable',values='value')
statesWide = statesWide.reset_index()
statesWide print(statesWide)
variable state Frost HS.Grad Income Life.Exp Murder Population
0 California 20.0 62.6 5114.0 71.71 10.3 21198.0
1 Illinois 127.0 52.6 5107.0 70.14 10.3 11197.0
2 Rhode Island 127.0 46.4 4558.0 71.90 2.4 931.0
Here “index” specifies the row names of the wide format, while “columns” is the variable name in the thin data that becomes column names in the wide data. Finally, “values” is the column name in the thin data that becomes the values of the columns in the wide data.
library(tidyverse)
<- read_csv('data/states.csv')
states states
# A tibble: 3 x 7
state Population Income Life.Exp Murder HS.Grad Frost
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 California 21198 5114 71.7 10.3 62.6 20
2 Illinois 11197 5107 70.1 10.3 52.6 127
3 Rhode Island 931 4558 71.9 2.4 46.4 127
This data is not tidy: We want each row to be a (state,measurement) combination. To make it tidy we use the pivot_longer function:
<- states %>%
statesThin pivot_longer(-state,names_to='variable',values_to = 'value')
This will take all the columns expect “state” and stack them into one column called “variable” and all the values into another column called “value”. The result is
statesThin
# A tibble: 18 x 3
state variable value
<chr> <chr> <dbl>
1 California Population 21198
2 California Income 5114
3 California Life.Exp 71.7
4 California Murder 10.3
5 California HS.Grad 62.6
6 California Frost 20
7 Illinois Population 11197
8 Illinois Income 5107
9 Illinois Life.Exp 70.1
10 Illinois Murder 10.3
11 Illinois HS.Grad 52.6
12 Illinois Frost 127
13 Rhode Island Population 931
14 Rhode Island Income 4558
15 Rhode Island Life.Exp 71.9
16 Rhode Island Murder 2.4
17 Rhode Island HS.Grad 46.4
18 Rhode Island Frost 127
We will see later on that it is much easier to perform group summaries and visualizations of data when it is “tidy”. Note that you can also do the reverse action using pivot_wider:
<- statesThin %>%
statesWide pivot_wider(names_from = 'variable', values_from = 'value')
statesWide
# A tibble: 3 x 7
state Population Income Life.Exp Murder HS.Grad Frost
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 California 21198 5114 71.7 10.3 62.6 20
2 Illinois 11197 5107 70.1 10.3 52.6 127
3 Rhode Island 931 4558 71.9 2.4 46.4 127
Here the “names_from” option specify which column’s values becomes the columns in the wide data format and the “values_from” option is the column in the thin format that contains the values of the columns in the wide format.
Example: Wide Retail Sales Data
Suppose we have store sales data in wide format. The file storeSalesWide.csv contains data on total weekly sales for each of the stores in the retail sales example. Let’s load data along with the store information file:
We can apply the exact same thin-before-join trick with Polars:
import polars as pl
= pl.read_csv('data/storeSalesWide.csv')
salesWide = pl.read_csv('data/stores.csv') stores
Now we are going to melt the wide data and perform a merge with the store information:
= salesWide.melt(id_vars='WEEK_END_DATE',variable_name='STORE_NUM',value_name='SPEND')
salesWide_melt
= salesWide_melt.with_columns(pl.col('STORE_NUM').cast(pl.Int64).alias('STORE_NUM')) # need same type for merge
salesWide_melt
= salesWide_melt.join(stores, left_on = 'STORE_NUM',right_on = 'STORE_ID') storeSalesThin
Finally, we filter out Texas stores and sum:
None
We can apply the exact same thin-before-join trick in Python:
import pandas as pd
= pd.read_csv('data/storeSalesWide.csv')
salesWide = pd.read_csv('data/stores.csv') stores
Now we are going to melt the wide data and perform a merge with the store information:
= pd.melt(salesWide,id_vars='WEEK_END_DATE',var_name='STORE_NUM',value_name='SPEND')
df 'STORE_NUM'] = df['STORE_NUM'].astype(int) # need same type for merge
df[
= df.merge(stores,left_on = 'STORE_NUM',right_on = 'STORE_ID') storeSalesThin
Finally, we filter out Texas stores and sum:
= storeSalesThin[storeSalesThin.ADDRESS_STATE_PROV_CODE=='TX']
storeSalesThinTX
= sum(storeSalesThinTX.SPEND)
totalSalesTX print(totalSalesTX)
11355083.889999993
library(tidy-verse)
<- read_csv('data/storeSalesWide.csv')
storeSalesWide <- read_csv('data/stores.csv') %>%
stores mutate(STORE_ID = as.character(STORE_ID))
The storeSalesWide dataframe has 78 columns: the first is a column for week, while the remaining 77 are one column per store. Suppose we have the following objective: Calculate total sales for all stores in Texas. We know that the store file contains information on which stores are located in Texas but how do we join that dataframe with the sales data? That’s difficult since the sales data is given to us in wide format with one column per store. The answer is that BEFORE we make the join, we first convert the sales data to thin, and then we perform the join:
<- storeSalesWide %>%
storeSalesThin pivot_longer(-WEEK_END_DATE,names_to='STORE_NUM',values_to = 'SPEND') %>%
inner_join(stores, by = c('STORE_NUM'='STORE_ID'))
Now we can simply filter out the Texas stores and sum:
<- storeSalesThin %>%
storeSalesTX filter(ADDRESS_STATE_PROV_CODE == 'TX')
<- sum(storeSalesTX$SPEND)
totalSalesTX
totalSalesTX
[1] 11355084
We will see later that this trick is also extremely useful for visualizations.