Working with Relational Data

In many situations your data will be stored in separate files with dimensions defined by the nature of the objects that the data represents. For example, a large transactional database might contain four individual files (or “tables”): (1) a transaction file (2) a product file (3) a customer file and (4) a store file. Each of these will contain different information that can be linked to each other. This is also called a relational database since the objects stored in it can be linked or related to each other.

You can download the code and data for this module by using this dropbox link: https://www.dropbox.com/sh/f6ty4rvkgxzx9io/AAA9QcwEByOanbDvCo4yhWiSa?dl=0. If you are using Rstudio you can do this from within Rstudio:

usethis::use_course('https://www.dropbox.com/sh/f6ty4rvkgxzx9io/AAA9QcwEByOanbDvCo4yhWiSa?dl=1')

Here is a simple example of a relational database: Suppose you have data on the results of two running races, race 1 and race 2. Each set of race results form a table. A third table contains a list of the runners, while a fourth contains the list of races. These four tables define a relational database.

Let’s look at a mini-version with data:

import polars as pl

Race1Results = pl.read_csv('data/race1_results.csv')
Race2Results = pl.read_csv('data/race2_results.csv')
Races = pl.read_csv('data/races.csv')
Runners = pl.read_csv('data/runners.csv')

shape: (4, 3)
┌────────┬──────────┬──────────┐
│ RaceId ┆ RunnerId ┆ RaceTime │
│ ---    ┆ ---      ┆ ---      │
│ i64    ┆ i64      ┆ i64      │
╞════════╪══════════╪══════════╡
│ 1      ┆ 1        ┆ 100      │
│ 1      ┆ 2        ┆ 120      │
│ 1      ┆ 3        ┆ 90       │
│ 1      ┆ 4        ┆ 140      │
└────────┴──────────┴──────────┘

shape: (4, 3)
┌────────┬──────────┬──────────┐
│ RaceId ┆ RunnerId ┆ RaceTime │
│ ---    ┆ ---      ┆ ---      │
│ i64    ┆ i64      ┆ i64      │
╞════════╪══════════╪══════════╡
│ 2      ┆ 2        ┆ 115      │
│ 2      ┆ 4        ┆ 145      │
│ 2      ┆ 5        ┆ 105      │
│ 2      ┆ 6        ┆ 95       │
└────────┴──────────┴──────────┘

shape: (2, 2)
┌────────┬─────────────────────────────┐
│ RaceId ┆ RaceName                    │
│ ---    ┆ ---                         │
│ i64    ┆ str                         │
╞════════╪═════════════════════════════╡
│ 1      ┆ La Jolla Half Marathon      │
│ 2      ┆ Silver Strand Half Marathon │
└────────┴─────────────────────────────┘

shape: (6, 4)
┌──────────┬─────────┬────────┬───────┐
│ RunnerId ┆ Name    ┆ Gender ┆ Age   │
│ ---      ┆ ---     ┆ ---    ┆ ---   │
│ i64      ┆ str     ┆ str    ┆ str   │
╞══════════╪═════════╪════════╪═══════╡
│ 1        ┆ Peter   ┆ M      ┆ 35-39 │
│ 2        ┆ Jenny   ┆ F      ┆ 30-34 │
│ 3        ┆ Heather ┆ F      ┆ 25-29 │
│ 4        ┆ John    ┆ M      ┆ 30-34 │
│ 5        ┆ Max     ┆ M      ┆ 35-39 │
│ 6        ┆ Julie   ┆ F      ┆ 40-44 │
└──────────┴─────────┴────────┴───────┘

There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools. In Python we join dataframes by using the join function in polars.

There are different join operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race 1 and race 2 table. You can think of this as the intersection between the two tables. This is done by performaing an inner merge:

## intersection - returning both race results using polars 
inner_merged = Race1Results.join(Race2Results, on = "RunnerId", how = 'inner')
print(inner_merged)

shape: (2, 5)
┌────────┬──────────┬──────────┬──────────────┬────────────────┐
│ RaceId ┆ RunnerId ┆ RaceTime ┆ RaceId_right ┆ RaceTime_right │
│ ---    ┆ ---      ┆ ---      ┆ ---          ┆ ---            │
│ i64    ┆ i64      ┆ i64      ┆ i64          ┆ i64            │
╞════════╪══════════╪══════════╪══════════════╪════════════════╡
│ 1      ┆ 2        ┆ 120      ┆ 2            ┆ 115            │
│ 1      ┆ 4        ┆ 140      ┆ 2            ┆ 145            │
└────────┴──────────┴──────────┴──────────────┴────────────────┘

Notice that when using polars, python renamed the race specific variables by adding an “_right”. This is because the race.id and race.time variable has the same name in race1 and race2 and it helps in distinghuisning by telling which table’s column it is representing (Here it tells that these came from Race2Results as it was on the right side in pur join)

Suppose instead that you were interested in a master file that had information on all races for all runners. You can think of this as the union of the race 1 and race 2 table. You can do this operation by using outer merge:

union_merged = Race1Results.join(Race2Results, on = "RunnerId", how = "outer")
print(union_merged)

shape: (6, 5)
┌────────┬──────────┬──────────┬──────────────┬────────────────┐
│ RaceId ┆ RunnerId ┆ RaceTime ┆ RaceId_right ┆ RaceTime_right │
│ ---    ┆ ---      ┆ ---      ┆ ---          ┆ ---            │
│ i64    ┆ i64      ┆ i64      ┆ i64          ┆ i64            │
╞════════╪══════════╪══════════╪══════════════╪════════════════╡
│ 1      ┆ 2        ┆ 120      ┆ 2            ┆ 115            │
│ 1      ┆ 4        ┆ 140      ┆ 2            ┆ 145            │
│ null   ┆ 5        ┆ null     ┆ 2            ┆ 105            │
│ null   ┆ 6        ┆ null     ┆ 2            ┆ 95             │
│ 1      ┆ 3        ┆ 90       ┆ null         ┆ null           │
│ 1      ┆ 1        ┆ 100      ┆ null         ┆ null           │
└────────┴──────────┴──────────┴──────────────┴────────────────┘

Notice that we now have a bunch of nulls in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.

Suppose instead you wanted to extract the race data for all races that runners in race 1 has ever run. This can be accomplished by a left merge:

left_merged = Race1Results.join(Race2Results, on = "RunnerId", how = "left")
print(left_merged)

shape: (4, 5)
┌────────┬──────────┬──────────┬──────────────┬────────────────┐
│ RaceId ┆ RunnerId ┆ RaceTime ┆ RaceId_right ┆ RaceTime_right │
│ ---    ┆ ---      ┆ ---      ┆ ---          ┆ ---            │
│ i64    ┆ i64      ┆ i64      ┆ i64          ┆ i64            │
╞════════╪══════════╪══════════╪══════════════╪════════════════╡
│ 1      ┆ 1        ┆ 100      ┆ null         ┆ null           │
│ 1      ┆ 2        ┆ 120      ┆ 2            ┆ 115            │
│ 1      ┆ 3        ┆ 90       ┆ null         ┆ null           │
│ 1      ┆ 4        ┆ 140      ┆ 2            ┆ 145            │
└────────┴──────────┴──────────┴──────────────┴────────────────┘

import pandas as pd

Race1Results = pd.read_csv('data/race1_results.csv')
Race2Results = pd.read_csv('data/race2_results.csv')
Races = pd.read_csv('data/races.csv')
Runners = pd.read_csv('data/runners.csv')

   RaceId  RunnerId  RaceTime
0       1         1       100
1       1         2       120
2       1         3        90
3       1         4       140

   RaceId  RunnerId  RaceTime
0       2         2       115
1       2         4       145
2       2         5       105
3       2         6        95

   RaceId                     RaceName
0       1       La Jolla Half Marathon
1       2  Silver Strand Half Marathon

   RunnerId     Name Gender    Age
0         1    Peter      M  35-39
1         2    Jenny      F  30-34
2         3  Heather      F  25-29
3         4     John      M  30-34
4         5      Max      M  35-39
5         6    Julie      F  40-44

There are different merge operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race 1 and race 2 table. You can think of this as the intersection between the two tables. This is done by performaing an inner merge:

## intersection - returning both race results 
inner_merged = pd.merge(Race1Results, Race2Results, on = "RunnerId", how = 'inner')
print(inner_merged)

   RaceId_x  RunnerId  RaceTime_x  RaceId_y  RaceTime_y
0         1         2         120         2         115
1         1         4         140         2         145

Notice that Python renamed the race specific variables by adding an “x” and “y”. This is because the race.id and race.time variable has the same name in race1 and race2 and we obviously need to distinguish between them. The “by” option informs R on which variable the join operation is to be performed.

union_merged = pd.merge(Race1Results, Race2Results, how = "outer", on = "RunnerId")
print(union_merged)

   RaceId_x  RunnerId  RaceTime_x  RaceId_y  RaceTime_y
0       1.0         1       100.0       NaN         NaN
1       1.0         2       120.0       2.0       115.0
2       1.0         3        90.0       NaN         NaN
3       1.0         4       140.0       2.0       145.0
4       NaN         5         NaN       2.0       105.0
5       NaN         6         NaN       2.0        95.0

Notice that we now have a bunch of NAs in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.

Suppose instead you wanted to extract the race data for all races that runners in race 1 has ever run. This can be accomplished by a left merge:

left_merged = pd.merge(Race1Results, Race2Results, how = "left", on = "RunnerId")
print(left_merged)

   RaceId_x  RunnerId  RaceTime_x  RaceId_y  RaceTime_y
0         1         1         100       NaN         NaN
1         1         2         120       2.0       115.0
2         1         3          90       NaN         NaN
3         1         4         140       2.0       145.0

library(tidyverse)

Race1Results <- read_csv('data/race1_results.csv')
Race2Results <- read_csv('data/race2_results.csv')
Races <- read_csv('data/races.csv')
Runners <- read_csv('data/runners.csv')

Race1Results
Race2Results
Races
Runners

# A tibble: 4 x 3
  RaceId RunnerId RaceTime
   <dbl>    <dbl>    <dbl>
1      1        1      100
2      1        2      120
3      1        3       90
4      1        4      140

# A tibble: 4 x 3
  RaceId RunnerId RaceTime
   <dbl>    <dbl>    <dbl>
1      2        2      115
2      2        4      145
3      2        5      105
4      2        6       95

# A tibble: 2 x 2
  RaceId RaceName                   
   <dbl> <chr>                      
1      1 La Jolla Half Marathon     
2      2 Silver Strand Half Marathon

# A tibble: 6 x 4
  RunnerId Name    Gender Age  
     <dbl> <chr>   <chr>  <chr>
1        1 Peter   M      35-39
2        2 Jenny   F      30-34
3        3 Heather F      25-29
4        4 John    M      30-34
5        5 Max     M      35-39
6        6 Julie   F      40-44

There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools.

There are different join operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race 1 and race 2 table. You can think of this as the intersection between the two tables. This is done by using the inner_join function:

runners.race12 <- inner_join(Race1Results,Race2Results,by='RunnerId')
runners.race12

# A tibble: 2 x 5
  RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
     <dbl>    <dbl>      <dbl>    <dbl>      <dbl>
1        1        2        120        2        115
2        1        4        140        2        145

Notice that R renamed the race specific variables by adding an “x” and “y”. This is because the race.id and race.time variable has the same name in race1 and race2 and we obviously need to distinguish between them. The “by” option informs R on which variable the join operation is to be performed.

runners.race12.all <- full_join(Race1Results,Race2Results,by='RunnerId')
runners.race12.all

# A tibble: 6 x 5
  RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
     <dbl>    <dbl>      <dbl>    <dbl>      <dbl>
1        1        1        100       NA         NA
2        1        2        120        2        115
3        1        3         90       NA         NA
4        1        4        140        2        145
5       NA        5         NA        2        105
6       NA        6         NA        2         95

Suppose instead you wanted to extract the race data for all races that runners in race 1 has ever run. This can be accomplished by a left_join operation:

runners.race1.all <- left_join(Race1Results,Race2Results,by='RunnerId')
runners.race1.all

# A tibble: 4 x 5
  RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
     <dbl>    <dbl>      <dbl>    <dbl>      <dbl>
1        1        1        100       NA         NA
2        1        2        120        2        115
3        1        3         90       NA         NA
4        1        4        140        2        145

Suppose now you wanted to compare race results by age group for race 1. How should we do that? Well, now we need to join the race1 and runners table:

race1.groups <- inner_join(Race1Results,Runners,by='RunnerId')
race1.groups

# A tibble: 4 x 6
  RaceId RunnerId RaceTime Name    Gender Age  
   <dbl>    <dbl>    <dbl> <chr>   <chr>  <chr>
1      1        1      100 Peter   M      35-39
2      1        2      120 Jenny   F      30-34
3      1        3       90 Heather F      25-29
4      1        4      140 John    M      30-34

Suppose instead you wanted results for runners participating in both races and the associated race names. We can do this as

runners.race12 <- inner_join(
  inner_join(Race1Results,Races,by='RaceId'),
  inner_join(Race2Results,Races,by='RaceId'),
  by='RunnerId'
)

runners.race12

# A tibble: 2 x 7
  RaceId.x RunnerId RaceTime.x RaceName.x         RaceId.y RaceTime.y RaceName.y
     <dbl>    <dbl>      <dbl> <chr>                 <dbl>      <dbl> <chr>     
1        1        2        120 La Jolla Half Mar~        2        115 Silver St~
2        1        4        140 La Jolla Half Mar~        2        145 Silver St~

Finally, you may sometimes be interested in extracting the records that do not overlap between two tables rather than the ones that do. For example, suppose you were interested in the runners who ran in race 1 but not in race 2. You can think of this as the complement of the runners.race12 table when compared to the original race 1 table. This can be generated by using an anti_join operation:

race1.only <- anti_join(Race1Results,Race2Results,by='RunnerId')
race1.only

# A tibble: 2 x 3
  RaceId RunnerId RaceTime
   <dbl>    <dbl>    <dbl>
1      1        1      100
2      1        3       90

Before considering any join operation, first think about what you are trying to accomplish - what is the end result you are looking for? Then find the appropriate join operation consistent with you goal. When you are working with large databases, the wrong type of join method can be disastrous. For example, wrongly using a full_join when you should be using an inner_join or left_join can lead to enormous data frames being generated which will often crash you computer.

The Structure of a Transactional Database

The classic example of a relational database is a transactional database. This database will contain one table with the actual transactions, i.e., one row for each transaction, where a transaction is an interaction between a certain customer in a certain store buying a certain product at a certain date. An “order” is then a series of transactions made by the same customer on a certain date. The product file will contain one row for each product available with a set of variables capturing various product characteristics. The customer file will contain one row for each customer (with associated customer information), while the store file will contain one row per store (with store information).

We can illustrate the structure of the relations between the files as:

	Transaction.File	Product.File	Customer.File	Store.File
Order ID	X
Order Date	X
Product ID	X	X
Customer ID	X		X
Store ID	X			X
Volume	X
Total Price	X
Product Info		X
Customer Info			X
Store Info				X

The specific joins to perform in this application will depend on your objectives. For example, if you wanted to investigate how sales of a certain product group varied with the regional location of stores you would need to join both the product file (to get information about which product IDs belong to the group of interest) and the store file (to get the regional location of each store) with the transaction file. On the other hand, if you wanted to know the characteristics of your top customers, you would need to join the transaction file with the customer file.

Example: Retail Sales Data

Here will look at some retail data for consumer packaged goods in four categories: mouthwash, frozen pizza, pretzels and cereal. The source of the data is the research company dunnhumby. This is sales data for 156 weeks for each of 77 stores. There are a total of 58 products (UPCs) covering the four categories. This type of data is very typical for the consumer packaged goods industry and is used by both retailers and manufacturers.

Suppose we have the following simple objective: What are the total dollar sales of pretzels in stores located in Texas? To meet this objective we have to sum the sales of UPCs that are pretzel-UPCs for stores that are located in Texas. Note that the raw transaction file doesn’t tell us which UPCs are pretzels-UPCs - this information is found in the product data. Furthermore, the transaction file doesn’t tell us which stores are located in Texas - this information is in the store data. This means we have to join both the store data AND the product data with the transaction data.

We start by reading the required data, then do two inner joins (left joins will give the same result here since all products are in both transaction file and product file and all stores are in both transaction file and store file) and then filter out stores in Texas and pretzel-UPCs. Note that the key for store has different names in the transaction and store file - we need to explicitly state this to get the right join. Finally we just some up to get total sales.

import polars as pl

## read data
products = pl.read_csv('data/products.csv')
stores = pl.read_csv('data/stores.csv')
transactions = pl.read_csv('data/transactions.csv')

## join
allDF =  transactions.join(stores,left_on='STORE_NUM', right_on='STORE_ID').join(products,on='UPC')
pretzelTX = allDF.filter((pl.col('ADDRESS_STATE_PROV_CODE') == 'TX') & (pl.col('SUB_CATEGORY') == 'PRETZELS'))

SpendPretzelTX = pretzelTX['SPEND'].sum()
print(SpendPretzelTX)

1446886.7700000007

import pandas as pd

## read data
products = pd.read_csv('data/products.csv')
stores = pd.read_csv('data/stores.csv')
transactions = pd.read_csv('data/transactions.csv')

## join
allDF =  transactions.merge(stores,left_on='STORE_NUM', right_on='STORE_ID').merge(products,on='UPC')
pretzelTX = allDF[(allDF.ADDRESS_STATE_PROV_CODE=='TX') & (allDF.SUB_CATEGORY=='PRETZELS')]

SpendPretzelTX = sum(pretzelTX.SPEND)
print(SpendPretzelTX)

1446886.7699998382

library(tidyverse)

products <- read_csv('data/products.csv')
stores <- read_csv('data/stores.csv')
transactions <- read_csv('data/transactions.csv')

## Find total spending on pretzel UPCS in TX stores 
pretzelTX <- transactions %>%
  inner_join(stores, by = c('STORE_NUM'='STORE_ID')) %>%
  inner_join(products, by = 'UPC') %>%
  filter(ADDRESS_STATE_PROV_CODE =='TX',
         SUB_CATEGORY == 'PRETZELS')

SpendPretzelTX <- sum(pretzelTX$SPEND)
SpendPretzelTX

[1] 1446887

Example: DonorsChoose

WWW

DonorsChoose is a non-profit organization that raises money for school projects. Potential donors can go to the website and pick among active projects and donate to any project of their choosing. When a project is funded the organization channels the funds to the school in question.

Here we will focus on projects and donations in 2015 and 2016 (if you are interested, you can go to the DonorsChoose website and download the full database). The data is available as two zipped csv files so we start by importing them:

import polars as pl
import zipfile
## read data

with zipfile.ZipFile('projects.zip', 'r') as zip_ref:
    zip_ref.extractall('extracted_files1')

projects = pl.read_csv('extracted_files1/projects.csv')

with zipfile.ZipFile('donations.zip', 'r') as zip_ref:
    zip_ref.extractall('extracted_files2')

donations = pl.read_csv('extracted_files2/donations.csv')

Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the value_counts command:

frq = projects['school_state'].value_counts().sort(by="counts", descending=True)
print(frq)

shape: (51, 2)
┌──────────────┬────────┐
│ school_state ┆ counts │
│ ---          ┆ ---    │
│ str          ┆ u32    │
╞══════════════╪════════╡
│ CA           ┆ 45901  │
│ NY           ┆ 22531  │
│ TX           ┆ 21470  │
│ FL           ┆ 16998  │
│ …            ┆ …      │
│ MT           ┆ 729    │
│ ND           ┆ 404    │
│ VT           ┆ 335    │
│ WY           ┆ 184    │
└──────────────┴────────┘

California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US.

Consider now the following question: How local is giving? In other words, do donors mainly care about their own state or do the give to projects in many states? Alternatively, we can ask from where donations to projects located in New York originated. We will later return to this question in more detail, but for now let’s focus on donors in New York state. Where do these donors give? We can start by only considering the donation data for donors who reside in New York:

ny_donors = donations.filter(pl.col('donor_state') == "NY")

Now we need to find out to where these donations were made. To do this we need information on the location on each project. This is in the projects file. Therefore we need to merge the ny_donors file with the projects file:

ny_donors_projects = ny_donors.join(projects, on="_projectid", how="inner")

This file will have the same number of rows as the ny_donors file, but will now have added columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:

frq = ny_donors_projects['school_state'].value_counts().sort(by="counts", descending=True)
print(frq)

shape: (51, 2)
┌──────────────┬────────┐
│ school_state ┆ counts │
│ ---          ┆ ---    │
│ str          ┆ u32    │
╞══════════════╪════════╡
│ NY           ┆ 47950  │
│ CA           ┆ 7368   │
│ PA           ┆ 4168   │
│ FL           ┆ 3453   │
│ …            ┆ …      │
│ MT           ┆ 143    │
│ NH           ┆ 113    │
│ ND           ┆ 44     │
│ WY           ┆ 28     │
└──────────────┴────────┘

What would you say this says about how local giving is?

import pandas as pd

## read data
projects = pd.read_csv('data/projects.zip')
donations = pd.read_csv('data/donations.zip')

Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the value_counts command:

frq = projects['school_state'].value_counts(normalize=True)
print(frq)

school_state
CA    0.144154
NY    0.070760
TX    0.067428
FL    0.053383
NC    0.045123
IL    0.041531
SC    0.038990
GA    0.037084
PA    0.027216
AZ    0.026958
IN    0.026952
MI    0.026723
WA    0.025624
OK    0.023237
MO    0.022006
MA    0.021663
OH    0.020555
NJ    0.020043
VA    0.018749
TN    0.017326
LA    0.016284
UT    0.014754
MD    0.014547
AL    0.014242
CT    0.014120
OR    0.012345
NV    0.011341
CO    0.011174
WI    0.010760
AR    0.010543
MS    0.010401
KY    0.009742
MN    0.009390
HI    0.006331
IA    0.006209
ID    0.005411
KS    0.005264
WV    0.005160
NM    0.004899
DC    0.004758
ME    0.004259
DE    0.003357
NH    0.003181
NE    0.003062
AK    0.002805
RI    0.002575
SD    0.002393
MT    0.002289
ND    0.001269
VT    0.001052
WY    0.000578
Name: proportion, dtype: float64

California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US.

ny_donors = donations[donations['donor_state'] == "NY"]

ny_donors_projects = pd.merge(ny_donors, projects, on = "_projectid", how = 'inner')

frq = ny_donors_projects['school_state'].value_counts(normalize=True)
print(frq)

school_state
NY    0.458983
CA    0.070527
PA    0.039897
FL    0.033053
NC    0.031004
TX    0.030478
IL    0.023576
SC    0.021681
GA    0.020503
MI    0.017833
MA    0.016320
NJ    0.015708
MO    0.014664
CT    0.013688
VA    0.013133
MD    0.012970
AZ    0.012425
OH    0.011697
IN    0.011171
TN    0.010931
AL    0.009419
OK    0.009247
CO    0.007840
LA    0.007648
WA    0.007093
OR    0.006509
WI    0.006452
NV    0.005810
MN    0.005140
MS    0.004987
UT    0.004346
AR    0.004145
DC    0.003752
KY    0.003599
ID    0.003379
WV    0.003207
ME    0.003121
RI    0.003082
NM    0.002575
DE    0.002498
IA    0.002230
HI    0.002106
KS    0.001934
AK    0.001704
VT    0.001646
SD    0.001637
NE    0.001512
MT    0.001369
NH    0.001082
ND    0.000421
WY    0.000268
Name: proportion, dtype: float64

What would you say this says about how local giving is?

donations <- read_csv('data/donations.zip')
projects <- read_csv('data/projects.zip')

The data has 1643289 made to 318416. There is a lot of information about both donors and projects (you can see a sample of the data if you write glimpse(projects)) in R. Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the table command:

table(projects$school_state)


   AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA    HI    IA 
  893  4535  3357  8584 45901  3558  4496  1515  1069 16998 11808  2016  1977 
   ID    IL    IN    KS    KY    LA    MA    MD    ME    MI    MN    MO    MS 
 1723 13224  8582  1676  3102  5185  6898  4632  1356  8509  2990  7007  3312 
   MT    NC    ND    NE    NH    NJ    NM    NV    NY    OH    OK    OR    PA 
  729 14368   404   975  1013  6382  1560  3611 22531  6545  7399  3931  8666 
   RI    SC    SD    TN    TX    UT    VA    VT    WA    WI    WV    WY 
  820 12415   762  5517 21470  4698  5970   335  8159  3426  1643   184

California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US. You can easily convert the counts into proportions:

prop.table(table(projects$school_state))


          AK           AL           AR           AZ           CA           CO 
0.0028045073 0.0142423748 0.0105428119 0.0269584443 0.1441541882 0.0111740616 
          CT           DC           DE           FL           GA           HI 
0.0141198935 0.0047579267 0.0033572434 0.0533829958 0.0370835636 0.0063313401 
          IA           ID           IL           IN           KS           KY 
0.0062088589 0.0054111602 0.0415305764 0.0269521632 0.0052635546 0.0097419728 
          LA           MA           MD           ME           MI           MN 
0.0162837295 0.0216634842 0.0145470077 0.0042585800 0.0267229034 0.0093902316 
          MO           MS           MT           NC           ND           NE 
0.0220058037 0.0104014874 0.0022894578 0.0451233606 0.0012687805 0.0030620321 
          NH           NJ           NM           NV           NY           OH 
0.0031813728 0.0200429627 0.0048992513 0.0113405105 0.0707596352 0.0205548716 
          OK           OR           PA           RI           SC           SD 
0.0232368976 0.0123454852 0.0272159690 0.0025752475 0.0389898749 0.0023930958 
          TN           TX           UT           VA           VT           WA 
0.0173263906 0.0674275162 0.0147542837 0.0187490578 0.0010520828 0.0256237124 
          WI           WV           WY 
0.0107595096 0.0051599166 0.0005778604

ny.donations <- donations %>%
  filter(donor_state=='NY')

This leaves us with a donation file with 104470 donations. Now we need to calculate to where these donations were made. To do this we need information on the location on each project. This is in the projects file. Therefore we need to join the ny.donations file with the projects file:

ny.donations.projects <- ny.donations %>%
  inner_join(projects,by='_projectid')

This file will have the same number of rows as the ny.donations file, but will now have added columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:

table(ny.donations.projects$school_state)


   AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA    HI    IA 
  178   984   433  1298  7368   819  1430   392   261  3453  2142   220   233 
   ID    IL    IN    KS    KY    LA    MA    MD    ME    MI    MN    MO    MS 
  353  2463  1167   202   376   799  1705  1355   326  1863   537  1532   521 
   MT    NC    ND    NE    NH    NJ    NM    NV    NY    OH    OK    OR    PA 
  143  3239    44   158   113  1641   269   607 47950  1222   966   680  4168 
   RI    SC    SD    TN    TX    UT    VA    VT    WA    WI    WV    WY 
  322  2265   171  1142  3184   454  1372   172   741   674   335    28

prop.table(table(ny.donations.projects$school_state))


          AK           AL           AR           AZ           CA           CO 
0.0017038384 0.0094189720 0.0041447305 0.0124246195 0.0705274241 0.0078395712 
          CT           DC           DE           FL           GA           HI 
0.0136881401 0.0037522734 0.0024983249 0.0330525510 0.0205034938 0.0021058677 
          IA           ID           IL           IN           KS           KY 
0.0022303054 0.0033789605 0.0235761463 0.0111706710 0.0019335694 0.0035991194 
          LA           MA           MD           ME           MI           MN 
0.0076481286 0.0163204748 0.0129702307 0.0031205131 0.0178328707 0.0051402316 
          MO           MS           MT           NC           ND           NE 
0.0146644970 0.0049870776 0.0013688140 0.0310041160 0.0004211735 0.0015123959 
          NH           NJ           NM           NV           NY           OH 
0.0010816502 0.0157078587 0.0025749019 0.0058102805 0.4589834402 0.0116971379 
          OK           OR           PA           RI           SC           SD 
0.0092466737 0.0065090457 0.0398966210 0.0030822246 0.0216808653 0.0016368335 
          TN           TX           UT           VA           VT           WA 
0.0109313679 0.0304776491 0.0043457452 0.0131329568 0.0016464057 0.0070929453 
          WI           WV           WY 
0.0064516129 0.0032066622 0.0002680195

So 46 percent of the donations originating in New York are made to school projects in New York. What would you say this says about how local giving is?

Reshaping Data

The standard data frame in R and Python is organized in “short and wide” format, i.e., rows are records (e.g., users, transactions etc.) and columns are records/fields (i.e., different measures of users, transactions etc.). However, there are many situations where this is not the optimal structure for your data. In particular, it is often recommend to organize your data as “tall and thin” instead. This is somtimes referred to as “tidy” data, see here for a lengthy introduction to the idea of tidy data.

Here we will consider the most common problem in tidy data organization. This is the problem where your data looks like this:

Customer	V1	V2	V3	V4	V5
ID1	5	5	4	4	8
ID2	9	4	3	1	9
ID3	1	8	7	2	6

…but you want it to look like this:

Customer	Variable	Value
ID1	V1	5
ID1	V2	4
ID1	V3	8
.	.	.
.	.	.

The first version of the data is “short and wide”, while the second is “tall and thin”. For example, when you import data from an Excel sheet it will often be organized with one row per customer/user and one column for each measurement. However, this is often not the ideal way to organize your data. Suppose, for example, that you are analyzing a customer satisfaction survey. Each row is a customer and you may have measured 20 to 30 different dimensions of satisfaction (e.g., ordering process, product, packaging, etc.). Now you want to summarize mean satisfaction for each dimension and different customer segments. The most elegant and efficient way of doing this is first to make the data long and thin and then do a group-by statement.

Let’s look at a simple examples:

We can do similar wide-to-thin and thin-to-wide actions in Python using the melt and pivot_table functions:

import polars as pl

states = pl.read_csv('data/states.csv')

statesThin = states.melt(id_vars='state',variable_name='variable',value_name='value')

print(statesThin)

shape: (18, 3)
┌──────────────┬────────────┬─────────┐
│ state        ┆ variable   ┆ value   │
│ ---          ┆ ---        ┆ ---     │
│ str          ┆ str        ┆ f64     │
╞══════════════╪════════════╪═════════╡
│ California   ┆ Population ┆ 21198.0 │
│ Illinois     ┆ Population ┆ 11197.0 │
│ Rhode Island ┆ Population ┆ 931.0   │
│ California   ┆ Income     ┆ 5114.0  │
│ …            ┆ …          ┆ …       │
│ Rhode Island ┆ HS.Grad    ┆ 46.4    │
│ California   ┆ Frost      ┆ 20.0    │
│ Illinois     ┆ Frost      ┆ 127.0   │
│ Rhode Island ┆ Frost      ┆ 127.0   │
└──────────────┴────────────┴─────────┘

melt takes wide format data and makes it thin. The option “id_vars” specify which variable we do NOT want melted. “var_name” and “value_name” are the user-specified names of the two other columns in the melted thin data frame. We can return to wide by using pivot_table:

statesWide = statesThin.pivot(index='state',columns='variable',values='value')

<string>:1: DeprecationWarning: In a future version of polars, the default `aggregate_function` will change from `'first'` to `None`. Please pass `'first'` to keep the current behaviour, or `None` to accept the new one.

print(statesWide)

shape: (3, 7)
┌──────────────┬────────────┬────────┬──────────┬────────┬─────────┬───────┐
│ state        ┆ Population ┆ Income ┆ Life.Exp ┆ Murder ┆ HS.Grad ┆ Frost │
│ ---          ┆ ---        ┆ ---    ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ str          ┆ f64        ┆ f64    ┆ f64      ┆ f64    ┆ f64     ┆ f64   │
╞══════════════╪════════════╪════════╪══════════╪════════╪═════════╪═══════╡
│ California   ┆ 21198.0    ┆ 5114.0 ┆ 71.71    ┆ 10.3   ┆ 62.6    ┆ 20.0  │
│ Illinois     ┆ 11197.0    ┆ 5107.0 ┆ 70.14    ┆ 10.3   ┆ 52.6    ┆ 127.0 │
│ Rhode Island ┆ 931.0      ┆ 4558.0 ┆ 71.9     ┆ 2.4    ┆ 46.4    ┆ 127.0 │
└──────────────┴────────────┴────────┴──────────┴────────┴─────────┴───────┘

Here “columns” is the variable name in the thin data that becomes column names in the wide data. Finally, “values” is the column name in the thin data that becomes the values of the columns in the wide data.

We can do similar wide-to-thin and thin-to-wide actions in Python using the melt and pivot_table functions in pandas:

import pandas as pd

states = pd.read_csv('data/states.csv')

statesThin = pd.melt(states,id_vars='state',var_name='variable',value_name='value')

print(statesThin)

           state    variable     value
0     California  Population  21198.00
1       Illinois  Population  11197.00
2   Rhode Island  Population    931.00
3     California      Income   5114.00
4       Illinois      Income   5107.00
5   Rhode Island      Income   4558.00
6     California    Life.Exp     71.71
7       Illinois    Life.Exp     70.14
8   Rhode Island    Life.Exp     71.90
9     California      Murder     10.30
10      Illinois      Murder     10.30
11  Rhode Island      Murder      2.40
12    California     HS.Grad     62.60
13      Illinois     HS.Grad     52.60
14  Rhode Island     HS.Grad     46.40
15    California       Frost     20.00
16      Illinois       Frost    127.00
17  Rhode Island       Frost    127.00

statesWide = pd.pivot_table(statesThin,index='state',columns='variable',values='value')
statesWide = statesWide.reset_index()
print(statesWide)

variable         state  Frost  HS.Grad  Income  Life.Exp  Murder  Population
0           California   20.0     62.6  5114.0     71.71    10.3     21198.0
1             Illinois  127.0     52.6  5107.0     70.14    10.3     11197.0
2         Rhode Island  127.0     46.4  4558.0     71.90     2.4       931.0

Here “index” specifies the row names of the wide format, while “columns” is the variable name in the thin data that becomes column names in the wide data. Finally, “values” is the column name in the thin data that becomes the values of the columns in the wide data.

library(tidyverse)
states <- read_csv('data/states.csv')
states

# A tibble: 3 x 7
  state        Population Income Life.Exp Murder HS.Grad Frost
  <chr>             <dbl>  <dbl>    <dbl>  <dbl>   <dbl> <dbl>
1 California        21198   5114     71.7   10.3    62.6    20
2 Illinois          11197   5107     70.1   10.3    52.6   127
3 Rhode Island        931   4558     71.9    2.4    46.4   127

This data is not tidy: We want each row to be a (state,measurement) combination. To make it tidy we use the pivot_longer function:

statesThin <- states %>%
  pivot_longer(-state,names_to='variable',values_to = 'value')

This will take all the columns expect “state” and stack them into one column called “variable” and all the values into another column called “value”. The result is

statesThin

# A tibble: 18 x 3
   state        variable     value
   <chr>        <chr>        <dbl>
 1 California   Population 21198  
 2 California   Income      5114  
 3 California   Life.Exp      71.7
 4 California   Murder        10.3
 5 California   HS.Grad       62.6
 6 California   Frost         20  
 7 Illinois     Population 11197  
 8 Illinois     Income      5107  
 9 Illinois     Life.Exp      70.1
10 Illinois     Murder        10.3
11 Illinois     HS.Grad       52.6
12 Illinois     Frost        127  
13 Rhode Island Population   931  
14 Rhode Island Income      4558  
15 Rhode Island Life.Exp      71.9
16 Rhode Island Murder         2.4
17 Rhode Island HS.Grad       46.4
18 Rhode Island Frost        127

We will see later on that it is much easier to perform group summaries and visualizations of data when it is “tidy”. Note that you can also do the reverse action using pivot_wider:

statesWide <- statesThin %>%
  pivot_wider(names_from = 'variable', values_from = 'value')

statesWide

# A tibble: 3 x 7
  state        Population Income Life.Exp Murder HS.Grad Frost
  <chr>             <dbl>  <dbl>    <dbl>  <dbl>   <dbl> <dbl>
1 California        21198   5114     71.7   10.3    62.6    20
2 Illinois          11197   5107     70.1   10.3    52.6   127
3 Rhode Island        931   4558     71.9    2.4    46.4   127

Here the “names_from” option specify which column’s values becomes the columns in the wide data format and the “values_from” option is the column in the thin format that contains the values of the columns in the wide format.

Example: Wide Retail Sales Data

Suppose we have store sales data in wide format. The file storeSalesWide.csv contains data on total weekly sales for each of the stores in the retail sales example. Let’s load data along with the store information file:

We can apply the exact same thin-before-join trick with Polars:

import polars as pl

salesWide = pl.read_csv('data/storeSalesWide.csv')
stores = pl.read_csv('data/stores.csv')

Now we are going to melt the wide data and perform a merge with the store information:

salesWide_melt = salesWide.melt(id_vars='WEEK_END_DATE',variable_name='STORE_NUM',value_name='SPEND')

salesWide_melt = salesWide_melt.with_columns(pl.col('STORE_NUM').cast(pl.Int64).alias('STORE_NUM'))  # need same type for merge

storeSalesThin =  salesWide_melt.join(stores, left_on = 'STORE_NUM',right_on = 'STORE_ID')

Finally, we filter out Texas stores and sum:

None

We can apply the exact same thin-before-join trick in Python:

import pandas as pd

salesWide = pd.read_csv('data/storeSalesWide.csv')
stores = pd.read_csv('data/stores.csv')

Now we are going to melt the wide data and perform a merge with the store information:

df = pd.melt(salesWide,id_vars='WEEK_END_DATE',var_name='STORE_NUM',value_name='SPEND')
df['STORE_NUM'] = df['STORE_NUM'].astype(int)  # need same type for merge

storeSalesThin =  df.merge(stores,left_on = 'STORE_NUM',right_on = 'STORE_ID')

Finally, we filter out Texas stores and sum:

storeSalesThinTX = storeSalesThin[storeSalesThin.ADDRESS_STATE_PROV_CODE=='TX']

totalSalesTX = sum(storeSalesThinTX.SPEND)
print(totalSalesTX)

11355083.889999993

library(tidy-verse)

storeSalesWide <- read_csv('data/storeSalesWide.csv')
stores <- read_csv('data/stores.csv') %>%
  mutate(STORE_ID = as.character(STORE_ID))

The storeSalesWide dataframe has 78 columns: the first is a column for week, while the remaining 77 are one column per store. Suppose we have the following objective: Calculate total sales for all stores in Texas. We know that the store file contains information on which stores are located in Texas but how do we join that dataframe with the sales data? That’s difficult since the sales data is given to us in wide format with one column per store. The answer is that BEFORE we make the join, we first convert the sales data to thin, and then we perform the join:

storeSalesThin <- storeSalesWide %>%
  pivot_longer(-WEEK_END_DATE,names_to='STORE_NUM',values_to = 'SPEND') %>%
  inner_join(stores, by = c('STORE_NUM'='STORE_ID'))

Now we can simply filter out the Texas stores and sum:

storeSalesTX <- storeSalesThin %>%
  filter(ADDRESS_STATE_PROV_CODE == 'TX')

totalSalesTX <- sum(storeSalesTX$SPEND)

totalSalesTX

[1] 11355084

We will see later that this trick is also extremely useful for visualizations.