import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
# get raw data
= pd.read_csv('data/yellow_tripdata_2015-06.csv')
taxi
# transformations for visualizations
'pickupDateTime'] = pd.to_datetime(taxi['tpep_pickup_datetime'])
taxi['dropoffDateTime'] = pd.to_datetime(taxi['tpep_dropoff_datetime'])
taxi['tripDuration'] = (taxi['dropoffDateTime'] - taxi['pickupDateTime']).dt.total_seconds()/60
taxi['pickupDay'] = taxi['pickupDateTime'].dt.day
taxi['pickupDate'] = taxi['pickupDateTime'].dt.date
taxi['pickupHour'] = taxi['pickupDateTime'].dt.hour
taxi['weekDay'] = taxi['pickupDateTime'].dt.weekday
taxi['weekDay2'] = taxi['pickupDateTime'].dt.strftime('%A')
taxi['paymentType'] = taxi['payment_type'].map({1: 'Credit Card',
taxi[2: 'Cash',
3: 'No Charge',
4: 'Other'})
Basic Data Visualization in R and Python
Visualizations bring data to life. A good visualization will give you new insights and will often lead to new ideas for additional analyses or visualizations. As humans we are much better at processing visual information than numeric information - both in terms of comprehension and speed. So unless you can think of any reason otherwise, you should should always present your raw data AND the results of any analysis you have done as a visualization.
One of the real strengths of both R and Python is the ability to easily visualize even very complex data. See the map on the right? This shows incidents of 6 types of crimes in San Diego for the year 2012. This map shows both the geographical dispersion of different crimes and their actual incidence. You can produce this map with one line of code (you will see how in the maps section). You can even make interactive maps allowing the user obtain further information by clicking on the map.
In this section we will focus on using the powerful ggplot2
library in R and the seaborn
library in Python. When you mastered this you will have a wide range of visualization tools at your disposal with very little coding effort.
You can download the code and data for this module by using the Dropbox file transfer link https://www.dropbox.com/t/vL52OAVBBF1Xqr8h.
Case Study: New York Taxi Cabs
This dataset contains information on every single trip taken with a yellow New York City taxi cab in the month of June, 2015. This is over 12 million trips! You can download data in raw format for other months of the year, green cabs and limousine rides here. The New York data doesn’t contain an individual taxi id code so you cannot link trips for the same cab driver. However, there are data for other cities where this is possible, e.g., Chicago (you can find the raw Chicago data here).
In Python the standard visualization libraries for exploratory analysis is Seaborn and its lower level cousin matplotlib
. You can think of Seaborn
as being a user-friendly version of matplotlib
. We start by importing those libraries long with pandas
and datetime
- the main library for working with date/time stamps. We also read in the data and make a few transformations for use in later visualizations.
Let’s start by learning about different aspects of the number of taxi trips.
Number of Trips
We start by looking at the number of trips for each day of the month using a bar plot
=(15, 8))
plt.figure(figsize= sns.countplot(x='pickupDay', data=taxi)
pl set(title='Taxi Pickups by Day of Month', xlabel = 'Day')
pl. plt.show()
This produces a simple bar chart with counts of the number of rides (or rows in the data) for each value of pickupDay. The chart clearly shows a weekday effect repeated throughout the month.
Is the pattern the same for cash trips only?
='pickupDay', data=taxi[taxi['paymentType']=='Cash'])
sns.countplot(x
= sns.countplot(x='pickupDay', data=taxi[taxi['paymentType']=='Cash'])
pl set(title='Taxi Pickups by Day of Month\nCash Trips Only', xlabel = 'Day')
pl. plt.show()
Again we see weekday effects but the peaks are higher than compared to all trips above. It’s a bit hard to compare the two plots. How about we plot the main payment types - credit card and cash - on the same bar chart? That’s pretty straightforward by using the hue option to plot separate bars for each payment type
= ['Credit Card','Cash']
paymentTypes = sns.countplot(x='pickupDay', data=taxi[taxi.paymentType.isin(paymentTypes)],
pl = 'paymentType')
hue set(title='Taxi Pickups by Day and Payment Type', xlabel = 'Day')
pl. plt.show()
There are many more credit card trips than cash trips. Furthermore, the peaks appear on different weekdays. We can’t tell what these weekdays are - we will look at that below.
How about trips throughout hour of the day?
= sns.countplot(x='pickupHour', data=taxi)
pl set(title='Taxi Pickups by Hour of Day', xlabel = 'Hour')
pl. plt.show()
Between 8am and 3pm there is a stable and roughly constant number of rides. Trip demand then increases between 6pm and 10pm. Above we saw that, overall, there were substantially more credit card rides than cash rides. Is this true throughout the day?
Let’s see how trip counts vary by hour of day and payment type
= sns.countplot(x='pickupHour', data=taxi[taxi.paymentType.isin(paymentTypes)],
pl = 'paymentType')
hue set(title='Taxi Pickups by Hour of Day and Payment Type', xlabel = 'Hour')
pl. plt.show()
We see a large variation in the ratio of payment types throughout the day. For example, in the evening there are about twice as many credit card trips compared to cash trips. However, in the early morning it is close to 50-50.
How do the number of trips vary by weekday? We want to make sure that the days of the week are ordered correctly in the plot so we enforce the ordering of the x-axis using the order
option
= ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
weekDayOrder = sns.countplot(x='weekDay2', data=taxi, order=weekDayOrder)
pl set(title='Taxi Pickups by Weekday', xlabel = 'Day of Week')
pl. plt.show()
Substantially more trips on Tuesdays? That sounds weird. As discussed in the R example of this data, this happens because there aren’t the same number of weekdays during a single month (e.g., there are 5 Tuesdays and only 4 Wednesdays for the month of June 2015). To correct this - and get a better sense of how trips vary by weekday - we calculate the average number of trips by weekday using a groupby
command and the plot the averages
= taxi.pickupDate.value_counts().reset_index()
df 'weekDay'] = pd.to_datetime(df['pickupDate']).dt.strftime('%A')
df[
= df.groupby(['weekDay'])['count'].mean().reset_index()
avgTrips
= sns.barplot(x='weekDay', y = 'count',
pl =['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'],
order=avgTrips)
dataset(title='Average Taxi Pickups by Weekday', xlabel = 'Day', ylabel = 'Number of Trips')
pl. plt.show()
We see that Fridays and Saturdays have the most number of trips on average.
To highlight day-to-day variations we could go with a line or points plot instead
= taxi.pickupDay.value_counts().reset_index(name='Count')
dayCountsDF
dayCountsDF.head()
pickupDay Count
0 5 22798
1 6 22618
2 27 22396
3 12 22067
4 13 21960
= sns.lineplot(x='pickupDay', y = 'Count', marker="o", data=dayCountsDF)
pl set(title='Taxi Pickups by Day of Month', xlabel = 'Day', ylabel = 'Number of Trips')
pl. plt.show()
Here it is easier to ascertain the variation in trip counts throughout the month as compared to the bar plot. Similar if we are interested in comparing trips and payment type for the month
= ['Credit Card','Cash']
paymentTypes = taxi[taxi.paymentType.isin(paymentTypes)].value_counts(["pickupDay", "paymentType"]).reset_index(name="Count")
dayPaymentDF
= sns.lineplot(x='pickupDay', y = 'Count', hue = 'paymentType',marker="o", data=dayPaymentDF)
pl set(title='Taxi Pickups by Day of Month and Payment', xlabel = 'Day', ylabel = 'Number of Trips')
pl. plt.show()
Trip Duration Let’s now turn to visualizing the duration of trips. What is the overall distribution of trip durations? We can use a histogram (where we trim the data for the extreme outliers - see the discussion of this under the R example)
= 120
maxDur = sns.histplot(data=taxi[(taxi.tripDuration <= maxDur) & (taxi.tripDuration > 0)], x='tripDuration')
pl set(title='Distribution of Trip Durations', xlabel = 'Minutes')
pl. plt.show()
How do trip durations vary by day of month? Since this is a very skewed distribution we will use the median rather than the mean to represent a “typical” trip for each day. We starting by using a groupby
command to get the median duration for each day and then plot that using a line plot
= taxi.groupby(['pickupDay'])['tripDuration'].median().reset_index()
durDay
= sns.lineplot(x='pickupDay', y = 'tripDuration', marker="o", data=durDay)
pl set(title='Trip Duration by Day of Month', xlabel = 'Day', ylabel = 'Median Duration of Trips (mins)')
pl. plt.show()
Here is median duration by day of week
= taxi.groupby(['weekDay2'])['tripDuration'].median().reset_index()
durDay
'weekDay2'] = pd.Categorical(durDay['weekDay2'],
durDay[=weekDayOrder,
categories=True)
ordered
= sns.lineplot(x='weekDay2', y = 'tripDuration', marker="o", data=durDay)
pl set(title='Trip Duration by Day of Week', xlabel = 'Day', ylabel = 'Median Duration of Trips (mins)')
pl. plt.show()
In terms of duration, the longest trips happen mid-week, while the shortest are on weekends.
Let’s look at duration by weekday and hour of day. Let’s first do a version where we superimpose the hourly line for each weekday
= taxi.groupby(['weekDay2','pickupHour'])['tripDuration'].median().reset_index()
durDayHour 'weekDay2'] = pd.Categorical(durDayHour['weekDay2'],
durDayHour[=weekDayOrder,
categories=True)
ordered
= sns.lineplot(x='pickupHour', y = 'tripDuration', hue='weekDay2', marker="o", data=durDayHour)
pl set(title='Trip Duration by Day of Week', xlabel = 'Day', ylabel = 'Median Duration of Trips (mins)')
pl.=(1.05, 1), loc=2, borderaxespad=0.) # put legend outside plot area
plt.legend(bbox_to_anchor plt.show()
Here we clearly see how weekend days stand out. An alternative is to make sure of the FacetGrid option in Seaborn to create a “small multiples” plot
= sns.FacetGrid(durDayHour, hue="weekDay2", col="weekDay2",sharey=True,col_wrap=3,height=4.5, aspect=1)
g = g.map(sns.lineplot, "pickupHour", "tripDuration")
g g.add_legend()
Here we impose the same y-axis scaling on all plots with the sharey=True
option.
The ggplot2
library is one of the gems of R. The syntax for producing plots may appear at bit strange at first, but once you get it, you will be producing beautiful and insightful visualizations in no time. With ggplot2
you create visualizations by adding layers to a plot. The ggplot2
is part of the tidyverse
library that we always import in an R session so you don’t need to separately import it.
- Any plot in ggplot2 consists of - Data: what you want to plot, duh! - Aesthetics: which variables go on the x-axis, y-axis, colors, styles etc. - Style of plot: Bar, scatter, line etc. These are called plot layers in ggplot and are specified using the syntax geom_layer, e.g., geom_point, geom_line, geom_histogram etc.
We start by loading the tidyverse
, and a couple of other helpful libraries that we will rely on below and - of course - the data. We first apply a few transformations using the mutate
function
library(tidyverse)
library(lubridate)
library(forcats)
library(scales)
<- read_csv('data/yellow_tripdata_2015-06.csv')
taxi
<- taxi %>%
taxi mutate(weekday = wday(tpep_dropoff_datetime,label=TRUE,abbr=TRUE),
hour.trip.start = factor(hour(tpep_pickup_datetime)),
day = factor(mday(tpep_dropoff_datetime)),
trip.duration = as.numeric(difftime(tpep_dropoff_datetime,tpep_pickup_datetime,units="mins")),
trip.speed = ifelse(trip.duration >= 1, trip_distance/(trip.duration/60), NA),
payment_type_label = fct_recode(factor(payment_type),
"Credit Card"="1",
"Cash"="2",
"No Charge"="3",
"Other"="4"))
Warning: package 'tidyverse' was built under R version 4.1.3
Warning: package 'ggplot2' was built under R version 4.1.3
Warning: package 'tibble' was built under R version 4.1.3
Warning: package 'stringr' was built under R version 4.1.3
Warning: package 'forcats' was built under R version 4.1.3
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.1.9000 v readr 2.0.2
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.2 v tibble 3.2.1
v lubridate 1.8.0 v tidyr 1.1.4
v purrr 0.3.4
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: package 'scales' was built under R version 4.1.3
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
Rows: 616247 Columns: 19-- Column specification --------------------------------------------------------
Delimiter: ","
chr (1): store_and_fwd_flag
dbl (16): VendorID, passenger_count, trip_distance, pickup_longitude, picku...
dttm (2): tpep_pickup_datetime, tpep_dropoff_datetime
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Number of Trips
We can start by looking at the total number of cab rides. For example, total number by day of the month
ggplot(data=taxi, aes(x=day)) + geom_bar()
This produces a simple bar chart with counts of the number of rides (or rows in the data) for each value of day. The command aes
means “aesthetic” in ggplot. Plot aesthetics are used to tell R what should be plotted, which colors or shapes to use etc. You can also use the “chain” syntax from in conjunction with ggplot. For example, the command
%>%
taxi ggplot(aes(x=day)) + geom_bar()
will produce the exaxt same plot. This is quite useful since you now have all the usual tools available to use prior to calling ggplot. For example, suppose you only wanted trips paid with cash. Then you could simply insert a filter
statement prior to the plot command
%>%
taxi filter(payment_type_label=='Cash') %>%
ggplot(aes(x=day)) + geom_bar()
Let’s compare the number of credit and cash rides
%>%
taxi filter(payment_type_label %in% c('Credit Card','Cash')) %>%
ggplot(aes(x=day,fill=payment_type_label)) + geom_bar()
There are clearly more credit card rides than cash rides. If you “dodge” the bars you can plot them next to each other instead
%>%
taxi filter(payment_type_label %in% c('Credit Card','Cash')) %>%
ggplot(aes(x=day,fill=payment_type_label)) + geom_bar(position='dodge')
Next, let’s look at ride activity by time of day
%>%
taxi ggplot(aes(x=hour.trip.start)) + geom_bar()
Between 8am and 3pm there is a stable and roughly constant number of rides. Trip demand then increases between 6pm and 10pm. Above we saw that, overall, there were substantially more credit card rides than cash rides. Is this true throughout the day?
%>%
taxi filter(payment_type_label %in% c('Credit Card','Cash')) %>%
ggplot(aes(x=hour.trip.start,fill=payment_type_label)) + geom_bar(position='dodge')
We see a large variation in the ratio of payment types throughout the day. For example, in the evening there are about twice as many credit card trips compared to cash trips. However, in the early morning it is close to 50-50.
Let’s break out trips by day of week
%>%
taxi ggplot(aes(x=weekday)) + geom_bar()
Substantially more trips on Tuesdays? That sounds weird. Here it is important to remember two things: What the structure of the data is and how R plots the data. Remember that the data are all trips for each day of June 2015. To determine the height of a bar, R will count the number of rows for each value of weekday. If your objective is to compare the number of trips for each day of week, this calculation will only make sense if there are the same number of each weekday in a month. Let’s check
%>%
taxi group_by(day) %>%
summarize(weekday=weekday[1]) %>%
count(weekday)
# A tibble: 7 x 2
weekday n
<ord> <int>
1 Sun 4
2 Mon 4
3 Tue 5
4 Wed 5
5 Thu 4
6 Fri 4
7 Sat 4
So there were 5 Mondays and Tuesdays but only 4 of every other weekday in June 2015. That’s why Mondays and Tuesdays appear to have the most number of rides. To correct this, we can manually calculate the number of rides for each day of the month, while recording what weekday it is. Then we can simply average across weekdays and plot the result. Here is one way of doing this
%>%
taxi group_by(day) %>%
summarize(n = n(),
wday = weekday[1]) %>%
group_by(wday) %>%
summarize(n.trip.mean=mean(n)) %>%
ggplot(aes(x=wday,y=n.trip.mean)) + geom_bar(stat='identity')
Since you have already calculated the height of each bar, you need to tell R what the variable capturing bar-height is (below “n.trip.mean”) and that no more counting is necessary (stat=‘identity’).
If you don’t like bar charts, you can create point-chart versions of the plots instead. In this case you have to explicitly inform R about what goes on the x and y-axis
%>%
taxi count(day) %>%
ggplot(aes(x=day,y=n)) + geom_point()
Here it might be good to connect the points by a line to indicate the time-series nature of the data
%>%
taxi count(day) %>%
ggplot(aes(x=day,y=n)) + geom_point() + geom_line(aes(group=1),linetype='dotted')
You need to tell R which points to connect. The option group=1 simply means all of them. Here is the time of day version using points and lines
%>%
taxi count(hour.trip.start) %>%
ggplot(aes(x=hour.trip.start,y=n)) +
geom_point() +
geom_line(aes(group=1),linetype='dotted')
Let’s add payment type using a different color for each payment
%>%
taxi filter(payment_type_label %in% c('Credit Card','Cash')) %>%
count(payment_type_label, hour.trip.start) %>%
ggplot(aes(x=hour.trip.start,y=n,color=payment_type_label,group=payment_type_label)) +
geom_point() +
geom_line(linetype='dotted')
Trip Duration
Let’s now turn to visualizing the duration of trips. What is the overall distribution of trip durations? We can use a histogram
%>%
taxi ggplot(aes(x=trip.duration)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ehh….what?? - that’s a weird histogram. This is the (very standard) problem of outliers. Here are the 5 longest trips in the data
%>%
taxi arrange(desc(trip.duration)) %>%
select(tpep_pickup_datetime,tpep_dropoff_datetime,trip.duration) %>%
slice(1:5)
# A tibble: 5 x 3
tpep_pickup_datetime tpep_dropoff_datetime trip.duration
<dttm> <dttm> <dbl>
1 2015-06-27 21:42:24 2015-06-30 14:53:08 3911.
2 2015-06-19 02:37:26 2015-06-20 02:37:02 1440.
3 2015-06-07 22:40:43 2015-06-08 22:40:08 1439.
4 2015-06-28 03:47:54 2015-06-29 03:47:16 1439.
5 2015-06-10 01:45:47 2015-06-11 01:45:09 1439.
Alright - those are some long trips! Recall that this is measured in minutes. Is this normal? What percentage of rides are above 2 hours?
sum(taxi$trip.duration > 120)/nrow(taxi)
[1] 0.0009330674
Only 0.09% of trips are longer than 2 hours. So let’s cut off the histogram at 2 hours
%>%
taxi ggplot(aes(x=trip.duration)) + geom_histogram() + xlim(0,120)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 578 rows containing non-finite values (`stat_bin()`).
Warning: Removed 2 rows containing missing values (`geom_bar()`).
Much better! Most trips are less than 60 minutes with the vast majority of trips between 0 and 25 minutes. This is also a highly skewed distribution so if we want to characterize the “typical” trip duration we should probably not use the average. In the following we will focus on the median trip duration.
Here is the median duration for each day of the month
%>%
taxi group_by(day) %>%
summarize(med.duration=median(trip.duration)) %>%
ggplot(aes(x=day,y=med.duration)) +
geom_point() +
geom_line(aes(group=1),linetype='dotted')
Let’s “pretty-up” this plot a bit by adding some axis titles and weekday information
%>%
taxi group_by(day) %>%
summarize(med.duration=median(trip.duration),
weekday=weekday[1]) %>%
ggplot(aes(x=day,y=med.duration,group=1)) +
geom_point(aes(color=weekday),size=5) +
geom_line(linetype='dotted')+
labs(x='Day of Month',
y='Median Trip Duration (Mins.)',
title='Median Trip Duration by Day of Month')
In terms of duration, the longest trips happen mid-week, while the shortest are on weekends. An alternative approach is to add labels directly on the plot
%>%
taxi group_by(day) %>%
summarize(med.duration=median(trip.duration),
weekday=weekday[1]) %>%
ggplot(aes(x=day,y=med.duration)) + geom_text(aes(label=weekday)) +
geom_line(aes(group=1),linetype='dotted')+
labs(x='Day of Month',
y='Median Trip Duration (Mins.)',
title='Median Trip Duration by Day of Month')
Now let’s look at median trip duration by time of day
%>%
taxi group_by(hour.trip.start) %>%
summarize(med.duration=median(trip.duration)) %>%
ggplot(aes(x=hour.trip.start,y=med.duration)) +
geom_point() +
geom_line(aes(group=1),linetype='dotted')
Does this pattern stay stable throughout the week? Let’s break out this relationship for each weekday
%>%
taxi group_by(weekday,hour.trip.start) %>%
summarize(med.duration=median(trip.duration)) %>%
ggplot(aes(x=hour.trip.start,y=med.duration,group=weekday,color=weekday)) +
geom_point(size=3) +
geom_line(size=0.5) +
facet_wrap(~weekday,nrow=1) +
theme(legend.position="none")+
scale_x_discrete(breaks=c(0,3,6,9,12,15,18,21))
`summarise()` has grouped output by 'weekday'. You can override using the
`.groups` argument.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
i Please use `linewidth` instead.
This visualization is an example of a “facet” and this feature alone makes it worthwhile to learn ggplot. A facet repeats the same base plot for every value of the facet variable - here weekday. This makes it laughably easy to make complex and highly informative plots.
You can even create two-dimensional facets. Suppose we wanted to repeat the above plot for each payment type. Easy
%>%
taxi filter(payment_type_label %in% c('Credit Card','Cash')) %>%
group_by(weekday,hour.trip.start,payment_type_label) %>%
summarize(med.duration=median(trip.duration)) %>%
ggplot(aes(x=hour.trip.start,y=med.duration,group=weekday,color=weekday)) +
geom_point(size=3) +
geom_line(size=0.5) +
facet_grid(weekday~payment_type_label) +
theme(legend.position="none")+
scale_x_discrete(breaks=c(0,3,6,9,12,15,18,21))
`summarise()` has grouped output by 'weekday', 'hour.trip.start'. You can
override using the `.groups` argument.
Admittedly this is not a good visualization if the objective is to highlight differences between payment types by weekday and time of day. Here is a better version for that purpose
%>%
taxi filter(payment_type_label %in% c('Credit Card','Cash')) %>%
group_by(weekday,hour.trip.start,payment_type_label) %>%
summarize(med.duration=median(trip.duration)) %>%
ggplot(aes(x=hour.trip.start,y=med.duration,group=payment_type_label,
color=payment_type_label,linetype=payment_type_label,shape=payment_type_label)) +
geom_point(size=2) +
geom_line(size=0.5) +
facet_wrap(~weekday,nrow=1) +
labs(x='Time of Day',
y='Median Trip Duration')+
scale_x_discrete(breaks=c(0,6,12,18))
`summarise()` has grouped output by 'weekday', 'hour.trip.start'. You can
override using the `.groups` argument.
Trips paid with credit card tend to be slightly longer in duration - especially for mid-day and mid-week trips.
Trip Distance Here is median trip distance for each day of the month
%>%
taxi group_by(day) %>%
summarize(med.trip=median(trip_distance),
weekday=weekday[1]) %>%
ggplot(aes(x=day,y=med.trip)) + geom_point(aes(color=weekday),size=5) +
geom_line(aes(group=1),linetype='dotted')+
labs(x = 'Day of Month',
y = 'Median Trip Distance (Miles)',
title = 'Median Trip Distance by Day of Month')
In terms of distance, we see the longest trips on weekends. For time of day we get
%>%
taxi group_by(hour.trip.start) %>%
summarize(med.trip=median(trip_distance)) %>%
ggplot(aes(x=hour.trip.start,y=med.trip)) +
geom_point(size=3) +
geom_line(aes(group=1),linetype='dotted')
Trips are longer at night and shortest during the day. Here is the version where we cut it by weekday
%>%
taxi group_by(weekday,hour.trip.start) %>%
summarize(med.trip=median(trip_distance)) %>%
ggplot(aes(x=hour.trip.start,y=med.trip,group=weekday,color=weekday)) +
geom_point(size=3) +
geom_line(size=0.5) +
facet_wrap(~weekday,nrow=1) +
theme(legend.position="none")+
scale_x_discrete(breaks=c(0,3,6,9,12,15,18,21))
`summarise()` has grouped output by 'weekday'. You can override using the
`.groups` argument.
Taxi Exercise 1: Trip Speed Try to visualize trip speed and distance across time of day and day of week. Do you see any interesting patterns? Do your findings make sense when compared to the findings for trip duration and distance?
Fares Let’s look are fare mounts for each payment type:
%>%
taxi filter(payment_type_label %in% c('Credit Card','Cash')) %>%
ggplot(aes(x=fare_amount,fill=payment_type_label)) + geom_histogram() + facet_wrap(~payment_type_label) + xlim(0,75)+
theme(legend.position="none")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 914 rows containing non-finite values (`stat_bin()`).
Warning: Removed 4 rows containing missing values (`geom_bar()`).
These distributions are not too different - credit card trips appear to have slightly larger fares.
How fares by day of week?
%>%
taxi filter(fare_amount < 100) %>%
group_by(day) %>%
summarize(mean.fare=mean(fare_amount),
weekday=weekday[1]) %>%
ggplot(aes(x=day,y=mean.fare)) + geom_point(aes(color=weekday),size=5) +
geom_line(aes(group=1),linetype='dotted')+
labs(x = 'Day of Month',
y = 'Mean Fare ($)',
title = 'Mean Fare Amount by Day of Month')
Average fares are smallest on Saturdays and largest on Thursdays and Sundays.
Let’s investigate the relationship between fare amount, hour of day, weekday and payment type:
%>%
taxi filter(fare_amount < 100,payment_type_label %in% c('Credit Card','Cash')) %>%
group_by(weekday,hour.trip.start,payment_type_label) %>%
summarize(mean.fare=mean(fare_amount)) %>%
ggplot(aes(x=hour.trip.start,y=mean.fare,color=payment_type_label,group=payment_type_label)) +
geom_point(size=2) + geom_line() +
facet_wrap(~weekday,nrow=1)+
scale_x_discrete(breaks=c(0,6,12,18))+
labs(y ='Mean Fare ($)',
x = 'Hour of Trip Start',
title = 'Mean Fares by Time of Day and Weekday')
`summarise()` has grouped output by 'weekday', 'hour.trip.start'. You can
override using the `.groups` argument.
Mean fares tend to be $2-$3 higher for credit card trips.
Taxi Exercise 2: Tips Visualize relationships between tips and payment type and tips and weekday and time of day. The dollar amount of tip is tip_amount
Taxi Exercise 3: Passenger Count Can you find any interesting patterns for passenger count?
Case Study: New York Citibike
In this section we will visualize parts of the citibike data introduced in the Group Summaries section. We start by reading in the data and adding a few transformations:
<- read_rds('data/201508.rds') %>%
citibike mutate(day = factor(mday(as.Date(start.time, "%m/%d/%Y"))),
start.hour=factor(start.hour))
How many trips are there for each hour of the day? Let’s check:
ggplot(data=citibike,aes(x=start.hour)) +
geom_bar() +
labs(x = 'Time of Day',
y = 'Number of Trips')+
theme(axis.text.x = element_text(size=8,angle=90))
Hmmm…looks like there are large rush hour effects - both morning and afternoon. But it this true for both user segments?
ggplot(data=citibike,aes(x=start.hour)) + geom_bar() +
labs(x = 'Time of Day',
y = 'Number of Trips')+
theme(axis.text.x = element_text(size=8,angle=90)) +
facet_wrap(~usertype)
No - rush hour spikes seems to be limited to the “Subscriber” segment.
How about trips by weekday?
ggplot(data=citibike,aes(x=weekday)) + geom_bar() +
labs(x = 'Day of Week',
y = 'Number of Trips',
title = 'Number of Bike Trips by Day of Week')
This suffers from the same problem that we encountered for the taxi data - some weekdays occur 5 times in a month while others only occur 4 times. We can correct this the same say as for the taxi data:
%>%
citibike group_by(day) %>%
summarize(n=n(),
weekday = weekday[1]) %>%
group_by(weekday) %>%
summarize(n.m=mean(n)) %>%
ggplot(aes(x=weekday,y=n.m)) + geom_bar(stat='identity') +
labs(x = 'Day of Week',
y = 'Number of Trips',
title = 'Number of Bike Trips by Day of Week')
The fewest number of trips occurs on weekends. Is this pattern the same for both segments?
%>%
citibike group_by(day,usertype) %>%
summarize(n=n(),
weekday = weekday[1]) %>%
group_by(weekday,usertype) %>%
summarize(n.m=mean(n)) %>%
ggplot(aes(x=weekday,y=n.m)) + geom_bar(stat='identity') +
labs(x = 'Day of Week',
y = 'Number of Trips',
title = 'Number of Bike Trips by Day of Week') +
facet_wrap(~usertype)
`summarise()` has grouped output by 'day'. You can override using the `.groups`
argument.
`summarise()` has grouped output by 'weekday'. You can override using the
`.groups` argument.
That’s interesting! For “Customers” we see spikes on weekends, while the opposite is true for “Subcribers”. This is consistent with the interpretation of customers as tourists and subscribers as locals.
Let’s put it all together - trips by weekday by segment by time of day:
%>%
citibike group_by(day,usertype,start.hour) %>%
summarize(n=n(),
weekday = weekday[1]) %>%
group_by(weekday,usertype,start.hour) %>%
summarize(n.m=mean(n)) %>%
ggplot(aes(x=start.hour,y=n.m,fill=weekday)) +
geom_bar(stat='identity') +
labs(x = 'Time of Day',
y = 'Number of Trips',
title = 'Number of Bike Trips by Time of Day and Weekday')+
facet_grid(weekday~usertype) +
theme(axis.text.x = element_text(size=8,angle=90),
legend.position="none")
`summarise()` has grouped output by 'day', 'usertype'. You can override using
the `.groups` argument.
`summarise()` has grouped output by 'weekday', 'usertype'. You can override
using the `.groups` argument.
Even more interesting: On weekends, “Subscribers” as as “Customers” - no rush hour spikes.
Now let’s turn to analyzing trip durations rather than the number of trips. What does the distribution of trip durations look like? Remember from above that trip duration is recorded in seconds. This is hard to think about. Let’s start by defining a new variable, which is trip duration in minutes. Also, based on analyzing this data in the Group Summaries section, we ignore the few outlier trips with of extreme length:
<- citibike %>%
citibike mutate(tripduration.m = tripduration/60)
%>%
citibike filter(tripduration.m < 100) %>%
ggplot(aes(x=tripduration.m)) + geom_histogram()+
labs(x = 'Trip Duration (min.))',
y = 'Number of Trips')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a skewed distribution with a long right tail. Most trips are less than 30 minutes. Do the two segments have similar duration distributions?
%>%
citibike filter(tripduration.m < 100) %>%
ggplot(aes(x=tripduration.m)) +
geom_histogram()+
labs(x = 'Trip Duration (min.)',
y = 'Number of Trips') +
facet_wrap(~usertype)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These distributions are different - the “Customer” distribution is much less skewed with more weight on longer trips. This is more evident if we plot the density versions of the histograms (a “density” is just a smoothed version of a histogram):
%>%
citibike filter(tripduration.m < 100) %>%
ggplot(aes(x=tripduration.m,fill=usertype)) +
geom_density(alpha=0.2) +
labs(x = 'Trip Duration (min.)',
y = 'Number of Trips')
Here we clearly see that customers take longer trips than subscribers.
Finally, let’s look at effect of gender and birth year on trip duration. Do segments defined by gender and age take different trips in terms of duration?
%>%
citibike filter(!birth.year=='NA', gender %in% c('female','male')) %>%
mutate(birth.year.f=cut(as.numeric(birth.year),
breaks = c(0,1955,1965,1975,1985,1990,2000),
labels=c('<1955','1955-1964','1965-1974','1975-1984','1985-1989','>=1990'))) %>%
group_by(birth.year.f,gender) %>%
summarize(med.trip.dur = median(tripduration.m)) %>%
ggplot(aes(x=birth.year.f,y=med.trip.dur,group=gender,color=gender)) +
geom_point() +
geom_line(linetype='dotted') +
labs(y = 'Trip Duration (min.)',
x = 'Age Group',
title = 'Median Trip Duration by Gender and Birth Year')
`summarise()` has grouped output by 'birth.year.f'. You can override using the
`.groups` argument.
Answer Yes! Men take shorter (in time) trips than women at any age. Furthermore, younger riders of any gender take shorter trips than older riders.