New York City’s Taxi and Limousine Commission (TLC) Trip Data

Wencong (Priscilla) Li & Trang Le

2017-10-25

NYC Taxi

NYC’s Taxi and Limousine Commission Trip Data is a collection of trip records including fields capturing pick-up and drop-off locations, times, trip distances, fares, rate types, and driver-reported passenger counts. The data was collected and provided to the NYC TLC by technology providers under the Taxicab & Livery Passenger Enhancement Programs.

Getting started

To start using the package, load the ‘nyctaxi’ package into your R session. Since the ‘nyctaxi’ package currently lives on GitHub and not on CRAN, you have to install it using ‘devtools’.

#install.packages("devtools")
#devtools::install_github("beanumber/nyctaxi")

Using the NYC Taxi Trip Data

Two dataframes are included in this package: ‘green_2016_01_sample’ and ‘yellow_2016_01_sample’. There are random samples of 1000 observations generated by the ‘sample’ function in base R from the 2016 January green and yellow taxi trip data.

library(nyctaxi)
data(green_2016_01_sample)
head(green_2016_01_sample)
## # A tibble: 6 x 21
##   VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag
##      <int>               <dttm>                <dttm>              <chr>
## 1        1  2016-01-01 19:17:11   2016-01-01 19:25:55                  N
## 2        2  2016-01-05 21:02:16   2016-01-05 21:17:13                  N
## 3        1  2016-01-22 19:32:24   2016-01-22 19:49:32                  N
## 4        2  2016-01-30 23:31:57   2016-01-30 23:39:28                  N
## 5        2  2016-01-17 22:55:03   2016-01-17 23:00:39                  N
## 6        2  2016-01-25 19:27:34   2016-01-25 19:33:05                  N
## # ... with 17 more variables: RateCodeID <int>, Pickup_longitude <dbl>,
## #   Pickup_latitude <dbl>, Dropoff_longitude <dbl>,
## #   Dropoff_latitude <dbl>, Passenger_count <int>, Trip_distance <dbl>,
## #   Fare_amount <dbl>, Extra <dbl>, MTA_tax <dbl>, Tip_amount <dbl>,
## #   Tolls_amount <dbl>, Ehail_fee <chr>, improvement_surcharge <dbl>,
## #   Total_amount <dbl>, Payment_type <int>, Trip_type <int>

Extracting, Transforming and Loading the data

To access data during wider time spans, make use of the ‘etl’ package to download the data and import it into a database. Please see the documentation for ‘etl_extract’ for further details and examples.

help("etl_extract.etl_nyctaxi")

The code below creates a directory on your local desktop and downloads NYC taxicab trip data from Janaury, 2016 to your local directory. It also transforms/cleans the data and loads it to a sqlite database.

taxi <- etl("nyctaxi", dir = "~/Desktop/nyctaxi/")

taxi %>%
  etl_extract(years = 2016, months = 1, types = c("green")) %>% 
  etl_transform(years = 2016, months = 1, types = c("green")) %>% 
  etl_load(years = 2016, months = 1, types = "green")}

Using the NYC Green Taxi Trip Data

library(dplyr)
library(leaflet)
library(lubridate)

We can use leaflet to visualize the pickup and dropoff locations of the 1000 trips in the green taxi trip dataset:

my_trips <- green_2016_01_sample

#clean_up data according to date and time of pickup
one_cab <- my_trips %>% 
  filter(Pickup_longitude != 0)

leaflet(data = one_cab) %>% 
  addTiles() %>% 
  addCircles(lng = ~Pickup_longitude, lat = ~Pickup_latitude) %>% 
  addCircles(lng = ~Dropoff_longitude, lat = ~Dropoff_latitude, color = "green")

We can use lubridate to clean datetime variable:

clean_datetime <- my_trips %>% 
  mutate(lpep_pickup_datetime = ymd_hms(lpep_pickup_datetime)) %>%
  mutate(Lpep_dropoff_datetime = ymd_hms(Lpep_dropoff_datetime)) %>% 
  mutate(weekday_pickup = weekdays(lpep_pickup_datetime)) %>%
  mutate(weekday_dropoff= weekdays(Lpep_dropoff_datetime))

We can now analyze the number of trips occurred on each day of a week:

clean_datetime %>% 
  group_by(weekday_pickup) %>%
  summarize(N = n(), avg_dist = mean(Trip_distance), 
            avg_passengers = mean(Passenger_count), 
            avg_price = mean(Total_amount))
## # A tibble: 7 x 5
##   weekday_pickup     N avg_dist avg_passengers avg_price
##            <chr> <int>    <dbl>          <dbl>     <dbl>
## 1         Friday   197 2.917462       1.324873  14.82411
## 2         Monday   119 2.225882       1.344538  12.91471
## 3       Saturday   183 2.862787       1.448087  14.77727
## 4         Sunday   133 2.940075       1.413534  13.99323
## 5       Thursday   115 2.850000       1.191304  15.27861
## 6        Tuesday   132 2.545985       1.401515  13.75205
## 7      Wednesday   121 2.469256       1.586777  13.11273

It looks like on Friday and Saturday had the most trips.