NYC’s Taxi and Limousine Commission Trip Data is a collection of trip records including fields capturing pick-up and drop-off locations, times, trip distances, fares, rate types, and driver-reported passenger counts. The data was collected and provided to the NYC TLC by technology providers under the Taxicab & Livery Passenger Enhancement Programs.
To start using the package, load the ‘nyctaxi’ package into your R session. Since the ‘nyctaxi’ package currently lives on GitHub and not on CRAN, you have to install it using ‘devtools’.
#install.packages("devtools")
#devtools::install_github("beanumber/nyctaxi")
Two dataframes are included in this package: ‘green_2016_01_sample’ and ‘yellow_2016_01_sample’. There are random samples of 1000 observations generated by the ‘sample’ function in base R from the 2016 January green and yellow taxi trip data.
library(nyctaxi)
data(green_2016_01_sample)
head(green_2016_01_sample)
## # A tibble: 6 x 21
## VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag
## <int> <dttm> <dttm> <chr>
## 1 1 2016-01-01 19:17:11 2016-01-01 19:25:55 N
## 2 2 2016-01-05 21:02:16 2016-01-05 21:17:13 N
## 3 1 2016-01-22 19:32:24 2016-01-22 19:49:32 N
## 4 2 2016-01-30 23:31:57 2016-01-30 23:39:28 N
## 5 2 2016-01-17 22:55:03 2016-01-17 23:00:39 N
## 6 2 2016-01-25 19:27:34 2016-01-25 19:33:05 N
## # ... with 17 more variables: RateCodeID <int>, Pickup_longitude <dbl>,
## # Pickup_latitude <dbl>, Dropoff_longitude <dbl>,
## # Dropoff_latitude <dbl>, Passenger_count <int>, Trip_distance <dbl>,
## # Fare_amount <dbl>, Extra <dbl>, MTA_tax <dbl>, Tip_amount <dbl>,
## # Tolls_amount <dbl>, Ehail_fee <chr>, improvement_surcharge <dbl>,
## # Total_amount <dbl>, Payment_type <int>, Trip_type <int>
To access data during wider time spans, make use of the ‘etl’ package to download the data and import it into a database. Please see the documentation for ‘etl_extract’ for further details and examples.
help("etl_extract.etl_nyctaxi")
The code below creates a directory on your local desktop and downloads NYC taxicab trip data from Janaury, 2016 to your local directory. It also transforms/cleans the data and loads it to a sqlite database.
taxi <- etl("nyctaxi", dir = "~/Desktop/nyctaxi/")
taxi %>%
etl_extract(years = 2016, months = 1, types = c("green")) %>%
etl_transform(years = 2016, months = 1, types = c("green")) %>%
etl_load(years = 2016, months = 1, types = "green")}
library(dplyr)
library(leaflet)
library(lubridate)
We can use leaflet
to visualize the pickup and dropoff locations of the 1000 trips in the green taxi trip dataset:
my_trips <- green_2016_01_sample
#clean_up data according to date and time of pickup
one_cab <- my_trips %>%
filter(Pickup_longitude != 0)
leaflet(data = one_cab) %>%
addTiles() %>%
addCircles(lng = ~Pickup_longitude, lat = ~Pickup_latitude) %>%
addCircles(lng = ~Dropoff_longitude, lat = ~Dropoff_latitude, color = "green")
We can use lubridate
to clean datetime variable:
clean_datetime <- my_trips %>%
mutate(lpep_pickup_datetime = ymd_hms(lpep_pickup_datetime)) %>%
mutate(Lpep_dropoff_datetime = ymd_hms(Lpep_dropoff_datetime)) %>%
mutate(weekday_pickup = weekdays(lpep_pickup_datetime)) %>%
mutate(weekday_dropoff= weekdays(Lpep_dropoff_datetime))
We can now analyze the number of trips occurred on each day of a week:
clean_datetime %>%
group_by(weekday_pickup) %>%
summarize(N = n(), avg_dist = mean(Trip_distance),
avg_passengers = mean(Passenger_count),
avg_price = mean(Total_amount))
## # A tibble: 7 x 5
## weekday_pickup N avg_dist avg_passengers avg_price
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Friday 197 2.917462 1.324873 14.82411
## 2 Monday 119 2.225882 1.344538 12.91471
## 3 Saturday 183 2.862787 1.448087 14.77727
## 4 Sunday 133 2.940075 1.413534 13.99323
## 5 Thursday 115 2.850000 1.191304 15.27861
## 6 Tuesday 132 2.545985 1.401515 13.75205
## 7 Wednesday 121 2.469256 1.586777 13.11273
It looks like on Friday and Saturday had the most trips.