A Reasonable Approximation

Latest posts

London Cycle Hires and Weather

Preface: I wrote this report for Udacity's "Explore and Summarize Data" module. The structure is kind of strange for a blog post, but I'm submitting the finished report essentially unchanged.

One thing I will note. I find that the cycle hire usage doesn't change much throughout the year. Shortly after submitting, I read this article which finds that it does vary quite a lot. I'm inclined to trust that result more. It's intuitively sensible, and it looks directly at the number of rides taken, instead of looking at a proxy like I do.

Take this as evidence for how much to trust my other results.

My goal is to investigate usage of the London cycle hire scheme, and in particular how it varies with the weather. I'm running an analysis from July 2013 to June 2014.

I'm using two data sets here. Daily weather data comes from Weather Underground, using the weather station at London Heathrow airport.

(London City Airport is closer to the bike stations that I use, but the data from that airport reports 0 precipitation on every single day. The data from Heathrow seems to be more complete, and I expect it to be almost as relevant.)

I collected the cycle hire data myself, over the course of the year, by downloading CSV files from an unofficial API which now appears to be defunct. It has a granularity of about ten minutes. That's about 50,000 entries per docking station for the year, so for this analysis, I'm only using the data from four docking stations near my office.

All data and source code used for this project can be found in the git repository.

Exploring the weather data

Temperature

plot of chunk temp.1v plot of chunk temp.1v

These variables measure the minimum, average, and maximum daily temperatures. The graphs all look similar, and overlap a lot. The shape is a little surprising, as I didn't expect the density graphs to be bimodal. It could potentially be caused by significant differences between summer and winter, with an abrupt shift between the two.

Rainfall

plot of chunk rain.1v plot of chunk rain.1v

According to the rain column, There are over 225 rainy days and only about 125 non-rainy days. But by far the most common bin for precip.mm is the leftmost one. Table of values of precip.mm:

## 
##     0  0.25  0.51  0.76  1.02  2.03  3.05  4.06  5.08   6.1  7.11  7.87 
##   207    35    20     9    17    22    12     8    12     4     4     2 
##  8.89  9.91 10.92 11.94 13.97 
##     3     5     2     1     2

Although more than half of observations have rain == TRUE, more than half of them also have precip.mm == 0, which needs more investigation. Rainfall as measured by precip.mm versus as measured by rain:

plot of chunk rain.precip.2v plot of chunk rain.precip.2v

The two measures don't always agree. Sometimes rain is false but precip.mm is nonzero; and often rain is true but precip.mm is zero. Neither of those is surprising individually: if rain is only counted when the rainfall exceeds a certain threshold, then that threshold could be large (giving false/nonzero) or small (giving true/zero). But the combination suggests that that isn't what's going on, and I don't know what is.

This table counts the anomalies by turning precip.mm into a boolean zero/nonzero (false/true) and comparing it to rain:

##        
##         FALSE TRUE
##   FALSE   119    9
##   TRUE     88  149

There are 88 instances of true/zero, 9 instances of false/nonzero, but the cases where they agree are the most common.

I find precip.mm to me more plausible here. I feel like fewer than half of days are rainy. This website agrees with me, saying that on average, 164 days out of the year are rainy (rain - 237, precip.mm - 158).

Wind

plot of chunk wind.1v plot of chunk wind.1v

These three measures of wind speed are all averages. wind is simply the average wind speed over a day. wind.max is the daily maximum of the average wind speed over a short time period (I think one minute). gust is the same thing, but with a shorter time period (I think 14 seconds).

Unlike with temperature, the three measures look different. All are right-skewed, although gust looks less so. There are several outliers (the isolated points on the box plots), and the quartiles don't overlap. The minimum gust speed (about 24) is almost as high as the median wind.max.

Exploring the bike data

Time between updates

plot of chunk dt.1v

There are a few outliers here. Not all the lines are visible due to rendering artifacts, but above 5000, we only have five entries:

##                     name        prev.updated             updated
## 46779    Earnshaw Street 2013-10-03 08:50:23 2013-10-13 09:20:28
## 46899  Southampton Place 2013-10-03 08:50:22 2013-10-13 09:20:27
## 46918       High Holborn 2013-10-03 08:50:24 2013-10-13 09:20:30
## 47049         Bury Place 2013-10-03 08:50:26 2013-10-13 09:20:32
## 175705 Southampton Place 2014-06-20 17:36:06 2014-06-30 08:30:03

The first four of these happened when my collection script broke and I failed to realize it. The other occurred when Southampton Place was taken out of service temporarily.

Let's zoom in on the lower ones:

plot of chunk dt.1v.left

There are several instances where the time between updates is unusually large, on the order of hours or days. The times of entries with between 2000 and 5000 minutes between updates:

##                     name        prev.updated             updated
## 32650       High Holborn 2013-08-31 15:10:07 2013-09-02 12:30:05
## 32660         Bury Place 2013-08-31 15:10:08 2013-09-02 12:30:07
## 32672  Southampton Place 2013-08-31 15:10:05 2013-09-02 12:30:04
## 32674    Earnshaw Street 2013-08-31 15:10:06 2013-09-02 12:30:05
## 38546       High Holborn 2013-09-14 22:39:00 2013-09-16 08:24:22
## 38719         Bury Place 2013-09-14 22:39:02 2013-09-16 08:24:23
## 38734  Southampton Place 2013-09-14 22:38:58 2013-09-16 08:24:20
## 38735    Earnshaw Street 2013-09-14 22:38:59 2013-09-16 08:24:21
## 84066         Bury Place 2013-12-27 15:40:08 2013-12-29 23:10:14
## 84069       High Holborn 2013-12-27 15:40:06 2013-12-29 23:10:13
## 84073  Southampton Place 2013-12-27 15:40:05 2013-12-29 23:10:11
## 84078    Earnshaw Street 2013-12-27 15:40:05 2013-12-29 23:10:12
## 84186    Earnshaw Street 2013-12-30 00:10:05 2013-12-31 13:10:07
## 84202       High Holborn 2013-12-30 00:10:06 2013-12-31 13:10:09
## 84269  Southampton Place 2013-12-30 00:10:05 2013-12-31 13:10:06
## 84330         Bury Place 2013-12-30 00:10:07 2013-12-31 13:10:11
## 89443  Southampton Place 2014-01-12 20:20:10 2014-01-14 18:40:07
## 89459       High Holborn 2014-01-12 20:20:13 2014-01-14 18:40:11
## 89467         Bury Place 2014-01-12 20:20:14 2014-01-14 18:40:16
## 89524    Earnshaw Street 2014-01-12 20:20:11 2014-01-14 18:40:09
## 121381   Earnshaw Street 2014-03-15 14:50:06 2014-03-17 01:50:04
## 121398      High Holborn 2014-03-15 14:50:07 2014-03-17 01:50:05
## 121444        Bury Place 2014-03-15 14:50:10 2014-03-17 01:50:07
## 121591 Southampton Place 2014-03-15 14:50:05 2014-03-17 01:50:04
## 133765      High Holborn 2014-04-11 16:59:37 2014-04-14 01:29:07
## 133900   Earnshaw Street 2014-04-11 16:59:36 2014-04-14 01:29:05
## 133961        Bury Place 2014-04-11 16:59:38 2014-04-14 01:29:08
## 134027 Southampton Place 2014-04-11 16:59:35 2014-04-14 01:29:05

It looks like these happened to all stations simultaneously, suggesting problems with either my collection script or the API, rather than problems with individual locations.

Entries with less than 60 minutes between updates, no longer on a log scale:

plot of chunk dt.1v.60

In the vast majority of cases, updates are approximately ten minutes apart. This encourages me to take a subset of the data (bikes.all -> bikes), considering only entries with d.updated less than 15 minutes. This eliminates many outliers in future graphs.

Date and time of update

plot of chunk date.time.1v plot of chunk date.time.1v

All times of day are approximately equally represented to within ten minutes, which is good. There are five noticeable troughs preceeded by spikes, but they probably don't signify much. Dates are a lot less uniform, however. Even apart from the ten-day period where my script was broken, many days have significantly fewer updates than typical, and some have none at all.

Number of days spent with a given number of active docks

plot of chunk ndocks.time.2v

It was common for every station to report less than a full complement of docks. At least two had a full complement for less than half the time (High Holborn and Bury place are unclear in that respect). This isn't surprising, since a bike reported as defective will be locked in, using up a slot but not being available for hire.

Journeys taken throughout the year

plot of chunk date.journeys.2v

The time of year makes very little difference to the number of rides. There appears to be a slight sinusoidal relationship, but it's very weak. (I didn't do a PMCC test because that assumes that any relationship is linear, which we would naively expect not to be the case here, and also doesn't look true from the graph.)

Journeys by weekday

plot of chunk weekday.journeys.2v plot of chunk weekday.journeys.2v plot of chunk weekday.journeys.2v

Fewer journeys are taken on weekends. The median number of bikes available doesn't change much throughout the week (5 on monday and friday, 4 on other days), but the distribution does. Saturday and Sunday have noticeably different shapes to the others. They have a single peak, while weekdays are somewhat bimodal, with a small peak where the station is full (probably when people are arriving at work).

(Since the stations have different numbers of docks, I did a graph of fullness rather than of number of bikes. The density plot doesn't show peaks exactly at 0 and 1 because of how the density window works, but histograms of num.bikes and num.spaces show that that's where they are. It would be difficult to use a histogram for this graph because there's no sensible binwidth.)

Change in number of bikes between updates

plot of chunk bikes.prevbikes.name.mv

## 
##  Pearson's product-moment correlation
## 
## data:  bikes$num.bikes and bikes$prev.num.bikes
## t = 2466.8, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9859301 0.9861908
## sample estimates:
##      cor 
## 0.986061

There's very strong correlation between the number of bikes in adjacent entries. This is as expected, especially given what we saw about d.num.bikes previously. The colors here don't show any particular station-dependent trends.

Number of bikes at any given time