Preface: I wrote this report for Udacity's "Explore and Summarize Data" module. The structure is kind of strange for a blog post, but I'm submitting the finished report essentially unchanged.
One thing I will note. I find that the cycle hire usage doesn't change much throughout the year. Shortly after submitting, I read this article which finds that it does vary quite a lot. I'm inclined to trust that result more. It's intuitively sensible, and it looks directly at the number of rides taken, instead of looking at a proxy like I do.
Take this as evidence for how much to trust my other results.
My goal is to investigate usage of the London cycle hire scheme, and in particular how it varies with the weather. I'm running an analysis from July 2013 to June 2014.
I'm using two data sets here. Daily weather data comes from Weather Underground, using the weather station at London Heathrow airport.
(London City Airport is closer to the bike stations that I use, but the data from that airport reports 0 precipitation on every single day. The data from Heathrow seems to be more complete, and I expect it to be almost as relevant.)
I collected the cycle hire data myself, over the course of the year, by downloading CSV files from an unofficial API which now appears to be defunct. It has a granularity of about ten minutes. That's about 50,000 entries per docking station for the year, so for this analysis, I'm only using the data from four docking stations near my office.
All data and source code used for this project can be found in the git repository.
These variables measure the minimum, average, and maximum daily temperatures. The graphs all look similar, and overlap a lot. The shape is a little surprising, as I didn't expect the density graphs to be bimodal. It could potentially be caused by significant differences between summer and winter, with an abrupt shift between the two.
According to the rain
column, There are over 225 rainy days and only about 125 non-rainy days. But by far the most common bin for precip.mm
is the leftmost one. Table of values of precip.mm
:
##
## 0 0.25 0.51 0.76 1.02 2.03 3.05 4.06 5.08 6.1 7.11 7.87
## 207 35 20 9 17 22 12 8 12 4 4 2
## 8.89 9.91 10.92 11.94 13.97
## 3 5 2 1 2
Although more than half of observations have rain == TRUE
, more than half of them also have precip.mm == 0
, which needs more investigation. Rainfall as measured by precip.mm
versus as measured by rain
:
The two measures don't always agree. Sometimes rain
is false but precip.mm
is nonzero; and often rain
is true but precip.mm
is zero. Neither of those is surprising individually: if rain
is only counted when the rainfall exceeds a certain threshold, then that threshold could be large (giving false/nonzero) or small (giving true/zero). But the combination suggests that that isn't what's going on, and I don't know what is.
This table counts the anomalies by turning precip.mm
into a boolean zero/nonzero (false/true) and comparing it to rain
:
##
## FALSE TRUE
## FALSE 119 9
## TRUE 88 149
There are 88 instances of true/zero, 9 instances of false/nonzero, but the cases where they agree are the most common.
I find precip.mm
to me more plausible here. I feel like fewer than half of days are rainy. This website agrees with me, saying that on average, 164 days out of the year are rainy (rain
- 237, precip.mm
- 158).
These three measures of wind speed are all averages. wind
is simply the average wind speed over a day. wind.max
is the daily maximum of the average wind speed over a short time period (I think one minute). gust
is the same thing, but with a shorter time period (I think 14 seconds).
Unlike with temperature, the three measures look different. All are right-skewed, although gust
looks less so. There are several outliers (the isolated points on the box plots), and the quartiles don't overlap. The minimum gust speed (about 24) is almost as high as the median wind.max
.
There are a few outliers here. Not all the lines are visible due to rendering artifacts, but above 5000, we only have five entries:
## name prev.updated updated
## 46779 Earnshaw Street 2013-10-03 08:50:23 2013-10-13 09:20:28
## 46899 Southampton Place 2013-10-03 08:50:22 2013-10-13 09:20:27
## 46918 High Holborn 2013-10-03 08:50:24 2013-10-13 09:20:30
## 47049 Bury Place 2013-10-03 08:50:26 2013-10-13 09:20:32
## 175705 Southampton Place 2014-06-20 17:36:06 2014-06-30 08:30:03
The first four of these happened when my collection script broke and I failed to realize it. The other occurred when Southampton Place was taken out of service temporarily.
Let's zoom in on the lower ones:
There are several instances where the time between updates is unusually large, on the order of hours or days. The times of entries with between 2000 and 5000 minutes between updates:
## name prev.updated updated
## 32650 High Holborn 2013-08-31 15:10:07 2013-09-02 12:30:05
## 32660 Bury Place 2013-08-31 15:10:08 2013-09-02 12:30:07
## 32672 Southampton Place 2013-08-31 15:10:05 2013-09-02 12:30:04
## 32674 Earnshaw Street 2013-08-31 15:10:06 2013-09-02 12:30:05
## 38546 High Holborn 2013-09-14 22:39:00 2013-09-16 08:24:22
## 38719 Bury Place 2013-09-14 22:39:02 2013-09-16 08:24:23
## 38734 Southampton Place 2013-09-14 22:38:58 2013-09-16 08:24:20
## 38735 Earnshaw Street 2013-09-14 22:38:59 2013-09-16 08:24:21
## 84066 Bury Place 2013-12-27 15:40:08 2013-12-29 23:10:14
## 84069 High Holborn 2013-12-27 15:40:06 2013-12-29 23:10:13
## 84073 Southampton Place 2013-12-27 15:40:05 2013-12-29 23:10:11
## 84078 Earnshaw Street 2013-12-27 15:40:05 2013-12-29 23:10:12
## 84186 Earnshaw Street 2013-12-30 00:10:05 2013-12-31 13:10:07
## 84202 High Holborn 2013-12-30 00:10:06 2013-12-31 13:10:09
## 84269 Southampton Place 2013-12-30 00:10:05 2013-12-31 13:10:06
## 84330 Bury Place 2013-12-30 00:10:07 2013-12-31 13:10:11
## 89443 Southampton Place 2014-01-12 20:20:10 2014-01-14 18:40:07
## 89459 High Holborn 2014-01-12 20:20:13 2014-01-14 18:40:11
## 89467 Bury Place 2014-01-12 20:20:14 2014-01-14 18:40:16
## 89524 Earnshaw Street 2014-01-12 20:20:11 2014-01-14 18:40:09
## 121381 Earnshaw Street 2014-03-15 14:50:06 2014-03-17 01:50:04
## 121398 High Holborn 2014-03-15 14:50:07 2014-03-17 01:50:05
## 121444 Bury Place 2014-03-15 14:50:10 2014-03-17 01:50:07
## 121591 Southampton Place 2014-03-15 14:50:05 2014-03-17 01:50:04
## 133765 High Holborn 2014-04-11 16:59:37 2014-04-14 01:29:07
## 133900 Earnshaw Street 2014-04-11 16:59:36 2014-04-14 01:29:05
## 133961 Bury Place 2014-04-11 16:59:38 2014-04-14 01:29:08
## 134027 Southampton Place 2014-04-11 16:59:35 2014-04-14 01:29:05
It looks like these happened to all stations simultaneously, suggesting problems with either my collection script or the API, rather than problems with individual locations.
Entries with less than 60 minutes between updates, no longer on a log scale:
In the vast majority of cases, updates are approximately ten minutes apart. This encourages me to take a subset of the data (bikes.all
-> bikes
), considering only entries with d.updated
less than 15 minutes. This eliminates many outliers in future graphs.
All times of day are approximately equally represented to within ten minutes, which is good. There are five noticeable troughs preceeded by spikes, but they probably don't signify much. Dates are a lot less uniform, however. Even apart from the ten-day period where my script was broken, many days have significantly fewer updates than typical, and some have none at all.
It was common for every station to report less than a full complement of docks. At least two had a full complement for less than half the time (High Holborn and Bury place are unclear in that respect). This isn't surprising, since a bike reported as defective will be locked in, using up a slot but not being available for hire.
The time of year makes very little difference to the number of rides. There appears to be a slight sinusoidal relationship, but it's very weak. (I didn't do a PMCC test because that assumes that any relationship is linear, which we would naively expect not to be the case here, and also doesn't look true from the graph.)
Fewer journeys are taken on weekends. The median number of bikes available doesn't change much throughout the week (5 on monday and friday, 4 on other days), but the distribution does. Saturday and Sunday have noticeably different shapes to the others. They have a single peak, while weekdays are somewhat bimodal, with a small peak where the station is full (probably when people are arriving at work).
(Since the stations have different numbers of docks, I did a graph of fullness rather than of number of bikes. The density plot doesn't show peaks exactly at 0 and 1 because of how the density window works, but histograms of num.bikes and num.spaces show that that's where they are. It would be difficult to use a histogram for this graph because there's no sensible binwidth.)
##
## Pearson's product-moment correlation
##
## data: bikes$num.bikes and bikes$prev.num.bikes
## t = 2466.8, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9859301 0.9861908
## sample estimates:
## cor
## 0.986061
There's very strong correlation between the number of bikes in adjacent entries. This is as expected, especially given what we saw about d.num.bikes
previously. The colors here don't show any particular station-dependent trends.
The correlation also looks strong between the number of bikes at each station at any given time. Since they're all close to each other, that's not surprising. The time is a big factor, with large numbers of bikes in the stations during office hours, and few numbers in the evening and early morning. There's a slight dip around 1pm, which could be related to people using them on their lunch breaks.
This graph gives an overview of global trends, but I mostly use the bikes at specific times. We can zoom in on those:
(when I'm trying to arrive at work)
This is a proportional frequency plot: within each facet of the graph, the heights of the bins add up to 1. Only weekdays are considered.
About 40% of the time, Earnshaw street has no spaces. That's actually less than I'd realized. It's directly outside my office, and I haven't even been checking it because I'd assumed it was always full.
(in case I'm running late)
If I'm late, I have slightly less chance of finding a docking station, but not much less.
Here, rain
is the original variable in the dataset, and rain2
simply measures whether precip.mm
is nonzero. We have graphs looking at d.num.bikes
on each type of day, and tables comparing its mean absolute value.
## Source: local data frame [2 x 2]
##
## rain mean(abs(d.num.bikes))
## 1 FALSE 0.5160167
## 2 TRUE 0.4156172
## Source: local data frame [2 x 2]
##
## rain2 mean(abs(d.num.bikes))
## 1 FALSE 0.4824637
## 2 TRUE 0.4073405
## Source: local data frame [4 x 3]
## Groups: rain
##
## rain rain2 mean(abs(d.num.bikes))
## 1 FALSE FALSE 0.5184101
## 2 FALSE TRUE 0.4755501
## 3 TRUE FALSE 0.4351990
## 4 TRUE TRUE 0.4042656
Earlier I said I feel like precip.mm
is more accurate than rain
. Despite that, rain
seems to be capturing something that precip.mm
doesn't, because bike usage responds slightly more to it. This would seem to suggest that days where rain
is true but precip.mm
is zero have less bike usage than average; and indeed this is what we see.
Taking rain
to be our measure, slightly over 70% of observations had no bikes added or removed on rainy days, and slightly under 70% on non-rainy days. The mean absolute difference is about 25% higher on non-rainy days.
## Source: local data frame [2 x 2]
##
## fog mean(abs(d.num.bikes))
## 1 FALSE 0.4488018
## 2 TRUE 0.4568736
##
## Pearson's product-moment correlation
##
## data: bikes$t and abs(bikes$d.num.bikes)
## t = 31.414, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07057403 0.07993830
## sample estimates:
## cor
## 0.07525782
##
## Pearson's product-moment correlation
##
## data: bikes$wind and abs(bikes$d.num.bikes)
## t = -22.389, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05840721 -0.04901677
## sample estimates:
## cor
## -0.05371317
Unlike rain, it seems that fog, wind and temperature make approximately no difference. The mean absolute difference in number of bikes is about the same regardless of fog, and the correlation between that and temperature/wind is close to zero.
Rain reduces the variance, with fewer bikes during office hours and more outside of them.
With the data in the current format, not all the questions we want to ask are easy. For example: how does the number of bikes at one station correlate with another at any given time? I previously said it “looks strong”, but that's pretty vague.
To answer questions like that, we need to be somewhat forgiving with our definition of 'any given time'. Updates don't necessarily happen simultaneously, so we need to bin them together.
I'm going to create bins ten minutes wide, and assign every observation to a bin. Then in each bin, we can ask how many bikes were at each station. Using this, we can check correlation between each station:
Correlations range between 0.703 and 0.758, and the scatter plots and density histograms all look pretty similar. Does the correlation depend on time? Let's go for 0930, 1800, midnight, and noon.
The correlations are almost all lower. That surprised me, but I think it's an example of Simpson's paradox. I note that the darkest points in the graph are at midnight, with no bikes in any station much of the time. Bikes are periodically moved in vans to account for anticipated demand; I assume that these stations are emptied most nights to prepare for people coming to work in the morning.
An interesting point is that the weakest correlation on any of the graphs is 0.149, between Earnshaw Street and Bury Place at 1800. But the strongest correlation at a specific time is 0.757, also between those two stations, at 0930.
We also see the density charts sometimes having very different shapes, especially at 0930 and 1800. But this seems to be at least partly to do with the way that ggpairs
chooses the axes on its density plots. For example, here's 0930:
The troughs look a lot less significant now.
We can view a histogram of the total number of bikes available at different times:
We see heavy leftward skews overnight, with much flatter (but somewhat right-skewed) distributions during office hours, and gradual transitions between the two.
We can also check correlation between times more distant than a single tick. If I check the slots available when I leave the house, can I learn how many will be there when I arrive?
##
## Pearson's product-moment correlation
##
## data: at.0900 and at.0930
## t = 68.675, df = 1228, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8785868 0.9017383
## sample estimates:
## cor
## 0.8907389
This is good correlation! Does it depend on the rain?
##
## Pearson's product-moment correlation
##
## data: at.0900[rain] and at.0930[rain]
## t = 55.466, df = 816, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8737218 0.9025687
## sample estimates:
## cor
## 0.8890242
##
## Pearson's product-moment correlation
##
## data: at.0900[!rain] and at.0930[!rain]
## t = 39.748, df = 410, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8692649 0.9093735
## sample estimates:
## cor
## 0.8910456
Not much, if at all.
We can construct a model
##
## Call:
## lm(formula = at.0930 ~ at.0900, data = spaces.0900.0930)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.334 -1.502 0.561 1.708 16.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.43496 0.15556 -9.225 <2e-16 ***
## at.0900 0.97899 0.01426 68.675 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.802 on 1228 degrees of freedom
## (69 observations deleted due to missingness)
## Multiple R-squared: 0.7934, Adjusted R-squared: 0.7932
## F-statistic: 4716 on 1 and 1228 DF, p-value: < 2.2e-16
with an R2 of 0.79, which is okay. But this isn't the best we can do, because it groups all stations together. Ideally we would create one model per station, with inputs from every station.
##
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name ==
## "Southampton Place", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.566 -1.857 -0.148 1.152 15.420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.79625 0.47090 -1.691 0.0919 .
## sp 0.74101 0.04179 17.731 <2e-16 ***
## hh 0.05424 0.05307 1.022 0.3075
## bp 0.11811 0.05092 2.320 0.0210 *
## es 0.07909 0.04550 1.738 0.0832 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.17 on 296 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.736, Adjusted R-squared: 0.7324
## F-statistic: 206.3 on 4 and 296 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name ==
## "High Holborn", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4068 -1.1295 0.1503 1.2304 8.3941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.87268 0.29894 -9.610 < 2e-16 ***
## sp 0.08354 0.02653 3.149 0.00181 **
## hh 0.76021 0.03369 22.567 < 2e-16 ***
## bp 0.09533 0.03232 2.949 0.00344 **
## es 0.15937 0.02888 5.518 7.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.012 on 296 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.8349, Adjusted R-squared: 0.8327
## F-statistic: 374.2 on 4 and 296 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name ==
## "Bury Place", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3465 -1.3008 0.3121 1.4809 9.4211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.28068 0.32734 -13.077 < 2e-16 ***
## sp 0.18778 0.02907 6.460 4.32e-10 ***
## hh 0.03132 0.03687 0.850 0.396253
## bp 0.91255 0.03538 25.796 < 2e-16 ***
## es 0.11197 0.03160 3.543 0.000459 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.201 on 295 degrees of freedom
## (18 observations deleted due to missingness)
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8955
## F-statistic: 641.5 on 4 and 295 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name ==
## "Earnshaw Street", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8978 -1.4508 0.3118 1.3272 11.8323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.98653 0.35005 -8.532 7.60e-16 ***
## sp 0.05579 0.03107 1.796 0.0735 .
## hh 0.03405 0.03945 0.863 0.3887
## bp 0.17361 0.03785 4.587 6.65e-06 ***
## es 0.83329 0.03382 24.638 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.356 on 296 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.8579, Adjusted R-squared: 0.856
## F-statistic: 446.9 on 4 and 296 DF, p-value: < 2.2e-16
Southampton Place has slightly regressed, but the others have improved slightly. In particular, Bury Place gets an R2 of 0.89, which is pretty good. (It's important to note that this doesn't make our model worse for Southampton Place than the aggregate model. The aggregate model was just overconfident on that station.)
The total number of bikes available changes gradually throughout the day, with few bikes typically available at night, but often many available during the daytime. The distribution looks left-skewed from around 10:00 to 17:00, and right-skewed from around 19:00 to 07:30. The left skew is never as extreme as the right skew, but because the stations have different numbers of slots, that doesn't tell us much.
This time around, I restricted the graph to weekdays only. It's rare for the number of stations to go up between 09:00 and 09:30. All four stations have similar usage patterns.
At 09:00, if there are five or fewer spaces available, it looks as though the most common single outcome at 09:30 is no spaces at all.
Points above the dotted black line are ones where more spaces were available at 09:30 than at 09:00. (Caveat: I've applied slight jittering, so points very close to that line are ones where the same number of spaces were available.) There are obviously much fewer of them. However, the top-left corner of the graph has a few points in it where the bottom-right corner is empty. The number of bikes never goes down by more than eleven, but it goes up by as much as fifteen.
I took advantage of binning to calculate specific summary functions. All stations show similar patterns: at night, there are few bikes available; during office hours, there are almost always some, and the 10-90 percentile range is a lot higher. The trough around 1pm in the previous version of this plot no longer shows up, which makes me suspect it was simply an artifact of the smoothing method.
During the day, the number of bikes available is generally ranked by the number of docking slots at each station - so High Holborn has the least, and Bury Place has the most. When the bikes are taken around 18:00, High Holborn seems to lose them more slowly than the other stations. For Earnshaw Street and especially Bury Place, the 90th percentile lines suggest that those two stations were often completely full.
I've learned a lot about how to fight ggplot when it doesn't do exactly what I want by default, and in particular about how to shape my data for it.
I feel like a data frame isn't an ideal structure for the data I have. The fact that I had to create prev.*
and d.*
copies of those columns that need it seems suboptimal, ideally I would have wanted to be able to refer directly to offset rows in the data. (For example, there's currently no easy way to ask “what's the difference between the number of bikes now and 30 minutes ago?”) But I couldn't find anything that worked better. In particular, time series only allow one data type, so I would have had to fight to use them at all, and I don't know if they would have been any more useful.
My data set itself isn't ideal, particularly in the amount of missing data. Unfortunately, I don't think any better historical bike record data is available. I think I have enough data to trust my conclusions.
In general, it seems that weather doesn't have much impact on bike usage. I checked rain, fog, temperature and wind speed, and only rain made a significant difference. But since the rainfall data seems to be internally inconsistent, I don't know how much we can learn from it. It would be useful to validate it from another source. We might also learn more with finer-grained weather data. For example, when predicting bike availability at a specific time, it doesn't help much if we know whether or not it rained at all on a given day; but it might help more to know whether it was raining at that particular time.
On the other hand, we can make pretty good predictions about future bike (and slot) availability just from current availability. An ambitious future project might be a prediction system. A user could specify a station and an arrival time, and the system could tell her how likely it would be that she could find a slot in that station and nearby ones, and suggest an earlier arrival time that would increase that chance.
One thing I didn't examine was public holidays. For example, we might ask whether, on plot 2 above, many of the points where spaces were freed up fell on holidays. (We can calculate 85 points above the line, and only 8*4 = 32 of them could be on public holidays, but that's still potentially a third of them.)
After initially submitting this report, I noticed a big problem. All timestamps were collected and reported in physical time, but bike usage patterns are going to be related to clock time. So some of my graphs, particularly later ones, were mixing in data from two different clock times (e.g. 09:00 and 10:00) as if they were the same. My first submission was rejected for unrelated reasons, and I've corrected the error in all future versions.
Posted on 19 August 2015
Comments elsewhere: Hacker News