Preface: I wrote this report for Udacity's "Explore and Summarize Data" module. The structure is kind of strange for a blog post, but I'm submitting the finished report essentially unchanged.
One thing I will note. I find that the cycle hire usage doesn't change much throughout the year. Shortly after submitting, I read this article which finds that it does vary quite a lot. I'm inclined to trust that result more. It's intuitively sensible, and it looks directly at the number of rides taken, instead of looking at a proxy like I do.
Take this as evidence for how much to trust my other results.
My goal is to investigate usage of the London cycle hire scheme, and in particular how it varies with the weather. I'm running an analysis from July 2013 to June 2014.
I'm using two data sets here. Daily weather data comes from Weather Underground, using the weather station at London Heathrow airport.
(London City Airport is closer to the bike stations that I use, but the data from that airport reports 0 precipitation on every single day. The data from Heathrow seems to be more complete, and I expect it to be almost as relevant.)
I collected the cycle hire data myself, over the course of the year, by downloading CSV files from an unofficial API which now appears to be defunct. It has a granularity of about ten minutes. That's about 50,000 entries per docking station for the year, so for this analysis, I'm only using the data from four docking stations near my office.
All data and source code used for this project can be found in the git repository.
These variables measure the minimum, average, and maximum daily temperatures. The graphs all look similar, and overlap a lot. The shape is a little surprising, as I didn't expect the density graphs to be bimodal. It could potentially be caused by significant differences between summer and winter, with an abrupt shift between the two.
According to the
rain column, There are over 225 rainy days and only about 125 non-rainy days. But by far the most common bin for
precip.mm is the leftmost one. Table of values of
## ## 0 0.25 0.51 0.76 1.02 2.03 3.05 4.06 5.08 6.1 7.11 7.87 ## 207 35 20 9 17 22 12 8 12 4 4 2 ## 8.89 9.91 10.92 11.94 13.97 ## 3 5 2 1 2
Although more than half of observations have
rain == TRUE, more than half of them also have
precip.mm == 0, which needs more investigation. Rainfall as measured by
precip.mm versus as measured by
The two measures don't always agree. Sometimes
rain is false but
precip.mm is nonzero; and often
rain is true but
precip.mm is zero. Neither of those is surprising individually: if
rain is only counted when the rainfall exceeds a certain threshold, then that threshold could be large (giving false/nonzero) or small (giving true/zero). But the combination suggests that that isn't what's going on, and I don't know what is.
This table counts the anomalies by turning
precip.mm into a boolean zero/nonzero (false/true) and comparing it to
## ## FALSE TRUE ## FALSE 119 9 ## TRUE 88 149
There are 88 instances of true/zero, 9 instances of false/nonzero, but the cases where they agree are the most common.
precip.mm to me more plausible here. I feel like fewer than half of days are rainy. This website agrees with me, saying that on average, 164 days out of the year are rainy (
rain - 237,
precip.mm - 158).
These three measures of wind speed are all averages.
wind is simply the average wind speed over a day.
wind.max is the daily maximum of the average wind speed over a short time period (I think one minute).
gust is the same thing, but with a shorter time period (I think 14 seconds).
Unlike with temperature, the three measures look different. All are right-skewed, although
gust looks less so. There are several outliers (the isolated points on the box plots), and the quartiles don't overlap. The minimum gust speed (about 24) is almost as high as the median
There are a few outliers here. Not all the lines are visible due to rendering artifacts, but above 5000, we only have five entries:
## name prev.updated updated ## 46779 Earnshaw Street 2013-10-03 08:50:23 2013-10-13 09:20:28 ## 46899 Southampton Place 2013-10-03 08:50:22 2013-10-13 09:20:27 ## 46918 High Holborn 2013-10-03 08:50:24 2013-10-13 09:20:30 ## 47049 Bury Place 2013-10-03 08:50:26 2013-10-13 09:20:32 ## 175705 Southampton Place 2014-06-20 17:36:06 2014-06-30 08:30:03
The first four of these happened when my collection script broke and I failed to realize it. The other occurred when Southampton Place was taken out of service temporarily.
Let's zoom in on the lower ones:
There are several instances where the time between updates is unusually large, on the order of hours or days. The times of entries with between 2000 and 5000 minutes between updates:
## name prev.updated updated ## 32650 High Holborn 2013-08-31 15:10:07 2013-09-02 12:30:05 ## 32660 Bury Place 2013-08-31 15:10:08 2013-09-02 12:30:07 ## 32672 Southampton Place 2013-08-31 15:10:05 2013-09-02 12:30:04 ## 32674 Earnshaw Street 2013-08-31 15:10:06 2013-09-02 12:30:05 ## 38546 High Holborn 2013-09-14 22:39:00 2013-09-16 08:24:22 ## 38719 Bury Place 2013-09-14 22:39:02 2013-09-16 08:24:23 ## 38734 Southampton Place 2013-09-14 22:38:58 2013-09-16 08:24:20 ## 38735 Earnshaw Street 2013-09-14 22:38:59 2013-09-16 08:24:21 ## 84066 Bury Place 2013-12-27 15:40:08 2013-12-29 23:10:14 ## 84069 High Holborn 2013-12-27 15:40:06 2013-12-29 23:10:13 ## 84073 Southampton Place 2013-12-27 15:40:05 2013-12-29 23:10:11 ## 84078 Earnshaw Street 2013-12-27 15:40:05 2013-12-29 23:10:12 ## 84186 Earnshaw Street 2013-12-30 00:10:05 2013-12-31 13:10:07 ## 84202 High Holborn 2013-12-30 00:10:06 2013-12-31 13:10:09 ## 84269 Southampton Place 2013-12-30 00:10:05 2013-12-31 13:10:06 ## 84330 Bury Place 2013-12-30 00:10:07 2013-12-31 13:10:11 ## 89443 Southampton Place 2014-01-12 20:20:10 2014-01-14 18:40:07 ## 89459 High Holborn 2014-01-12 20:20:13 2014-01-14 18:40:11 ## 89467 Bury Place 2014-01-12 20:20:14 2014-01-14 18:40:16 ## 89524 Earnshaw Street 2014-01-12 20:20:11 2014-01-14 18:40:09 ## 121381 Earnshaw Street 2014-03-15 14:50:06 2014-03-17 01:50:04 ## 121398 High Holborn 2014-03-15 14:50:07 2014-03-17 01:50:05 ## 121444 Bury Place 2014-03-15 14:50:10 2014-03-17 01:50:07 ## 121591 Southampton Place 2014-03-15 14:50:05 2014-03-17 01:50:04 ## 133765 High Holborn 2014-04-11 16:59:37 2014-04-14 01:29:07 ## 133900 Earnshaw Street 2014-04-11 16:59:36 2014-04-14 01:29:05 ## 133961 Bury Place 2014-04-11 16:59:38 2014-04-14 01:29:08 ## 134027 Southampton Place 2014-04-11 16:59:35 2014-04-14 01:29:05
It looks like these happened to all stations simultaneously, suggesting problems with either my collection script or the API, rather than problems with individual locations.
Entries with less than 60 minutes between updates, no longer on a log scale:
In the vast majority of cases, updates are approximately ten minutes apart. This encourages me to take a subset of the data (
bikes), considering only entries with
d.updated less than 15 minutes. This eliminates many outliers in future graphs.
All times of day are approximately equally represented to within ten minutes, which is good. There are five noticeable troughs preceeded by spikes, but they probably don't signify much. Dates are a lot less uniform, however. Even apart from the ten-day period where my script was broken, many days have significantly fewer updates than typical, and some have none at all.
It was common for every station to report less than a full complement of docks. At least two had a full complement for less than half the time (High Holborn and Bury place are unclear in that respect). This isn't surprising, since a bike reported as defective will be locked in, using up a slot but not being available for hire.
The time of year makes very little difference to the number of rides. There appears to be a slight sinusoidal relationship, but it's very weak. (I didn't do a PMCC test because that assumes that any relationship is linear, which we would naively expect not to be the case here, and also doesn't look true from the graph.)
Fewer journeys are taken on weekends. The median number of bikes available doesn't change much throughout the week (5 on monday and friday, 4 on other days), but the distribution does. Saturday and Sunday have noticeably different shapes to the others. They have a single peak, while weekdays are somewhat bimodal, with a small peak where the station is full (probably when people are arriving at work).
(Since the stations have different numbers of docks, I did a graph of fullness rather than of number of bikes. The density plot doesn't show peaks exactly at 0 and 1 because of how the density window works, but histograms of num.bikes and num.spaces show that that's where they are. It would be difficult to use a histogram for this graph because there's no sensible binwidth.)
## ## Pearson's product-moment correlation ## ## data: bikes$num.bikes and bikes$prev.num.bikes ## t = 2466.8, df = 173250, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.9859301 0.9861908 ## sample estimates: ## cor ## 0.986061
There's very strong correlation between the number of bikes in adjacent entries. This is as expected, especially given what we saw about
d.num.bikes previously. The colors here don't show any particular station-dependent trends.