A Reasonable Approximation

Latest posts

London Cycle Hires and Weather

Preface: I wrote this report for Udacity's "Explore and Summarize Data" module. The structure is kind of strange for a blog post, but I'm submitting the finished report essentially unchanged.

One thing I will note. I find that the cycle hire usage doesn't change much throughout the year. Shortly after submitting, I read this article which finds that it does vary quite a lot. I'm inclined to trust that result more. It's intuitively sensible, and it looks directly at the number of rides taken, instead of looking at a proxy like I do.

Take this as evidence for how much to trust my other results.

My goal is to investigate usage of the London cycle hire scheme, and in particular how it varies with the weather. I'm running an analysis from July 2013 to June 2014.

I'm using two data sets here. Daily weather data comes from Weather Underground, using the weather station at London Heathrow airport.

(London City Airport is closer to the bike stations that I use, but the data from that airport reports 0 precipitation on every single day. The data from Heathrow seems to be more complete, and I expect it to be almost as relevant.)

I collected the cycle hire data myself, over the course of the year, by downloading CSV files from an unofficial API which now appears to be defunct. It has a granularity of about ten minutes. That's about 50,000 entries per docking station for the year, so for this analysis, I'm only using the data from four docking stations near my office.

All data and source code used for this project can be found in the git repository.

Exploring the weather data

Temperature

plot of chunk temp.1v plot of chunk temp.1v

These variables measure the minimum, average, and maximum daily temperatures. The graphs all look similar, and overlap a lot. The shape is a little surprising, as I didn't expect the density graphs to be bimodal. It could potentially be caused by significant differences between summer and winter, with an abrupt shift between the two.

Rainfall

plot of chunk rain.1v plot of chunk rain.1v

According to the rain column, There are over 225 rainy days and only about 125 non-rainy days. But by far the most common bin for precip.mm is the leftmost one. Table of values of precip.mm:

## 
##     0  0.25  0.51  0.76  1.02  2.03  3.05  4.06  5.08   6.1  7.11  7.87 
##   207    35    20     9    17    22    12     8    12     4     4     2 
##  8.89  9.91 10.92 11.94 13.97 
##     3     5     2     1     2

Although more than half of observations have rain == TRUE, more than half of them also have precip.mm == 0, which needs more investigation. Rainfall as measured by precip.mm versus as measured by rain:

plot of chunk rain.precip.2v plot of chunk rain.precip.2v

The two measures don't always agree. Sometimes rain is false but precip.mm is nonzero; and often rain is true but precip.mm is zero. Neither of those is surprising individually: if rain is only counted when the rainfall exceeds a certain threshold, then that threshold could be large (giving false/nonzero) or small (giving true/zero). But the combination suggests that that isn't what's going on, and I don't know what is.

This table counts the anomalies by turning precip.mm into a boolean zero/nonzero (false/true) and comparing it to rain:

##        
##         FALSE TRUE
##   FALSE   119    9
##   TRUE     88  149

There are 88 instances of true/zero, 9 instances of false/nonzero, but the cases where they agree are the most common.

I find precip.mm to me more plausible here. I feel like fewer than half of days are rainy. This website agrees with me, saying that on average, 164 days out of the year are rainy (rain - 237, precip.mm - 158).

Wind

plot of chunk wind.1v plot of chunk wind.1v

These three measures of wind speed are all averages. wind is simply the average wind speed over a day. wind.max is the daily maximum of the average wind speed over a short time period (I think one minute). gust is the same thing, but with a shorter time period (I think 14 seconds).

Unlike with temperature, the three measures look different. All are right-skewed, although gust looks less so. There are several outliers (the isolated points on the box plots), and the quartiles don't overlap. The minimum gust speed (about 24) is almost as high as the median wind.max.

Exploring the bike data

Time between updates

plot of chunk dt.1v

There are a few outliers here. Not all the lines are visible due to rendering artifacts, but above 5000, we only have five entries:

##                     name        prev.updated             updated
## 46779    Earnshaw Street 2013-10-03 08:50:23 2013-10-13 09:20:28
## 46899  Southampton Place 2013-10-03 08:50:22 2013-10-13 09:20:27
## 46918       High Holborn 2013-10-03 08:50:24 2013-10-13 09:20:30
## 47049         Bury Place 2013-10-03 08:50:26 2013-10-13 09:20:32
## 175705 Southampton Place 2014-06-20 17:36:06 2014-06-30 08:30:03

The first four of these happened when my collection script broke and I failed to realize it. The other occurred when Southampton Place was taken out of service temporarily.

Let's zoom in on the lower ones:

plot of chunk dt.1v.left

There are several instances where the time between updates is unusually large, on the order of hours or days. The times of entries with between 2000 and 5000 minutes between updates:

##                     name        prev.updated             updated
## 32650       High Holborn 2013-08-31 15:10:07 2013-09-02 12:30:05
## 32660         Bury Place 2013-08-31 15:10:08 2013-09-02 12:30:07
## 32672  Southampton Place 2013-08-31 15:10:05 2013-09-02 12:30:04
## 32674    Earnshaw Street 2013-08-31 15:10:06 2013-09-02 12:30:05
## 38546       High Holborn 2013-09-14 22:39:00 2013-09-16 08:24:22
## 38719         Bury Place 2013-09-14 22:39:02 2013-09-16 08:24:23
## 38734  Southampton Place 2013-09-14 22:38:58 2013-09-16 08:24:20
## 38735    Earnshaw Street 2013-09-14 22:38:59 2013-09-16 08:24:21
## 84066         Bury Place 2013-12-27 15:40:08 2013-12-29 23:10:14
## 84069       High Holborn 2013-12-27 15:40:06 2013-12-29 23:10:13
## 84073  Southampton Place 2013-12-27 15:40:05 2013-12-29 23:10:11
## 84078    Earnshaw Street 2013-12-27 15:40:05 2013-12-29 23:10:12
## 84186    Earnshaw Street 2013-12-30 00:10:05 2013-12-31 13:10:07
## 84202       High Holborn 2013-12-30 00:10:06 2013-12-31 13:10:09
## 84269  Southampton Place 2013-12-30 00:10:05 2013-12-31 13:10:06
## 84330         Bury Place 2013-12-30 00:10:07 2013-12-31 13:10:11
## 89443  Southampton Place 2014-01-12 20:20:10 2014-01-14 18:40:07
## 89459       High Holborn 2014-01-12 20:20:13 2014-01-14 18:40:11
## 89467         Bury Place 2014-01-12 20:20:14 2014-01-14 18:40:16
## 89524    Earnshaw Street 2014-01-12 20:20:11 2014-01-14 18:40:09
## 121381   Earnshaw Street 2014-03-15 14:50:06 2014-03-17 01:50:04
## 121398      High Holborn 2014-03-15 14:50:07 2014-03-17 01:50:05
## 121444        Bury Place 2014-03-15 14:50:10 2014-03-17 01:50:07
## 121591 Southampton Place 2014-03-15 14:50:05 2014-03-17 01:50:04
## 133765      High Holborn 2014-04-11 16:59:37 2014-04-14 01:29:07
## 133900   Earnshaw Street 2014-04-11 16:59:36 2014-04-14 01:29:05
## 133961        Bury Place 2014-04-11 16:59:38 2014-04-14 01:29:08
## 134027 Southampton Place 2014-04-11 16:59:35 2014-04-14 01:29:05

It looks like these happened to all stations simultaneously, suggesting problems with either my collection script or the API, rather than problems with individual locations.

Entries with less than 60 minutes between updates, no longer on a log scale:

plot of chunk dt.1v.60

In the vast majority of cases, updates are approximately ten minutes apart. This encourages me to take a subset of the data (bikes.all -> bikes), considering only entries with d.updated less than 15 minutes. This eliminates many outliers in future graphs.

Date and time of update

plot of chunk date.time.1v plot of chunk date.time.1v

All times of day are approximately equally represented to within ten minutes, which is good. There are five noticeable troughs preceeded by spikes, but they probably don't signify much. Dates are a lot less uniform, however. Even apart from the ten-day period where my script was broken, many days have significantly fewer updates than typical, and some have none at all.

Number of days spent with a given number of active docks

plot of chunk ndocks.time.2v

It was common for every station to report less than a full complement of docks. At least two had a full complement for less than half the time (High Holborn and Bury place are unclear in that respect). This isn't surprising, since a bike reported as defective will be locked in, using up a slot but not being available for hire.

Journeys taken throughout the year

plot of chunk date.journeys.2v

The time of year makes very little difference to the number of rides. There appears to be a slight sinusoidal relationship, but it's very weak. (I didn't do a PMCC test because that assumes that any relationship is linear, which we would naively expect not to be the case here, and also doesn't look true from the graph.)

Journeys by weekday

plot of chunk weekday.journeys.2v plot of chunk weekday.journeys.2v plot of chunk weekday.journeys.2v

Fewer journeys are taken on weekends. The median number of bikes available doesn't change much throughout the week (5 on monday and friday, 4 on other days), but the distribution does. Saturday and Sunday have noticeably different shapes to the others. They have a single peak, while weekdays are somewhat bimodal, with a small peak where the station is full (probably when people are arriving at work).

(Since the stations have different numbers of docks, I did a graph of fullness rather than of number of bikes. The density plot doesn't show peaks exactly at 0 and 1 because of how the density window works, but histograms of num.bikes and num.spaces show that that's where they are. It would be difficult to use a histogram for this graph because there's no sensible binwidth.)

Change in number of bikes between updates

plot of chunk bikes.prevbikes.name.mv

## 
##  Pearson's product-moment correlation
## 
## data:  bikes$num.bikes and bikes$prev.num.bikes
## t = 2466.8, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9859301 0.9861908
## sample estimates:
##      cor 
## 0.986061

There's very strong correlation between the number of bikes in adjacent entries. This is as expected, especially given what we saw about d.num.bikes previously. The colors here don't show any particular station-dependent trends.

Number of bikes at any given time

plot of chunk bikes.time.name.mv

The correlation also looks strong between the number of bikes at each station at any given time. Since they're all close to each other, that's not surprising. The time is a big factor, with large numbers of bikes in the stations during office hours, and few numbers in the evening and early morning. There's a slight dip around 1pm, which could be related to people using them on their lunch breaks.

This graph gives an overview of global trends, but I mostly use the bikes at specific times. We can zoom in on those:

Number of slots available at 0930

(when I'm trying to arrive at work)

This is a proportional frequency plot: within each facet of the graph, the heights of the bins add up to 1. Only weekdays are considered.

plot of chunk slots.0930.2v

About 40% of the time, Earnshaw street has no spaces. That's actually less than I'd realized. It's directly outside my office, and I haven't even been checking it because I'd assumed it was always full.

And at 0940

(in case I'm running late)

plot of chunk slots.0940.2v

If I'm late, I have slightly less chance of finding a docking station, but not much less.

Combining the two

Journeys taken on rainy vs. non-rainy days

Here, rain is the original variable in the dataset, and rain2 simply measures whether precip.mm is nonzero. We have graphs looking at d.num.bikes on each type of day, and tables comparing its mean absolute value.

plot of chunk rain.rain2.journeys.2v

## Source: local data frame [2 x 2]
## 
##    rain mean(abs(d.num.bikes))
## 1 FALSE              0.5160167
## 2  TRUE              0.4156172

plot of chunk rain.rain2.journeys.2v

## Source: local data frame [2 x 2]
## 
##   rain2 mean(abs(d.num.bikes))
## 1 FALSE              0.4824637
## 2  TRUE              0.4073405
## Source: local data frame [4 x 3]
## Groups: rain
## 
##    rain rain2 mean(abs(d.num.bikes))
## 1 FALSE FALSE              0.5184101
## 2 FALSE  TRUE              0.4755501
## 3  TRUE FALSE              0.4351990
## 4  TRUE  TRUE              0.4042656

Earlier I said I feel like precip.mm is more accurate than rain. Despite that, rain seems to be capturing something that precip.mm doesn't, because bike usage responds slightly more to it. This would seem to suggest that days where rain is true but precip.mm is zero have less bike usage than average; and indeed this is what we see.

Taking rain to be our measure, slightly over 70% of observations had no bikes added or removed on rainy days, and slightly under 70% on non-rainy days. The mean absolute difference is about 25% higher on non-rainy days.

Foggy versus non-foggy days

plot of chunk fog.journeys.2v

## Source: local data frame [2 x 2]
## 
##     fog mean(abs(d.num.bikes))
## 1 FALSE              0.4488018
## 2  TRUE              0.4568736

Journeys by temperature and wind:

plot of chunk temp.wind.journeys.2v

## 
##  Pearson's product-moment correlation
## 
## data:  bikes$t and abs(bikes$d.num.bikes)
## t = 31.414, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07057403 0.07993830
## sample estimates:
##        cor 
## 0.07525782

plot of chunk temp.wind.journeys.2v

## 
##  Pearson's product-moment correlation
## 
## data:  bikes$wind and abs(bikes$d.num.bikes)
## t = -22.389, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05840721 -0.04901677
## sample estimates:
##         cor 
## -0.05371317

Unlike rain, it seems that fog, wind and temperature make approximately no difference. The mean absolute difference in number of bikes is about the same regardless of fog, and the correlation between that and temperature/wind is close to zero.

Number of bikes at any given time, depending on rain:

plot of chunk bikes.time.rain.mv

Rain reduces the variance, with fewer bikes during office hours and more outside of them.

Reformatting

With the data in the current format, not all the questions we want to ask are easy. For example: how does the number of bikes at one station correlate with another at any given time? I previously said it “looks strong”, but that's pretty vague.

To answer questions like that, we need to be somewhat forgiving with our definition of 'any given time'. Updates don't necessarily happen simultaneously, so we need to bin them together.

I'm going to create bins ten minutes wide, and assign every observation to a bin. Then in each bin, we can ask how many bikes were at each station. Using this, we can check correlation between each station:

plot of chunk ggpairs

Correlations range between 0.703 and 0.758, and the scatter plots and density histograms all look pretty similar. Does the correlation depend on time? Let's go for 0930, 1800, midnight, and noon.

plot of chunk ggpairs.times plot of chunk ggpairs.times plot of chunk ggpairs.times plot of chunk ggpairs.times

The correlations are almost all lower. That surprised me, but I think it's an example of Simpson's paradox. I note that the darkest points in the graph are at midnight, with no bikes in any station much of the time. Bikes are periodically moved in vans to account for anticipated demand; I assume that these stations are emptied most nights to prepare for people coming to work in the morning.

An interesting point is that the weakest correlation on any of the graphs is 0.149, between Earnshaw Street and Bury Place at 1800. But the strongest correlation at a specific time is 0.757, also between those two stations, at 0930.

We also see the density charts sometimes having very different shapes, especially at 0930 and 1800. But this seems to be at least partly to do with the way that ggpairs chooses the axes on its density plots. For example, here's 0930:

plot of chunk bikes.0930.density

The troughs look a lot less significant now.

We can view a histogram of the total number of bikes available at different times:

plot of chunk bikes.time.hists

We see heavy leftward skews overnight, with much flatter (but somewhat right-skewed) distributions during office hours, and gradual transitions between the two.

We can also check correlation between times more distant than a single tick. If I check the slots available when I leave the house, can I learn how many will be there when I arrive?

plot of chunk cor.0900.0930

## 
##  Pearson's product-moment correlation
## 
## data:  at.0900 and at.0930
## t = 68.675, df = 1228, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8785868 0.9017383
## sample estimates:
##       cor 
## 0.8907389

This is good correlation! Does it depend on the rain?

plot of chunk cor.0900.0930.rain

## 
##  Pearson's product-moment correlation
## 
## data:  at.0900[rain] and at.0930[rain]
## t = 55.466, df = 816, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8737218 0.9025687
## sample estimates:
##       cor 
## 0.8890242
## 
##  Pearson's product-moment correlation
## 
## data:  at.0900[!rain] and at.0930[!rain]
## t = 39.748, df = 410, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8692649 0.9093735
## sample estimates:
##       cor 
## 0.8910456

Not much, if at all.

We can construct a model

## 
## Call:
## lm(formula = at.0930 ~ at.0900, data = spaces.0900.0930)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.334 -1.502  0.561  1.708 16.477 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.43496    0.15556  -9.225   <2e-16 ***
## at.0900      0.97899    0.01426  68.675   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.802 on 1228 degrees of freedom
##   (69 observations deleted due to missingness)
## Multiple R-squared:  0.7934, Adjusted R-squared:  0.7932 
## F-statistic:  4716 on 1 and 1228 DF,  p-value: < 2.2e-16

with an R2 of 0.79, which is okay. But this isn't the best we can do, because it groups all stations together. Ideally we would create one model per station, with inputs from every station.

## 
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name == 
##     "Southampton Place", ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.566 -1.857 -0.148  1.152 15.420 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79625    0.47090  -1.691   0.0919 .  
## sp           0.74101    0.04179  17.731   <2e-16 ***
## hh           0.05424    0.05307   1.022   0.3075    
## bp           0.11811    0.05092   2.320   0.0210 *  
## es           0.07909    0.04550   1.738   0.0832 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.17 on 296 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7324 
## F-statistic: 206.3 on 4 and 296 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name == 
##     "High Holborn", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4068 -1.1295  0.1503  1.2304  8.3941 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87268    0.29894  -9.610  < 2e-16 ***
## sp           0.08354    0.02653   3.149  0.00181 ** 
## hh           0.76021    0.03369  22.567  < 2e-16 ***
## bp           0.09533    0.03232   2.949  0.00344 ** 
## es           0.15937    0.02888   5.518  7.5e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.012 on 296 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.8349, Adjusted R-squared:  0.8327 
## F-statistic: 374.2 on 4 and 296 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name == 
##     "Bury Place", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3465 -1.3008  0.3121  1.4809  9.4211 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.28068    0.32734 -13.077  < 2e-16 ***
## sp           0.18778    0.02907   6.460 4.32e-10 ***
## hh           0.03132    0.03687   0.850 0.396253    
## bp           0.91255    0.03538  25.796  < 2e-16 ***
## es           0.11197    0.03160   3.543 0.000459 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.201 on 295 degrees of freedom
##   (18 observations deleted due to missingness)
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8955 
## F-statistic: 641.5 on 4 and 295 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = at.0930 ~ sp + hh + bp + es, data = spaces.tmp[spaces.tmp$name == 
##     "Earnshaw Street", ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8978 -1.4508  0.3118  1.3272 11.8323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.98653    0.35005  -8.532 7.60e-16 ***
## sp           0.05579    0.03107   1.796   0.0735 .  
## hh           0.03405    0.03945   0.863   0.3887    
## bp           0.17361    0.03785   4.587 6.65e-06 ***
## es           0.83329    0.03382  24.638  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.356 on 296 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.856 
## F-statistic: 446.9 on 4 and 296 DF,  p-value: < 2.2e-16

Southampton Place has slightly regressed, but the others have improved slightly. In particular, Bury Place gets an R2 of 0.89, which is pretty good. (It's important to note that this doesn't make our model worse for Southampton Place than the aggregate model. The aggregate model was just overconfident on that station.)

Final plots and summary

Plot 1

plot of chunk final.plot.1

The total number of bikes available changes gradually throughout the day, with few bikes typically available at night, but often many available during the daytime. The distribution looks left-skewed from around 10:00 to 17:00, and right-skewed from around 19:00 to 07:30. The left skew is never as extreme as the right skew, but because the stations have different numbers of slots, that doesn't tell us much.

Plot 2

plot of chunk final.plot.2

This time around, I restricted the graph to weekdays only. It's rare for the number of stations to go up between 09:00 and 09:30. All four stations have similar usage patterns.

At 09:00, if there are five or fewer spaces available, it looks as though the most common single outcome at 09:30 is no spaces at all.

Points above the dotted black line are ones where more spaces were available at 09:30 than at 09:00. (Caveat: I've applied slight jittering, so points very close to that line are ones where the same number of spaces were available.) There are obviously much fewer of them. However, the top-left corner of the graph has a few points in it where the bottom-right corner is empty. The number of bikes never goes down by more than eleven, but it goes up by as much as fifteen.

Plot 3

plot of chunk final.plot.3

I took advantage of binning to calculate specific summary functions. All stations show similar patterns: at night, there are few bikes available; during office hours, there are almost always some, and the 10-90 percentile range is a lot higher. The trough around 1pm in the previous version of this plot no longer shows up, which makes me suspect it was simply an artifact of the smoothing method.

During the day, the number of bikes available is generally ranked by the number of docking slots at each station - so High Holborn has the least, and Bury Place has the most. When the bikes are taken around 18:00, High Holborn seems to lose them more slowly than the other stations. For Earnshaw Street and especially Bury Place, the 90th percentile lines suggest that those two stations were often completely full.

Reflection

I've learned a lot about how to fight ggplot when it doesn't do exactly what I want by default, and in particular about how to shape my data for it.

I feel like a data frame isn't an ideal structure for the data I have. The fact that I had to create prev.* and d.* copies of those columns that need it seems suboptimal, ideally I would have wanted to be able to refer directly to offset rows in the data. (For example, there's currently no easy way to ask “what's the difference between the number of bikes now and 30 minutes ago?”) But I couldn't find anything that worked better. In particular, time series only allow one data type, so I would have had to fight to use them at all, and I don't know if they would have been any more useful.

My data set itself isn't ideal, particularly in the amount of missing data. Unfortunately, I don't think any better historical bike record data is available. I think I have enough data to trust my conclusions.

In general, it seems that weather doesn't have much impact on bike usage. I checked rain, fog, temperature and wind speed, and only rain made a significant difference. But since the rainfall data seems to be internally inconsistent, I don't know how much we can learn from it. It would be useful to validate it from another source. We might also learn more with finer-grained weather data. For example, when predicting bike availability at a specific time, it doesn't help much if we know whether or not it rained at all on a given day; but it might help more to know whether it was raining at that particular time.

On the other hand, we can make pretty good predictions about future bike (and slot) availability just from current availability. An ambitious future project might be a prediction system. A user could specify a station and an arrival time, and the system could tell her how likely it would be that she could find a slot in that station and nearby ones, and suggest an earlier arrival time that would increase that chance.

One thing I didn't examine was public holidays. For example, we might ask whether, on plot 2 above, many of the points where spaces were freed up fell on holidays. (We can calculate 85 points above the line, and only 8*4 = 32 of them could be on public holidays, but that's still potentially a third of them.)

After initially submitting this report, I noticed a big problem. All timestamps were collected and reported in physical time, but bike usage patterns are going to be related to clock time. So some of my graphs, particularly later ones, were mixing in data from two different clock times (e.g. 09:00 and 10:00) as if they were the same. My first submission was rejected for unrelated reasons, and I've corrected the error in all future versions.

Posted on 19 August 2015

Comments elsewhere: Hacker News