A Reasonable Approximation

Latest posts

Chris Masterjohn on Coronavirus, Part 1

Chris Masterjohn is a nutritionist who has some advice on supplements to take to help protect against Covid-19, and some to avoid. The raw advice is available for free, but the full report with explanation and references costs $10. I bought a copy.

Should we trust him? On the one hand, "nutritionist" is not a profession I necessarily hold much respect for (it's not a protected term). Nor do I tend to think highly of people emphasizing that they have a PhD. Also, his website looks super sketchy to my eyes. Also also, something that costs money gets fewer eyeballs than something free, and so mistakes are less likely to be discovered.

(Only the last one of those should be considered a problem with the report; the rest are just priors.)

On the other hand, Chris previously turned me on to zinc so he has some credibility with me. Also, although I'm out of my depth in actually evaluating the science, I do think I have a decent bullshit detector, and the report is superficially very convincing to me.1 I get the sense that he actually really knows what he's doing, is weighing the evidence appropriately and making claims that are justified by it. He admits to uncertainty, and some of his recommendations are "this probably won't help, but just in case"; but he does speak precisely enough to be falsifiable. This doesn't mean he's right, of course. But I think he's worth some people paying some attention to.

My intention here is mostly to summarize his reasoning, in more detail than is on the linked page but less detail than is in the full report. You can make up your own minds. Much of it I haven't even checked how actionable it is right now (i.e. are these things available in stores or for delivery, and how much do they cost). I do have some questions and commentary of my own that I've scattered about, possibly not always well-flagged.

The report makes twelve recommendations:

  • Four "essentials": elderberry, nutritional zinc, ionic zinc, and copper.
  • Four "optional add-ons": garlic or stabilized allicin, echinacea, vitamin C, and N-acetyl-cysteine. Chris is taking all of these except vitamin C, and he plans to take a low-dose supplement of that if he can't get fresh food.
  • Four "things to limit or avoid": these are much less pithy. "Don't take high doses of vitamins A or D" (but don't get deficient either); "Limit Calcium and Don't Use Calcium Supplements That Aren't Balanced by Phosphorus"; "Don't Use Monolaurin"; "Don't Use High-Dose Vitamin C, Pelargonium Sidoides (Umcka), or Bee Propolis".

I'm not convinced twelve recommendations is the natural grouping for this, but there we are. In this post I'm going to focus on the "essentials"; my current plan is to do the rest in future posts.

So far, I've followed this advice to the extent of:

  • I've stopped taking daily vitamin D supplements, and advised my partner to do the same. (Somewhat to my embarassment, as they'd only started taking them on my recommendation. I've been taking it 5x/week for years.) He says not to supplement with vitamin D at all, but he also says to get normal amounts from sunshine and vitamin D-rich foods, and we're not doing that. So maybe we should take one a week instead of one a day.
  • I bought a bulb of garlic, but I haven't done anything with it yet.
  • I took one of the zinc capsules that came free with my last order of zinc lozenges. Then I had a bad gastrointesinal day. Possibly unrelated, but they also had something of an unpleasant smell that I don't know whether it was normal. I haven't tried again. Maybe I will. They're 50mg, which is more than Chris recommends at one time.

If you're going to follow any of it yourself, you should at least read his public post, and ideally also the report.

General info: ACE2

SARS-CoV-2 (the virus that causes the disease Covid-19) is not like the common cold or the flu, and things which work against those may not be helpful here. On the other hand, it's very like SARS-CoV (the virus that causes SARS). The genomes are 80% identical, and "the majority of its proteins are 85-100% homologous, with an average homology of 87%".

(Question: That seems like a really low average, given that range?)

The two main things the report focuses on are ACE2 and interferon.

ACE2 is an enzyme used in regulating blood pressure. SARS-CoV-2 enters our cells by docking to ACE2. It has this in common with just two other human coronaviruses, SARS-CoV and HCoV-NL63. So heightened levels of ACE2 could increase susceptibility to SARS-CoV-2.

(By comparison, the common cold is mostly caused by rhinoviruses, and most of those dock to ICAM-1. Some colds are caused by coronaviruses, but those dock to aminopeptidase N or sialic acid. The flu docks to sialic acid. So if something protects against those by preventing them from docking, it's likely to have no effect on Covid-19.)

I'll talk about interferon in a future post, because it's not relevant to any of the "essentials".

Essential: Elderberry

"In rhesus monkey kidney cell culture, elderberry has virucidal, anti-plaque, anti-replication and anti-attachment activity toward HCoV-NL63". Most of the effect seems to come from caffeic acid. That binds directly to ACE2, which suggests elderberry would be similarly effective against SARS-CoV and SARS-CoV-2.

(Question: is that in vitro? Doesn't seem very reliable in general, but maybe the specific mechanism makes it moreso.)

As a bonus, elderberry is also effective against avian infectious bronchitis virus through compromising the lipid envelope. Since all coronaviruses have one of those, that effect might generalize to SARS-CoV-2.

Other foods include caffeic acid, but only black chokeberries have anything like a comparable amount. And elderberry extract is the thing that's been actually studied, so that's what Chris recommends.

There are studies using elderberry in humans to treats colds, the flu, and cardiovascular disease, but Chris doesn't mention their results. He just uses them to determine a safe dose.

Essential: Nutritional zinc

This is zinc from food or most supplements (including tablets or capsules).

Zinc "inhibits three proteins required for SARS-CoV replication: papain-like protease-2, 3CL protease, and helicase." So it probably inhibits the homologous proteins in SARS-CoV-2.

(Question: how similar do proteins need to be for this to be a decent guess? This suggests proteins called "homologous" might be only 40% similar in sequence. If I guess that "homologous" means this is likely to be a decent guess; and that these ones are >85% similar (based on base rates of similarity between the viruses)… that suggests it's probably a pretty good guess? But I'm not at all cofident in this thinking.)

So we should try to deliver zinc to the relevant tissues. What are those?

The infection would begin somewhere between the nose or throat (where the virus mostly enters our body) and lungs (where it primarily infects), wherever the virus first encounters ACE2. There are two papers trying to answer this question, and they give different opinions.

(Question: this seems to assume that the virus doesn't first infect something other than the lungs, and then move on to them. Is that a safe assumption? I would guess so.)

Hamming et al (2004)2 suggests that the virus wouldn't find any ACE2 until it reached the lungs. They did find ACE2 in the oral and nasal mucous membranes, but on the wrong side to be any use to the virus.

Xu et al (2020)3 argues that ACE2 is highly expressed through the surface of the mouth, especially the tongue.

The two used different methods. Xu had better resolution, down to single-cell. But Hamming could tell where on the cells the ACE2 was expressed. The second thing matters, and the first doesn't. Xu's results are what Hamming would have predicted, they just aren't relevant. (The symptoms are also what Hamming would predict: a cough suggests lung infection, and is present. A runny nose or other cold symptoms would suggest throat infection, but they aren't present.)

We don't have a specific mechanism to target the lung with zinc. We just have to take it orally in the normal way (that is, in food or supplements) and let it be distributed there.

Chris recommends 7-15mg of zinc four times a day, away from sources of phytate ("whole grains, nuts, seeds, and legumes") which can inhibit zinc intake.

(At one point he says 10-15mg, and at one point 7-10, but I think this is just bad proofreading. Mostly he says 7-15.)

Conventional wisdom says we can't absorb nearly that much, but Chris thinks we just need more dakka: "the relevant studies have been limited to less than 20 mg/d. Supplementation with 100 mg/d zinc sulfate has been shown to more than double total zinc absorbed from 4.5 to 10.5 mg/d, while decreasing the fraction absorbed from 43% to 9%."

At such high doses, side effects have been observed. "Zinc at 50 mg/d has been shown to lower superoxide dismutase, and at 85 mg/d increased self-reported anemia. Both of these could probably have been averted by proper balancing with copper, which is addressed in the section below. However, the increased need for copper at high zinc intakes reflects increased expression of metallothionein, which can bind to many positively charged metals besides copper." I confess I'm not sure what this "however" is meant to mean. It kind of sounds like "so we still probably shouldn't go that high", but then we go that high anyway. I'm a bit concerned about this.

(I also confess that I have no idea what superoxide dismutase is.)

If you take zinc, you should balance it with copper.

Essential: Ionic zinc

This specifically means zinc from the kind of lozenges that work for a cold, or failing that from an ionic zinc throat spray.

(The mechanism that makes this work against a cold will not help with SARS-CoV-2.)

It delivers ionic zinc to the mouth, nose and throat tissues, like we couldn't do with the lungs. But as discussed above, those probably aren't infected, so this probably won't help. It's included "as a hedge against the possibility" that they are. He takes one a day, and plans to increase that if he gets cold symptoms or Covid-19 symptoms.

(Question: this delivers ionic zinc to the surfaces of these tissues, while we want it on the inside. Will that work?)

I'm not really sure why this is an "essential" instead of an "optional add-on".

Essential: Copper

Copper surfaces work great against coronaviruses. This knowledge is not super helpful, since we are not copper surfaces.

It does suggest that copper ions in our cells might be toxic to the virus. But this has never been well studied.

Like zinc, copper inhibits papain-like protease 2 of SARS CoV. But it's much less effective at it.

The main reason to take copper is to keep the zinc-to-copper ratio acceptable. It should be between 10:1 and 15:1. (At one point he says 2:1 - 15:1, but again, I think that's bad proofreading.)

Like with zinc, he recommends also using a copper spray just in case SARS-CoV-2 infects the throat.

  1. There are some things that kind of smell like bullshit to me. Most notably, I feel like at times, the report goes into a lot of detail on things that aren't super relevant, like the renin-angiotensin system that ACE2 plays a part in. As far as I've seen so far, the precise mechanics of that don't really matter. Meanwhile, a lot of the important claims are speculative - necessarily so, because things are moving too fast to have good evidence here, but speculative all the same. In combination, this can kind of feel like… "throw a lot of impeccably researched, uncontroversial and unimportant science at the reader; then try to sneak in the difficult bits under the radar"? I'm sure there's a term for this that I'm forgetting.

    I don't actually think this is what's happening. My sense is that it's more likely to be bad editing, and I'm not even confident it's that. But it seemed important to note.

    I've also noticed some inconsistencies that I chalk up to bad proofreading. 

  2. Hamming, I. et al. Tissue distribution of ACE2 protein, the functional receptor for SARS coronavirus. A first step in understanding SARS pathogenesis. J. Pathol. 203, 631–637 (2004). 

  3. Xu, H. et al. High expression of ACE2 receptor of 2019-nCoV on the epithelial cells of oral mucosa. Int. J. Oral Sci. 12, 8 (2020). 

Posted on 29 March 2020 | Comments

My Bet Log

I occasionally make bets about future events. I think I probably should do that more often than I currently do. Given that, I think it's good practice to have a public record of the bets I've made and their outcomes, when available.

I'll try to keep this updated going forward, if I continue to make bets. Currently I'm down £72.46 with one win and one loss.

(This has nothing to do with the matched betting that I previously wrote about. There I was earning money from sign-up bonuses. Here I'm trying to earn money by being better at predicting the future than other people.)

In order of date I made the bet:

Undecided: NO on Trump leaving before the end of his first term.

I placed this bet on 20-Oct-2019, on Betfair, after a friend said on Facebook that the odds seemed very favorable to him. The odds at the time were 1.3, giving an implied probability of ${1.3 - 1 \over 1.3} = 30\%$ that he'd leave. I bet £50 against this, giving me £15 profit if he doesn't. (Minus commission, so £14.25.) As of 19-Mar-2020 the odds are 1.12, for an implied probability of 11% that he'll leave. (I could cash out now, taking my £50 back immediately and £7.56 minus commission when the market closes1. I'm not inclined to do that.)

I lost: NO on a large increase in working from home.

LessWrong user Liron said:

I also bet more than 50% chance that within 3 years at least one of {Google, Microsoft, Facebook, Amazon} will give more than 50% of their software engineers the ability to work from home for at least 80% of their workdays.

I took him up on this for $100 each. We confirmed on 02-Jan-2020. I don't think I'd heard the term "coronavirus" at the time.

This cost me $100 (£92.69 over Paypal, which I think included £2.99 transaction fee). But I got a bunch of LW karma for it. Swings and roundabouts.

Undecided: YES on someone contracting Covid-19.

In a Telegram group chat I'm in, I bet another member €30 each that at least one person in the chat would contract Covid-19 by 28-Feb-2021. (We made the bet on 28-Feb-2020.)

I wasn't confident of winning at the time, and I'm still not. Confirmed infection rates are lower than I'd anticipated - for example, Hubei has had just over 0.1% of its population confirmed as having had the disease at one point, and Italy just over 0.05%.

I think an ambiguous result is more likely than I used to, because people who get mild cases of the disease often aren't tested.

I won: DECLINE in Trump's odds of being re-elected.

On 04-Mar-2020, I noticed that the betting markets' odds on Trump's re-election had gone up since Covid-19 became a thing. (According to Election Betting Odds, rising from about 40% mid-november to about 56% at the time.) At the time, the number of officially diagnosed cases in the US was very small. I predicted that as the official number rose, and as it became more widely known that the CDC and FDA had basically fucked up, the odds would drop again.

The obvious objection was: "if you can predict that happening, so can anyone. Why haven't the odds changed already?" I didn't have an answer for that, but the stock markets had seemed to react slowly to the news2 so maybe it would happen here too. (And the next obvious objection was: "oh, so you're expecting the same thing to work twice in a row?" I didn't have an answer for that either.)

Less obvious objections that I realized/someone pointed out after I'd made the bet:

  • Bush's popularity rose after 9/11.
  • Covid-19 seems to play very nicely into Trump's anti-immigrant, anti-globalist narrative.

In any case, I went ahead. Again on Betfair, I bet £100 against Trump at odds of 1.76. The odds started to drop pretty soon after that. Two weeks later, on 18-Mar-2020, I cashed out (by betting £110.26 for Trump at odds of 2.1, but actually by pressing the "cash out" button). I've claimed my initial £100 back, and no matter who wins I'll earn £21.29. (£20.23 after commission, but nothing if the market is voided.)

  1. I can't take any profit now, because if he gets assassinated the market is void. 

  2. This didn't do me any good. I hadn't sold any stocks at the time, and I still haven't. 

Posted on 19 March 2020 | Comments

Tapping Out In Two

I'm one of those people who feels personally called out by xkcd's "Duty Calls" ("someone is wrong on the internet").

(Not as much as I used to. At some point I stopped reading most of the subreddits that I would argue on, partly for this reason, and Hacker News, for unrelated reasons, and now I don't do it as much.)

As pastimes go, there's nothing particularly wrong with internet arguments. But sometimes I get involved in one and I want to stop being involved in one and that's not easy.

I could just, like, stop posting. Or I could let them say something, and then respond with something like, "yeah, I'm still not convinced, but I don't really have time to get into this any more". But then the other person wins, and that's terrible. It potentially looks to onlookers like I stopped because I couldn't find something to say any more. And in at least some cases, it would feel kind of rude: if they've put a lot of thought into their most recent post, it's a bit dismissive to just leave. In the past, when people have done that to me, I've disliked it, at least some times. (Well, at least once. I remember one specific occasion when I disliked it, and there may have been others.)

Another thing I could do is respond to their most recent post, and then say "and now I'm done". But that feels rude, too, and I certainly haven't liked it when people have done it to me. (Why not? It puts me in the position where I don't know whether to respond. Responding feels petty, and kind of a waste of time; not responding feels like they win, and that's terrible.) If they do reply, then I'm shut off completely; I can't even make a minor clarification on the order of "no, you somehow interpreted me as saying exactly the opposite of what I actually said".

So I don't like those tactics. There are probably mental shifts I could make that would bring me more inclined towards them, but… about a year ago I came up with another tactic, which has seemed quite helpful.

What I do now is say something to the effect of: "after this, I'm limiting myself to two more replies in this thread."

This has various advantages. It feels less rude. It doesn't look like I'm quitting because I have no reply. It helps the conversation reach a more natural conclusion. And it also feels a lot easier to do, partly for the above reasons and partly for the normal reasons precommitment helps me to do things.

For some quantitative data, I went over my reddit history. It looks like I've used it ten times. (I don't think I've ever used it outside of reddit, though I initially thought of this after an argument on Facebook.)

  1. "(This doesn't seem super productive, so I'm going to limit myself to two more relies in this thread.)" This was my third comment, and we each made one more afterwards.
  2. "Limiting myself to two more replies in this thread." This was my third comment, and afterwards, I made two more and they made three more. Their final two replies got increasingly rude (they'd been civil until then), but have subsequently been deleted. This is the only time I ran into the two-more-comments limit. Also, someone else replied to my first comment (in between my final two comments in the main thread) and I replied to that as well.
  3. "I find myself getting annoyed, so I'm limiting myself to two more replies in this thread." This was my fourth comment, and afterwards we made each made one more. (This was the only person who questioned why I was continuing to reply at all. A perfectly reasonable question, to which I replied "not going down this rabbithole".)
  4. "(Limiting myself to two more replies in this thread.)" This was my third comment, and afterwards we made two more each. If their final comment had come after I hit the limit, I would have been tempted to reply anyway. (Uncharitably paraphrased: "oh, we've been talking past each other, this whole time I've been assuming X and you've been assuming Y, which is very silly of you" / "well I explicitly said I was assuming Y and not X in my very first post in this thread, and even if we assume X and not Y your behaviour still makes no sense".) All of their comments in this thread have since been deleted.
  5. "Pretty sure this is all a big waste of time, so I'm limiting myself to two more replies in this thread." This was my sixth comment linearly, but only second in reply to this specific user, and I'd previously made another two in different subthreads below my first comment. Afterwards, the person I'd been most recently replying to didn't make any more comments. But someone else replied to this comment, and I replied to them; and two other people replied in those other subthreads, and I replied to both of them as well (but one of those replies was just a link to the other).
  6. "Limiting myself to two more replies in this thread." This was my fifth comment, and neither of us replied afterwards, though I did quickly edit this comment to add a clarification that arguably could have counted as its own comment. Someone else replied to one of my comments higher up the chain, and I replied to them.
  7. "(I don't want to get sucked into this, so I'm limiting myself to two more replies in this thread.)" This was only my second comment, and neither of us replied further.
  8. "(Limiting myself to two more replies in this thread.)" But in their next reply, they made a mistake that I had previously explicitly pointed out. So instead of replying in depth, I said, "It's quite annoying that you continue to misunderstand that, so I'm out." They replied saying "I think you're struggling to justify your initial comments and that's why you're "out", but that's fine. I accept your unconditional surrender." I didn't reply further. Since it was /r/science, and I was quite miffed, I reported this comment as unscientific to see if it would help me feel better. I don't remember if it worked. The comment did not get removed.
  9. "I think I'm going to reply at most twice more." This was my fifth comment, and neither of us made any more afterwards. Their first and final comments are still there, but their other three have been deleted.
  10. "I'm going to limit myself to two more replies after this." This was my third comment, plus I'd made one in another subthread below my first comment. Afterwards we each replied twice more.

In thread #5, I made 11 comments in total; but in others, my max comment count was 6. This feels like I mostly did a decent job of not getting sucked in too deeply. And since I generally got the last word in (threads #2 and #7 were the only exceptions), I think (though I don't specificly remember) I rarely had the feeling of "if I don't reply now then they win and that's terrible". Thread #7 is the only one I think I still feel lingering resentment about. From before I thought of this tactic, I can think of at least two arguments where I didn't get the last word and I still feel lingering resentment. (One is the argument that triggered me to think of this tactic.)

So generally this seems like a success. Ideally we'd compare to a world where I didn't think of this tactic, but I'm not really sure how to do that. (We could go over my reddit arguments where I didn't use it, but that's clearly going to have sampling bias. We could go over my reddit arguments from before I thought of it, but we can't know when I would have used it or how things would have progressed. Possible experiment: going forward, each time I want to limit my future replies, I toss a coin for whether I actually do it in that comment. Keep track of when I did this. I am unlikely to actually run this experiment.) For a way it plausibly might not have been a success: I suspect it's the case that having limited my total investment, I spent more effort on many of these comments than I would have otherwise. If these arguments would have ended just as quickly in any case, then this tactic caused me to spend more time and effort on them.

I'm a bit surprised that I only ran into the two-more-comments limit once. A part of me would like to interpret that along the lines of: once I started putting more effort into my comments, I demolished my opponents' arguments so thoroughly that they accepted defeat. But this seems unlikely at best.

I will say that none of these comment chains felt like collaborative discussions. Some of them started that way, but by the end, they all just felt like "I am right and you are wrong". (This is not very virtuous of me, I admit.) My thinking had been that this tactic would be most valuable in collaborative discussions. But it seems I don't have many of those on reddit, at least not ones that I spend much time on. So, no direct evidence on that yet.

I'm not sure how to handle replies to other people, or replies in subthreads other than the main one. A strict wording would suggest that I should count subthreads against the limit, but I haven't so far and it hasn't caused me problems. Even a weak wording would suggest that I should count replies to other users against the limit, but I've only had one of those and didn't reach the limit whether you count them or not.

I'd ideally like to have a get-out clause like "…unless I actually decide that replying after that is worth my time". But I'm not quite sure that's the "unless" that I want. (Plus it seems kind of rude, but it's not like I'm being super polite as-is.) Anyway, I haven't needed that clause yet.

Posted on 05 December 2019 | Comments

The Effect pattern: Transparent updates in Elm

I don't like writing tests in Elm1. Naively, it seems like the fact that all Elm functions are pure would make it easy. But I've still never really been able to test the things I really want to.

It's possible that I'm not very good at writing tests in general. But even if that's true, I don't think it's the whole story.

I have another hypothesis, which is: much of the reason is that although Elm functions are pure, in the sense of "given the same input, they return the same output" - they aren't necessarily transparent, in the sense of "you can look at the output and verify that it's correct".

To recap2, I would describe the Elm architecture as consisting of:

  • Model - a type describing the data you're interested in. This is usually mostly transparent. There are some opaque blobs where components haven't exposed their internals, but I think those mostly aren't the kind of opacity that gets in the way3.

  • Msg - a type describing the messages your app passes around. Again, mostly transparent, with opaque blobs where components haven't exposed their messages. I feel like this opacity is maybe a bit more annoying, but still not very.

  • update : Msg -> Model -> (Model, Cmd Msg) - a function describing the logic of your app (or whatever it is you're architecting). The Model and Msg are still transparent, but the second return value is a Cmd Msg which is completely opaque, and which is the main source of the problem.

  • view : Model -> Html Msg - a function describing the rendering of your app. Html Msg is also arguably opaque, but you can inspect it using elmer4, or (as I discovered while writing this post) the Test.Html modules in the test package. I'm not sure if you can inspect it as deeply as you might want, I haven't explored these much yet.

Now, in my experience5, a lot of the logic in an app is contained in the way the update function chains. "After receiving this message, we send a command which eventually results in receiving that message, and then we want to put some value in the model." And we can't test those chains, because Cmd is opaque. So that's part of why I haven't liked writing Elm tests.

But I think I've found a method that will help, at least somewhat. You don't need to rewrite anything except your update function, and I actually think it's a mild improvement even if you ignore testing. I call it the Effect pattern.

We've implemented this pattern at work, in a couple of places. When I first did so, the first two tests that I wrote both revealed bugs in my code. (I think I'd known about both of those bugs but forgotten them. I wasn't consciously expecting the tests to fail.) Neither of those tests needed to be updated when I fixed the bugs. I think this is a good sign, but it's not very strong evidence. Beyond that, I mostly have theory.


I'm going to demonstrate with a simple app. It'll be buggy, and the bug will be hard to test for. Then I'm going to refactor the app to make it more testable, and write a test. Then I'm going to fix the bug, and verify that the test now passes. I think the app is too simple to make the benefits really obvious, but hopefully I can convince you anyway6.

So here's the app. It has a number, and a button to increase the number by one, and a button to multiply it by two. When you press one of those buttons, the number gets "saved to the server"; normally that would be an http request, but we emulate that by just sending a time delayed response. There's a message telling you whether the number has been successfully saved yet.

Actually, most of the app supports increasing the number by any integer, and multiplying it by any integer. That's not implemented in the view though. The Watsonian explanation for this is that the developer has a habit of overengineering things. The Doylist explanation is that it makes the interesting bits more interesting, but not significantly more complicated; and the view code is boring.

The bug is: if you press buttons quickly, the "saved" message will temporarily be wrong. You've sent two messages to the server, and one hasn't come back yet; the number that's been saved is not the number on display. Silly, but I think it works as a demonstration.

(There are surely also bugs related to the size of numbers, but I'm not interested in those right now. Additionally, a real life server would have other problems like "sometimes requests will fail" and "sometimes requests will be handled out of order"; the second in particular is hairy, but I haven't emulated these possibilities.)

Here's the update function:

update : Msg -> Model -> (Model, Cmd Msg)
update msg model =
    case msg of
        IncBy n ->
            ( { model | count = model.count + n, saved = False }
            , fakeSaveCount (model.count + n)
        MultBy n ->
            ( { model | count = model.count * n, saved = False }
            , fakeSaveCount (model.count * n)
        Saved ->
            ( { model | saved = True }, Cmd.none )

You can probably guess roughly what Model and Msg look like just from that.

As written, how would you test for the bug? I think the way to do it is to write a wrapper update1 msg model = Tuple.first <| update msg model that ignores the Cmd. Then test that

    |> update1 (IncBy 1)
    |> update1 (IncBy 1)
    |> Saved
    |> .saved

is False. And that works, but you need to know exactly what messages get sent in response to IncBy 1. In this case that's easy. In more involved cases, you'll need to know it for every Msg constructor, and you're going to make a mistake.

Here's how I'd rewrite for testability:

type Effect = ESaveCount Int

update : Msg -> Model -> (Model, Cmd Msg)
update msg model =
    let (newModel, effs) = updateE msg model
    in (newModel, Cmd.batch <| List.map (\e -> runEffect e model) effs)

updateE : Msg -> Model -> (Model, List Effect)
updateE msg model =
    case msg of
        IncBy n ->
            ( { model | count = model.count + n, saved = False }
            , [ESaveCount (model.count + n)]
        MultBy n ->
            ( { model | count = model.count * n, saved = False }
            , [ESaveCount (model.count + n)]
        Saved ->
            ( { model | saved = True }, [] )

runEffect : Effect -> Model -> Cmd Msg
runEffect eff _ =
    case eff of
        ESaveCount n -> fakeSaveCount n

That is, we split it into two parts. One part updates the model, and describes the side effects transparently. The other part turns the transparent side effects into opaque ones.

Even if we're not testing this, I kind of like it. It's nice to have a complete list of possible side effects in one place, where it isn't mixed in with model updates. But I do also want to be able to test it. Obviously, now we test updateE instead of update. Is that any better? I think so.

Here's a simple example of something we can now test that we couldn't before: "whenever update changes model.count, it makes a server request to save the new value; and whenever it makes a server request to save a value, that value is now in model.count." (We may need to write fuzzers for our Msg and Model to do this.) To do this, we need to trust that runEffect works properly, that ESaveCount actually generates a server request; but I think that's a fairly okay thing to trust.

But that's small fry. I spoke above about chaining the update function, and that's harder. If you're going to do it properly, you still need to know what messages get sent in response to every Effect constructor, and you can't learn that from calling the function. But I still think this is an improvement: if done right, I would expect (and my limited experience agrees) that typically Effect will have many fewer constructors than Msg, and for each one, the transformation to messages is fairly simple.

Here's a helper function that you can use to test updateE:

    : (state -> msg -> model -> (state, (model, List effect)))
    -> (state -> effect -> model -> (state, List msg))
    -> state
    -> model
    -> List (List msg)
    -> (state, model)
runUpdates updateE runEffect initialState initialModel messages =
    let go = runUpdates updateE runEffect
    in case messages of
        [] -> (initialState, initialModel)
        [] :: moreMsgs -> go initialState initialModel moreMsgs
        (msg1 :: moreMsgs1) :: moreMoreMsgs ->
            let (newState1, (newModel, effs)) =
                    updateE initialState msg1 initialModel
                (newState2, newMsgs) =
                        (\e (s, m) -> runEffect s e newModel
                            |> Tuple.mapSecond ((++) m)
                        (newState1, [])
            in go newState2 newModel ((moreMsgs1 ++ newMsgs) :: moreMoreMsgs)

On a high level, the way it works is this: you pass it your regular updateE function and a mocked runEffect function, together with an initial model and a list of messages to send. The messages get sent, one at a time, to the model. Any effects caused by updateE are handled by the mock runEffect, which returns a list of additional messages to be sent in future. We keep running until there are no more messages.

There are two complications. One is that we also thread through some arbitrary state, that can be updated both by updateE and runEffect7. We'll see the value of that later, but if you don't need it,

stateless : (a -> b -> c) -> (() -> a -> b -> ((), c))
stateless f () a b = ((), f a b)

is a helper function that lets you forget about it.

The other complication is that we pass in a nested list of messages. That's to give us some control over timing. Any messages returned by runEffect will be handled after the current "batch" of messages, but before the next "batch". So if you want to test the sequency "run msg1, then msg2, then handle the results of msg1", you would pass in [[ msg1, msg2 ]]. If you want to test the sequence "run msg1, handle its results, then run msg2", you would pass in [ [msg1], [msg2] ].

I have a feeling this won't always be enough control, and in future the mocked runEffect will need to return List (List msg) in place of List msg. (With the first item being messages to run immediately, the second being items to run between the current and the next batch, and so on. You'd get the current behaviour by returning [ [], msgs ].) But it suffices for now.

And here's one way to test the app:

type alias TState = (Int, List Expectation)

testSave : Test
testSave =
        mockUpdateE : TState -> Msg -> Model -> (TState, (Model, List Effect))
        mockUpdateE (n, exps) msg model =
            let (newModel, effs) = updateE msg model
                newState = case msg of
                    Saved ->
                        ( n - 1
                        , (newModel.saved |> Expect.equal (n - 1 == 0))
                            :: exps
                    _ -> (n, exps)
            in (newState, (newModel, effs))

            : TState
            -> Effect
            -> Model
            -> (TState, List Msg)
        mockRunEffect (n, exps) eff model =
            case eff of
                ESaveCount _ -> ( (n+1, exps), [Saved] )
    test "Doesn't pretend to be saved" <| \() ->
        let ((_, exps), _) =
                runUpdates mockUpdateE mockRunEffect (0, []) initialModel
                    [[IncBy 1, IncBy 1]]
        in Expect.all (List.map (\e () -> e) exps) ()

I admit, this is pretty ugly. But I think it's conceptually quite simple. The state keeps track of two things: how many save requests are currently "in flight", which gets updated as we step through; and a list of assertions, which we verify at the end. Every time we send a request (with ESaveCount), we increase the in-flight count. Every time we receive a Saved message, we decrease it, and add a new assertion to the list: model.saved should be True iff there are no requests remaining in-flight.

You can see this version of the app here. Note that to avoid the hassle of playing with Test.Runner, I've replaced the Test with an Expectation by commenting out the test "..." line (but nothing underneath), and put the result in the view. You can remove the second IncBy and check that it now passes (because if there's only one IncBy, the bug doesn't exhibit).

Now to fix the bug, and see what effect that has on the tests. If you can't fix a bug without changing the tests you wrote for it, that's a bad sign about your tests.

The test uncovered the bug by keeping track of some additional state. So the obvious thing to do is to move that state into the model. Count the number of requests in-flight, and only count as "saved" if there are none8. In fact, we no longer need saved as a field at all; as written in the test, saved is true iff there are no in-flight requests.

(This doesn't work in the real world, where requests can fail. I think it's fine as a demonstration.)

Removing saved actually does mean we need to rewrite the test. But only minorly, and we can do that before fixing the bug. All we need is to replace the reference newModel.saved with isSaved newModel where isSaved = .saved. Then when we fix the bug, we rewrite that as isSaved m = m.inFlight == 0, and the test passes with no more changes. We'll want to use that function in the view anyway. (Avoiding this kind of thing is one reason people like to write opaque models.)

You can see the fixed version of the app here.

However… now that we've put that state in the model, although the test still works, it's looking a bit silly. We've got some state in the tests that should always be identical to the state in the model; why not just test that the model is updating the state correctly?

So here are some other tests that we could write (but that I'm not going to, for time reasons):

  • We could test that, given the same messages as above, the model doesn't become isSaved until the very end of runUpdate. We can still do that with runUpdates; our state parameter is now List Model, we use mockUpdateE to keep track of every intermediate model (and mockRunEffect no longer needs to deal with state at all), and then finally verify that isSaved is false for all but the last. (This test we could have written with the old model. As described it fails if, in future, Saved starts having other effects - maybe it wants to send something to a port. We can get around that by doing more bookkeeping, but we could also simply decide that mockRunEffect will pretend it doesn't have those effects.)

  • We could test that any update which returns an ESaveCount also increases inFlight; and further, that any messages sent in response to ESaveCount will decrease inFlight. I think this test is most valuable if it uses a mocked runEffect that's also used in other tests.

Really though, I'm not sure what's best. I do think the Effect pattern will help.

Further thoughts

We haven't used this much yet, so there's a lot left to explore. Here are some thoughts I've had, but not developed very far, in no particular order.

Nomenclature. We've gone from update to update, updateE and runEffects. I currently think runEffects is an okay name, but I feel like with no historical baggage, the name update would have gone to updateE. Then what would update be called? I left it as-is partly to avoid stepping on historical toes, and partly because I don't know. runUpdate? That would give us the pattern "update functions update the model and return some kind of what-next value, run functions produce Cmds". (Of course, runUpdates violates that pattern. Maybe that should become foldUpdate…?)

Also, we'll want a helper function to convert updateE and runEffects into the correct form. What do we call that function? The only sensible suggestion I have is liftUpdate, but I'm not sure I like it.

Randomness. The test I've demonstrated was deterministic. So have all the tests I've written for Effect so far. (At least one used a fuzzed input, but that doesn't feel like it counts.) To get randomness I imagine you'd need to put a Seed in the state parameter of runUpdates, and then use Random.step (maybe with Test.Runner.fuzz to get you the Generator that you need).

Simpler test helpers. runUpdates is quite complicated, and as mentioned, I suspect it'll need to be even more so. I think most tests probably won't use all of its features. What simpler variants of runUpdates are worth creating (and what should they be named)? An obvious candidate is a variant with no state. Maybe also a variant that returns a list of models instead of just the final model.

Full state machine. In Haskell we use quickcheck-state-machine, which I really like. Could we do something similar in Elm? I think it would be possible, though probably looking quite different. Elm doesn't have all of the type-level features that QSM relies on, but it also doesn't have all of the features that QSM needs to support.

Should runEffect see the model? Reasons yes: it simplifies the Effect constructors; it reduces total amount of code; it allows one stateless function to work with both updateE and runEffects. Reasons no: it gives runEffect more rope to hang you with (more likely to be complicated, more likely to diverge in the codebase and test suite). We went with yes because in one of our components, many constructors would have needed many parameters. But we probably could have refactored a little so that many constructors would have needed only one parameter each.

Can you reuse mocked runEffect functions? I touched on this above. That is, will you be able to write just one or two per component and use those in most of the tests for that component? (If not directly, then just wrapping them in some state-updating code that doesn't change the Msg they return.) Or will each test need a separate one? I certainly hope you can reuse them. If not, that might diminish a lot of the value. The fewer you have, the easier it is to keep them all correct when the real one changes.

(What if you have a potentially large number of functions, but selected by a configuration parameter? I think depending on how this is handled, it could be closer to either end of the scale.)

Composability. A component using this pattern can be embedded in one that doesn't, no problem. You just don't get any of the benefits in the parent.

What if you embed a non-Effect component in an Effect component? This forces you to put some Cmd in your Effect type. How do you test that? I'm not sure you reasonably can. So far I think we've just avoided testing those parts of our components.

What if you embed an Effect component in another Effect component? In the codebase, I imagine your parent's Effect type has a constructor ChildEffect Child.Effect which in turn calls Child.runEffect. That lets you test the parent component, but… you may need to mock runEffect for every child component, and that doesn't sound fun. (If the mocked functions do turn out to be fairly reusable, maybe not so bad.)

Also, if you want to provide an Effect component in a library, you may want the Effect type to be opaque (like Model and Msg often are today). But then if you want people to be able to test your component in theirs, maybe that makes you responsible for providing suitable mocked runEffect functions.

Unconventional updates. I've been assuming your existing update function has type Msg -> Model -> (Model, Cmd Msg). But a lot don't. A common pattern is something like Msg -> Model -> (Model, Cmd Msg, MsgToParent) where MsgToParent is something that the parent may need to act on. (In our case, it's often just global state that needs to be sent back up in turn.) How do you handle this kind of thing?

In the codebase, I think the correct decomposition for this is still to change only the Cmd Msg return value, since that's the only opaque one, and for runEffect to be the same as it would be otherwise. (You could give it MsgToParent as an extra parameter, but like with Model that won't give you any extra power. By default I wouldn't do it, but I don't think it would be an obviously bad idea.) If you had two opaque return values… honestly I don't think I've seen this and I don't think I want to speculate about what to do in that situation.

In the test suite, you won't be able to use the existing runUpdates function. What to do instead? It might prove useful to have a runUpdates3 (or something) which can handle this case, but I'm not sure exactly how that would work.

Another possibility would be using wrappers to put your updateE in the right shape. For example, maybe you can do type EffectOrMsg = Effect Effect | MsgToParent MsgToParent, and then combine the final two results of updateE. I don't know if that would always get you what you want.


You can take your existing function update : Msg -> Model -> (Model, Cmd Msg), and decompose it into two others:

  • updateE : Msg -> Model -> (Model, List Effect)
  • runEffects : Effect -> Model -> Cmd Msg

defining the new Effect type as whatever makes this decomposition work nicely. I think this is a slight improvement in the codebase; and independently of that, I think it helps you to write better tests.

  1. I'm not a fan of writing Elm code in general, but that's for mostly unrelated reasons

  2. I'm not going to recap in detail. This post is mostly targeted at people who already know Elm. I kind of hope others can get something from it too, but I'm not trying hard for that. 

  3. I'm somewhat philosophically opposed to this kind of opacity, even if practically I don't think it's a big problem. But my philosophical objections aren't relevant here. 

  4. I'm confused about how Elmer works. It uses kernel code, which isn't supposed to be available to mere mortals. The elm.json file says its package name is "elm-explorations/elmer", which would explain how it can use kernel code; but the name typically points at a github repository, and that repository doesn't exist. Has Elmer found some way to circumvent the kernel code restrictions? It seems to be something to do with the elmer-test library on npm, but… maybe I'm blind, but I can't even find the source code for that online? It'll be in my filesystem at work somewhere. Maybe some day I'll investigate further.

    I wish Elmer had a README section like "oh, you thought this was impossible? Surprise, sucker! Here's how it works!" I've seen many projects that I wish had that. 

  5. My vast experience of writing one Elm app, in one company, mostly with other developers who have approximately the same amount of experience. 

  6. Or, if it happens that I'm wrong about the benefits, I hope I can explain myself well enough that someone can understand why I'm wrong, and then convince me

  7. My initial version only exposed state to runEffect, because we didn't need it for updateE on the tests we've written so far at work. But I couldn't see a way to avoid it, or something like it, for this example.

    I weakly guess that if the state updates generated by updateE depend on the message, that's already a bad sign. We need it for the test I'm about to write, but that's precisely because the model doesn't keep track of everything it needs to. 

  8. I had planned to take a different approach, storing both "current count" and "the count saved on the server" (which would be placed in the model through a new argument on the Saved constructor). But that has its own bugs: suppose the count starts at 0 and quickly goes 1, 0, 1. Then we'll get messages Saved 1, Saved 0, Saved 1. After the first Saved 1, we'll think we're saved until Saved 0 arrives. 

Posted on 20 October 2019 | Comments

A reckless introduction to Hindley-Milner type inference

(I've been editing this post on and off for almost a year. I'm not really happy with it, but I suspect I never will be.)

Several months ago I gave a talk at work about Hindley-Milner type inference. When I agreed to give the talk I didn't know much about the subject, so I learned about it. And now I'm writing about it, based on the contents of my talk but more fleshed out and hopefully better explained.

I call this a reckless introduction, because my main source is wikipedia. A bunch of people on the internet have collectively attempted to synthesise a technical subject. I've read their synthesis, and now I'm trying to re-synthesise it, without particularly putting in the effort to check my own understanding. I'm not going to argue that this is a good idea. Let's just roll with it.

I'm also trying to tie in some quasi-philosophy that surely isn't original to me but I don't know if or where I've encountered it before.1


When people write software, sometimes it doesn't do exactly what we want. One way to find out is to try running it and see, but that's not ideal because any complicated program will have way too many possible inputs to test. (Especially when you consider that inputs include things like "amount of free space on disk" and "time taken for a web server to respond to a request".) So it would be nice if we could mathematically prove whether our software does what we want, without actually running it. Can we do that?

That's not a very well-defined question, but we can ask more precise versions of it. Here's a well-known one: given some possible input to our software, we might want to prove that our software will eventually stop running. Can we prove that?

That question is known as the halting problem, and the simple answer is that we can't, not in general; the halting problem is undecideable. But the full answer is more complicated.

To solve the halting problem, we want a program that, when shown another program and some input to be fed to that program, satisfies three different conditions:

  1. It will always return an answer.
  2. The answer will always be either "yes, this always terminates" or "no, sometimes this doesn't terminate".
  3. The answer is always correct.

And that's not possible. But we can compromise on any of the three. We can make a program that sometimes doesn't return an answer, or one that sometimes gets the answer wrong. But perhaps most interestingly, we can make a program that sometimes says "I don't know".

And when you allow that answer, you can create a language on which the halting problem is decideable. You can write a program that will tell you truthfully whether any program written in that language will terminate; and for any other program, will say "I don't know". (Perhaps expressed in words like "syntax error on line 1".)

Now, the halting problem is tricky. It turns out that if you create a language like that, there are a lot of interesting things that programs written in that language just won't be able to do; the language will necessarily be Turing incomplete.2 But there are also lots of interesting things that they can do. To give three examples of such languages3:

  • Regular expressions are really useful for certain operations on strings, but that's about all they're good for.
  • SQL is really useful for working with databases. According to some people on stack overflow, the ANSI SQL-92 standard was Turing incomplete and the ANSI SQL-99 standard is Turing complete. (No mention of the SQL-96 standard that came between these, but reading between the lines, probably Turing incomplete.) If I understand correctly, the feature required to make SQL-99 Turing complete4 is one I've literally never used; so for my purposes, it may as well be Turing incomplete.
  • Coq is used for proving math theorems. It's an interesting one because when you write your program, you have to also provide a proof that your program terminates. (I think this is slightly false, but again, good enough for the point I'm making.)

So although these languages can't do everything, they can still be incredibly useful in their domains. More useful than a more general purpose language might be. One reason for this is that being able to prove non-termination is a useful property of the language. If you had to write a SQL query in C, it would be all too easy to write some C code that would accidentally loop forever.

I'm trying to illustrate here something that seems to me important, which is that there's a tradeoff between what I'll call expressiveness and legibility. A programming language is expressive if you can easily write many interesting programs in it5; it's legible if you can easily say many interesting things about the programs you've written in it. And I claim that the most expressive programming languages won't be the most legible, and vice-versa; though there will certainly be languages which are neither expressive nor legible. This tradeoff seems fundamental to me, and I expect that some approximation of it has been proven as a theorem.6

I haven't defined these very well, but hopefully some examples will help. I will also clarify that both of them are highly dimensional; and that "raw computational power" is one of the things that expressiveness can point at, but not the only thing; and "human readability" is not really one of the things that legibility points at, but many things that increase legibility will also increase human readability.

  • Perl-compatible regular expressions can classify sets of strings that normal regular expressions can't. But they're harder to make time and space guarantees about. And it's possible to prove whether two regular expressions are equivalent, but that's not possible in general for PCREs (proof: PCREs can encode CFGs; CFGs can't be proved equivalent).

  • Under certain assumptions, Haskell's monadic IO lets you look at the type of a piece of code and know that it won't depend on external state. In return, a function can only bring in external state if its caller allows it to (which requires having permission from its own caller, and so on).

    The assumptions in question are false (partly because unsafePerformIO exists), but I've been able to get away with pretending they're true (partly because unsafePerformIO is punishable with excommunication).

  • Custom operators (at least as implemented in Haskell and Elm) are equivalent to named functions, and don't gain or cost much in terms of legibility and expressivity. They simply make code more or less readable. But operator overloading, at least when combined with dynamic typing, gains expressivity at the cost of legibility (you no longer know that a + b will do anything remotely like an addition).

  • Macros make it easier to do things like create DSLs, reduce boilerplate, and set compile-time config options. But they mean that a function call might not look like one, or vice-versa; expressions might get evaluated many times, or not at all; and the code might perform differently depending on the phase of the moon when it was compiled.


So we've got this tradeoff, and in our programming language design we try to navigate it. We try to find kinds of legibility that can be bought for little cost in expressiveness. Or more precisely, we try to find kinds of legibility that we care about, and that can be bought for little cost in kinds of expressiveness that we care about.

And Hindley-Milner type systems are a tradeoff that's proved fairly successful, both in direct use and as inspiration. At my company7, we use Elm8, which runs on an approximately HM type system. (I don't think it's pure HM, due to extensible record types.) We also use Haskell9, which runs on a type system that extends HM in many directions. Haskell's system is more expressive and less legible, but still successful. (I'll mostly be using Elm for examples in this post, and not extensible records.) ML and OCaml are other notable languages based on HM, though I haven't used either.

The legibility HM offers is, roughly, the ability to prove that a program typechecks. I'm not going to clarify exactly what that means, but we probably all have a decent idea. It's the thing that lets the Elm compiler say "no, that program is trying to add a string to an int, bad program", while the Python interpreter doesn't know that's going to happen until it's too late. The Elm compiler will refuse to compile your program unless it can logically prove that it will typecheck.

More precisely, what HM offers isn't type checking but the more general type inference. (And beyond that, type inference in roughly linear time.) Type inference doesn't just tell you whether a program typechecks, but what its type is; a program fails to typecheck iff no type can be inferred for it.

What this means is that there's no need to supply type annotations. And indeed, in Elm you can get away without them, except I think for extensible records. In Haskell you sometimes can't, because Haskell loses some of the legibility that HM offers.

(We typically do supply type annotations, but that's because they're useful. Partly as documentation for humans, partly to help pinpoint errors when our programs fail to typecheck.)

And so in an HM system you get no runtime type errors. And although not all runtime errors are type errors, in many cases they could be. For example, an array out-of-bounds exception isn't a type error. But when designing a language, you can decide that array out-of-bounds exceptions won't exist, any array lookup will return either a value from the array or null. If type errors are possible, you've just eliminated one source of errors by pushing them somewhere else, and possibly somewhere harder to debug. But in HM, you've eliminated one source of errors by pushing them somewhere more visible, where they can be ruthlessly executed.

Elm actually tries to promise no runtime errors, period, provided you stay inside Elm. On one level, I think that's a fairly minor imposition on language design, something you get "for free" by deciding that none of the built-in functions you provide will ever throw a runtime error. On another level, it seems completely impractical to decide for example that cons will return a meaningful value if it can't allocate more memory. I'm not aware that Elm even tries to handle those errors.

(Haskell doesn't try to promise the same thing, and allows functions to return undefined. This is another legibility-expressiveness tradeoff.)

So HM's legibility gain is: type inference, powerful type system, no runtime type errors, optionally no runtime errors at all. It's good.

Meanwhile, the expressiveness cost is that you need to write your programs in ways that the type inference algorithms can work with, which forbids some things that you might like to do.

For example, suppose you want to clamp a number to between -1 and +1. In Python, you could write that like

def clamp(x): sorted([-1, x, 1])[1]

and as long as sorted always returns a list of the same length it started with, that works fine10. But it only works because the Python interpreter allows you to be reckless with array indexing. Elm doesn't let you be reckless, and so Elm has no equivalent way to perform array lookup. If you tried to write the same function in the same way in Elm, the result in the compiler's eyes would not be a number but a Maybe number - AKA "either a number or Nothing". (Nothing is roughly equivalent to None in python or null in many other languages, but you have to explicitly flag when it's allowed.) When you actually run this code, you will always get a number and never Nothing. But the compiler can't prove that.

(Again, I stress that you will never get Nothing as long as your sort function always returns a list of the same length it started with. That's something you can prove for yourself, but it's not something the Elm compiler can prove. It's not even the sort of thing the Elm compiler knows can be proven. And so in turn, it can't prove that you'll never get a Nothing here.)

And then the Elm compiler would force you to account for the possibility of Nothing, even though there's no way that possibility could occur at runtime. One option is to pick an arbitrary result that will never be exposed. That works fine until the code goes through several layers of changes, an assumption that used to be true is now violated, and suddenly that arbitrary result is wreaking havoc elsewhere. Or in Haskell, your program is crashing at runtime.

To be clear, that's not significantly worse than what we get in Python, where the code can also go through several layers of changes that result in it crashing at runtime. But we were hoping for better.

And in this case "better" is easy enough, you can just write your function to avoid indexing into a list, and then it can return a number with no need for trickery. The point isn't that you can't do the thing. The point is that (a), even if the thing is safe, the compiler might not know that; (b), if you decide it's safe anyway and find some way to trick the compiler, the compiler no longer protects you; and (c), if you want to do it in a way the compiler knows is safe, you might need to put in some extra work.

For another example, HM type systems can't implement heterogenous lists. So this is really easy in python:

def stringify_list(l):
    return [ repr(x) for x in l ]

                ["here's a", "nested list", {"and": "maybe", "a": "dict"}],
                "it can even be passed itself, like so:",

but impossible in Elm. You can sort of get the same effect by creating a type with many constructors

type HeteroType = HTInt Int
                | HTString String
                | HTBool Bool
                | HTList (List HeteroType)
                | ...

but it's not quite the same, because it can only accept types you know about in advance. Also, it's a massive pain to work with.

For a third example: Haskell is known for its monads. But Elm has no equivalent, because an HM type system can't support generic monad programming. You can implement the generic monad functions for specific cases, so there's Maybe.map and List.map, but there's no equivalent of Haskell's fmap which works on all monads.

Hindley-Milner type systems

I've talked about the tradeoffs that HM type systems offer, but not what HM type systems actually are. So here is where I get particularly reckless.

This bit is more formal than the rest. It's based on the treatment at wikipedia, but I've tried to simplify the notation. I'm aiming for something that I would have found fairly readable several months ago, but I no longer have access to that version of me.

Also, this part is likely to make more sense if you're familiar with at least one HM-based language. That's not a design feature, I just don't trust myself to bridge that inferential gap.

For an HM system, you need a language to run type inference on, and you need types to run type inference with, and you need some way to combine the two. You could use the language with no type inference, if you didn't mind crashes or weird behaviour at runtime, when you made a mistake with typing. (Haskell allows this with a compiler option.11) And you could run type inference without caring about the semantics of the language, treating it as essentially a SuDoku, an interesting puzzle but meaningless. (Haskell supports this, too.) But by combining them, the semantics of the language are constrained by the type system, and runtime type errors are eliminated.


Types come in a conceptual hierarchy which starts with type constants. That's things like, in Elm, Int, Float, Bool, String, Date, (). It also includes type variables, which in Elm are notated with initial lower-case, like a and msg. (Though the type variables number, comparable and appendable are special cases that I won't cover here.)

Next in the type hierarchy is applied types. Here a "type function" is applied to arguments, which are type constants and/or other applied types. These are things like List Int, Maybe (List Float), Result () Date, and a -> String. (In that last one, the type function is the arrow; Haskell would allow you to write it (->) a String. Also, (->) is the only type that HM specifically requires to exist.) Notably, an applied type must have a specific named type function as its root; you can't have m Int, which you would need for generalised monads.

Type constants and applied types are monotypes. You get a polytype by optionally sticking one or more "∀"s in front of a monotype. So for example a -> Int is a monotype, but ∀a. a -> Int is a polytype. So is ∀a. ∀b. a -> Int -> b, which is written equivalently as ∀a b. a -> Int -> b. ∀b. a -> Int is also a polytype; since the quantified variable doesn't show up, it's equivalent to the monotype a -> Int. We can do something like that to any monotype, so for simplicity we might as well decide that monotypes count as a special case of polytypes, not as a distinct set.

Type signatures in Elm typically have an implied "∀" over whichever variables it makes sense to quantify. (There's no syntax for explicitly writing the "∀".) So the type of List.map would be written

map : (a -> b) -> List a -> List b

but I'll be writing

map : a b. (a -> b) -> List a -> List b

for clarity. Because there's one place where Elm doesn't give an implied ∀, which is when you have scoped types. To demonstrate by example,

const : a b. a -> b -> a
const x = let foo : b -> a
              foo y = x
           in foo

const has a polytype here, but foo has a monotype, because (in context) its argument type and return type are constrained. If you tried to swap a and b in the type signature for foo, or rename either of them, the Elm compiler would complain.


The language has four kinds of expression, and each has a rule relating it to the type system. You need variables and constants, function calls, lambda expressions, and let statements.

Variables and constants

Variables and constants are things like True, 0.2, Just, "Hello", [], (), List.map. Each of these has a declared type, which in Elm is notated with :. So True : Bool, 0.2 : Float, Just : ∀a. a -> Maybe a, "Hello": String, [] : ∀a. List a, () : (), List.map : ∀a b. (a -> b) -> List a -> List b.

The rule that relates these to the type system is that type declarations imply type judgments. Mathematically it looks like

$$ \frac{x : π \quad π ⊑ μ}{x \sim μ}. $$

Reading clockwise from top left, this says: if you have a variable $x$ declared to have some polytype $π$, and if the monotype $μ$ is a specialisation of $π$, then $x$ can be judged to have type $μ$. ($π$ always denotes a polytype, and $μ$ always denotes a monotype.)

A type judgment, as opposed to a declaration, provides a type that an expression can be used as. A judgment is always as a monotype.

And type specialisation, denoted $⊑$, is the process of replacing quantified variables with less-quantified ones. So for example the type ∀a b. a -> b -> a might be specialized to ∀a. a -> String -> a, or to ∀b. Int -> b -> Int; and from either of those, it could be further specialised to Int -> String -> Int. Of course String -> Int -> String and List Float -> (Float -> String) -> List Float are valid specialisations too.

Thus: we have the type declaration [] : ∀a. List a, and we have (∀a. List a) ⊑ List Int, and so we can form the type judgment [] ~ List Int. We also have (∀a. List a) ⊑ List String, and so [] ~ List String. And [] ~ List (List (Maybe Bool)), and so on.

Function calls

Function calls are things like not True, (+ 1), List.Map Just. And the rule relating them to the type system is that function calls consume function types. This is the simplest of the rules. Mathematically it looks like

$$ \frac{f \sim μ → μ' \quad v \sim μ}{f v \sim μ'}. $$

Or: if $f$ can be judged to have a function type $μ → μ'$, and $v$ can be judged to have type $μ$, then the function call $fv$ can be judged to have type $μ'$.

Thus: we can infer the type judgment toString ~ (Int -> String), and we can infer 3 ~ Int, and so we can infer toString 3 ~ String.

Also, we can infer List.map ~ ((Int -> Maybe Int) -> (List Int -> List (Maybe Int))), and we can infer Just ~ (Int -> Maybe Int). So we can infer List.map Just ~ (List Int -> List (Maybe Int))

Lambda expressions

Lambda expressions are things like \x -> Just x, and in Elm they're used implicitly when something like const x y = x is turned into const = \x -> \y -> x. The type system rule is that lambda expressions produce function types. Mathematically:

$$ \frac{x : μ ⇒ e \sim μ'}{λx.e \sim μ → μ'}. $$

Or: suppose that the type declaration $x : μ$ would allow us to infer the judgment $e \sim μ'$. In that case, we could judge that $λx.e \sim (μ → μ)'$.

Typically $e$ would be some expression mentioning the variable $x$, but it's no problem if not. In that case, if you can get $e \sim μ'$ at all, you can get it assuming any $x : μ$, and so you have $λx.e \sim (\mathtt{Int} → μ')$ and $λx.e \sim (\mathtt{String} → μ')$ and $λx.e \sim (\mathtt{Result String (List (Maybe Float))} → μ')$ and so on.

Thus: given the declaration x : Int, we can infer the judgment [x] ~ List Int. And so we can infer the judgment (\x -> [x]) ~ (Int -> List Int).

Let expressions

Let expressions read like let x = y in a. Semantically, this is very similar to using a lambda expression, (\x -> a) y. But HM treats them differently in the type system, allowing a let expression to introduce polytypes. That permits code like

let f x = [x]
in (f "", f True)
-- returns ([""], [True])

If you tried to rewrite this as a lambda, you would get

(\f -> (f "", f True))(\x -> [x])

But type inference fails here, because there's no monotype declaration for f that allows a type judgment for (f "", f True). So the precondition for the lambda rule never obtains, and so in turn, no type judgment can be made for the expression \f -> (f "", f True).

Let expressions compensate for this deficiency, with the rule let expressions are like polymorphic lambda applications. (I don't have a good name for it.) Mathematically:

$$ \frac{a \sim μ \quad x : \bar{μ} ⇒ b \sim μ'}
        {(\mathtt{let}\ x = a\ \mathtt{in}\ b) \sim μ'} $$

Or: suppose that $a$ can be judged to have type $μ$, and that the declaration $x : \bar{μ}$ would allow us to infer the judgment $b \sim μ'$. In that case, we could judge that $(\mathtt{let}\ x = a\ \mathtt{in}\ b)$ has type $μ'$.

This introduces the notation $\bar{μ}$, which generalises a monotype to a polytype. How it works is: if $μ$ mentions a type variable $a$, and $a$ isn't quantified over in the surrounding context, then $\bar{μ}$ contains a "$∀a$".

Thus: we can infer (\x -> [x]) ~ (a -> List a), where a is a type variable unused in the surrounding context. That type generalises to ∀a. a -> List a. And given the declaration f : ∀a. a -> List a, we can infer (f "", f True) ~ (List String, List Bool). So in total, we can infer

$$ (\mathtt{let\ f\ x\ =\ [x]\ in\ (f\ "",\ f\ True)})
   \sim \mathtt{(List\ String,\ List\ Bool)}. $$

(It seems a little strange to me that the approach here is to first construct a meaningless type, and then quantify over it. Still, that's my understanding. It's of course possible I'm mistaken.)

Why do we need both let and lambda? Well, we can't replace lambda expressions with let expressions: they're not re-usable. (When you translate a let expression into a lambda expression, you actually generate a lambda applied to an argument. There's no way to translate a lambda expression by itself into a let expression.) Meanwhile, I'm not entirely sure why we can't make lambdas polymorphic in the same way let expressions are. I think the answer is that if we tried it, we'd lose some of the legibility that HM offers - so let can be more powerful in the type system because it's less powerful in the language. But I'm not sure exactly what legibility would be lost.


There's an interesting thing about the system I just described: it may or may not be Turing complete.

The problem is that there's no specified way of doing recursion. A function can't call itself, and it can't call any other function that can call it.

But a fixed-point combinator allows recursion, and might be included in the initial set of variables. Failing that, the proper recursive types can be used to define one. (Elm and Haskell allow us to define such types12.)

Failing both of those, we can introduce a new kind of expression

$$ \frac{x : μ ⇒ a \sim μ \quad x : \bar{μ} ⇒ b \sim μ'}
        {(\mathtt{letrec}\ x = a\ \mathtt{in}\ b) \sim μ'}. $$

This is much the same as let, but makes the variable x = a available when evaluating a. It's only available as a monotype when evaluating a, and still doesn't get generalised to a polytype until evaluating b.

(Elm and Haskell provide letrec as let and don't provide simple let at all.)

But if an HM language doesn't provide the appropriate variables or types, and doesn't implement letrec or something similar, it won't be Turing complete. Legibility gain, expressivity cost.

Wrapping up

And modulo some small details, that's the entirety of a Hindley-Milner type system. If you have a language with those features, and a suitable set of types, you can perform type inference.

What we have is a set of rules that allows us to construct proofs. That is, if we look at a program written in this language, we would be able to construct a proof of its type (or lack thereof). But I already said HM is better than that: it lets us mechanically construct a proof, in (roughly) linear time.

I confess, I'm not entirely sure how to do that. The outline is obvious, recurse down the parse tree and at each step apply the appropriate rule. But since a constant can be judged as one of many types, you need to keep track of which types are acceptable. Wikipedia hints at how it works, but not in a way that I understand particularly well.

Elm and Haskell both support many things not covered so far. To look at some of them briefly, and occasionally getting even more recklesss,

  • It seems obvious, but both allow you to evaluate the language, something I haven't touched on much. And it does need to be touched on, because there's more than one way to do it. Haskell uses a lazy evaluation model, while Elm is strict.

  • Both have ways to introduce new types. That doesn't change what we've seen, but it does separate the languages into two parts. One part describes the types used in a program and one part implements the semantics of a program.

  • Both also support case statements along with destructuring, like

    mHead : Maybe (List a) -> Result Bool a
    mHead ml = case ml of
        Just (a::_) -> Ok a
        Just _ ->     Err True
        Nothing ->    Err False

    To implement these, you'd want to add a fifth class of language expression. But I think it would be possible in theory to write a "thin" first-pass compiler to translate these statements into the existing language. By "thin" I mean to do this in such a way that we don't lose any of the legibility guarantees we care about.13 (For example, if this compiler turned $n$ bytes of code in a case statement into more than $O(n)$ bytes of code in the base language, or if it ran in more than O(n) time, this condition would fail.)

    If I'm right about that, then case statements neither make the language more expressive nor less legible, at least in one important sense.

  • (By comparison, if-then-else statements are also another class of language expression, but one which can obviously be thinly compiled down to the existing ones.)

  • In the type system, Elm supports record types, which are a lot like tuples but with nicer syntax. I believe these too could be thinly compiled down. But it also supports extensible records, which are more complicated. On one level you can think of a type like {a | x : Int, y : Int} like a tuple ∀a. (a, Int, Int). But then this tuple needs to be unpacked and manipulated when you pass it into a function expecting an {a | x : Int}.

    I believe this is unresolvable, and extensible records represent an extension of Elm from HM. (But one with fairly minor legibility costs, in comparison to the expressiveness gains.)

  • Haskell supports typeclasses, which are a way of allowing functions to operate on multiple different types. (For example, the show function can be applied to a String, an Int, a (), a [Float], ….) Elm doesn't, but simple typeclasses can be emulated with only a little added verbosity.

Another thing I'll say is that I've been talking about legibility and expressivity of a language. But a type system is itself a language, and may be more or less legible and expressive. I don't have a strong intuition for how these interact.

There's a lot more I could add to this post. Some things that I omitted for brevity, some that I omitted because I don't know enough about them yet14, and some that I omitted because I don't know about them at all. I don't know what a sensible cutoff point is, so I'm just going to end it here.

From writing my original talk, and subsequently this blog post, I think I understand HM type systems much better than I used to. Hopefully you think the same. Hopefully we're both correct. If you see any inaccuracies, please point them out.

  1. While writing this essay I came across the talk Constraints Liberate, Liberties Constrain. From the title and the context I encountered it, it sounds like it's on the same subject. But I haven't watched it, because it's in the form of a 50 minute video. 

  2. If the halting problem is decideable on a language, the language is Turing incomplete. I don't know whether the reverse is true: are there Turing incomplete languages on which the halting problem is still undecideable? I'm mostly going to assume not. At any rate, I don't think I'm going to discuss any such languages. 

  3. To nitpick myself: these aren't just languages for which you can prove termination, they're languages which never terminate, at least not for finite inputs. I don't offhand know any languages which are Turing incomplete but have the ability to loop forever, though such a thing can exist. 

  4. Specifically, it looks to me like SQL-99 without recursive common table expressions is Turing incomplete. I've only ever used nonrecursive CTEs. 

  5. I've subsequently discovered that wikipedia uses the same name for this concept. 

  6. I think this is related to the way that ZF set theory can encode Peano arithmetic. Thus, ZF is more expressive than PA. But because ZF allows you to construct objects that PA doesn't, there are more things you can say about "all objects in PA" than about "all objects in ZF". So PA is more legible than ZF. I don't understand the Curry-Howard correspondence, but I think that's related too. 

  7. "My company" is a phrase which sometimes means "the company I own or run" and sometimes "the company I work for". Here it means the latter. I don't know an unambigous way to phrase that which I don't find slightly awkward, so instead I'm using a super-awkward footnote. But, y'know. Ironically, or something. 

  8. We use Elm 0.18. 0.19 is a fairly significant version change, but I think not different enough to be relevant for this post. 

  9. Specifically GHC, which offers many extensions over Haskell proper. Whenever I refer to Haskell, I'm really talking about the language that GHC implements. 

  10. At any rate, it works fine when you pass it a number. If you pass it something else, it might do anything. 

  11. Well, sort of. It still performs type inference, it just allows it to fail. I'm not sure if "no type inference at all" would work for Haskell; but I do think it would work for a pure HM system, if you count things like "3 is of type Int" as a raw fact, non-inferred. 

  12. Minor brag: I myself contributed the Elm implementation on that page. 

  13. I think it might look something like this:

    mHead ml =
        if *isJust ml && (*fromJust ml (\_x -> *isCons _x)) then
            *fromJust ml (\_x -> *fromCons _x (\a _ -> Ok a))
        else if *isJust ml then
            *fromJust ml (\_ -> Err True)
        else if *isNothing ml then
            Err False

    functions marked with a * can be hidden from the language user. Additionally, *fromJust, *fromCons and *fail would be able to throw runtime errors. These don't violate Elm's "no runtime errors" policy, because the compiler would only generate them in contexts where it could prove they wouldn't throw. (In the case of *fail, when it could prove that code branch was unreachable, so it could also just not bother.)

    I'm very much spitballing here. I wouldn't be surprised to discover that the approach I've described is completely unworkable. 

  14. Not that that stopped me from writing this entire post. 

Posted on 05 May 2019 | Comments

"Now here's why I'm punching you..."

Related: be nice, at least until you can coordinate meanness.

A premise of this post is that punching people is sometimes better than the alternatives.

I mean that literally, but mostly metaphorically. Things I take as metaphorical punching include name calling, writing angry tweets to or about someone, ejecting them from a group, callout posts, and arguing that we should punch them.

Given that punching people is sometimes better than the alternatives, I think we need to be able to have conversations about when "sometimes" is. And indeed we can and do have those conversations. Many words have been spilled on the subject.

But I think it's probably a good idea to try to avoid having those conversations while actually punching people.

Here's what I mean. Alice thinks that punching Bob is better than the alternatives. But she thinks that if she just starts punching, Carol and Dave and Eve might not understand why. Not even if she tells them what Bob has done. She thinks punching Bob is better than the alternives, but she thinks the reasons for that are slightly complicated and haven't previously been articulated very well, at least not in a way that makes them common knowledge.

So she writes an essay in which:

  1. She proposes a theory for when punching people is better than the alternatives. (She readily admits that the theory is not complete, nor is it intended to be, but it covers part of the space.)

  2. She describes the situation with Bob, and how the theory justifies punching him.

  3. She punches Bob.

I think this could be a mistake. I think she should maybe split that post into at least two parts, published separately. In the first part, she proposes the theory with no mention of Bob. Then, if Carol and Dave and Eve seem to more-or-less agree with the theory, she can also publish the part where it relates to Bob, and punch him.

I think this has a few advantages.

  • Suppose Alice can't convince anyone that the theory holds. Then Bob is kept out of things entirely, unless Alice wants to go ahead and punch him even knowing that people won't join in. In that case, people know in advance that Alice is punching under a theory that isn't commonly subscribed to.

  • Suppose the theory is sound, and also justifies punching Fred. Then someone can link to the theory post separately, without implicitly bringing up the whole Bob thing. This is especially good if the theory doesn't actually justify punching Bob, but it's somewhat good regardless.

  • Suppose Bob disagrees with some part of the argument. When he gets punched, he's likely to be triggered or at least defensive. That's going to make it harder for him to articulate his disagreement. If it comes split up, the "thing he has to react to while triggered" may be smaller. (It may not be, if he has to react to the whole thing; but even then, he may have seen the first article, and had a chance to respond to it, before getting punched.)

  • Suppose that splitting-things-up like this becomes a community norm. Now, if Alice just wants to come up with excuses to punch Bob, it's harder for her to do that and get away with it, harder for her to make it look like an honest mistake.

It might seem even better to split into three posts: theory, then application ("and here's why that justifies punching Bob"), and then wait for another post to actually punch him. But since "arguing that we should punch Bob" is a form of punching Bob, splitting those two out isn't necessarily possible. At best it would be "theory, then application and mild punching, then full-strength punching". It's more likely to be worth it if there's a big difference between the two levels. "Here is why I think I should kick Bob out of the group" is considerably weaker than "I hereby kick Bob out of the group". But "here is why I think you all should stop trusting Bob now" is not much weaker than "you all should stop trusting Bob now".

However, I don't think this is always unambiguously a good thing. There are some disadvantages too:

  • The initial post is likely to be drier, less compelling, without concrete examples.1 And perhaps harder to evaluate, especially for less abstract thinkers.
  • You can't really remove the initial post from its context of "Alice thinks we should punch Bob". You can hide that context, but that doesn't remove its influence. For example, if there are cases similar to Bob's that would be covered by the same theory, Alice's post is likely to gloss over the parts of the theory that relate to them-but-not-Bob, and to focus too much on the parts that relate to Bob-but-not-them.

  • Suppose the theory is sound, but the facts of the case don't support punching Bob. Splitting the posts adds more opportunity for sleight-of-hand, such as using a term to mean different things in different places. This would be harder to notice in a split post than a monolithic post, if each part is internally consistent.

  • It may be harder to write this way, which may cause some better-than-the-alternatives punching to go unperformed.

  • It's slower. Sometimes that's probably neutral-to-good. But often, if punching someone is better than the alternatives, it's because they're currently hurting other people. If punching them will make them stop, then ideally we want to punch quickly.

I'm not sure how all these factors really shake out, and I expect it'll vary from case to case. So I don't want to offer a blanket suggestion. I think my advice is: if you're thinking of writing one of those all-in-one posts, consider splitting it up. It won't always be the right thing to do, but I think it's an option to bear in mind. Here are some questions to ask that might sway you in one direction or the other:

  • How hard are you punching?2 If someone googles Bob, will they find your punches? (At least once, Scott Alexander used a pseudonym for a person he was punching; this seems like a useful tool.)
  • If the punching is delayed, does anything bad happen?

  • Does the theory apply more generally than it needs to for this specific case? Thinking of similar cases might help, especially real ones but also fictional. (If you can think of lots of real cases, the value of having a reference post for the theory goes up, and its value as a reference post goes up if it has less baggage.)

(As an aside: I want to note that a post which looks like an all-in-one might not be. It may be recapping previously established theory. Common knowledge is rarely absolutely common, so I suspect this will usually be a good idea.)

  1. See for example, this post. (Though the reason I don't have examples here is different. My motivating example hasn't been written yet3, and I didn't go looking for others. Still, I expect the effects of not having examples are similar.) 

  2. And not just you personally, but your audience. If your audience is large and vicious, then no matter how gently you yourself punch someone, they're going to experience a lot of pummelling. 

  3. And there's a decent chance it won't ever, given my track record. 

Posted on 16 October 2018 | Comments

The Mathematics of Matched Betting

I've been doing a little bit of matched betting lately1. The idea is that you place two opposite bets on the same event, calibrated so that your profit will be the same no matter which bet wins. If you do this entirely with your own money, your profit will (under reasonable assumptions) be negative. But bookmakers often offer free bets; you can use matched betting to extract most of the amount of that free bet as actual money.

This post isn't advice about how to get into matched betting. That market is probably saturated; if you want to learn, I used this guide and it served me well. (However, if anyone is inspired to try it by this post, I have a referral code for smarkets: safto14. It looks like if you sign up with that and bet $£20$, both you and I will receive $£10$ in risk-free bets. I swear that's not why I'm posting this. I might not even get around to using it.)

(Um, but after that I feel obliged to give at least some safety information. So here goes: if you're not a UK citizen currently in the UK, this may be a bad idea. Don't use credit cards to deposit funds; it seems they interpret that as a cash transaction and charge fees. Start small; that way there's less at risk in case you do something silly like use a credit card to deposit funds. Probably don't expect to earn lots of money this way, either in total or per-hour.)

Instead, I want to go into the math behind it, in more depth than I've seen in other places. None of this math is complicated, but some of it is useful, and I haven't found it anywhere else. (I even deliberately went looking.)

A simple example

(If you've seen one example of matched betting, you've seen them all, and you can skip this particular one.)

You have a $£10$ free bet at a bookmaker. You find a football game, say Manchester Utd versus Liverpool, that you want to bet on. The bookmaker offers odds of $4$ on Liverpool, and you bet your $£10$ on them.

(A note on odds: the usual convention in gambling seems to be to use decimal odds. Odds of $x$ mean that your potential winnings are $x-1$ times your stake. Thus, odds of $4$ mean a bet of $£10$ has the potential to pay out $£30$. If you're used to odds notated $a:b$ or (equivalently) $a/b$, then the decimal odds are given by $a/b + 1$.)

So if Liverpool wins, you'll earn $£30$; if they lose or draw, you lose nothing. You then look up the same match at a betting exchange. An exchange allows you to take both sides of a bet, which a bookmaker won't. The exchange offers odds of $4.3$ to lay Liverpool; this means that you win your bet in the exchange only if Liverpool doesn't win. You accept a stake of $£6.98$, which means your own stake is $£23.03$.

Now if Liverpool wins the match, the bookmaker pays you $£30$ and you lose $£23.03$ in the exchange, for a net profit of $£6.97$. And if Liverpool loses, you earn $£6.98$ in the exchange and lose nothing at the bookmaker, for a net profit of $£6.98$. You've turned a $£10$ free bet into almost $£7$ of actual money.

(I'm ignoring for now the commission that the exchange will usually collect when you win a bet on them. With $2\%$ commission, you would instead accept stakes of $£7.01$, wagering your own $£23.13$; if Liverpool doesn't win, you would earn $£7.01 · 0.98 = £6.87$, which is also what you'd earn if Liverpool does win.)

Before bookmakers will give you a free bet, you'll usually have to place a bet with them using your own money. You lose a small amount of money on this bet, but you can use the same principles to ensure that you lose the same amount no matter who wins. You might lose around $£0.50$ on a $£10$ qualifying bet, in which case you end up with around $£6.50$ profit when all's said and done.

This has been a very brief introduction to matched betting. Now, into the math. I'm going to be focusing on two kinds of bet: qualifying bets, which are usually known as just bets, and free bets, where you don't lose anything if your back bet loses. I'm also going to ignore rounding; let's just pretend that the sterling is infinitely divisible.

Some definitions

We can think of a paired bet as having six parameters, $(O_b, O_l, S_b, S_l, C_b, C_l)$. These are three parameters for each of a pair of back and lay bets.

$O_b, O_l ≥ 1$ are the odds on the back and lay bets. It's typically safe to assume $O_b < O_l$; otherwise, modulo commission, you could make a profit even on your qualifying bets2. They can't be less than $1$ because we're using decimal odds; that would correspond to a probability below $0$.

$S_b, S_l ≥ 0$ are the stakes on the back and lay bets. Note that $S_l$ is the stake offered by the other party to your lay bet; it's (roughly) the amount you stand to win on that bet, not the amount you stand to lose. This may seem strange, but it's the convention used.

And $C_b, C_l ∈ [0, 1]$ are the commission charged on your winnings on each side. Usually $C_b = 0$: bookmakers don't charge commissions, they make money by offering low odds. The two exchanges I've used have $C_l = 2\% = 0.02$ (Smarkets) and $C_l = 5\% = 0.05$ (Betfair).

I'm also going to introduce the symbol $C^+ = (1 - C_l)(1 - C_b)$. If you passed $£1$ through your bookmaker and exchange, and they each charged commission and nothing else, you would have $£C^+$ left at the end. $C^+$ isn't enough for us to fully understand a matched bet, we need the individual back and lay commissions as well, but it'll be convenient shorthand.

Now let $R_{xy}$ (where $x,y ∈ \{b,l\}$) be your return on side $y$ if your bet on side $x$ wins. So for a qualifying bet (which is just a regular bet, outside the context of matched betting), we have:

$$ \begin{align*}
     R_{bb} &= S_b (O_b - 1) (1 - C_b)     \\
     R_{bl} &= - S_l (O_l - 1)             \\
     R_{lb} &= - S_b                       \\
     R_{ll} &= S_l (1 - C_l).
\end{align*} $$

For a free bet, the only change is

$$ R_{lb} = 0. $$

So your profit is $R_{bb} + R_{bl}$ if your back bet wins; and $R_{lb} + R_{ll}$ if your lay bet wins.

And now we can say that a matched bet is simply a paired bet, where your profit is the same in either case. I won't need to talk about simple paired bets from now on; all bets are matched. When I talk about a "free bet" or "qualifying bet", those are matched bets too.3

Incidentally, the six prameters are over-determined. Most commonly we want to learn $S_l$ given the other five; but knowing any five of them will fix the value of the sixth.

Optimal lay stake

The first question we'll ask is, given $O_*$, $C_*$ and $S_b$, what must $S_l$ be to make our bet a matched bet? Or in other words, what $S_l$ should we choose to eliminate all risk?

We need

$$ R_{bb} + R_{bl} = R_{lb} + R_{ll} $$

which after some (perhaps silly-seeming) substitution and rearrangement gives

$$ S_l = { R_{bb} - R_{lb} \over R_{ll}/S_l - R_{bl}/S_l }. $$

This looks circular, but when we substitute for the values of $R_{**}$, $S_l$ disappears from the right hand side. For a qualifying bet, this gives

$$ S_l = S_b { (O_b - 1)(1 - C_b) + 1 \over O_l - C_l }, $$

and for a free bet,

$$ S_l = S_b { (O_b - 1)(1 - C_b) \over O_l - C_l }. $$

A thing to note here is that $O_l$ and $C_l$ only appear in the term $O_l - C_l$. In other words, the effect of lay commission is to decrease the effective lay odds in the most natural way. It would be nice if this happened in other contexts too, but unfortunately I haven't come across it. The $O_l - C_l$ term is common, but it's usually accompanied by another $O_l$ and/or $C_l$ somewhere else in the expression.


Next, we'll want to know how much profit we make. This is given by $R_{lb} + R_{ll}$, where we calculate $R_{ll}$ using the lay stake we just found. But since both of these terms are proportional to $S_b$, we'll find it more convenient to think in terms of profit per unit of back stake,

$$ P = { R_{lb} + R_{ll} \over S_b }. $$

Under a qualifying bet, this is

$$ P_q = C^+ { O_b + C_b/(1 - C_b) \over O_l - C_l } - 1, $$

and for a free bet, it's

$$ P_f = C^+ { O_b - 1 \over O_l - C_l }. $$

We can look at these functions graphically:

(Graph of P_q(O_b, O_l)) (Graph of P_f(O_b, O_l))

(all images link to larger versions)

each line represents a contour of the function, a set of points that all have the same profit. The sets of contours look superficially similar, but they're generally steeper for a free bet, and they get cut off on the bottom edge instead of the left edge. In both cases, profit increases with $O_b$ and decreases with $O_l$.

We can reparameterise in terms of $O_b$ and $σ = O_l - O_b$, the spread between the back and lay odds. Since $O_l ≥ O_b$, we only need to consider $σ ≥ 0$. This gives us

$$ \begin{align*}
    P_q &= C^+ { O_b + C_b/(1 - C_b) \over O_b + σ - C_l } - 1  \\
    P_f &= C^+ { O_b - 1 \over O_b + σ - C_l }.
\end{align*} $$

(Graph of P_q(O_b, σ)) (Graph of P_f(O_b, σ))

These are slightly more distinct. Looking at these graphs, it seems that for a qualifying bet, having low $σ$ is more significant than having high $O_b$; but for a free bet, having high $O_b$ is more significant than having low $σ$. If so, that suggests you might want to be looking for different sorts of bets at each stage. (It is so, and we'll make it more precise later.)

We can also look at $P_f - P_q$, the difference in profit between a qualifying bet and a free bet. This isn't particularly useful to compare bets: you place qualifying bets to get free bets, and you place free bets to get money, and if you're doing pure matched betting, I don't think you'll ever be asking yourself should I place this bet free or as a qualifier? Still, the difference is

$$ P_f - P_q = 1 - { (1 - C_l)(1 - 2C_b) \over O_l - C_l }. $$

The more $O_l$ grows, the worse a qualifier becomes relative to a free bet. This is another suggestion that you should be looking at different sorts of bets for your qualifiers and your free bets.


One more thing is important when making a matched bet: lay liability. This is how much you stand to lose on the exchange where you make your lay bet. (It's only important for boring real-world reasons like liquidity and the possibility of doing something stupid, but those are still important.) You need to have this much money in your account at the exchange, which means you need to be able to spare it from your bank account for a week or so. Low-liability bets are also safer if something goes wrong, which makes them a good choice for early dabblers in matched betting.

Liability is simply given by $-R_{bl} = S_l (O_l - 1)$, which is

$$ S_b (O_l - 1) { (O_b - 1)(1 - C_b) + 1 \over O_l - C_l } $$

for a qualifying bet and

$$ S_b (O_l - 1) { (O_b - 1)(1 - C_b) \over O_l - C_l }  $$

for a free bet.

(Graph of liability(O_b, O_l), qualifying bet) (Graph of liability(O_b, σ), free bet)

(I made the graphs in $σ$ as well, but they're too boring to include inline)

Unlike profit, liability increases with both $O_b$ and $O_l$. But it increases arbitrarily with $O_b$, and asymtotically with $O_l$; it's bounded above by roughly $S_b O_b$ for a qualifying bet and $S_b (O_b - 1)$ for a free bet. (If the graphs were extended further, as they stretch upwards the lines would become ever more vertical, but they'd always stay separate. To the right, the lines would become ever more horizontal, all of them converging on $O_l = 1$.)

Improving on a free bet

Matched bet calculators aren't hard to find, and what I've given so far is nothing that they can't do for you. But they don't tell you everything you might want to know. Let's look at a bet, and see how we might find a better bet. Since the two types have different behaviours, we'll treat them separately.

To maximise profit, we usually need to consider that $S_b, C_b$ and $C_l$ are fixed, and find the dependence of $P$ on $O_b$ and $O_l$. For a free bet, that means we want to maximise the term

$$ P_f ∝ {O_b - 1 \over O_l - C_l}. $$

This tells us a few things. The first is that we want high back odds and low lay odds. We already knew that, and it's not very helpful; we expect back and lay odds to more-or-less rise and fall together. It also tells us that adding a constant to both odds will increase profit; odds of 5 and 6 will be better than odds of 4 and 5. (This, too, we could have deduced before; or we could have seen it on the graph of $P_f(O_b, σ)$.)

But consider what happens when $σ = 0$. Then the term in question is

$$ { O_b - 1 \over O_b - C_l } $$

which, as $O_b$ ranges from $1$ to $∞$, takes all values in $[0, 1)$. But when $σ > 0$, the possible values are exactly the same; high $σ$ changes the $O_b$ that gives you any particular profit, but it doesn't make any profit value available or unavailable.

What that means is: given any free bet, we can construct another free bet with equal profit but $σ = 0$, not changing $S_b$ or $C_*$.

Or: given odds $O_b, O_l$, we can calulate the odds $O'$ that would give you the same profit, if you could find these odds for both a back and a lay bet.

In turn, that tells you that if you want to improve your profits, you can ignore bets with $O_b < O'$. (Because for those bets, $P_f(O_b, σ) < P_f(O', σ) ≤ P_f(O', 0)$. The first inequality comes from adding a constant to both odds, and the second comes from reducing $O_l$.) This is a useful thing to know, that matched bet calculators don't tell you.

To find $O'$, we set

$$ { O_b - 1 \over O_l - C_l } = { O' - 1 \over O' - C_l } $$

and deduce

$$ \begin{align*}
    O' &= { O_l - O_bC_l \over 1 + O_l - O_b - C_l } \\
       &= O_b { 1 - C_l + σ/O_b \over 1 - C_l + σ }.
\end{align*} $$

The expression with $σ$ isn't exactly simpler, but I think it's more aesthetically pleasing. (Consider that $1-C_l$ is approximately as fundamental as $C_l$ itself.) Graphically:

(Graph of P_f(O_b, Ol)) (Graph of P_f(O_b, σ))

We can also calculate $O'$ simply as a function of profit, and vice versa:

(Graph of P_f(O'))

$$ P_f = C^+ { O' - 1 \over O' - C_l }                 \\
   O' = { C_lP_f - C^+ \over P_f - C^+ } $$

$P_f$ approaches an asymtote at $C^+$, but slowly. With $C_b = 0, C_l = 0.02$, extracting $80\%$ of a free bet is only possible if $O_b ≥ 5.36$. For $90\%$, you need $O_b ≥ 12.03$. Such bets are somewhat rare in my experience, and typically have high spread.

We can go more general. Given a profit, we can calculate the level curve of all bets which generate that profit; the case $σ=0$ gives us only a single point on that curve. The curve divides bet-space into two regions, so that it's easy to see whether a bet gives more or less than this amount of profit.

(Earlier we saw this level curve graphically, for certain specific profits. Now we find the explicit formula for the curve, which I secretly already used to draw the graphs.)

We already have

$$ \begin{align*}
    P_f &= C^+ { O_b - 1 \over O_l - C_l }    \\
        &= C^+ { O_b - 1 \over O_b + σ - C_l },
\end{align*} $$

and it's just a matter of rearranging these:

$$ O_b C^+ = P_f (O_l - C_l) + C^+\\
   O_b (C^+ - P_f) = P_f (σ - C_l) + C^+. $$

These two equations can be used to find $O_b$ in terms of $O_l$ or $σ$, and vice-versa. Both are very simple at heart: they're linear relationships, that could be rearranged to the form $y = mx + c$.

Looking more closely at the second one, notice that $C^+$ is the upper bound on profit. So the term $C^+ - P_f$ can be thought of as how much profit is being left on the table, compared to what you could hypothetically get if odds of $∞$ were a thing. The less profit you leave behind, the less $σ$ has to change to compensate for a given change in $O_b$. In other words, when profit is high, the level curve on the graph of $P_f(O_b, σ)$ becomes shallower, as we saw above.

Improving on a qualifying bet

For a qualifying bet, we can't quite do the same thing. If we temporarily assume $C_b = 0$, then the term we want to maximise is

$$ P_q + 1 ∝ {O_b \over O_l - C_l}. $$

This doesn't work the same as the equivalent term for a free bet. If you keep $σ$ fixed and consider profit as a function of $O_b$, then this function acts differently depending on the sign of $σ - C_l$. If $σ ≤ C_l$, then regardless of $O_b$ you get more profit than is ever possible with $σ > C_l$.

This isn't immediately practically important, because $σ > C_l$ is a pretty safe assumption. But it's mathematically significant. For a free bet, setting $σ$ to $0$ doesn't rule out any profit levels, so we could ask "how would we get this particular profit with $σ = 0$?" If we try to ask that for a qualifying bet, the answer is typically that we can't. So the approach we used for a free bet doesn't work on a qualifying bet.

We also can't set $O_b$ to its best possible value, because it can go arbitrarily high. But we can try setting it to its limiting worst value ($O_b = 1$). We find $σ'$ such that

$$ { O_b + C_b/(1 - C_b) \over O_b + σ - C_l }
   = { 1 + C_b/(1 - C_b) \over 1 + σ' - C_l }, $$

which gives us

(Graph of σ'(O_b, σ))

$$ σ' = { σ + (O_b - 1)(1 - C^+) \over 1 + (O_b - 1)(1 - C_b) }. $$

Now we know that any bet with a spread less than $σ'$ will give better profit than the bet we started with. Unfortunately, I think this still isn't as good as what we got for a free bet, for three reasons.

  1. For a free bet, we had an easy negative test: some bets (those with $O_b < O'$) could be ruled out on a glance, but verifying them took more work. Here, the test is positive: some bets (those with $σ < σ'$) can be accepted on a glance, but verifying the others takes more work.4

    In practice, I expect the positive test will almost alway be inconclusive, meaning you still need to do the more difficult check on every bet. (I haven't done enough betting myself, while writing this, to say from experience.)

  2. My workflow is to find a plausible-looking back bet and then see how it would be matched. With a free bet, I can run the easy test without looking for the match. For a qualifying bet, I need to find both sides of the bet before I can run the easy test.

  3. Qualifying bets often must be placed at a minimum odds (on the back side) in order to count. That typically rules out the lowest-spread bets (see below digression).

Still, this is what we have. Following a similar theme as before, we can calculate $σ'$ and $P_q$ as functions of each other:

(Graph of P_q(σ'))

$$ P_q = { 1 - C_l \over 1 - C_l + σ' } - 1 \\
   σ' = { 1 - C_l \over P_q + 1 } + C_l - 1. $$

(Note that these equations don't contain $C_b$. That's not because we're assuming it's $0$: when you set $O_b = 1$, $C_b$ completely disappears from the equation for $P_q$.)

Interestingly, the bounds of $P_q$ don't depend on commission at all. As $σ'$ grows, $P_q$ always approaches an asymtote at $-1$, which isn't surprising: you can't quite lose your entire back stake, but you can get arbitrarily close to that, even with no commission.

On the other edge of the graph, we always have $P_q(O_b=1, σ'=0) = 0$. (That may not be clear on this image, but it's easy to see algrebraically.) That's because at $O_b = O_l = 1$, both bets are completely one-sided. On the back side you have a chance of losing money, but no way to win it; on the lay side you have a chance of winning money, but no way to lose it. In particular, if the back bet wins, you make no profit or loss on either bet, so commission is irrelevant. And so the lay stake is calibrated for your lay winnings, after commission, to cancel out your back loss. (But if someone is willing to give you free maybe-money, you might as well ask for as much maybe-money as they're willing to give you.)

And again, given profit, we can calculate the level curve of bets which return that profit. Unsurprisingly, we find another linear relationship; it comes to

$$ O_bC^+ + C_b(1 - C_l) = (Pq + 1)(Ol - Cl) \\
   O_b(1 - C_b - Λ) + C_b = Λ(σ - C_l), $$


$$ Λ = { P_q + 1 \over 1 - C_l }. $$

I'm afraid I can offer no particular interpretation of what $Λ$ means, though I observe that we can substitute it into a previous equation, $σ' = 1/Λ + C_l - 1$. Note that if $Λ ≥ 1 - C_b$, equivalently if $P_q + 1 ≥ C^+$, then $σ$ and $O_b$ start to move in opposite directions: for fixed profit, $σ$ goes up as $O_b$ goes down. At this point, you get more profit with low $O_b$ as well as with low $σ$, which would be convenient if it was ever realistically going to happen.

(It turns out that $P_q + 1 ≥ C^+$ precisely when $σ ≤ C_l + C_b/(1 - C_b)$. I noted above that if $C_b = 0$, the possible values of $P_q + 1$ depend on the sign of $σ-C_l$. This is the same result, generalized to all values of $C_b$.)

A digression on odds

Note that in general, you can expect spread to be lower at lower odds. That's because odds are more sensitive to evidence when they're high than when they're low.

There's a technical interpretation of that, but I'm just going to illustrate by example. Consider the probabilities $1/5$ and $1/6$. These are complementary to $4/5$ and $5/6$ - the two pairs of probabilities encode the same information. Converting to decimal, probabilities $1/5$ and $1/6$ are decimal odds $1.25$ and $1.2$; and probabilities $4/5$ and $5/6$ are decimal odds $5$ and $6$.

So the difference between the odds $1.25$ and $1.2$ is, in a very important sense, the same as the difference between $5$ and $6$. But when it comes to betting, the spreads of $0.05$ and $1$ are very different.

The takeaway from this is that for qualifying bets, you should be looking at bets with low odds. High odds have better returns, but the effect of low spread is much more significant, and low spread comes with low odds.

The effects of commission

I want to explore one more question: how does profit depend on commission? For this, we'll keep $C_b$ and $O_b$ fixed, and explore how $O_l$ and $C_l$ affect profit.

Conveniently, the term we want to maximise is the same,

$$ \begin{align*}
    P_q + 1 ∝ { 1 - C_l \over O_l - C_l }  \\
    P_f ∝ { 1 - C_l \over O_l - C_l }.
\end{align*} $$

So if we find the same lay bet on two different exchanges, we can compare them without regard for the back bet we'd be matching.

The two exchanges I've used have $C_l$ of $0.02$ and $0.05$, so they give equal profits when

$$ { 0.98 \over O_S - 0.02 } = { 0.95 \over O_B - 0.95 } $$

where $O_S$ is the odds offered on Smarkets and $O_B$ is the odds offered on Betfair. This rearranges to

$$ 98·O_b = 95·O_S. $$

Since $98/95 ≈ 1.03$, it's better to use Betfair than Smarkets if the offered odds are roughly $3%$ lower, which happens to be the difference in commission. So for example, odds of $6$ on Betfair correspond to roughly $6.19$ on Smarkets.

It should be easy to take a bunch of equivalent bets on the two sites, and compare to see which seems likely to give better profit in general. I started doing that, but then I looked at three football games and got bored and stopped.

They all had exactly the same odds on all three positions (win/draw/win), even when they fluctuated slightly pre-game. (I did look at one game as it began, and the two sites didn't quite stay in sync then. But betting while odds are fluctuating a lot is a bad idea.) This suggests that Smarkets typically offers better profits. But Betfair is a more popular site, which probably has advantages; in particular, it's plausible that a large bet would be harder to fully match on Smarkets.

And that's it. There are plenty more interesting questions you could ask5, but I'm going to stop there.

Something that would be nice would be a calculator that can make use of this. The online calculators all seem pretty crap: they only tell you profit, lay stake and lay liability, and only for one bet at a time. Being able to compare bets seems like it would be a pretty important feature, but I haven't seen it anywhere. (Some of them have features to work with more complicated types of bets than I've looked at, but I don't care about that. Here's one that's no worse than any other.) I've also seen an excel calculator, which had the neat feature of automatically adding bets to a spreadsheet. But again, only one bet at a time; plus, I don't have excel, and don't keep track of my bets in a spreadsheet. (I use ledger, though it's not a great fit.)

I've written a command-line tool that can show you multiple bets at a time for comparison purposes. It also shows you, for a free bet, the lowest possible back odds to improve on your highest-profit bet ($O'$); or, for a qualifying bet, the highest possible spread that won't improve on it ($σ'$).

But the interface isn't very good, I think partly because of the limits of the command line and partly because of a questionable design decision (see the README). And it can't show graphs, which I think would be nice. If you want to use it anyway, it's on github.

If I were to dedicate more time to the project, I currently think I'd start again in Javascript. I think I have a vague idea of how a decent one could work. But right now, as far as I can tell there are no good calculators.

  1. Well, I haven't done any for a few months now. This post took a long time to write. 

  2. I think maybe sometimes bookmakers will offer improved odds as a special offer, and maybe then sometimes you'll find $O_b > O_l$. But it's not a case I'll particularly consider in this post. 

  3. I'm also assuming that all free bets are "stake not returned". In a standard bet, if you win, you get your stake back and also collect your winnings. If you win a stake not returned free bet, you collect your winnings; but you don't also get to keep the stake, because that wasn't yours in the first place. If you have a "stake returned" free bet, that effectively increases the odds on the back side by $1$. I've not yet encountered one of these, myself.

    Another type is the "risk-free" bet, which I won't go into here partly because I'm not 100% sure what it means. But I think that "£10 in risk-free bets" allows you to make a bet of more than £10, and if you lose, you get £10 back. I think the way to treat it is as putting £10 into a free bet (stake not returned) and the remainder of your stake into a qualifying bet, and so by default you should put in no more than the risk-free amount. 

  4. Shape of the "more profitable" space

    Another way to look at this is by the shape of the "more profitable" space on the graphs of $P_q(O_b, σ)$ and $P_f(O_b, σ)$, as seen here. On the $P_f$ graph, $O'$ carves up the "less profitable" space without entering the "more profitable" space. On the $P_q$ graph, $σ'$ carves up the "more profitable" space without entering the "less profitable" space. There's no equivalent of $O'$ for a qualifying bet, or of $σ'$ for a free bet. 

  5. A thing I'd like to explore at some point, is qualifying bets where you only win a free bet under certain circumstances. For example, you bet on a team to win a football match, and you win a free bet iff your team is winning at the end of both halves. I think you can still guarantee a profit with the right combination of bets, but there are more variables involved. (On the exchange, you lay your team to win, like normal. That way you turn a profit if they win both halves, and lose a small amount otherwise. So you also lay them to win both halves - that's not a standard win/lose/draw bet, but I think it's generally available. If you pick the right stakes, you can turn a profit in both cases. Though probably not exactly the same profit in both cases, because you can't easily predict exactly how much the free bet will be worth.) 

Posted on 02 June 2018 | Comments

Pareto improvements are rarer than they seem

this is surely not an original insight, but I haven't seen it before

A Pareto improvement is where you make one party better off and no parties worse off.

Suppose Adam has a rare baseball card. He assigns no intrinsic value to baseball cards. Adam likes Beth, and somewhat values her happiness. Beth collects baseball cards, and would happily pay $100 for Adam's card.

If Adam just gives Beth his baseball card, is that a Pareto improvement? Naively, yes: he loses the card that he doesn't care about, and gains her happiness; she gains the card. Both are better off.

But I claim not, because if Adam has the card, he can sell it to Beth for $100. He would much prefer doing that over just giving her the card. But if Beth has the card, he can't do that. He assigns no intrinsic value to the card, but he can still value it as a trading chip.

Now suppose Adam has the baseball card but Beth also has a copy of that card. Then Beth has less desire for Adam's card, so this situation also isn't a Pareto improvement over the original. By giving something to Beth, we've made Adam's situation worse, even though Adam likes Beth and values her happiness.

And I think situations like this are common. The ability to give someone something they want, is a form of power; and power is instrumentally useful. And the less someone wants, the less able you are to give them something they want1.

For a closer-to-reality example, the reddit comment that sparked this post said:

bringing Platform 3 back into use at Liverpool Street Underground Station was denied because the platform would not be accessible. Neither of the platforms currently in use for that line is accessible, so allowing Platform 3 to be used would be a Pareto improvement

The model here is that there are two parties, people who can access the platforms at Liverpool St and those who can't. If Platform 3 is brought back into use, the first group gains something and the second group loses nothing.

But I think that if Platform 3 is brought back into use, the second group loses some power. They lose the power to say "we'll let you bring back Platform 3 if you give us…". Maybe Platform 3 can be made accessible for $1 million. Then they can say "we'll let you bring it back if you make it accessible", but they can't do that if it's already back in use.

And they lose some power to say "if you ignore us, we'll make things difficult for you". Maybe it would take \$1 trillion to make Platform 3 accessible. If Platform 3 remains out of use, people are more likely to spend \$1 million to make their building projects accessible, because they've seen what happens if they don't. Conversely, if Platform 3 comes back, people are more likely to exaggerate future costs of accessibility. "If I say it costs \$1 million, I'll have to pay. If I say it costs \$10 million, maybe I won't."

I haven't researched the situation in question, and I expect that the actual power dynamics in play don't look quite like that. But I think the point stands.

(My original reply said: "If it's easier to turn an unused inaccessible platform into a used accessible platform, than to turn a used inaccessible platform into a used accessible platform - I don't know if that's the case, but it sounds plausible - then opening the platform isn't a Pareto improvement." That still seems true to me, but it's not what I'm talking about here. There are lots of reasons why something might not be a Pareto improvement.)

This doesn't mean Pareto improvements don't exist. But I think a lot of things that look like them are not.

Update 2018-02-02: some good comments on reddit and LessWrong. Following those, I have two things in particular to add.

First, that I like /u/AntiTwister's summary: "If you have the power to prevent another party from gaining utility, then you lose utility by giving up that power even if you are allies. There is opportunity cost in abstaining from using your power as a bargaining chip to increase your own utility."

Second, that there is a related (weaker) concept called Kaldor-Hicks efficiency. I think that a lot of the things that look-like-but-aren't Pareto improvements, are still Kaldor-Hicks improvements - meaning that the utility lost by the losing parties is still less than the utility gained by the winners. In theory, that means that the winners could compensate the losers by giving them some money, to reach a Pareto improvement over the original state. But various political and practical issues can (and often do) get in the way of that.

  1. This feels like it generalizes far beyond questions of Pareto efficiency, but I'm not sure how to frame it. Something like game theory is more competitive than it appears. Even when no two players value the same resource, even when all players genuinely want all other players to do well, players still have an incentive to sabotage each other. 

Posted on 27 January 2018 | Comments

Improving goimports

Epistemic status: Pure dilettantism. I have never used Go. This might make me unqualified to say what I'm about to say. But it's okay, because I use the word "seems" a lot.

In Go, if you have an unused import, your program fails to compile. This has made a lot of people mildly annoyed and been sometimes regarded as a bad idea, but not universally.

The devs decline to add a compiler flag to change this behaviour, "because compiler options should not affect the semantics of the language and because the Go compiler does not report warnings, only errors that prevent compilation". This strikes me as reasonable, if not the decision I personally would make.

Instead, they suggest this pattern:

import "unused"

// This declaration marks the import as used by referencing an
// item from the package.
var _ = unused.Item  // TODO: Delete before committing!

- that is, add a line of code to trick the compiler into thinking that you're using the import, even though you're not. (Unless you forget to delete the trick line, which the compiler won't verify for you.)

This does not strike me as a very good substitute. With a compiler flag, I could turn it on while debugging and turn it off for production to make sure I had no unused imports. I could use a commit hook to keep unused imports out of the repository. By tricking the compiler, there's no particularly easy way to do this. (I suppose I could put the string "FAKEIMPORT" is a comment on those lines, and grep for that string. This is still not a great solution.)

I also refer to the devs' opinion that "if it's worth complaining about, it's worth fixing in the code. (And if it's not worth fixing, it's not worth mentioning.)" I claim that there's no fundamental difference between

import unused


import unused
var _ = unused.Item

Neither affects the semantics of the program, if we ignore that the first one doesn't compile1. If one is worth complaining about, so is the other. But the devs are sending mixed signals. It seems the first is worth complaining about, because the compiler complains. But it seems the second is not, because the devs recommend it. This should be a sign that something about this whole situation is less than ideal2.

(For myself, I would say that both are worth complaining about in production, and neither is worth complaining about when debugging. But the trick works equally well in both instances.)

There is another solution, in the form of a tool called goimports. The idea is that you don't write imports at all. If you have a symbol that isn't imported, it searches your filesystem for an appropriate package and adds an import line. If you have an unused import, it deletes it.

But word on the street is that some names are ambiguous, supplied by multiple packages, and goimports has no way to choose which one you mean. So if you comment out a line, and then uncomment it, goimports might add back a different import than the one you originally used. This, too, seems less than ideal.

All of which is to say: although I don't use Go, it seems to me that Go has a problem, and that the existing solutions aren't perfect.

I propose a new solution, which could be implemented as two new modes for goimports to run in or as one or two completely new tools.

In the first mode, this tool acts like goimports, but more conservatively. Instead of removing unused imports, it merely comments them out. And instead of searching your filesystem for packages, it merely searches your comment lines, and uncomments them if there's a match.

So if you're debugging, and comment out the only use of a package, this tool will comment out the import for you. When you uncomment that use, the import will be added back in, but without the ambiguous naming problem. At no point do you have to trick the compiler, so you don't have to remember to stop tricking the compiler.

In the second mode, this tool checks for commented out import lines, and tells you whether it found any (or optionally deletes them). It can be called in commit hooks, to prevent such lines from cluttering up a repository.

This seems to me like it would be an improvement on the status quo.

  1. At least, they don't look like they do. Not knowing Go, it's conceivable that they could. If the second can have effects, then the devs' recommendation seems particularly bad. In that case, if a var _ = unused.Item line shows up in a codebase, it's probably a mistake - but you can't be sure without checking, which takes much longer. 

  2. I want to clarify three things. First, I do not think the devs are wrong to recommend using this trick. Second, I do not think the second version is just as bad as the first. The second version has a more obvious smell, which is valuable, because bad code that smells bad is less bad than bad code that smells good. Third, I said the situation is less than ideal, and that's what I meant. I do not think the situation is terrible. 

Posted on 13 January 2017 | Comments

The Parable of the Clock

The monk Dawa had a clock that had stopped, and he was content. When he wished to know the hour, he would glance at the clock, and discover that it was noon.

One day a visiting friend commented on the clock. "Why does your clock say that the hour is noon, when I am quite sure that it is six in the evening?"

Dawa found this unlikely, for the hour had always been noon in his experience. But he had been instilled with the virtues of curiosity and empiricism. If the hour is noon, I desire to believe it is noon. If the hour is six in the evening, I desire to believe it is six in the evening. Let me not become attached to beliefs I may not want. Thus fortified, he sought out other clocks.

The time was indeed six in the evening. In accordance with the virtue of relinquishment, and gently laughing inside at his past foolishness, Dawa serenely set his broken clock forwards by six hours.

Posted on 22 December 2016 | Comments

Data surgery is not rocket science

This is a story of me failing to do something, and some thoughts on how I might have succeeded.

A friend had a problem. He'd been recording some audio on his phone when the battery died, leaving him with a .m4a file that couldn't be played. He had a look at the contents and it contained a bunch of data, so he offered $40 if someone could recover the audio for him.

This seemed like a fun challenge that would exercise skills there isn't much call for. I didn't necessarily expect to succeed, but I decided to give it a go. (I only remember doing something like this once before, when I had a corrupt FAT partition that I managed to fix with some bit-flipping.)

To help, the friend provided two small files from the same app: one a successful recording, and one corrupt like the target file.

The simplest thing was to simply try playing the broken files with mplayer, just in case. It didn't work, and gave an error message saying (among other things) "moov atom not found".

The next thing was to look at all the files in a hex editor, which in this case was "hexdump -C piped into less" because I don't think I have a dedicated hex editor installed. I quickly noticed that the good recording had the bytes moov at location 0x1d, while the corrupt recordings both had the bytes free there.

I also noticed that all three files had the bytes mdat at location 0xc95, followed by some low-entropy data, and then some apparently high-entropy data that seemed to go on to the end of the file. I guessed that this was the actual audio data, while the start of the good recording was a valid audio header.

00000000  00 00 00 18 66 74 79 70  6d 70 34 32 00 00 00 00  |....ftypmp42....|
00000010  69 73 6f 6d 6d 70 34 32  00 00 0c 79 66 72 65 65  |isommp42...yfree|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000c90  00 3f 3f 3f 3f 6d 64 61  74 01 40 22 80 a3 7f f8  |.????mdat.@"....|
00000ca0  85 2d 2d 2d 2d 2d 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |.---------------|
00000cb0  2d 2d 2d 2d 2d 2d 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |----------------|
00000da0  2d 2d 2d 2d 2d 2d 2d 2d  2d 2d 2d 2d 2e ff f1 0a  |------------....|
00000db0  5a 5a 5a 5a 5a 5a 5a 5a  5a 5a 5a 5a 5a 5a 5a 5a  |ZZZZZZZZZZZZZZZZ|
00000eb0  5a 5a 5a 5a 5a 5a 5a 5a  5a 5a 5a 5d e0 e2 14 b4  |ZZZZZZZZZZZ]....|
00000ec0  b4 b4 b4 b4 b4 b4 b4 b4  b4 b4 b4 b4 b4 b4 b4 b4  |................|
00000ed0  b4 b4 bc 00 e2 36 2e 70  5b 0c 09 88 8b b1 2a ae  |.....6.p[.....*.|
00000ee0  9d 55 6c 14 4c 2a 2a 55  45 44 01 c5 35 93 14 a0  |.Ul.L**UED..5...|

The start of the target file. hexdump prints a * to indicate that several lines have been skipped which were identical to the one above. The audio data seems to start around position 0xed2.

So that gave me another simple thing to try. I took the first 0xc95 bytes from the good file, and byte 0xc95 onwards from the small corrupt file, and concatenated them.

head -c 3221 tinytest_notbroken.m4a > fixtest-1.m4a
tail -c +3222 tinytest_broken.m4a >> fixtest-1.m4a

After this, fixtest-1.m4a played in mplayer. It printed lots of warning messages while it did so, but whatever.

So I did the same thing with the target file, and recovered two seconds of the original audio. I could report to my friend that it started with him saying "um".

This is what I was expecting, since the good recording was only two seconds, and it would presumably have had the length in the header. But now I needed to work out how to lengthen it.

I played around with ffmpeg, but it didn't seem to have an option for "ignore the audio duration stated in the header", so it seemed I would have to fix the header myself.

ftyppm42 seemed like it might be a file type indicator, so I googled that. A few clicks led to the MP4 registration authority, which suggested that the relevant standard was called "ISO/IEC 14496-14". So I then googled "ISO/IEC 14496-14 track header", and found a pdf of the relevant standard. Unfortunately this is a high-context document, I didn't particularly understand it, and it didn't help me very much.

I also found wikipedia on MP4, which pointed me to ISO base media file format, ISO/IEC 14496-12. Google then gave me a pdf of that standard, which was much more detailed, helpful, and lengthy.

I didn't attempt to understand it all. But I searched for "mdat", and shortly after one of the hits, I found the following description of a data structure:

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) {
    if (version==1) {
        unsigned int(64) creation_time;
        unsigned int(64) modification_time;
        unsigned int(32) timescale;
        unsigned int(64) duration;
    } else { // version==0
        unsigned int(32) creation_time;
        unsigned int(32) modification_time;
        unsigned int(32) timescale;
        unsigned int(32) duration;
    template int(32) rate = 0x00010000; // typically 1.0
    template int(16) volume = 0x0100; // typically, full volume
    const bit(16) reserved = 0;
    const unsigned int(32)[2] reserved = 0;
    template int(32)[9] matrix =
        { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
        // Unity matrix
    bit(32)[6] pre_defined = 0;
    unsigned int(32) next_track_ID;

Promising! My header contained the following two lines:

00000020  00 00 00 6c 6d 76 68 64  00 00 00 00 d4 6b 4f 0d  |...lmvhd.....kO.|
00000030  d4 6b 4f 0d 00 00 03 e8  00 00 08 17 00 01 00 00  |.kO.............|

That's mvhd from the description, followed by four null bytes, followed by two identical 32-bit ints - the creation and modification times would naturally have been the same - and then two more ints.

timescale was 0x3e8, decimal 1000, which per spec means 1000 time units pass in a second. duration was 0x817, decimal 2071, indicating a track 2.071 seconds long. So presumably I need to edit this value.

What to set it to? As long as it's sufficiently long, it's not important. So I set it to 0x10000817, using emacs (in hexl-mode) as my editor, and tried again. No dice, it still cut off at two seconds.

So I searched "duration" in the same file, and found two more places to edit. One soon after the bytes tkhd, which used the same timescale as the mvhd; and one soon after the bytes mdhd, which had its own timescale. In this case the timescale was 0xac44, decimal 44100.

But adding 0x1000000 to both of these durations didn't help either.

(After one of these three edits - I don't remember which - mplayer started reporting the file as being 74 hours long. But it still cut out after two seconds.)

At this point I was up late and out of ideas. But I realized that I could just ask my friend to record another long track, and then use the header from that to do the job. So I left him instructions, plus some notes on where I'd got to, so someone else could pick up where I'd left off if that didn't work.

Then while I was asleep, someone else came along and fixed it for him before he tried that. I don't know how.

But here are some thoughts on how I might have proceeded, if my friend hadn't been able to record the longer audio.

For one thing, I might have been able to generate my own long file in ffmpeg. But attempting this now, I can't make it work. I concatenate several copies of the good recording, and get a file that starts with these three lines:

00000000  00 00 00 18 66 74 79 70  4d 34 41 20 00 00 02 00  |....ftypM4A ....|
00000010  69 73 6f 6d 69 73 6f 32  00 00 00 08 66 72 65 65  |isomiso2....free|
00000020  00 07 93 5e 6d 64 61 74  01 40 22 80 a3 7f f8 85  |...^mdat.@".....|

It plays fine. But when I try to mix it with the target file, it doesn't play. It complains about the missing moov atom, even though that's also missing in the working concatenated file. I'm not sure what's wrong.

Failing that, I could have recorded the long file myself in the same app my friend used, to spare his attention for things he actually wanted to be doing. (Which, after all, was the reason he was willing to pay someone else to fix the file.) I could also, for curiousity's sake, have recorded another short file, and attempted to find more durations by comparing the headers.

But perhaps the simplest thing would have been to take a completely different approach from the beginning. It turns out that other people have encountered this problem before, and paved the way for those like me. For example, someone on facebook posted this page, which teaches you to fix these files using a piece of software called "faad.exe". More research reveals that this is open source, and even available in portage.

(I also find references to FAAD if I google "fix corrupt m4a file".)

It looks like this was a case of "more haste, less speed". At least the haste was fun, but probably not worth the $40 I might have earned by being thorough.

Posted on 17 December 2016 | Comments

The Sally-Anne Fallacy

I'd like to coin a term. The Sally-Anne fallacy is the mistake of assuming that somone believes something, simply because that thing is true.1

The name comes from the Sally-Anne test, used in developmental psychology to detect theory of mind. Someone who lacks theory of mind will fail the Sally-Anne test, thinking that Sally knows where the marble is. The Sally-Anne fallacy is also a failure of theory of mind.

In internet arguments, this will often come up as part of a chain of reasoning, such as: you think X; X implies Y; therefore you think Y. Or: you support X; X leads to Y; therefore you support Y.2

So for example, we have this complaint about the words "African dialect" used in Age of Ultron. The argument goes: a dialect is a variation on a language, therefore Marvel thinks "African" is a language.

You think "African" has dialects; "has dialects" implies "is a language"; therefore you think "African" is a language.

Or maybe Marvel just doesn't know what a "dialect" is.

This is also a mistake I was pointing at in Fascists and Rakes. You think it's okay to eat tic-tacs; tic-tacs are sentient; therefore you think it's okay to eat sentient things. Versus: you think I should be forbidden from eating tic-tacs; tic-tacs are nonsentient; therefore you think I should be forbidden from eating nonsentient things. No, in both cases the defendant is just wrong about whether tic-tacs are sentient.

Many political conflicts include arguments that look like this. You fight our cause; our cause is the cause of [good thing]; therefore you oppose [good thing]. Sometimes people disagree about what's good, but sometimes they just disagree about how to get there, and think that a cause is harmful to its stated goals. Thus, liberals and libertarians symmetrically accuse each other of not caring about the poor.3

If you want to convince someone to change their mind, it's important to know what they're wrong about. The Sally-Anne fallacy causes us to mistarget our counterarguments, and to mistake potential allies for inevitable enemies.

  1. From the outside, this looks like "simply because you believe that thing". 

  2. Another possible misunderstanding here, is if you agree that X leads to Y and Y is bad, but still think X is worth it. 

  3. Of course, sometimes people will pretend not to believe the obvious truth so that they can further their dastardly ends. But sometimes they're just wrong. And sometimes they'll be right, and the obvious truth will be untrue. 

Posted on 09 April 2016 | Comments

Political Polarization in the US House of Representatives

I've created an interactive graph of historical levels of political polarization in the US House of Representatives. It would be tricky to embed in this blog, so I'm only linking it. Summary:

The x-axis on this graph is based on DW-NOMINATE left-right scores of each member of each U.S. House of Representatives from 1865 to 2015. This uses a member's voting record to measure the direction and extremity of their political views, regardless of party affiliation.

If a member's score on this axis is known, it's possible to predict their vote on any given issue with high confidence, given no other information about the member. Members whose votes are typically left-aligned receive negative scores, while members whose votes are typically right-aligned receive positive scores.

(However, see The Thin Blue Line That Stays Strangely Horizontal, which questions the validity of DW-NOMINATE.)

Background: I made this last year for a Udacity course, "Data Visualization and D3.js". I needed to submit a visualization for marking, and this was my submission. I'm grateful for feedback provided by Udacity and by some of my friends. Without that, the result would certainly have been worse.

The source is available on GitHub, including datasets and some python scripts I used to process them. The README also documents some of the design history.

I'm aware of one bug: in firefox (38.6.1 on linux), the legend appears to display the 5-95 and 25-75 percentile boxes identically. They're implemented as rects with fill-opacity: 0.3: the 25-75 boxes have two of these rects on top of each other. This is also how the paths on the graph itself are colored.

I assume there are other bugs.

Posted on 26 February 2016 | Comments

Officially faceblind

I realized a few years ago that I was at least somewhat faceblind/prosopagnosic. A while back I took an online test out of curiousity, and scored low. They said that if I was in London and interested in further tests, I should leave my email address. A few days ago I went in for those tests, and now I have a PhD student (Katie) also telling me I'm faceblind. Which makes it official, I guess.

Next she wants to run EEGs on me, which should be cool. That will help work out where my brain is going wrong, in the long chain between "photons stimulate nerve endings in my eyeballs" and "I recognize a face" (whatever that means). Also, apparently there's a phenomon which sounds to me like blindsight, where some prosopagnosics' brains are clearly reacting to faces on some level that doesn't reach their consciousness. She wants to learn more about that too.

What follows is discussion of the tests, my scores, and what they mean. I've been given a powerpoint with my scores reported as percentiles, along with absolute scores and average control scores. 2% or lower is counted as "impaired". Percentiles are only given as integers, or as "<1%". On the day, Katie also gave me some numbers in terms of standard deviations (σ). Under a normal distribution, 2.5% would be approximately -2σ, but I'm not sure any of these results will be normally distributed, so I don't know if σ scores really tell me anything.

A note: if you think you might be faceblind, and you'd be interested in getting more detailed tests, it might be a good idea not to read the below. I expect it wouldn't significantly bias the results if you did, except for one bit that I've rot13ed. But I don't trust myself to make that call. If you're in London, you can take the above test like me and see what happens. Otherwise I'm not sure how you'd go about getting more tests.

The object/face recognition tests were "memorise these things, then we show you a sequence of things and you have to say if each of these things was a thing in the first set". The things were houses, cars, horses, and bald women's faces. I was bad at all of these: 4% for cars, 2% for houses, and <1% for horses and women. (Average score was higher for women than horses, and my score was higher for horses than women, so I'm worse at women than horses. I think Katie told me I was somewhere between -5σ and -6σ for women. Under normality, -5σ is one in thirty million, but this is clearly not normal.) So it seems I have some level of general object agnosia, but more specific prosopagnosia on top of that.

I was 11% for reading emotions from eyes, which is a point for Team Phil Does Not Have Aspergers (some of my friends are divided about that). In fact, the average score is 26 and I scored 23, and there were a few cases where I said an answer, then thought "wait no it's this" and didn't say anything because I wasn't sure if I should. (I was speaking my answer and Katie was recording it. I had to choose from four emotions, so I'm not sure why this wasn't recorded by a computer like most of the other tests.) So plausibly I'm actually above 11%.

I was <1% at famous face recognition, recognising five out of fifty that I'd been exposed to, out of sixty in total. (I got Jvyy Fzvgu, Uneevfba Sbeq, Dhrra Ryvmnorgu, Ebova Jvyyvnzf, and surprisingly Ovyy Pyvagba.) It seems that controls tend to get above 40, so even counting that "exposed to" is vague, I did really badly at this. I think Katie said I was -9σ, which would be one in 10^19 under normality.

I'm <1% at the Cambridge Memory Test for Faces, which is the one I linked above. I actually scored worse in the lab than online. (47% versus 58%, IIRC, with a control average of 80%, and 60% indicating impairment. But the lab score I've been given is 34 against control of 58, so it's clearly been adjusted.) There could be any number of reasons for this, including "chance". But when I took it online, I often thought that one of the faces looked a little like Matt Damon, and chose that one. I like to think that "mistaking people for Matt Damon" is the way forward in face recognition.

I was somewhat okay at half of the Cambridge Face Perception Test. In this one, they showed me a face at an angle, and below it the same face face-on, six times, with varying degrees of modification. I had to order them according to how similar each was to the original face, within a minute. For half the tests, the faces were all upside down. For all of the tests, they all looked incredibly similar and my instinctive reaction was WTF.

On the upright test, I got <1%. On the inverted test, I got 7%. One strategy I used a few times was to focus on the lips, specifically on the size of the dip in the bow. I just ordered them according to that. I guess it helps, but I found it a lot easier for inverted faces.

Doing better at inverted would seem to suggest that I'm doing some kind of holistic face processing that goes wrong and blocks off later avenues for perception. Buuut, objectively I scored worse on the inverted faces, just not as much worse as average, so I'm not sure if this does suggest that. (And I'm not sure it is "objectively" - if all the faces had been assigned to the other condition, would my scores have changed?)

Hypothetically, high scores on both tests could indicate my problem is with memory, not initial perception. The low score here doesn't mean I don't have a problem with memory, but it does seem to hint that I do have a problem with initial perception. And I suspect the famous faces test points at me also having a memory problem.

Posted on 19 January 2016 | Comments

A Scream of Swifties

Inspired by Slate Star Codex

Explaining a joke is like dissecting a frog: it's one way to learn about frogs. If you want me to explain any of these, ask, and I will explain without making fun of you.

"I hear someone playing a triangle in the corridor," said Tom haltingly.

"We've got to overturn every last insect in this garden," said Tom flippantly.

"Goose feathers are annoyingly fragile," said Tom, breaking down.

"Anastasia gives me pins and needles," said Christian gratingly.

"I miss my submissive," René opined.

"I didn't do it, and nor did any of my siblings," Tom insisted.

"It's not much paint, it won't hurt to run your tongue up it," said Tom metallically.

"I'm so sick, even my flu has the flu," said Tom metallurgically.

"It's pitch black and I can hear a monster doing arithmetic," said Tom gruesomely.

"Man City don't hold a candle to the real heros of Manchester," said Tom manually.

"I just bought Manchester United," said Tom virtuously.

"Lancelot told me I was his favourite!" said Tom, surprised.

"I don't think this tube of semen is entirely straight," said the incumbent.

"I can fit inside my GameCube!" said Tom inconsolably.

"In a former life, I was a priest in pre-Columbian Peru," said Tom inconsolably.

"I need a name for my squid-and-flatfish restaurant," said Tom inconsolably.

"I make a living as the red tellytubby," said Tom, apropos.

"I'm doing crunches so I can get a six-pack," said Treebeard absently.

"I'm half-fish and made of lithium," said Treebeard limerently.

"Figure three plots counts of close-ups on male versus female genitals," said Tom pornographically.

"My breasts don't have enough room in this corset," said Victoria, double depressed.

"Bring me the head of my enemy," said Emacs vicariously.

"I have affirming the consequent, base rate neglect, and now also ad hominem," said Tom, getting cocky.

"We lost the treaty, so we had to ratify it again," said Tom, resigned.

"I'm in the group supporting Shiva's wife," said Tom with satisfaction.

Posted on 01 January 2016 | Comments

Unix pro tip: Print and pipe

Sometimes you (or at least, I) want to run a command for its output, but also want to pipe it through another command. For example, see the results of a find but also count how many hits it got. I've sometimes lamented that there's no easy way to do this. But the other day I had a flash of insight and figured out how:

find . | tee /dev/stderr | wc -l
proc1 | tee /dev/stderr | proc2    # general case

(I'm pretty proud of this. I don't know if it's original to me, but I discovered it independently even if not.)

tee will print the output of proc1 to both stdout and stderr. stderr goes to the terminal and stdout goes to proc2.

You can make it more convenient with an alias:

alias terr='tee /dev/stderr | '
find . | terr wc -l

(Putting a pipe in an alias seems to work in both zsh and bash.)

If you want to concatenate the streams, to pipe them to another process, you can use subshells:

proc1 | ( terr proc2 ) 2>&1 | proc3

but note that stderr output from proc2 will also get sent to proc3, unless you send it somewhere else. I haven't yet thought of a use for this.

There are potential issues with buffering here. I'm not aware that tee makes any promises1 about which order it writes the streams in. It's going to be interlacing them while it writes, so that it doesn't need to keep a whole copy in memory. So (if the input is large enough) proc2 will be receiving input before it's finished being written to stderr, and might start writing output, and then the output streams can get interlaced.

For some values of proc2, commands which start printing before they've finished reading, this is inevitable. But I think useful proc2s are likely to be aggregators2 - by which I mean, commands which can't do anything until they've finished reading all their input3. In my tests so far, those have been safe, but that doesn't prove much.

We can do a more reliable check with strace:

find . | strace tee /dev/stderr | wc -l

By the looks of things, tee will read into a buffer, then write it to stdout (the pipe), then write it to the specified target (stderr, which goes to the terminal), and repeat to exhaustion. But the important thing is, it doesn't close any file descriptors until it's finished writing everything, and then it closes the target before it closes stdout. If this is consistent amongst tee implementations - and it seems sensible4 - then aggregators almost definitely won't interlace their output with the output from proc1. I don't want to say "definitely", because there might be other stuff going on that I haven't accounted for. But at any rate, tee will finish writing before the aggregator starts.

Anyway, I see this as being the sort of thing that you're likely use manually, not in an automated process. So if the output does get interlaced a little, it's probably not that big a deal.

  1. Note how I didn't say guaranTEEs? You're welcome. 

  2. In fact, wc is the only proc2 I can currently think of, that I expect to be regularly useful. (And then just piping through cat -n does something similar, but it's not ideal.) numsum or perhaps some other tools from num-utils could conceivably be handy from time to time. 

  3. You can turn a non-aggregator into an aggregator with moreutils' sponge command, which is like cat but reads all its input before writing anything. 

  4. They could reasonably close their outputs in either order, which might make a difference. But I don't think it's usual for programs to explicitly close their std* streams, that gets taken care of by the surrounding infrastructure. So if tee explicitly closes its target, I'd expect that to happen before stdout gets closed. 

Posted on 07 October 2015 | Comments

London Cycle Hires and Weather

Preface: I wrote this report for Udacity's "Explore and Summarize Data" module. The structure is kind of strange for a blog post, but I'm submitting the finished report essentially unchanged.

One thing I will note. I find that the cycle hire usage doesn't change much throughout the year. Shortly after submitting, I read this article which finds that it does vary quite a lot. I'm inclined to trust that result more. It's intuitively sensible, and it looks directly at the number of rides taken, instead of looking at a proxy like I do.

Take this as evidence for how much to trust my other results.

My goal is to investigate usage of the London cycle hire scheme, and in particular how it varies with the weather. I'm running an analysis from July 2013 to June 2014.

I'm using two data sets here. Daily weather data comes from Weather Underground, using the weather station at London Heathrow airport.

(London City Airport is closer to the bike stations that I use, but the data from that airport reports 0 precipitation on every single day. The data from Heathrow seems to be more complete, and I expect it to be almost as relevant.)

I collected the cycle hire data myself, over the course of the year, by downloading CSV files from an unofficial API which now appears to be defunct. It has a granularity of about ten minutes. That's about 50,000 entries per docking station for the year, so for this analysis, I'm only using the data from four docking stations near my office.

All data and source code used for this project can be found in the git repository.

Exploring the weather data


plot of chunk temp.1v plot of chunk temp.1v

These variables measure the minimum, average, and maximum daily temperatures. The graphs all look similar, and overlap a lot. The shape is a little surprising, as I didn't expect the density graphs to be bimodal. It could potentially be caused by significant differences between summer and winter, with an abrupt shift between the two.


plot of chunk rain.1v plot of chunk rain.1v

According to the rain column, There are over 225 rainy days and only about 125 non-rainy days. But by far the most common bin for precip.mm is the leftmost one. Table of values of precip.mm:

##     0  0.25  0.51  0.76  1.02  2.03  3.05  4.06  5.08   6.1  7.11  7.87 
##   207    35    20     9    17    22    12     8    12     4     4     2 
##  8.89  9.91 10.92 11.94 13.97 
##     3     5     2     1     2

Although more than half of observations have rain == TRUE, more than half of them also have precip.mm == 0, which needs more investigation. Rainfall as measured by precip.mm versus as measured by rain:

plot of chunk rain.precip.2v plot of chunk rain.precip.2v

The two measures don't always agree. Sometimes rain is false but precip.mm is nonzero; and often rain is true but precip.mm is zero. Neither of those is surprising individually: if rain is only counted when the rainfall exceeds a certain threshold, then that threshold could be large (giving false/nonzero) or small (giving true/zero). But the combination suggests that that isn't what's going on, and I don't know what is.

This table counts the anomalies by turning precip.mm into a boolean zero/nonzero (false/true) and comparing it to rain:

##         FALSE TRUE
##   FALSE   119    9
##   TRUE     88  149

There are 88 instances of true/zero, 9 instances of false/nonzero, but the cases where they agree are the most common.

I find precip.mm to me more plausible here. I feel like fewer than half of days are rainy. This website agrees with me, saying that on average, 164 days out of the year are rainy (rain - 237, precip.mm - 158).


plot of chunk wind.1v plot of chunk wind.1v

These three measures of wind speed are all averages. wind is simply the average wind speed over a day. wind.max is the daily maximum of the average wind speed over a short time period (I think one minute). gust is the same thing, but with a shorter time period (I think 14 seconds).

Unlike with temperature, the three measures look different. All are right-skewed, although gust looks less so. There are several outliers (the isolated points on the box plots), and the quartiles don't overlap. The minimum gust speed (about 24) is almost as high as the median wind.max.

Exploring the bike data

Time between updates

plot of chunk dt.1v

There are a few outliers here. Not all the lines are visible due to rendering artifacts, but above 5000, we only have five entries:

##                     name        prev.updated             updated
## 46779    Earnshaw Street 2013-10-03 08:50:23 2013-10-13 09:20:28
## 46899  Southampton Place 2013-10-03 08:50:22 2013-10-13 09:20:27
## 46918       High Holborn 2013-10-03 08:50:24 2013-10-13 09:20:30
## 47049         Bury Place 2013-10-03 08:50:26 2013-10-13 09:20:32
## 175705 Southampton Place 2014-06-20 17:36:06 2014-06-30 08:30:03

The first four of these happened when my collection script broke and I failed to realize it. The other occurred when Southampton Place was taken out of service temporarily.

Let's zoom in on the lower ones:

plot of chunk dt.1v.left

There are several instances where the time between updates is unusually large, on the order of hours or days. The times of entries with between 2000 and 5000 minutes between updates:

##                     name        prev.updated             updated
## 32650       High Holborn 2013-08-31 15:10:07 2013-09-02 12:30:05
## 32660         Bury Place 2013-08-31 15:10:08 2013-09-02 12:30:07
## 32672  Southampton Place 2013-08-31 15:10:05 2013-09-02 12:30:04
## 32674    Earnshaw Street 2013-08-31 15:10:06 2013-09-02 12:30:05
## 38546       High Holborn 2013-09-14 22:39:00 2013-09-16 08:24:22
## 38719         Bury Place 2013-09-14 22:39:02 2013-09-16 08:24:23
## 38734  Southampton Place 2013-09-14 22:38:58 2013-09-16 08:24:20
## 38735    Earnshaw Street 2013-09-14 22:38:59 2013-09-16 08:24:21
## 84066         Bury Place 2013-12-27 15:40:08 2013-12-29 23:10:14
## 84069       High Holborn 2013-12-27 15:40:06 2013-12-29 23:10:13
## 84073  Southampton Place 2013-12-27 15:40:05 2013-12-29 23:10:11
## 84078    Earnshaw Street 2013-12-27 15:40:05 2013-12-29 23:10:12
## 84186    Earnshaw Street 2013-12-30 00:10:05 2013-12-31 13:10:07
## 84202       High Holborn 2013-12-30 00:10:06 2013-12-31 13:10:09
## 84269  Southampton Place 2013-12-30 00:10:05 2013-12-31 13:10:06
## 84330         Bury Place 2013-12-30 00:10:07 2013-12-31 13:10:11
## 89443  Southampton Place 2014-01-12 20:20:10 2014-01-14 18:40:07
## 89459       High Holborn 2014-01-12 20:20:13 2014-01-14 18:40:11
## 89467         Bury Place 2014-01-12 20:20:14 2014-01-14 18:40:16
## 89524    Earnshaw Street 2014-01-12 20:20:11 2014-01-14 18:40:09
## 121381   Earnshaw Street 2014-03-15 14:50:06 2014-03-17 01:50:04
## 121398      High Holborn 2014-03-15 14:50:07 2014-03-17 01:50:05
## 121444        Bury Place 2014-03-15 14:50:10 2014-03-17 01:50:07
## 121591 Southampton Place 2014-03-15 14:50:05 2014-03-17 01:50:04
## 133765      High Holborn 2014-04-11 16:59:37 2014-04-14 01:29:07
## 133900   Earnshaw Street 2014-04-11 16:59:36 2014-04-14 01:29:05
## 133961        Bury Place 2014-04-11 16:59:38 2014-04-14 01:29:08
## 134027 Southampton Place 2014-04-11 16:59:35 2014-04-14 01:29:05

It looks like these happened to all stations simultaneously, suggesting problems with either my collection script or the API, rather than problems with individual locations.

Entries with less than 60 minutes between updates, no longer on a log scale:

plot of chunk dt.1v.60

In the vast majority of cases, updates are approximately ten minutes apart. This encourages me to take a subset of the data (bikes.all -> bikes), considering only entries with d.updated less than 15 minutes. This eliminates many outliers in future graphs.

Date and time of update

plot of chunk date.time.1v plot of chunk date.time.1v

All times of day are approximately equally represented to within ten minutes, which is good. There are five noticeable troughs preceeded by spikes, but they probably don't signify much. Dates are a lot less uniform, however. Even apart from the ten-day period where my script was broken, many days have significantly fewer updates than typical, and some have none at all.

Number of days spent with a given number of active docks

plot of chunk ndocks.time.2v

It was common for every station to report less than a full complement of docks. At least two had a full complement for less than half the time (High Holborn and Bury place are unclear in that respect). This isn't surprising, since a bike reported as defective will be locked in, using up a slot but not being available for hire.

Journeys taken throughout the year

plot of chunk date.journeys.2v

The time of year makes very little difference to the number of rides. There appears to be a slight sinusoidal relationship, but it's very weak. (I didn't do a PMCC test because that assumes that any relationship is linear, which we would naively expect not to be the case here, and also doesn't look true from the graph.)

Journeys by weekday

plot of chunk weekday.journeys.2v plot of chunk weekday.journeys.2v plot of chunk weekday.journeys.2v

Fewer journeys are taken on weekends. The median number of bikes available doesn't change much throughout the week (5 on monday and friday, 4 on other days), but the distribution does. Saturday and Sunday have noticeably different shapes to the others. They have a single peak, while weekdays are somewhat bimodal, with a small peak where the station is full (probably when people are arriving at work).

(Since the stations have different numbers of docks, I did a graph of fullness rather than of number of bikes. The density plot doesn't show peaks exactly at 0 and 1 because of how the density window works, but histograms of num.bikes and num.spaces show that that's where they are. It would be difficult to use a histogram for this graph because there's no sensible binwidth.)

Change in number of bikes between updates

plot of chunk bikes.prevbikes.name.mv

##  Pearson's product-moment correlation
## data:  bikes$num.bikes and bikes$prev.num.bikes
## t = 2466.8, df = 173250, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9859301 0.9861908
## sample estimates:
##      cor 
## 0.986061

There's very strong correlation between the number of bikes in adjacent entries. This is as expected, especially given what we saw about d.num.bikes previously. The colors here don't show any particular station-dependent trends.

Number of bikes at any given time