This is a mini-post – I do have a proper one for this week, but I realised that I wanted to link to something I hadn’t actually written for one of the asides. It’s another in my series of “statistical inference by one of its consumers” …
At various points in the past, I’ve noted that the classification of statistical errors into “Type 1” (false positive) and “Type 2” (false negative) is incomplete. There’s also “Type 3” errors (getting the wrong answer because you asked the wrong question” and even “Type 4” (a correct answer which isn’t the one your boss wanted). Note that the classification isn’t complete, and that the “Type 4” error isn’t just a joke – in a lot of cases, a Type 4 error can be a genuine error, demonstrating that there is something wrong with the underlying data collection process and accounting system, so that the correct statistical process can nonetheless deliver something that an experienced boss familiar with “ground truth” can spot as absurd.
In this post, I extend the plan man’s guide to types of errors to the concept of statistical confidence and goodness of fit. This is a subject about which people talk hilariously unrigorously (in fact, the more rigorous the maths, the more cavalier statistics bods tend to be in talking about what they have or haven’t proved). In my view, the way to think about all statistical work is that it’s talking about a three way relationship between a model (in the sense of a set of assumptions), some data, and a statement of interest.
The stuff that goes on in the computer program, or the equations that the ordinary reader skips over to get to the conclusion, are all about setting out the bounds of what kinds of interesting statements are compatible with the combination of the data and the model.
If you’re only interested in setting very wide bounds on the statements of interest, you can use a very general model. (At the extreme, if you want to know whether it’s possible for a given quantity to exceed 10 and one of the observations is 15, you hardly need a model at all).
Which is important to bear in mind; often it’s worth remembering that you don’t always need a piece of research to prove anything like as much as it’s trying to prove. Back in the 2000s, some readers may remember I got into a lot of arguments about the Lancet studies of civilian casualties from the Iraq War. One of the points I kept trying to make was that one shouldn’t spend time arguing over the exact estimated number; the important thing to realise was that given the data, it was almost impossible to conceive of any model which might conclude that the excess death rate was negative (ie, that the post-war environment was better than under Saddam, something which might have been considered a reasonable ex ante benchmark to have reached after three years), and hardly any possible model in which the number of excess deaths was less than multiple tens of thousands.
But unfortunately often, you do need something to be tied down to quite narrow bounds – you either need a point estimate of a number of interest, or to get a reasonable degree of confidence in whether some hypothesis is true or not.
When that happens, you’re basically dependent on the model-data combination. Either the model has to be really convincing from first principles (such that if it’s even broadly consistent with the data you’d be inclined to accept it), or the data have to be really well-aligned (such that you can be fairly sure that the result would still be there under a variety of alternative models).
And you don’t usually get that. A reasonable but not unarguable model with a reasonable but not spectacular fit is the normal outcome from statistical research when you commission it yourself or look it up because you’re interested in the answer. (A really good fit of the data to a reasonable model is the normal outcome for something published in a scientific journal, but that’s a whole nother issue).
So while the general classification of statistical significance is “rejects the null hypothesis” or “fails to reject the null”, I’d argue that the most normal degree of significance is “makes you go hmm”. That’s my name for a statistical result which makes you feel a bit better if you were already inclined to believe the conclusion, but doesn’t overly upset you if you weren’t.
I'm surprised you didn't mention Andrew Gelman and his Type S (sign) and Type M (magnitude) errors
Nice read 👍