Call me a kill joy. Say that I don't like to have fun. That's fine with me, but I couldn't get behind the stupid Paul the Octopus movement (even though he has an iPhone app). Frankly, the obsession with Paul was just the latest demonstration of how little the worldwide population understands statistics and looks for meaning in expected outcomes.
One of the core concepts in statistics is that for a model, data set, or set of predictions to be significant they must be substantially different enough from a basic assumption (the null hypothesis). To pull a quote from an earlier blog post,
Six of the seven past World Cup winners were governed by dictators in the last 80 years, England being the exception. That means around 86% of World Cup winners have had dictators during the World Cup’s eight-decade run. Fine. But that number is only interesting if it’s significantly higher than the percentage of all competing countries to have had dictators during the same period; otherwise it’s just an artifact of the historical reality that most countries have had dictators in the last 80 years.Paul the Octopus and his followers suffer from the same problem. Let me explain...
In this case, a rational person looks and Paul and realizes that he should be no better than flipping the coin - he's got a 50/50 chance of getting things right as he has no real knowledge of the match nor its participants. The one caveat to that is that we're assuming that there's no bias being introduced by his handlers, the arrangement of the boxes, where he's at in the tank when the boxes are dropped, or at least another dozen or so factors that might influence which box he opens. For the sake of the exercise, let's make the assumption for now that there is no bias and come back to it later.
If we start with the assumption that Paul is no better than a coin flip, we can evaluate how good he was at picking the winners and whether or not he was better than what we could expect from flipping a coin.
Many people have focused on the fact that he got eight match outcomes correct. The chances of doing this are 0.5^8, or 0.4%. While impressive, the statistics suggest this is only marginally better than an outcome of coin flips.
Statisticians have the ability to determine how powerful a data set is compared to a known phenomena. They also have the ability to do the inverse - to define a desired power of a data set and the difference they would like to prove from a null hypothesis to understand what trend must be observed in the data to prove such a powerful relationship. This can be done many ways, but I used Minitab's "Power & Sample Size" function.
When performing such power and sample size analyses, statisticians often use a power of 0.9. This means there is a less than 10% chance of drawn the wrong conclusion of a statistically significant shift in the data set when one really isn't present. I used this assumption when looking at Paul's picks. I used that power, the fact that he picked eight times, and asked Minitab what proportion of his picks had to be correct to prove he was better than the average outcome of picking winners based upon coin flips (i.e. 0.50). The outcome of that analysis is in Figure 1.
Figure 1 shows the power one can derive from an observed population based upon the proportion of correct predictions of a sample size of eight. In the case of Paul's eight guesses, he needed a greater than 90% success rate (the red dot) to prove that he was better than flipping a coin. That is to say that with such a low sample size, seeing a coin accurately predict the outcome of 7 of 8 matches is not outside reasonable statistical expectations. Paul had to nail all eight predictions to be better than a coin flip - one screw up and he was meaningless.
My analysis and argumentation would have been much easier if Paul had screwed up one of the final match predictions. (Side note: I would have also been happier too because it meant the Dutch would have won the World Cup.) Nevertheless, examining what happens when we increase the sample size and the precarious position Paul was in provides a great example of why statisticians must see large sample sizes to feel comfortable with their analyses.
Let's say Paul hadn't retired from making soccer match predictions, and he could live long enough to see several World Cups. He'd have to restrict himself to predicting knock-out round matches, as he'd surely get burned by the ties in the group play stage. This would allow him 16 matches every four years that he could provide predictions. Running the "Power & Sample" size function in multiples of 16 yields Figures 2 and 3.
One can see that as sample size increases, the required proportion of correct guesses drops. If Paul had provided predictions to all 16 of the 2010 World Cup knockout matches, he could have only made two incorrect predictions before he was no better than a coin flip. After two World Cups, he could only get 8 wrong (out of 32) and still be better than flipping a coin. Twelve World Cups in, he would have to maintain better than a 60% success rate to be better than flipping a coin. My money would have been on him screwing up more than two picks in the other eleven matches he didn't predict in the 2010 World Cup knock out rounds.
Beyond Paul's followers essentially having faith in nothing more than a coin flip, there's the question of bias in Paul's predictions. If you go back and watch the videos, you see that he unquestionably favors the box on the right no matter where he is at in the tank. Others have theorized that he favors reds and yellows. No matter what the bias, it would have been discovered if Paul's owners had done a proper measurement system analysis (MSA). An MSA helps the statistician understand how much the measurement system contributes to any variation in observations versus the actual variation in the measured quantity. This process helps identify and quantify bias. This is especially important with inferred measurement systems like Paul's predictions. I am sure that if they had done an MSA they could have discovered where the bias was coming from, if there was one.
Paul's fame forced me to tear him down. He's earned my BS Stat of the Day. While my dislike for him may be much for some of my readers, it apparently pales in comparison to the death threats the whack job leader of Iran has issued against him. This is soccer and statistics, not a geopolitical power struggle.