Clearly, my problem was with a "garbage model".
The other night I was checking a few figures for an upcoming post that were based upon my binary logistic models (BLR) built from EPL match data, and I saw a number of counter intuitive trends. That got me to check my data for a fourth time, and sure enough... I had fat fingered a formula! It turns out that formula was used throughout the data set, and thus all of my analysis using DogFace's data had been using incorrect numbers. In terms of my blog's header quote, my model was really wrong and wasn't really useful. My post-match analysis of Arsenal/Blackburn and my quantification of Phil Dowd's officiating of Arsenal's matches were now both in question. I had to re-run the analysis, which totaled several hours of statistical work and about twice as long checking my numbers.
It's a bit frustrating, as my blog relies on the accuracy of my numbers to drive its content. I am very systematic in making sure they are right, and this is the first time I have found such an error. However, more important than being right the first time is correcting mistakes when I find them. This post is such a correction.
The Impact of Cards and Phil Dowd on Arsenal
After re-crunching the numbers, I found that the significant factors in the binary logistic regressions (BLR) turned out to be a bit better. My erroneous calculations had resulted in my elimination of fantasy league points for cards from consideration in the BLR. This was unfortunate because I had already done several posts on officiating at Arsenal's matches using that metric, and had hoped to re-use it here. It turns out that when the numbers are correctly calculated, such a term is statistically significant. Note that I have used Yahoo's Fantasy Premier League scoring system, which gives 3 points for a yellow card and 6 points for a red card. All other terms from the BLR in the previous post - venues and differentials of shots, shots-on-goal, corners, and fouls - were included in the analysis. The graphs below show the true relationship between those terms and fantasy point differential.
The updated models show that Arsenal is less impacted by cards at home than the average team in the Premier League, while they are impacted nearly identically as the average team when they are away from home. Quite a different conclusion than my previous post that had erroneous data! It turns out a number of the general conclusions regarding their overall odds, especially regarding their best away odds not even matching their odds of a home match where they experienced a fantasy point deficit of 5 points, didn't change much from the original post with bad data.
What about the affect of referees on Arsenal's matches? The plots below show the impact each referee has on the odd's Arsenal wins a match compared to the odds of Arsenal winning the match had the referee handed out the average number of fouls and cards Arsenal saw over the five year period. Only data from the 2006/07 through 2009/10 seasons was used as those were the four years where each of the eight referees had officiated at least one match.
Again, we see Phil Dowd lead the pack in odds differential penalty, although it is smaller than originally estimated (3% now vs. 4% with erroneous data). Webb and Halsey's numbers round out those who have the greatest impact on Arsenal's odds of winning. These corrected numbers will become more valuable in my next post on this topic, where I will compare the bias of Dowd, Webb, and Halsey to their records officiating other clubs' matches.
Finally, there's the small matter of Phil Dowd's officiating at Arsenal's recent match against Blackburn. We Gooners still can't blame the loss on Phil Dowd - his officiating certainly helped the Gunners odds. But their play didn't help nearly as much as my original post indicated, and in fact played into Balckburn's statistics a good bit. The table below summarizes the match statistics, and shows the likelihood of winning by each club.
That's not a typo - Blackburn did actually have a higher odds of winning the match based upon the way the statistical models work. Let me explain:
- Arsenal's coefficient for the constant term in the BLR is significant, and it is negative. This means that before anything else is known about Arsenal's match, they start out with less than a 50% chance of winning. This is not the case with Blackburn, whose constant term for their BLR is non-signficant and thus gives a coefficient of zero for no effect on their odds of winning.
- Arsenal's odds certainly increase playing at home and having a greater number of shots on goal, but their BLR coefficient for shots is not statistically significant so their is no significant contribution to their odds of winning from their 15 shot advantage.
- Blackburn, on the other hand, does have a statistically significant coefficient for the shots term of the BLR, but it is negative. That means as the opposition's shot differential increases, Blackburn's odds of winning go up. This can make some sense, if one thinks in terms of accuracy. A shot doesn't really matter unless it is on target. Blackburn's coefficient for SOG is statistically significant as well, and is positive, which makes sense. This means the more SOG's the opposition gets, the more Blackburn's odds of winning go down. However, they actually benefit when team's take wild shots with little likelihood of scoring a goal. This is all reflected in Arsenal's need to take so many shots to get roughly the same percentage of shots on goal as Blackburn. Had they had better accuracy, Blackburn's odds would have been much lower.
- Arsenal's odds suffer, and Blackburn's benefit, from Arsenal's corner differential. This is a trend seen throughout the data set, both in the total league data and individual team data. This possibly counter-intuitive trend will be explored in a later post.
- The foul differential is zero, but even if it weren't it would not matter. Neither team's BLR coefficient for fouls is statistically significant.
- Clearly Arsenal benefits from only having a single yellow card vs. two and a red card for Blackburn. This raises Arsenal's odds and lowers Blackburn's due to their statistically significant coefficients for fantasy points.
In all, it gives both teams high odds of winning the match. Perhaps a draw was the most appropriate outcome according to the statistics.
Conclusion
The beautiful thing about blogging is that mistakes can be corrected instantly, unlike books where one must wait months or years until th enext edition is published or magazines where a retraction can be made next month. The instantaneous nature doesn't negate the challenges associated with errors, but it does make the communication more honest and more open in a quicker manner. I hope that in being honest with my mistake that it reassures you that I am always striving for quality data. I'm redoubling my efforts to check all my numbers before I post any new material. Catching the error and correcting it has turned out for the better when it comes to the flexibility of the data set and the conclusions that can be drawn. Most importantly I have shown that fantasy points for yellow and red cards are statistically significant, which will enable the use of a single metric to capture the effect of two related, but one especially rare, events in a match. Stay tuned for the follow up post...

No comments:
Post a Comment