Saturday, April 17, 2010

Explaining regression through an enhanced Soccernomics analysis


The full results of the Soccernomics pay-for-play regression, including the actual equation and the distribution that accompanies it.

In this post I will attempt to tackle the much-abused and little understood topic of regression theory by using one of the more famous models in the soccer community: Soccernomics' infamous Figure 3.1 showing the pay-to-win regression of the top two English soccer leagues. This post will focus on the technical aspects of regression theory, walking through the step-by-step process likely used by the authors of the book. Occasionally I will go beyond what the authors showed, just to provide something more than a regurgitation of their study and hopefully provide greater insight into the general theory behind the analysis.

Before beginning, I would like to point out that the original inspiration behind this post was one via a request from one of my first tweeps. It just goes to show you that if you reach out to me on Twitter or via the open thread on the blog, I will respond with the requested analysis. I hope you enjoy this post, jblock49!

Background

In their seminal work Soccernomics, authors Simon Kuper and Stefan Szymanski lay out a very intuitive yet startling correlation: to finish higher in the tables of the top two English soccer leagues, one must spend more money than their opponents. Their analysis of the data, the results of which are shown in the graph below, shows that the team payroll as a function of a multiple of the leagues' average payroll explains a whopping 88.7% of the variation in finishing position within the tables.

Figure 1: Regression analysis from Soccernomics

The authors' use of transformed data (see "log" denotations on each axis), and their lack of discussion around the uncertainty inherent to any regression analysis, provides fertile ground for a case study in regression theory. Too often we equate regression analysis with dropping two data sets into Excel, plotting them with a fitted line, and hoping that the R-squared value comes out good enough to justify a relationship. What many people don't realize is that there are many more requirements of a good regression study whose conclusions can be accepted. I will explain those assumptions here.

The prerequisite: a normally distributed response variable

In the regression world, there are two types of variables. If we can imagine an equation in the form of y= mx+b, the following variables are named:
  • y = response variables
  • m = regressor coefficient
  • x = regressor variables
  • b = regression constant
In this case, y and x are data sets used to develop m and b and provide the regression equation we are used to seeing. Before beginning any regression analysis, we'd like to see a normal distribution to the data set y. In the case of the Soccernomics study, the regressor was the multiple of league average pay while the response was finishing position.

The trick with any analysis of league finishes is that the variable of interest is not the actual finish position, but rather how you finish relative to everyone else. The logic is similar to that used for wages - you don't need to spend a certain amount to win, just more than your opponents. That's where the first transform of the original data found in Soccernomics comes in. Instead of looking at the raw finish position, the authors looked at a relative finish position that provided a rough indication of how frequently another team would finish ahead of another. They did this by using the transformed data set of:

p/(45-p)

It appears they used 45 as a normalizing value due to the combined number of positions available to the teams in the top two leagues including relegation and promotion. This transform meant teams at the top of the table would have low values, while those at the bottom would have high values. However, this transform presented some challenges you can see in Figure 2 - the transformed data wasn't normal as indicated by p-value <>

Figure 2: Graphical Summary of p/(45-p)

The authors were on the right track, but they needed to perform an additional transform of the data to make it normal. This is when they would have turned to their commercial statistics program and asked it to run several simulations of common transforms (logarithmic, natural log, etc) to help identify a transform that provided a normal distribution. They settled on the natural log transform, and used a -1 coefficient in front of it. This is because a natural log transform would have taken the teams with high finish positions that now had low numbers in the p/(45-p) transform and given them a larger negative number after the natural log transformation. Using a -1 coefficient to invert the data set makes sense - high pay scales might lead to higher finish position, but not the other way around - and it doesn't affect the normality of the data set. The authors' suspicions were correct, and they were rewarded with a normal data set. See Figure 3 below, where the p-value is > 0.05 and thus we accept the assumption that the data is normally distributed.

Figure 3: Graphical Summary of -ln[p/(45-p)]

Transforming the Regressor

To preserve any chance of a linear relationship between the response variable and the regressor, the authors then likely set out transform the wage data by the similar method. A transformation using a natural log function produces a normal data set shown in Figure 4.

Figure 4: Graphical Summary of ln(wage multiple)

Prior to regression: determining if correlation exists

Prior to beginning regression modeling, the authors would have performed a correlation study and this is where we start getting into the claims of the book. Completing a correlation study is the first step because it helps understand the total amount of variation explained by the relationship of the data, and it provides a good statistical test as to whether the value is high enough to justify a regression analysis and equation. No more guessing at Excel-based R-squared values!

Correlation is measured via a correlation coefficient. This coefficient is calculated per the formula below, and it is essentially trying to measure the scatter around the mean of the two data sets (i.e. x(i) - x(bar), etc.) in relation to the overall scatter of the data (s(x), representing the sample standard deviation).

A graphical representation of this equation can be found in Figure 5.

Figure 5: Graphical representation of correlation measurement

Once a correlation coefficient has been calculated, it can be compared to values assigned to different risk levels based upon the number of samples in the data set. In the case of the English league data set of 58 samples, the authors would need a correlation coefficient of between 0.2948 to 0.3218 or greater to conclude there was less than a 1% risk of incorrectly concluding that a significant enough relationship exists between the two variables to proceed with a regression study. Instead of using the lookup tables, I used a statistical software package. Using the author's data and running it through a correlation study yields the results in Figure 6.

Figure 6: Correlation study of -ln[p/(45-p)] and ln (wage multiple)

The results of the study clearly indicate a low risk of assuming a correlation exists (p-value = 0.00), and that the relationship between the two variables explains 94% of the variation in their behavior. Now, this is a little bit different than the claim in Soccernomics, which was 92% of the variation in league position being explained by team expenditure. I triple checked the data that I copied from Figure 3.2 in the book, and could find no errors. I don't know if it is a typographical error in the book, or the result of some other analysis. Nonetheless, there seems to be a strong relationship between the two variables that warrants a regression analysis.

Checking the Regression Results

The step of making the regression equation and plot at this point is a formality for most of us. It should be noted that most regression algorithms are based upon the least squares method, which means that it uses multiple equations to describe the behavior of the system and finds the one regression equation that minimizes the sum of the squares of the errors made between the regression equation and the original data.

The trick isn't in the regression equation itself, but in the results that it produces. In general, regression analyses must have:
  • A normally distributed response variable data set
  • A statistically significant correlation coefficient
  • Have their residuals meet five basic requirements
Residuals are the difference between each response variable data point and the corresponding predicted response value from the regression equation. Residuals represent the error in the statistical model. For a regression equation to be accepted as statistically valid, the following five requirements must be met:
  • Residuals are normally distributed with a mean of zero
  • Residuals are random and show no pattern
  • Residuals have constant variance
  • Residuals are independent of the values of the regressor variables
  • Residuals are independent of each other
Meeting these requirements ensures that the relationship between the two data sets is real, and not the effect of an unseen factor, confounding variable, or test procedure.

When a regression analysis is carried out on the transformed finishing position and wage data, Figure 7 is produced. The two plots on the left suggest that the data is normal, and the normality test in Figure 8 confirms they are (barely) normally distributed with a p-value of 0.063. The two figures on the right measure the other four characteristics. There does seem to be some trouble at either end of the data sets. The upper right graph shows some increasing spread (non-constant variance) as one goes to either end of the data set. The graph in the lower right indicates a consistent under or over prediction in the model when looking at the ends of the data sets. I don't know if it would have been enough to conclude that the regression analysis was invalid, but it would definitely have caused me to look at the reasons why the ends are so skewed.

Figure 7: Four-in-one plot for regression residuals

Figure 8: Graphical summary of residuals

The Model's Results

Now that all of the assumptions have been checked, it is time to move on to analyzing the model's results. We are often used to seeing regression simply as a line and an R-squared value, but there is much more going on behind the scenes. As we are all aware of regression graph in Figure 3.1 in Soccernomics, I have instead focused on the the statistical analysis found in Figure 9 below.

Figure 9: Output from regression analysis

The first set of data to focus on is the R-Sq and R-Sq(adj). Notice that the R-Sq value is different that the correlation coefficient we calculated earlier. The R-Sq value is measuring the proportion of variation that is explained by the regression model that has been generated, and not the overall variation explained between the two data sets. R-sq is simply the SS (sum of squares) value from the "Regression" row in the Analysis of Variance subsection divided by the SS value from the "Total" row. Hence, 88.3% of the variation being explained by the regression model.

The R-Sq(adj) term is a modified form of R-Sq, which takes into account the number of terms in the model. In this case, the linear regression performed only has one term. If one were to try and predict the behavior by a cubic regression (i.e. an equation of y=ax^3+bx^2+cx+d), the equation would have three terms. The R-Sq adjusted is a way of telling whether or not the terms you are adding by going to more complex regression equations are actually improving the fit - higher R-Sq(adj) values means better regression bang-for-the-buck. In the case of the Soccernomics study no other regressions were run beyond the linear one.

With such a high R-Sq value, we can also look at the statistical tests of the predictors. Both the constant and the regressor have p-values <>

Finally, we can move on to the equation itself which was not included in the book. The equation relating average finish position to wages is:

-ln[p/(45-p] = 0.5465 +1.487*ln(wage multiplier)

For all you EPL fans out there, this is the equation that matters. If you ever were able to get you hands on a list of team payrolls at the beginning of a season, you would be able to project the average outcome of the season if it were played repeatedly at those price ratios for several years in a row.

Statistics are always about distributions

Finally, I'd like to close the discussion around regression by extending the author's analysis a bit further. For simplicity's sake, they presented what I would call a simplified model of regression. It works, and it gets the point across. But to statisticians, it's all a bit too simple. Statistics are always about distributions - nearly every test involves calculating means and variances or standard distributions. Sometimes these quantities are used as checks and pre-requisites to beginning tests, other times they are critical elements within the tests. More importantly, the output of the tests always has a distribution associated with it.

In the case of regression, the equation the line represents is associated with the mean values for the relationship. In reality, the likely distribution of the response data set at any one regressor can be calculated. The distribution of the predicted response will vary along the length of the regression line and across the data set, and is dependent upon the underlying data sets used in the regression. Thus, what we're really calculating when using the regression is the likely average of individual outcomes that could occur over time at that regressor test point that has been selected. In reality, there are a range of outcomes. This range of outcomes is called a prediction interval (PI), and the tighter band inside of it is the confidence interval (CI) for the predicted mean value.

Figure 10 is the same as the figure at the beginning of this blog post. It is the same regression analysis performed throughout this study, but I have turned on the 95% CI and PI lines. These lines represent the range of values of where we are 95% certain individual occurences of future observed data and its means will fall for each of the regressor values on the x-axis.

Figure 10: Regression analysis with PI and CI included.

As one can see, much of the observed variation from the data sets falls within the CI lines. This graph gives a much more complete picture of the expected behavior given the data, and becomes far more useful than the standard single regression equation line when one wants to understand whether or not a single season's finish in the English leagues is expected or an anomoly.

Conclusion

This was, believe it or not, a brief treatment of regression theory. I hope it has provided you a much better understanding of all the calculations and checks that must go on when performing regression studies. The next time someone shows you an Excel graph with a line and an R-squared value on it, ask them if they have checked their residuals and the p-values associated with the terms in the regression equation. Ask them if they know what the correlation coefficient for they data is, or if they are sure the response variable data is normally distributed. Until they can show you the data confirming all the critical checks of a regression analysis, it's just a pretty picture.

No comments:

Post a Comment

LinkWithin

Related Posts Plugin for WordPress, Blogger...