Monday, May 3, 2010

Predicting MLS table finishing position through regression


Get Microsoft Silverlight
This is why the Sounders are already in jeopardy of missing the playoffs.

As a Seattle Sounders fan, I have been spoiled. The team made the playoffs in their first year after some wildly inconsistent play, and we supporters have come to expect much more out of our second year team. We simply will not accept poor play from our second year franchise, which has made the start of this season a bit frustrating. Matches that have ended in ties with Real Salt Lake, FC Dallas, and the Columbus Crew due to stoppage time goals have taken 6 points away from the Sounders total in an already young season. In my frustration, I thought of how I might understand this rough start statistically, and have come up with a composite method for understanding how teams are doing at any point in the season.

Background

In attempting to understand where the Sounders' start compares to the rest of the league's performance, one immediately runs into the challenge that teams have played anywhere between seven and four games to date. This will affect the maximum points available to each team, and may skew the any perceptions taken from the raw data. See Figure 1 for the current team standings, points, games played, and goal differential.


Figure 1: MLS Standings as of May 3, 2010


In understanding how a team might be doing so far, I have built statistical models that project a team's finish based upon play-to-date using historical data as the inputs to the models. In this case, the response variable is finish position while the predictors are the teams' goal differential and points. I have also studied a third relationship - points vs. goal differential - which can be used as a check against the assumptions regarding projected goal differential.


There are some simplifications to these models - namely, one would expect teams like the LA Galaxy and DC United to regress a bit towards the mean. Nonetheless, one can account for such gross over- and under-performance by placing bounds on the projections that correspond to historical limits.


I have also rationalized such approach by planning a regular update to the projections on a monthly basis. I have made this initial projection based upon the majority of the teams completing nearly 20% of their season. Such projections are likely going on in these clubs right now, with the management teams trying to get their first read of the adjustments they must make to improve their teams.


The Input Data


Like the previous analysis involving team payroll and finish position, I used a normalized value of the teams' finishing positions. However, in this analysis I did not use an average finish position, but rather each team's individual finish position from each season. This was done to facilitate regression analyses that were used to study the relationship between goals, points, and finishing position.


Another wrinkle in this study was that I did not use a single normalization value. Instead, I used the general normalization equation below to account for the changing number of teams in the league.


-ln[p/(number of teams + 1 -p)]

Data from the 2005 through the 2009 seasons were used in the analysis, with the following number of teams.
  • 2005: 12
  • 2006: 12
  • 2007: 13
  • 2008: 14
  • 2009: 15
The 66 data points, when normalized per the formula above, provide a normally distributed data set. See Figure 2 below detailing the results of a graphical summary of the data set.



Figure 2: Graphical summary of finishing position transformation


Regression of Finishing Position and Points


The first study involved the correlation between finishing position and points. The first check produced a Pearson correlation coefficient of 0.915 - 91.5% of the variation in the data can be explained by the relationship between finishing position and points. See Figure 3 for the results of the correlation study. This is an intuitive relationship - it would not be good for the game if there wasn't such a correlation.

Figure 3: Correlation study results - finishing position vs. points



Once a statistically significant correlation was established, a regression analysis could take place. Figure 4 shows the regression analysis fitted line plot, confidence intervals, and prediction intervals.



Figure 4: Fitted line plot - finish position vs. points


The fitted line plot shows that the R-squared value is 83.7% - that means 83.7% of the variation in the data is explained by the regression model. This model is a good fit, and will be used later in this post to project teams' finishing positions based upon their performance to date in the 2010 season.


Regression of Finishing Position and Goal Differential


The second study involved the correlation between finishing position and goal differential. The first check produced a Pearson correlation coefficient of 0.824 - 82.4% of the variation in the data can be explained by the relationship between finishing position and goal differential. See Figure 5 for the results of the correlation study. This is an intuitive relationship, but apparently is less direct than points. This makes sense though - points more clearly translate to finishing position, while teams can have a match where they rack up a 3 or 4 goal differential in their favor while earning the same 3 points for the win as a team who wins by a single goal.


Figure 5: Correlation study results - finishing position vs. goal differential



Once a statistically significant correlation was established, a regression analysis could take place. Figure 6 shows the regression analysis fitted line plot, confidence intervals, and prediction intervals.



Figure 6: Fitted line plot - finish position vs. goal differential


The fitted line plot shows that the R-squared value is 67.9% - that means 67.9% of the variation in the data is explained by the regression model. This model is a decent fit, but not as good as the regression of finishing position vs. points. Nonetheless, it will be used later in this post to project teams' finishing positions based upon their performance to date in the 2010 season. This regression will be used in conjunction with others to produce an average finishing position from multiple calculation methods.

Regression of Points and Goal Differential

The third study involved the correlation between points and goal differential. The first check produced a Pearson correlation coefficient of 0.912 - 91.2% of the variation in the data can be explained by the relationship between points and goal differential. See Figure 7 for the results of the correlation study. Apparently the relationship between points and goals is nearly as good as finishing position and points.


Figure 7: Correlation study results - points vs. goal differential


Once a statistically significant correlation was established, a regression analysis could take place. Figure 8 shows the regression analysis fitted line plot, confidence intervals, and prediction intervals.



Figure 8: Fitted line plot - points vs. goal differential


The fitted line plot shows that the R-squared value is 83.2% - that means 83.2% of the variation in the data is explained by the regression model. This model is a good fit, and will be used in conjunction with the previous two models to project teams' finishing positions based upon their performance to date in the 2010 season.


Projected League Finish


Using the three regression models developed above, a projection for finishing position based upon teams' play to date and each regression model. Projections for each teams' points and goal differential at season's end are based upon the teams' performance to date and making a projection based upon how many games each team has played. Those projected season ending goal differentials and points are used as inputs to each of the three regression equations. See Figure 9 for a summary of the results.




Figure 9: Projected team finish position based upon play through May 3rd, 2010 (click to enlarge)

Each model's results are tabulated to project a team's finish position. The average of the three model's projected finish positions is then used to rank the teams in order - the projected average is less important than the order itself. Teams marked in green indicate the Top 8 teams, representing the teams most likely to qualify for the playoffs. The teams in red are the bottom 8, representing the teams most likely to miss the playoffs.

Going back to my original inspiration - the Sounders' three ties - yields some interesting observations. The Sounders are currently projected to finish 11th in the league. The season is still early enough that converting one of those ties to a win - i.e. 2 additional points and one additional goal in the goal differential category - moves the Sounders into the Top 8. Converting all three ties to wins of one goal each moves into the Top 4. The effect is large because it is assumed that they would perform in a similar manner the rest of the season - perhaps not a safe assumption given the erratic play to date.

This analysis does show how much behind the eight ball the Sounders are. Teams like Real Salt Lake, Toronto FC, and the Philadelphia Union are expected to be in the bottom 8. The Sounders are not. They have their work cut out for them.

Further Studies

The regression models I have created can be used in a variety of studies. Subsequent blog posts will explore the following:
  • The statistical minimum number of points and goal differential required to finish in the Top 8 in 2010.
  • The number of points and goal differential required in 2011 to make the playoffs when the league expands to 18 teams.
  • How much the LA Galaxy are outperforming the historical norm, as well as the DC United are under-performing the historical norm.
Additionally, I will keep track of how the projections change on a monthly basis through the end of the season.

No comments:

Post a Comment

LinkWithin

Related Posts Plugin for WordPress, Blogger...