visit
I almost like these better.
The top four teams, by win-loss record, of the 2017 NFL Season were:
And so, people come up with ranking systems. The Ringer, FiveThirtyEight, those dumbasses who make a job out of yelling on First Take, all have developed a system to evaluate and rank NFL teams based on data outside their records.
These vary in quality. As anyone with half a brain could imagine, Skip Bayless’s reasoning is god-awful. ESPN’s are pretty good, in my opinion, and I think FiveThirtyEight’s ELO-based system is both really interesting and gives consistently reasonable predictions. Today, I’m going to ignore all of those systems and make my own, based on fundamentally different reasoning. Why? Because I can.Let’s start with an example. Imagine a league with three teams: X, Y, and Z. And let’s say this series of results happened:
Our objective, then, is to assign each team a rating, R, that represents the true strength of that team.
First, let’s impose a constraint on every team’s rating:That is, the probability that team X beats team Y is equal to its share of the total rating between those two teams. Then, we can arrive at the following:
The above states that the total probability of this season’s results is equal to the product of the probability of the outcome of each individual game. But, we know that these games resulted in the given outcomes: we observed them ourselves! So, the ratings that maximize the probability of our observed outcomes are the “true” ratings of each team.
That is, we want to find values for each R such that Pr(schedule) is maximized: the maximum likelihood estimate for the team ratings.
Actually finding this takes some math, which we’ll go through in the next section, but first, let’s think about this optimization problem intuitively. Imagine that you have three levers, each representing a team’s rating. We start with all three levers at 1, and then slowly start increasing X’s lever. You can see that X is both in the numerator and the denominator of the fraction we’re optimizing, so increasing it only increases Pr(schedule) to a point. We then jump to Y’s lever, and start increasing it until the value again begins to decrease. We then jump back to X’s, or move forward to Z’s, and so on, until we find the perfect balance between the three that maximizes the equation.
But of course, reality has no levers, and calculating our maximum takes math. We’re essentially going to derive gradient descent for this equation (and eventually, program it and run it!), which starts with finding a gradient. Again, this is our equation:The first thing you’ll notice is that this would be a frustrating expression to take the derivative of. We have division and multiplication, which will lead to some painful algebra. But, as it turns out, we don’t have any particular allegiance to this specific equation: we’d be fine optimizing an equivalent equation that’s easier to derive but behaves the same way as this one (that is, has the same extrema as this one). I’m arbitrarily choosing the natural log, because it’s a monotonically increasing function that preserves the global extrema of the original equation, and it turns this complex blob of math into easier-to-handle additions and subtractions. Then:
With some algebra, this becomes:
The next step is to take the partial derivative with respect to each R in turn. I’m just going to give you these equations; verify my math if you’d like!
Now, we set each derivative to 0 (because we want to find the point at which the function is not changing, which in this function, is the maxima), and shuffle our equations into better forms:
And now we’re left with a system of three equations that represent the ratings for each of our teams. The obvious next step is solving this system, but even a cursory glance tells you that finding an algebraic solution would be quite difficult to arrive at, if not impossible. Matrix methods also won’t work here, as this is a nonlinear system that can’t be massaged into a better form.
So let’s step back for a second, and look at this differently. What if we just arbitrarily chose values for each R as an initial estimate? For example, let each rating be 1.
Then let’s plug those values back into each equation. That gives you:
What does this tell us? Well, it tells us that after one iteration of the above equations, we find that this is the current estimate of the maximum likelihood. And though it’s not a particularly good estimate, it already has expresses a lot of the intuitive ideas we discussed earlier. Y won against both X and Z, so it has the highest rating. X beat Y, but it also lost to Y, and it lost to Z, a team that Y beat, so it’s below Y. And Z is above X because it beat X and lost to Y, which X also lost to.
Now, we could run this same process again, plugging in these new R values and finding an updated estimate. There’s a whole century of academic literature, dating back to the 1920s, that states that eventually, these values will converge and you’ll find your maximum likelihood estimate.
Congratulations! You just ran gradient descent. Now, do the same thing, but for a full NFL season of 32 teams and 15 * 17 = 180 games. Our math above solved for a crippled and abridged NFL season, but it’s easily extensible to a full season. Here are our final equations again:Now, if you stare at this for a bit, you’ll be able to convince yourself that the general form of the above equations is:
where R_A is the rating of team A, W_A is the total number wins of team A, and G_Ω is the number of games A played against team Ω.
With our math above solved, we can finally write code to find maximum likelihood estimators for some real data. I’m going to use the 2017 NFL season as our example, and I’m going to write everything from scratch because again, I can.
Let’s start by first implementing the generalized rating equation that we want to optimize:Here, current_weights
is an array of length 32, initialized to 1 for our first iteration, where current_weights[i]
is the current estimate for the rating of team i, games_matrix
is a two-dimensional array such that games_matrx[i][j]
is the number of times team i played team j, and wins_array
is a one-dimensional array where wins_array[i]
is the number of wins of team i.
This function is fairly self-explanatory: it takes the games_matrix
and wins_array
params from above, initializes an current_weights
array, and then while there’s still change between iterations, it travels along the gradient curve (that is our optimization_function
) towards a local maxima.
where D is the score differential and S is the total number of points scored. What this essentially says is that every win is automatically given 1/2, and then, we add a factor proportional to the difference in points. If D is close to S, meaning the winning team won by a lot, we end up with a number close to 1/2, which leads to the total win weight approaching 1. If D is much less than S, however, we end up with a win weight approaching 1/2.
Every win will end up somewhere on this spectrum, but it’s worth will be proportional to the amount the winner won by. Here’s how it looks in code: Here, game is an object from the nflgame module, which contains data about a particular NFL game.If you remember, we counted our wins in the generate_matrices
function. Here it is again, but calling the wins_update_formula
function instead of simply adding 1 for each win:
Otherwise, you can find me on Twitter at @AakashJapi, email at aakashjapi at gmail dot com, and Facebook at the above name. Feel free to contact me with ideas/thoughts/nfl memes/Patriots trash talk, or anything at all.