Part
1: Introduction
Part
2: Conceptual Framework
Part 3: High-Level Results
Part 4: Formulas
Part
5: Empirical Data for AL 2000
Part
6: Example: David Wells in AL 2000
Part
7: Yearly Results for 1978-2001
Part
8: Top Stars
Part
9: Concluding Remarks
Formulas
In this section I will present the formulas underlying the Win Values system.?
There are actually only a few formulas.? Each formula is conceptually simple
but may look complex.?
We seek the formula for a starting pitcher?s Win Value for a game in which
he pitched Z innings,[1] the
score at the conclusion of the Zth inning was RS to RA,[2]
and the game was played in a ballpark with a park factor of PF.? Let me write
this as WinVal(RS,RA,Z,PF).
From the conceptual framework described above, we know that Win Value is
the difference between the team?s expected probability of winning a game given
that the score is RS to RA at the conclusion of the Zth inning, in a PF ballpark,
and the team?s expected probability of winning a game given that the team
has scored RS runs at the conclusion of the Zth inning, with league average
pitching, in a PF ballpark.
Let me write the first win probability as WinProb1(RS,RA,Z,PF) and the second
win probability as WinProb2(RS,Z,PF).? WinProb1 uses the pitcher?s actual
RA information whereas WinProb2 assumes average pitching.? So we have:
[Eq.1]?? WinVal(RS,RA,Z,PF) = WinProb1(RS,RA,Z,PF) ? WinProb2(RS,Z,PF).?
Let?s first turn to how park effects are handled.? It turns out to require
a couple of simplifying assumptions since there are not enough games played
in any league-season with the same park factor to estimate WinProb1 and WinProb2
for different PF?s.
It won?t be immediately clear why we would want to do so, but let?s rewrite
Equation 1 as follows:
[Eq.2]?? WinVal(RS,RA,Z,PF) = WinProb1(RS,RA,Z,PN) ? WinProb2(RS,Z,PN)
??????????????????????????????????????????????? ??? + [WinProb1(RS,RA,Z,PF)
? WinProb1(RS,RA,Z,PN)]
??????????????????????????????????????????????? ??? + [WinProb2(RS,Z,PN)
? WinProb2(RS,Z,PF)]
where PN denotes a park neutral setting.? The reason why we write the equation
this way is so that we can consider each of the terms in brackets.? What do
each of these terms represent?? The first term reflects how the win probability
of a team that has scored RS runs and allowed RA runs at the conclusion of
the Zth inning is affected by the park factor.? The second term reflects how
the win probability of a team that has scored RS runs at the conclusion of
the Zth inning with average pitching is affected by the park factor.
Empirically, I have found that the first term is typically small and can
safely be ignored.? Getting slightly ahead of ourselves, I have found that
WinProb1 is reflected in the game?s deficit (RA-RS) rather than having to
consider RS and RA separately.? By analyzing the inning-by-inning runs scored
distributions at various parks, I have found that the probability that a team
overcomes a deficit of a given size does not depend significantly upon the
park factor.[3]?
On the other hand, I have found that the second term is potentially significant,
and merits special treatment.? Dropping the first term in brackets in Equation
2, and defining a new variable, we have:
[Eq.3]?? WinVal(RS,RA,Z,PF) = WinProb1(RS,RA,Z) ? WinProb2(RS,Z)
??????????????????????????????????????????????? ??? + ParkAdder(RS,Z,PF)
where, for convenience, I have dropped the PN labels in WinProb1 and WinProb2,
and where ParkAdder(RS,Z,PF) = WinProb2(RS,Z,PN) ? WinProb2(RS,Z,PF).
I will now describe the formulas for each of these three terms.? WinProb1
involves two concepts.? The first is how RS and RA interact in the formula,
and the second is how we ?smear? the run support probabilities.
As described above, WinProb1 will be reflected in the deficit that the team
faces, RA-RS, rather than a different formula for every (RS,RA) pair.? There
are simply not enough games that have the same score in any league-season,
and fortunately, I have found empirically that the win probability of overcoming
a given deficit does not significantly depend upon the actual score.? For
example, the probability that a team trailing 6-3 comes back to win the game
is similar to the probability that a team trailing 8-5, say, will come back
to win the game.[4]
We have previously motivated the notion of ?could have been? runs scored
possibilities for a pitcher?s run support.? The idea is that a pitcher should
be evaluated based not only on how many runs he allowed but also on how many
runs he received that game in run support.? However, we do not want to be
fanatical about fixing his run support.? This would lead to evaluations that
I am not in favor of.[5]? Thus,
we seek a middle ground, and this is where the ?could have been? run support
smeared probabilities come in.
Putting these two ideas together, then, we have:
[Eq.4]?? WinProb1(RS,RA,Z) =? åmSmear(m;RS,Z)
* DWin(RA-m;Z)
where the summation is over m, the runs support possibilities (m goes from
0 to 25, say), Smear(•) is the ?could have been? run support probabilities
and DWin(?) is the probability that a team trailing by a specific number of
runs will come back to win the game.?
The DWin probabilities can be derived empirically for each inning Z (from
1 to 9) using all the inning-by-inning scoring data from all games in the
league-season under study.[6]?
Note that I smooth the DWin probabilities to remove any effect of small samples
or weird games.? That is, I ensure that the probability of winning increases
if the lead increases (holding the inning constant), and I ensure the probability
of overcoming a deficit decreases as the inning increases (holding the deficit
constant).
For the derivation of the Smear probabilities, we will need two new variables.?
Let R(x;Z) denote the probability that a team scores x runs at the conclusion
of the Zth inning.? Let S(w,y;Z,C) denote the probability that a team that
has scored y runs at the conclusion of the Cth inning will score w runs at
the conclusion of the Zth inning.? For our purposes, C will be less than or
equal to Z.? For example, we may be interested in knowing the ?propogation?
probabilities of a team that has scored 4 runs at the conclusion of the 6th
inning.? That is, how likely is such a team to wind up with 7 runs at the
conclusion of the 8th inning, say.
I will not repeat the description of the derivation of these smearing probabilities
via backwards Bayesian bootstrapping.? The interested reader should revisit
the derivation described above in the section on the system?s conceptual framework.?
Suffice it to say here that the backwards Bayesian bootstrapping method requires
me to designate how far back in a game to go for the ?could have been? phenomenon
to kick in.
Remember that Wolverton?s system essentially says to ignore the game?s actual
run support; this is equivalent to pushing the ?could have been? experiment
all the way back to the beginning of the game.? I have described why I do
not want to follow that approach.[7]?
The further back towards the beginning of the game I choose for my ?could
have been? cutoff, the more my method looks like Wolverton?s support-neutral
method.? The closer to the end of the game I choose for my ?could have been?
cutoff, the more my method reflects the pitcher?s W-L record.?
I have experimented with different cutoffs and have reviewed the distribution
of runs scored by inning in great detail.? I have settled upon the following
?could have been? inning cutoffs: 6th inning for a 9-inning outing, 5th inning
for a 7- or 8-inning outing, 4th inning for a 5- or 6-inning outing, 3rd inning
for a 3- or 4-inning outing, 2nd inning for a 2-inning outing, and 1st inning
for a 1-inning outing.
To save on some notation, let C denote these innings to which we allow the
?could have been? smearing to begin.? C will depend upon Z, but I will suppress
that in the formula below.
By backwards Bayesian bootstrapping, we have:
[Eq.5] Smear(m;RS,Z) = ån{[R(n;C)
* S(RS,n;Z,C)] / åj [R(j;C) * S(RS,j;Z,C)]} * S(m,n;Z,C)
where the two summations, over n and j respectively, go from 0 to 25, say.?
We can derive all the required R and S probability distributions empirically
for each inning Z (1 to 9) using all the inning-by-inning scoring data from
all games in the entire league-season under study.? Therefore, we are able
to derive the required WinProb1 probabilities.
Let?s return to the second term of Equation 3, WinProb2(RS,Z).? Remember
this is the probability that a team that scores RS runs in Z innings will
win the game with average pitching.? There are two elements to consider.?
First, for any run scored we estimate the probability that a team scoring
that many runs at the conclusion of the Zth inning will win the game with
league average pitching.? Second, as above, we smear the runs scored probabilities
based upon RS and the ?could have been? smearing probabilities.? Thus, we
have:
[Eq.6]?? WinProb2(RS,Z) = åm Smear(m;RS,Z) *
AWin(m;Z)
where the summation is over m, the run support possibilities (m goes from
0 to 25, say), and AWin is the probability that a team that scores a specific
number of runs in Z innings will win the game with league average pitching.?
Equation 5 previously gave the formula for Smear(•), so all we need
is AWin(•).
AWin is estimated empirically for each inning Z (1 to 9) using all the inning-by-inning
scoring data from all the games played in the league-season under study.[8]?
Note that I smooth the AWin probabilities to remove any effects of small samples
or weird games.? That is, I ensure that the probability of winning increases
if more runs are scored (holding the inning constant).? I also ensure that
the win probability when scoring any number of runs at the conclusion of Z
innings decreases with Z (holding the number of runs constant).
The remaining term of Equation 3 is the Park Adder.? There is insufficient
data to estimate a separate park adder for every possible park factor and
every possible run scored.? I therefore pool the park data of each league-season
into three groups: hitters parks, neutral parks, and pitchers parks.? This
allows me to estimate the park adder as a percentage of the park factor of
the home park for each of the possible run scored figures (say from 0 to 25).?
I then smooth these park adders to remove any effect of small samples or weird
games.
An example will help here.? Suppose that I find that the home park affects
the probability of winning a game when scoring exactly 5 runs by .009 per
percentage point of the home park?s park factor.? For concreteness, let?s
use the Oakland Coliseum in 2000 which had a Total Baseball Park Factor of
97.? This implies that runs were 6% less prevalent in games at the Oakland
Coliseum compared to a league neutral park.? Multiplying the 6 by the .009
yields an estimate that a team scoring 5 runs at Oakland Coliseum had a .054
higher chance of winning the game than a team that scored 5 runs in a league
neutral park.
The last step to deriving the Park Adder is to pro-rate the change in win
probability by the number of innings the pitcher pitched in the game.? Algebraically,
then, we have:
[Eq.7]?? ParkAdder(RS,Z,PF) = PAddPct(RS) * (2*(100-PF))
* (Z/9)+
where PAddPct(RS) is the per percentage change in the probability of winning
with RS runs (in the above example this was the .009).? The middle term reflects
the effect of the ballpark on runs scored and the fact that only half a team?s
games are played at home (in the example above PF was 97, so that the middle
term is 6).? (Z/9)+ is Z/9 capped at 1 to properly handle partial
games and pitchers who pitch more than 9 innings.
Finally, we have completed our formulaic journey.? Equations 3-7 provide
the formulas underlying the Win Values system.? Since formulas can be rather
dry and imposing, I next turn to giving numerical examples of all of the terms
appearing in these formulas.
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
You must be Registered and Logged In to post comments.
<< Back to main