The Snow Index Project, Part 2
The development of a new statistic, step-by-step.
A 162-game season begins this week. Each team will play out better than 10000 at-bats between offense and defense, enough that one would think that any “sample size” problems would sift themselves out, yet every year we seem to find a team or two commonly believed to be “lucky.”
I know this territory isn’t exactly uncharted, and I’m certain others smarter than me have taken a swing at it. I’m also aware that if anyone properly quantified luck in this situation, they could probably make themselves quite wealthy in Vegas. I don’t think I’m in a position to write the “ultimate quantification of luck in baseball,” but I would like to offer my own take on it, the RPSR.
The actual formula I’m starting with is rather simple:
Runs - Home Runs
———————————————-
Hits + Walks - Home Runs
I’m calling this formula the Run Production Success Rate, or RPSR. Simply, it measures how often a runner who gets on base, but doesn’t drive himself in, gets driven in. Several factors could make this number go up or down, such as:
Extra base hitting: Home runs are eliminated from the equation in the beginning, but doubles and triples are still in. A runner who starts his trek towards the plate from second or third has a better chance of getting there.
Speed: Cecil Fielder would be a little harder to score from first than Carl Crawford. Hell, in a race from first, Cecil Fielder might cross home plate behind Rocco Baldelli on crutches. Base stealing is an obvious and quantifiable factor, the ability to take the extra base or stay out of the double play is a little harder to lock down.
Power: On a team like the 2004 White Sox, that hit a bunch of home runs, getting on base at the top of the order was sometimes all that was necessary, as it gave Lee, Ordonez and Co. the chance to mash non-solo home runs.
Lineup balance: Tony Womack might hit .300 in the 9 hole for the Yankees this season. The Royals may have three regulars with OBP under .300. (3 players had better than 200 PA and OBP that low last season.) So when the middle of the Yankees lineup gets on base, there’s a chance the bottom of the order won’t kill the rally. When Mike Sweeney got on base for the Royals, the odds of him being driven in were minimal at best.
“Little ball”: Getting a runner on first and sacrificing him into scoring position increases the odds of getting one run, and simultaneously eliminates the chance of getting anyone else on base, thereby affecting both sides of the equation.
Luck: Sometimes the dice just falls on the right side. Bloop singles, errors and poor decisions by your opponent happen, and while capitalizing on them takes some skill, having them happen at the right time can cause more success than a good decision could have.
Extra base hitting is easy to measure, and can be eliminated from the equation as it is in the table below. The extra base hit ratio (XBH) is also pretty simple:
Doubles + Triples
————————————————
Hits + Walks – Home Runs
This gives you a percentage of times on base that aren’t home runs, but are extra base hits. If a team has a high XBH ratio, it makes logical sense that they should also have a high RPSR. In the cases where there are large differences between the two, one has to take a look at the other factors listed above. If the difference can’t be explained by one of those factors, it’s possible you have a luck situation on your hands.
The first number in the table below is RPSR, the second is XBH Ratio. The next two numbers are the respective team rankings in each statistic. As mentioned above, a team with a high number in one column should have a similar number in the other column. A positive number under “Difference” means a team did better than their XBH Ratio would imply. A negative implies underachievement.
style='width:464.55pt'>
Team
| RPSR
| XBH Ratio
| RPSR Rank
| XBH Rank
| Diff
| Comments:
|
ANA
| 0.356
| 0.2144
| 3
| 28
| 25
| Above
average team speed (6 players with 15+ SB) certainly helps, as does a team
strategy which seems to favor little ball.
|
CHW
| 0.358
| 0.2446
| 2
| 21
| 19
| Tied for AL lead in HR, Aaron Rowand and Willie Harris both show good speed and R/on base ratios
|
BAL
| 0.341
| 0.2332
| 7
| 26
| 19
| Led AL in hits, but appear to have gotten a disproportionate amount of luck driving runs home
as well.
|
NYY
| 0.343
| 0.2425
| 5
| 23
| 18
| $200
million payroll, balanced lineup with no dead spots means less runners left
on base at the bottom of the order.
|
KCR
| 0.327
| 0.2262
| 13
| 27
| 14
| Extraordinarily
weak bottom of lineup means most scoring opportunities come from the top of
the order, where the chance of scoring is better.
|
DET
| 0.339
| 0.2541
| 9
| 17
| 8
| Led AL in triples, but that’s not enough to explain this big of a gap.
|
STL
| 0.341
| 0.2579
| 6
| 13
| 7
| Biggest
NL overachiever, but good team speed and strong lineup throughout make them
only a small surprise.
|
LAD
| 0.313
| 0.2053
| 23
| 30
| 7
| Even with
good team speed and some good luck, the Dodgers only scored 761 runs, 9th
of 16 in the NL.
|
FLA
| 0.317
| 0.2363
| 20
| 25
| 5
| Speed is
the obvious reason they’re above break-even, but Juan Pierre’s 24 CS may be
part of the reason they aren’t farther above the line.
|
SDP
| 0.323
| 0.2431
| 18
| 22
| 4
| While
unspectacular, lineup was solid throughout with no black holes, causing a
production boost.
|
PIT
| 0.316
| 0.2379
| 21
| 24
| 3
| Despite
receiving some luck, still only scored 680 runs. To get to 85 wins with this
lineup and defense, they’d need a team ERA of 3.70.
|
SEA
| 0.296
| 0.2102
| 26
| 29
| 3
| Similar
to Pittsburgh. Overachieving makes this lineup only slightly less pathetic.
|
TBD
| 0.327
| 0.2549
| 14
| 16
| 2
| Carl
Crawford scored 104 runs despite just 210 times on base. Not too many hitters
can match that.
|
TEX
| 0.359
| 0.2822
| 1
| 2
| 1
| Ten
regulars with 10+ HR, lots of doubles, and a lot of runs driven in.
|
PHI
| 0.323
| 0.2527
| 17
| 18
| 1
| Scored
about as often as they deserved to, but didn’t provide as many opportunities
as they should have.
|
TOR
| 0.318
| 0.2506
| 19
| 19
| 0
| Twelfth
in AL in team OBP and slugging, average luck.
|
HOU
| 0.331
| 0.2596
| 11
| 10
| -1
| Below
average speed without Beltran, average power, average result.
|
ATL
| 0.327
| 0.2574
| 15
| 14
| -1
| Unspectacular
again, but solid across the board.
|
MIN
| 0.324
| 0.2563
| 16
| 15
| -1
| If
pitching and defense are what wins games in October, why aren’t we
celebrating the 4 time defending champion Twins?
|
BOS
| 0.355
| 0.2861
| 4
| 1
| -3
| Honestly,
can you really complain that much about this offense?
|
CLE
| 0.339
| 0.2708
| 8
| 4
| -4
| Some
speed, but didn’t score as often as would be expected from a team that led
the AL in doubles.
|
SFG
| 0.33
| 0.2635
| 12
| 8
| -4
| Barry
Bonds got on base 367 times by himself, 45 times via a home run. In those
other 322 chances, he was driven in just 26% of the time.
|
COL
| 0.333
| 0.2746
| 10
| 3
| -7
| Plenty of
extra base hits, but no speed and 1153 K’s left lots of runners on, too.
|
MON
| 0.284
| 0.2504
| 28
| 20
| -8
| Brad
Wilkerson scored 112 runs last year, second was Endy Chavez with 65.
Thirteenth or worse in every offensive category. And it’s possible they
should have been worse.
|
OAK
| 0.308
| 0.2588
| 25
| 12
| -13
| Well
balanced, solid at every position but overachieving nowhere.
|
CHC
| 0.314
| 0.2647
| 22
| 7
| -15
| Seemed to
find a new way to make outs on the basepaths everyday.
|
NYM
| 0.293
| 0.2594
| 27
| 11
| -16
| Seeing
the 2004 Mets listed as underachievers should come as a shock to no one.
|
CIN
| 0.311
| 0.2656
| 24
| 6
| -18
| Narrowly
avoided more strikeouts (1335) than hits (1380). It’s hard to move runners
over like that.
|
ARI
| 0.281
| 0.263
| 30
| 9
| -21
| .310 OBP assured
that when runners got on, ensuing hitters quickly made outs to insure they’d
get stranded.
|
MIL
| 0.283
| 0.2674
| 29
| 5
| -24
| Ahead of
only Arizona in runs scored, last in batting average, and second to last in
home runs, but Lyle Overbay can really crack a double.
|
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
1. VeteranPresence.com Posted: April 12, 2005 at 03:44 AM (#1249707)So what your telling me is...put the lime in the coconut?
Could you explain that adjustment?
Singles and home runs will help drive runners in, yes, but that wasn't why I separated out doubles and triples. I separated them out because a runner who gets on base by a double or a triple is more likely to score, and therefore a team's RPSR would rise in that situation. So if you double/triple a lot, your RPSR would rise. Home runs were taken out there because they've been removed from the entire equation in the beginning.
In response to post 6:
The concept I used for this one is pretty simple. The 2004 league RPSR was .324, and the league XBH Ratio was .252. .324/.252 = 1.2871. So while a team's actual runs scored are shown by this formula:
RPSR*(times on base-HR)+HR
A team's runs scored, if extra base hitting was the only factor involved, would in fact be shown by this one:
(XBH*1.2871)*(times on base-HR)+HR
This obviously isn't the best method to predict a team's success, and just because a team doesn't match this result doesn't mean they're lucky/unlucky, but it does remove the factor of extra base hitting, so you can look at the result and try to determine why the actual result happened. For example, by this system, the Yankees scored 58 runs more than they should have in 2004, which I'm attributing to a lineup with fewer dead spots than most teams have.
To take the final step, I used the pythagorean formula to determine what each team's win percentage would be if their runs allowed had remained the same, but their runs scored had been adjusted. It's worth noting that while my formula says the Yankees only should have won 84 games, their Pythagorean record from last season (before this change) was 89-73. So the change, from my perspective, was only 5 games.
If a team has a high XBH ratio, it makes logical sense that they should also have a high RPSR. In the cases where there are large differences between the two, one has to take a look at the other factors listed above. If the difference can’t be explained by one of those factors, it’s possible you have a luck situation on your hands.
I didn't say doubles were the only factor involved, in fact I listed most, if not all the factors involved. I eliminated extra base hitting from the equation because it's the easiest to quantify. Then, from there, you can look at the numbers again, with one factor removed.
Primates in your last thread offered quite a bit a constructive commentary on the work you had done. They were fairly negative (i.e. we generally said "You aren't doing this well, here are things to look at in order to learn how to do this better") and a bit patronizing. You completely ignored the substance of those comments. I don't really understand why you are posting this work if you are not looking for feedback on it (even if that feedback is negative.)
I'm being a bit of an ####### here, but this sort of stuff (bad, yet pretentious) hits my buttons.
If you find my work pretentious, I'm sorry, but if you're going to hit me for work I haven't even written yet, I'd encourage you to wait.
My take on the difference is pretty simply that pitchers kill rallies (lowering RPSR), but they don't have much effect on XBH ratio. If that's the case, it backs Nick S's observation that XBH and RPSR aren't well correlated, and indicates that the difference is largely due to lineup composition and doesn't measure luck.
Now, it's possible, then, that the difference in rankings that you're observing is interesting and does tell us something useful, but I don't think what it's telling us has been well-defined, and I certainly don't think you've eliminated much of anything nor made strides in isolating luck.
Your stats do show chances of driving in runs with singles and walks, but because HRs are so common and such an important part of run producing up and down the lineup, I'm not sure what the use of this stat is. Still, even though these stats aren't perfect, I found your article interesting, so thanks.
There are a lot more factors to be considered, though, before you are left with "luck" or random fluctuations. Eg -- sacrifice fly ability, bunting ability, baserunning ability esp stealing, etc. True Primates could probably help you out here. I hope they do.
Willie Harris and Aaron Rowand stole 19 and 17 bases, respectively, and were caught just 12 times combined, for a gain of 14 bases. This high of a success rate would imply a) good speed, and b) only going when you're sure you'll make it.
My bet is, though, that if they hadn't had 5 20+ HR hitters behind them (and Frank Thomas), they'd have been running more. A fast player on a home run hitting team will run less to prevent outs, because his chances of being driven in are better than average. Their odds of being able to get from first to third on a single, though, are also better than average. So I'm reluctant to measure speed as the opposite of power. I'm also reluctant to measure speed purely by stolen bases and success rate for much the same reason.
That's a 75% success rate, which is actually about break even. Someone has posted break even stealing percentages, broken down by no. of outs and whether stealing 2nd or 3rd. I think the range is from about 69% to 89% if stealing 3rd with 2 outs. Therefore, Harris and Rowand probably would have scored about as often if they hadn't tried to steal. They scored more times the 36 times they advanced successfully, but they lost some scoring opportunities by being caught 12 times.
So I'm reluctant to measure speed as the opposite of power. I'm also reluctant to measure speed purely by stolen bases and success rate for much the same reason.
You can measure speed in any way you want, for example, number of times advancing first to third on a single, as you said, and call it baserunning ability (I would include #of SB and success rate also), but you need to add these types of factors into the equation, because you need to control for these factors. You've found a difference between teams, and based on the comments you made you seem to understand some of the reasons for the differences. But because your comments are observation based and not statistics based, the stat in its present form doesn't allow us to draw any meaningful conclusions or allow us to utilize it in any meaningful way.
A fast player on a home run hitting team will run less to prevent outs, because his chances of being driven in are better than average
Not sure about that. Be careful about such statements until you research them. A home run hitting team may not have a higher chance of driving someone in, depending on many other factors such as batting average, OBP, number of strikeouts, etc. And a manager's preference for hit and run, sacrifices (rare now, I know) versus stealing also play a role.
Got to go. Good luck.
It's well documented how we react to traditionalists who try to create bogus stats, like Productive Out Percentage. Someone out there is watching us now to see what we'll do with an equally bogus and pointless stat being created, except from the other direction.
And if this is for real, which I'm seriously skeptical about, then what is exactly is the point? The Snow Index Part I attempted to reinvent Linear Weights, and screwed up unbelievably badly while doing so.
Now we have the Snow Index Part 2, theorizing about run scoring efficiency, and not really providing any insight.
Mr. Snow, please don't take this personally. The issues that I have with this research are fairly simple:
1. You are covering pretty well charted ground. It's not that this sort of thing isn't worth trying in a vacuum, but what does this teach us that we don't already know?
Bill James has complained about the number of new stats being created to measure things which we've already figured out, for little purpose. I never gave his comments much weight in this regard, but I'm rapidly reconsidering things.
2. In covering this well charted territory, you're making some pretty serious errors based on assumptions that you take as fact. Furthermore, you seem to be spending your time running correlations on the wrong things.
3. Don't name a stat after yourself.
Mr. Snow, have you considered controlling for the number of outs at the time the runner get on base? That would seem like a huge factor that affects whether the runner scores or not. Not controlling for it would mean that it's somehow included in the RPSR, introducing biases into the "luck" component you are trying to capture.
The correlation between RPSR and XBH Ratio is .21. It's not clear to me why the latter term becomes the gold standard for a new statistic to measure up to. That said, there's very little meaningful variance shared by the two.
First and foremost, I could say "Mr. Snow is my father," but that's not even true. Please, call me KL.
Sometimes I waste time running correlations on something where there's no correlation at all. But you usually don't see that in my work. And, I guess, in this process, a "serious error," by my defintion, would be defined as something that causes the ceiling in my apartment to cave in, and while it is doing that, I'm pretty sure it's not because of this project. Everything else I do I define as exploring a possibility. I don't expect most of these possibilities to be accepted, and in fact neither of the published ones have been. But largely, when I put a prospect out there, I'm not putting it out there as "This is the way it is." i'm putting it out there as "Here's what I'm thinking, would anyone care to offer an alternative suggestion?" My first article was buried under alternative suggestions, and I appreciated that. I spent most of a week digging through them and deciding what I could use. And you'll see the difference when I get back to linear weights.
In response to post 18: I guess, in the article, I spent a bit too much time playing up the "what would happen if RPSR was replaced by XBH Ratio" argument, such that people now think I'm putting too much weight on it. I'm not arguing that this is the only factor, I'm simply arguing that it's an easy factor to eliminate. And the concept I'm working on now to eliminate it will probably look considerably better than the one you just read. Unless, of course, someone provides me a better suggestion along the way.
More work on any topic is good (marketplace of ideas) but the value of this work to the greater community seems pretty low. K.L. Snow may have value as he may be learning something from this (both from the process of thinking about it and doing it and from the comments he's receiving). I agree that more references to existing work would be helpful.
I licky boom boom down!
I've been playing with baseball stats for 20 years, and have invented many dozens of stats, all of which seemed logical at the time. Really, only one of those stats has stood the test of time and is still considered the state of the art by the best analysts.
IOW, the end result is all that counts....
You must be Registered and Logged In to post comments.
<< Back to main