Baseball for the Thinking Fan

Login | Register | Feedback

You are here > Home > Primate Studies > Discussion
Primate Studies
— Where BTF's Members Investigate the Grand Old Game

Tuesday, April 12, 2005

The Snow Index Project, Part 2

The development of a new statistic, step-by-step.

A 162-game season begins this week. Each team will play out better than 10000 at-bats between offense and defense, enough that one would think that any “sample size” problems would sift themselves out, yet every year we seem to find a team or two commonly believed to be “lucky.”

I know this territory isn’t exactly uncharted, and I’m certain others smarter than me have taken a swing at it. I’m also aware that if anyone properly quantified luck in this situation, they could probably make themselves quite wealthy in Vegas. I don’t think I’m in a position to write the “ultimate quantification of luck in baseball,” but I would like to offer my own take on it, the RPSR.

The actual formula I’m starting with is rather simple:

    Runs - Home Runs
Hits + Walks - Home Runs

I’m calling this formula the Run Production Success Rate, or RPSR. Simply, it measures how often a runner who gets on base, but doesn’t drive himself in, gets driven in. Several factors could make this number go up or down, such as:

Extra base hitting: Home runs are eliminated from the equation in the beginning, but doubles and triples are still in. A runner who starts his trek towards the plate from second or third has a better chance of getting there.

Speed: Cecil Fielder would be a little harder to score from first than Carl Crawford. Hell, in a race from first, Cecil Fielder might cross home plate behind Rocco Baldelli on crutches. Base stealing is an obvious and quantifiable factor, the ability to take the extra base or stay out of the double play is a little harder to lock down.

Power: On a team like the 2004 White Sox, that hit a bunch of home runs, getting on base at the top of the order was sometimes all that was necessary, as it gave Lee, Ordonez and Co. the chance to mash non-solo home runs.

Lineup balance: Tony Womack might hit .300 in the 9 hole for the Yankees this season. The Royals may have three regulars with OBP under .300. (3 players had better than 200 PA and OBP that low last season.) So when the middle of the Yankees lineup gets on base, there’s a chance the bottom of the order won’t kill the rally. When Mike Sweeney got on base for the Royals, the odds of him being driven in were minimal at best.

“Little ball”: Getting a runner on first and sacrificing him into scoring position increases the odds of getting one run, and simultaneously eliminates the chance of getting anyone else on base, thereby affecting both sides of the equation.

Luck: Sometimes the dice just falls on the right side. Bloop singles, errors and poor decisions by your opponent happen, and while capitalizing on them takes some skill, having them happen at the right time can cause more success than a good decision could have.

Extra base hitting is easy to measure, and can be eliminated from the equation as it is in the table below. The extra base hit ratio (XBH) is also pretty simple:

    Doubles + Triples
Hits + Walks – Home Runs

This gives you a percentage of times on base that aren’t home runs, but are extra base hits. If a team has a high XBH ratio, it makes logical sense that they should also have a high RPSR. In the cases where there are large differences between the two, one has to take a look at the other factors listed above. If the difference can’t be explained by one of those factors, it’s possible you have a luck situation on your hands.

The first number in the table below is RPSR, the second is XBH Ratio. The next two numbers are the respective team rankings in each statistic. As mentioned above, a team with a high number in one column should have a similar number in the other column. A positive number under “Difference” means a team did better than their XBH Ratio would imply. A negative implies underachievement.




XBH Ratio


XBH Rank









Above   average team speed (6 players with 15+ SB) certainly helps, as does a team   strategy which seems to favor little ball.







Tied for AL lead in HR, Aaron Rowand and Willie Harris both show good speed and R/on base ratios







Led AL in hits, but appear to have gotten a disproportionate amount of luck driving runs home   as well.







$200   million payroll, balanced lineup with no dead spots means less runners left   on base at the bottom of the order.







Extraordinarily   weak bottom of lineup means most scoring opportunities come from the top of   the order, where the chance of scoring is better.







Led AL in triples, but that’s not enough to explain this big of a gap.







Biggest   NL overachiever, but good team speed and strong lineup throughout make them   only a small surprise.







Even with   good team speed and some good luck, the Dodgers only scored 761 runs, 9th   of 16 in the NL.







Speed is   the obvious reason they’re above break-even, but Juan Pierre’s 24 CS may be   part of the reason they aren’t farther above the line.







While   unspectacular, lineup was solid throughout with no black holes, causing a   production boost.







Despite   receiving some luck, still only scored 680 runs. To get to 85 wins with this   lineup and defense, they’d need a team ERA of 3.70.







Similar   to Pittsburgh. Overachieving makes this lineup only slightly less pathetic.







Carl   Crawford scored 104 runs despite just 210 times on base. Not too many hitters   can match that.







Ten   regulars with 10+ HR, lots of doubles, and a lot of runs driven in.







Scored   about as often as they deserved to, but didn’t provide as many opportunities   as they should have.







Twelfth   in AL in team OBP and slugging, average luck.







Below   average speed without Beltran, average power, average result.







Unspectacular   again, but solid across the board.







If   pitching and defense are what wins games in October, why aren’t we   celebrating the 4 time defending champion Twins?







Honestly,   can you really complain that much about this offense?







Some   speed, but didn’t score as often as would be expected from a team that led   the AL in doubles.







Barry   Bonds got on base 367 times by himself, 45 times via a home run. In those   other 322 chances, he was driven in just 26% of the time.







Plenty of   extra base hits, but no speed and 1153 K’s left lots of runners on, too.







Brad   Wilkerson scored 112 runs last year, second was Endy Chavez with 65.   Thirteenth or worse in every offensive category. And it’s possible they   should have been worse.







Well   balanced, solid at every position but overachieving nowhere.







Seemed to   find a new way to make outs on the basepaths everyday.







Seeing   the 2004 Mets listed as underachievers should come as a shock to no one.







Narrowly   avoided more strikeouts (1335) than hits (1380). It’s hard to move runners   over like that.







.310 OBP assured   that when runners got on, ensuing hitters quickly made outs to insure they’d   get stranded.







Ahead of   only Arizona in runs scored, last in batting average, and second to last in   home runs, but Lyle Overbay can really crack a double.





Admittedly, not every large differential is based on luck, many if not most of them have logical reasons. However, it is worth wondering, what would happen if the other factors went away?

Let’s assume for a second an ideal world, where RPSR is based directly on a team’s XBH Ratio. This would eliminate speed and lineup balance from the equation, but would also take away luck. Using the Pythagorean formula and adjusting teams’ runs scored for this new system, here’s what the 2004 standings would’ve looked like:

AL East

Team       W L GB     Change
Boston       99 63 -   1
NY Yankees     84 78 15 -17
Baltimore     75 87 24   -3
Toronto       72 90 27   5
Tampa Bay     69 93 30   -1

AL Central:

Team       W L GB Change
Minnesota 88 74 -   -4
Cleveland 83 79 5   3
Detroit       78 84 10   6
Chicago       77 85 11   -6
Kansas City 58 104 30   0

AL West:

Team       W L GB Change
Oakland       90 72 -   -1
Texas       88 74 2   -1
Anaheim       76 86 14 -16
Seattle       64 98 26   1

NL East:

Team       W L GB Change
Atlanta       95 67 -   -1
Philadelphia 87 74 8   1
NY Mets       83 78 12   12
Florida       80 82 15   -3
Montreal 74 88 21   7

NL Central:

Team       W L GB Change
St Louis 98 64 -   -7
Chicago       98 64 -   8
Houston       92 70 6   0
Milwaukee 79 83 19   12
Pittsburgh 72 90 26   0
Cincinnati 72 90 26   -4

NL West:

Team       W L GB Change
San Francisco 90 72 -   -1
San Diego 85 77 5   -2
Los Angeles 80 82 10 -13
Colorado 77 85 13   8
Arizona       64 98 26   13


Kyle Lobner Posted: April 12, 2005 at 01:19 AM | 27 comment(s) Login to Bookmark
  Related News:

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

   1. Posted: April 12, 2005 at 03:44 AM (#1249707)
   2. Kyle Lobner Posted: April 12, 2005 at 03:53 AM (#1249727)
Well, y'know, thanks for being constructive.
   3. StopBooingAbreu Posted: April 12, 2005 at 04:07 AM (#1249747)
Wait, Wait, Wait...

So what your telling me is...put the lime in the coconut?
   4. Posted: April 12, 2005 at 04:23 AM (#1249765)
i was sent by a messenger named JD. Unlike the JD you know, though, his knees were a lot stronger. As is his radio voice.
   5. dcsmyth1 Posted: April 12, 2005 at 11:13 AM (#1250262)
Why are you focusing on 2b+3b? Singles and homers will help score baserunners, too.
   6. xdog Posted: April 12, 2005 at 01:11 PM (#1250311)
"Using the Pythagorean formula and adjusting teams’ runs scored for this new system"

Could you explain that adjustment?
   7. Kyle Lobner Posted: April 12, 2005 at 02:08 PM (#1250383)
In response to post 5:

Singles and home runs will help drive runners in, yes, but that wasn't why I separated out doubles and triples. I separated them out because a runner who gets on base by a double or a triple is more likely to score, and therefore a team's RPSR would rise in that situation. So if you double/triple a lot, your RPSR would rise. Home runs were taken out there because they've been removed from the entire equation in the beginning.

In response to post 6:

The concept I used for this one is pretty simple. The 2004 league RPSR was .324, and the league XBH Ratio was .252. .324/.252 = 1.2871. So while a team's actual runs scored are shown by this formula:

RPSR*(times on base-HR)+HR

A team's runs scored, if extra base hitting was the only factor involved, would in fact be shown by this one:

(XBH*1.2871)*(times on base-HR)+HR

This obviously isn't the best method to predict a team's success, and just because a team doesn't match this result doesn't mean they're lucky/unlucky, but it does remove the factor of extra base hitting, so you can look at the result and try to determine why the actual result happened. For example, by this system, the Yankees scored 58 runs more than they should have in 2004, which I'm attributing to a lineup with fewer dead spots than most teams have.

To take the final step, I used the pythagorean formula to determine what each team's win percentage would be if their runs allowed had remained the same, but their runs scored had been adjusted. It's worth noting that while my formula says the Yankees only should have won 84 games, their Pythagorean record from last season (before this change) was 89-73. So the change, from my perspective, was only 5 games.
   8. Nick S Posted: April 12, 2005 at 03:05 PM (#1250488)
Your first equation (RPSR) is an important factor to look at in terms of run production (along with OBP and HR, that's pretty much all there is.) In trying to explain why teams vary in the stat, though, you should look at the correlations between RPSR and a number of factors (e.g. slugging percentage.) After that you should conclude what RPSR is primarily a function of, as opposed to what you've done here, which is concluding ahead of time that percent of doubles and triples is the primary factor. What you've done is akin to saying "Teams that hit a lot of doubles score a lot of runs, therefore I can factor out luck by normalizing each team's run scored to their doubles rate." It does not make sense.
   9. Kyle Lobner Posted: April 12, 2005 at 03:33 PM (#1250576)
Let me refer you back to this paragraph:

If a team has a high XBH ratio, it makes logical sense that they should also have a high RPSR. In the cases where there are large differences between the two, one has to take a look at the other factors listed above. If the difference can’t be explained by one of those factors, it’s possible you have a luck situation on your hands.

I didn't say doubles were the only factor involved, in fact I listed most, if not all the factors involved. I eliminated extra base hitting from the equation because it's the easiest to quantify. Then, from there, you can look at the numbers again, with one factor removed.
   10. Nick S Posted: April 12, 2005 at 04:11 PM (#1250687)
You haven't "eliminated extra base hitting from the equation". What you have done is guess that XBH ratio correlates well to RPSR. In fact, going by the numbers in your chart the two stats show almost no correlation (i.e. XBH ratio appears to have little predictive value for RPSR.)

Primates in your last thread offered quite a bit a constructive commentary on the work you had done. They were fairly negative (i.e. we generally said "You aren't doing this well, here are things to look at in order to learn how to do this better") and a bit patronizing. You completely ignored the substance of those comments. I don't really understand why you are posting this work if you are not looking for feedback on it (even if that feedback is negative.)

I'm being a bit of an ####### here, but this sort of stuff (bad, yet pretentious) hits my buttons.
   11. Kyle Lobner Posted: April 12, 2005 at 05:26 PM (#1250869)
Admittedly parts 1 and 2 of the project are somewhat unrelated, but part 3 ties them together. I read every comment, took notes on some things, and you'll see a notable difference in my theory on linear weights in Part 3. This article had nothing to do with linear weights, so the past comments didn't change my angle on it.

If you find my work pretentious, I'm sorry, but if you're going to hit me for work I haven't even written yet, I'd encourage you to wait.
   12. Andy Aymeloglu Posted: April 12, 2005 at 07:03 PM (#1251073)
Looking at the chart, it appears AL teams rank near the top, NL teams rank near the bottom. You don't address why this would be, and it looks like what you should probably be doing is splitting the chart in two, and ranking within the league.

My take on the difference is pretty simply that pitchers kill rallies (lowering RPSR), but they don't have much effect on XBH ratio. If that's the case, it backs Nick S's observation that XBH and RPSR aren't well correlated, and indicates that the difference is largely due to lineup composition and doesn't measure luck.

Now, it's possible, then, that the difference in rankings that you're observing is interesting and does tell us something useful, but I don't think what it's telling us has been well-defined, and I certainly don't think you've eliminated much of anything nor made strides in isolating luck.
   13. BTL: Lesser Primate, 4th Class Trainee Posted: April 12, 2005 at 10:42 PM (#1251522)
My least primate brain's take: there is probably (almost certainly) an inverse direct correlation between no. of HRs hit and speed (easy enough to measure roughly by comparing SB, CS, triples). Teams with more HRs hit will likely have a lower RPSR.
Your stats do show chances of driving in runs with singles and walks, but because HRs are so common and such an important part of run producing up and down the lineup, I'm not sure what the use of this stat is. Still, even though these stats aren't perfect, I found your article interesting, so thanks.
There are a lot more factors to be considered, though, before you are left with "luck" or random fluctuations. Eg -- sacrifice fly ability, bunting ability, baserunning ability esp stealing, etc. True Primates could probably help you out here. I hope they do.
   14. Kyle Lobner Posted: April 12, 2005 at 10:54 PM (#1251551)
Your first point (determining powerful teams and speedy teams) is interesting, and I'll look into it, but here's what I'm scared of:

Willie Harris and Aaron Rowand stole 19 and 17 bases, respectively, and were caught just 12 times combined, for a gain of 14 bases. This high of a success rate would imply a) good speed, and b) only going when you're sure you'll make it.

My bet is, though, that if they hadn't had 5 20+ HR hitters behind them (and Frank Thomas), they'd have been running more. A fast player on a home run hitting team will run less to prevent outs, because his chances of being driven in are better than average. Their odds of being able to get from first to third on a single, though, are also better than average. So I'm reluctant to measure speed as the opposite of power. I'm also reluctant to measure speed purely by stolen bases and success rate for much the same reason.
   15. BTL: Lesser Primate, 4th Class Trainee Posted: April 12, 2005 at 11:14 PM (#1251589)
Willie Harris and Aaron Rowand stole 19 and 17 bases, respectively, and were caught just 12 times combined, for a gain of 14 bases. This high of a success rate would imply a) good speed, and b) only going when you're sure you'll make it.

That's a 75% success rate, which is actually about break even. Someone has posted break even stealing percentages, broken down by no. of outs and whether stealing 2nd or 3rd. I think the range is from about 69% to 89% if stealing 3rd with 2 outs. Therefore, Harris and Rowand probably would have scored about as often if they hadn't tried to steal. They scored more times the 36 times they advanced successfully, but they lost some scoring opportunities by being caught 12 times.

So I'm reluctant to measure speed as the opposite of power. I'm also reluctant to measure speed purely by stolen bases and success rate for much the same reason.

You can measure speed in any way you want, for example, number of times advancing first to third on a single, as you said, and call it baserunning ability (I would include #of SB and success rate also), but you need to add these types of factors into the equation, because you need to control for these factors. You've found a difference between teams, and based on the comments you made you seem to understand some of the reasons for the differences. But because your comments are observation based and not statistics based, the stat in its present form doesn't allow us to draw any meaningful conclusions or allow us to utilize it in any meaningful way.

A fast player on a home run hitting team will run less to prevent outs, because his chances of being driven in are better than average

Not sure about that. Be careful about such statements until you research them. A home run hitting team may not have a higher chance of driving someone in, depending on many other factors such as batting average, OBP, number of strikeouts, etc. And a manager's preference for hit and run, sacrifices (rare now, I know) versus stealing also play a role.

Got to go. Good luck.
   16. bibigon Posted: April 13, 2005 at 06:26 AM (#1252392)
I swear, this whole Snow Index Project thing is part of some psychological study to see how Primates react to things that bother them from the statistical side.

It's well documented how we react to traditionalists who try to create bogus stats, like Productive Out Percentage. Someone out there is watching us now to see what we'll do with an equally bogus and pointless stat being created, except from the other direction.

And if this is for real, which I'm seriously skeptical about, then what is exactly is the point? The Snow Index Part I attempted to reinvent Linear Weights, and screwed up unbelievably badly while doing so.

Now we have the Snow Index Part 2, theorizing about run scoring efficiency, and not really providing any insight.

Mr. Snow, please don't take this personally. The issues that I have with this research are fairly simple:

1. You are covering pretty well charted ground. It's not that this sort of thing isn't worth trying in a vacuum, but what does this teach us that we don't already know?

Bill James has complained about the number of new stats being created to measure things which we've already figured out, for little purpose. I never gave his comments much weight in this regard, but I'm rapidly reconsidering things.

2. In covering this well charted territory, you're making some pretty serious errors based on assumptions that you take as fact. Furthermore, you seem to be spending your time running correlations on the wrong things.

3. Don't name a stat after yourself.
   17. Snoopy (#3 All-time in home runs) Posted: April 13, 2005 at 09:48 AM (#1252485)
I thought the point of the Snow Index was to show us the process by which statistics evolve. If Mr. Snow was trying to do something really groundbreaking, I doubt he would be showing us his notes while he was attempting it. After all, someone could steal his ideas. By attempting to reinvent the wheel, so to speak, he can at least pinpoint which direction he wants to go. I think this whole thing is just an heuristic process to show us how most ideas start out as crap and than evolve through refinements into something much better. That's why he's asking for suggestions, right? Having said that, here's a suggestion:

Mr. Snow, have you considered controlling for the number of outs at the time the runner get on base? That would seem like a huge factor that affects whether the runner scores or not. Not controlling for it would mean that it's somehow included in the RPSR, introducing biases into the "luck" component you are trying to capture.
   18. Too Much Coffee Man Posted: April 13, 2005 at 12:46 PM (#1252518)
Part of this discussion reminds me of my wife's family. They will argue for an hour whether or not a light is on in the other room without any of them getting up to check.

The correlation between RPSR and XBH Ratio is .21. It's not clear to me why the latter term becomes the gold standard for a new statistic to measure up to. That said, there's very little meaningful variance shared by the two.
   19. Kyle Lobner Posted: April 13, 2005 at 04:24 PM (#1252821)
Ok, I've got 20 minutes before my lunch is done, but I want to respond to each of the last three posts before I'm busy for the rest of the afternoon, so here goes:

First and foremost, I could say "Mr. Snow is my father," but that's not even true. Please, call me KL.

Sometimes I waste time running correlations on something where there's no correlation at all. But you usually don't see that in my work. And, I guess, in this process, a "serious error," by my defintion, would be defined as something that causes the ceiling in my apartment to cave in, and while it is doing that, I'm pretty sure it's not because of this project. Everything else I do I define as exploring a possibility. I don't expect most of these possibilities to be accepted, and in fact neither of the published ones have been. But largely, when I put a prospect out there, I'm not putting it out there as "This is the way it is." i'm putting it out there as "Here's what I'm thinking, would anyone care to offer an alternative suggestion?" My first article was buried under alternative suggestions, and I appreciated that. I spent most of a week digging through them and deciding what I could use. And you'll see the difference when I get back to linear weights.

In response to post 18: I guess, in the article, I spent a bit too much time playing up the "what would happen if RPSR was replaced by XBH Ratio" argument, such that people now think I'm putting too much weight on it. I'm not arguing that this is the only factor, I'm simply arguing that it's an easy factor to eliminate. And the concept I'm working on now to eliminate it will probably look considerably better than the one you just read. Unless, of course, someone provides me a better suggestion along the way.
   20. Nick S Posted: April 13, 2005 at 06:47 PM (#1253262)
If you haven't previously, read "How Runs are Really Created" on, particularly part 2. What you seem to be doing here is looking for a theoretical justification for Tango's 'experimentally' determined linear formula for the "b" component of BaseRuns. You present a hypothesis (what base runners start at is an important factor in percent of baserunners who score.) You create two stats in an attempt to quantify the two variables in the hypothesis. And then stop. There should be some attempt to evaluate the relationship between your two variables, and from that some conclusion drawn. For instance, you could show the correlation between the two stats, preferably over a large sample size of teams. You'd show that there is surprisingly little correlation, particulary considering that XBH ratio probably has a positive correlation to slugging, which almost certainly has a positive correlation to RPSR. You might then conclude that XBH has little to do with RPSR and that is evidence that the hypothesis was wrong. Better yet, you could look at play-by-play data ( and calculate for each team the percent of runners who start at 1b,2b,and 3b, and the respective scoring percentages. How large are the differences? How much do 2bs and 3bs affect the weighted average (RPSR)? Does this vary much from team-to-team? It is great that you want to put time and thought into questions like this.
   21. Michael Posted: April 13, 2005 at 11:16 PM (#1254123)
Does anyone else think of JT Snow when they see the snow index?

More work on any topic is good (marketplace of ideas) but the value of this work to the greater community seems pretty low. K.L. Snow may have value as he may be learning something from this (both from the process of thinking about it and doing it and from the comments he's receiving). I agree that more references to existing work would be helpful.
   22. Cabbage Posted: April 14, 2005 at 02:59 AM (#1255264)
I've said it before, and I'll say it again. This should really be called "The Informer Index"
   23. Mike Piazza Posted: April 14, 2005 at 05:05 PM (#1256121)
I've said it before, and I'll say it again:

I licky boom boom down!
   24. dcsmyth1 Posted: April 14, 2005 at 11:22 PM (#1257128)
As far as the idea of looking at the 'process", I think that is bogus, or certainly not worth the time if the end result is not enlightening.

I've been playing with baseball stats for 20 years, and have invented many dozens of stats, all of which seemed logical at the time. Really, only one of those stats has stood the test of time and is still considered the state of the art by the best analysts.

IOW, the end result is all that counts....
   25. Schtoopo Posted: April 15, 2005 at 11:44 PM (#1260321)
Gotta agree with #16. I'm betting this is a hoax. The fact that the author called it by his name after the whole Gleeman thing is just too much. Moreover, would the lords of PTF really let anything of this quality on the site if it were really a serious piece of work?
   26. Schtoopo Posted: April 15, 2005 at 11:51 PM (#1260342)
Oh, I know what it reminds me of now - the Sokal Affair.
   27. Mike Maddux Mike Posted: April 27, 2005 at 09:16 PM (#1293705)
And it reminds me of HEQ.

You must be Registered and Logged In to post comments.



<< Back to main

BBTF Partner

Dynasty League Baseball

Support BBTF


Thanks to
for his generous support.


You must be logged in to view your Bookmarks.


Page rendered in 0.6666 seconds
55 querie(s) executed