Baseball for the Thinking Fan

Login | Register | Feedback

You are here > Home > Dialed In > Discussion
Dialed In

Sunday, June 26, 2005

Discussing the Fog

Bill James and Phil Birnbaum discuss research - I stay out of the way with my mouth shut.

Discussing the Fog

Bill James wrote an article in the SABR publication

The Baseball Research Journal

(Number 33) called “Underestimating the Fog”.

As we discussed here, this is an important piece of work for what James says: “What I am saying in this article is that the fog may be many times more dense than we have been allowing for.  Let’s look again; let’s give the fog a little more credit.  Let’s not be too sure that we haven’t been missing something important.”

That is a very important reminder that often gets lost in baseball statistical analysis and other fields as well.

However, there is the issue of clutch hitting that James used in “Underestimating the Fog”.

In the latest SABR newsletter By the Numbers, the newletter from the Statistical Analysis Committee, there were rebuttals to James’ piece – one by Jim Albert, who is a college prof who wrote a book called Curve Ball,  that mathematically covers many aspects of baseball, and one by Phil Birnbaum, the chair of the Committee.

As a member of the SABR Statistical Analysis Committee, I subscribe to an email list where researchers post questions regarding research ideas and ask for help, or just discuss another point.

Bill James sent along his response to Jim Albert and Phil Birnbaum entitled “Mapping the Fog”.  Birnbaum wrote a response.  With permission from both, I am re-printing them here, in an effort to broaden the discussion.  If Jim Albert wants to chime in, I’ll provide him the floor as well.

Without further rambling from me, here’s Bill James’ “Mapping the Fog”:

Mapping the Fog

This article has not been copyrighted, and is not intended to benefit from copyright protections.  Please feel free to share it with anyone who might be interested. 

1.  My model

In issue number 33 of the Baseball Research Journal, I published an article entitled “Underestimating the Fog”.  The thesis of this article is that we in sabermetrics have been relying on a method which doesn’t actually work, under closer scrutiny, and we should stop relying on this method.  “This method” is the practice of attempting to determine whether some characteristic within the game is “real” or a statistical artifact by comparing whether the players who do well in this area in one year also do well in the same performance category the next year, as one would expect them to if the skill under study was “real”.  I hope that made sense. . . .I’m a little confused myself, and, speaking of myself, I certainly was not suggesting that other researchers were guilty of this but I wasn’t.  I was more guilty than anyone.  I had misled the public on a series of issues due to my own failure to think clearly about this one matter, and I felt it was important for me to stand up and take responsibility for that. 

What I did not do in that article, however, was to establish that what I was saying was true.  I argued that this was true, and I discussed the implications of that truth, but I did not attempt, in that forum, to demonstrate that this method does not in fact work reliably.  This is the first order of business in this article:  to demonstrate that what I was saying before was true (or, more carefully stated, to show you how you can demonstrate this to your own satisfaction.)

Let us take the issue of clutch hitting, which is the most controversial of the many peripheral subjects entangled in the debate.  Dick Cramer argued the following in 1977:

1)  If clutch hitting really exists, one would expect that the players who were clutch hitters in 1969 would be clutch hitters again in 1970.
2)  I have studied who was a clutch hitter in 1969 and who was a clutch hitter in 1970.
3)  The lists do not correlate to any notable extent.
4)  Ergo, clutch hitting does not exist.

I accepted this argument for about a quarter of a century, but eventually it began to trouble me.  When it began to trouble me enough, I posed a counter question to myself:  is it possible to create a model in which clutch hitting clearly exists, but goes undetected by this type of analysis?

It is, in fact, possible.  Let us create a “model league” based on the following assumptions:
1.  The league consists of 100 batters.
2.  Each batter has 600 at bats.
3.  Of those 600 at bats, 150 are in clutch situations, 450 are not. 
4.  The average hitter will hit .270.
5.  Individual batting averages can range from .170 to .370, but are normally distributed (bell-shaped curve) and are clustered around .270. 
6.  Eighty percent of the players will have the same expected batting average in clutch situations as in ordinary situations
7.  However, the other 20% may hit significantly better or significantly worse in clutch situations than they do overall. 

In clutch situations, the batting average of the other twenty percent was re-calculated as

Their regular batting average,
Minus 50 points,
Plus a random number between zero and one, divided by 10. 

Thus, a .280 hitter in non-clutch situations can be a .230 hitter in clutch situations, or a .330 hitter in clutch situations, or anywhere in between, and any one figure is as likely as any other—for those players who did have a “clutch element” in their makeup.  The average clutch effect, for those players who have one, is 25 points positive or negative.

You may or may not agree that this model represents a fair test of the clutch thesis.  If you agree that it does, end of subject.  If you would argue that it does not. …Dick Cramer, in his 1977 article, stated that “I have established clearly that clutch-hitting cannot be an important or general phenomenon.”  I would argue that if 20% of the hitters have clutch effects averaging 25 points, that is quite certainly an important and general phenomenon.  Further, in several respects, this model exaggerates the impact of clutch hitting, which should make it easier to detect whether or not a clutch hitting ability is an element of the mix.  In this league there were 60,000 at bats, which were neatly divided into 600 at bats each for 100 players.  In the real American League in 1969—one of the leagues included in Cramer’s study—there were 65,536 at bats, but there were only 25 players who had 550 or more at bats, the rest of the at bats being messily distributed among players who had 350, 170, 80 and 4 at bats.  This would make it much easier to detect the presence of clutch hitters in the model than in real life. 

In the real leagues studied by Cramer, there were many players who had 520 at bats one year but 25 the next, making those players—and those at bats—essentially useless as a basis for year-to-year comparison.  In my model, all 100 players had 600 at bats each year, with no one dropping out or coming in.  This, again, would make it vastly easier to have meaningful year-to-year comparisons, in my model, than it would be in real life.

In my model, one-fourth of all at bats are designated as “clutch” at bats.  In real life, it seems unlikely that the number of true “clutch” at bats would be that large.  In real life, a player probably has 50 or 75 high-pressure at bats in a season.  In my model, he had 150.  This would make it vastly easier to detect clutch performers in the model than it would be in real life.

In my model, all at bats are cleanly delineated as “clutch” or “non clutch”.  In real life, it is extremely difficult to say to what extent any at bat is “clutch” or “non clutch”.  Again, this would it make it much, much easier to detect the presence of clutch hitters in this model than it would be in real life. 

If you object to the fact that only 20% of the players in this study had some clutch ability:
a)  what if only 20% of players in real life have some clutch ability?, and
b)  it isn’t crucial anyway.  The conclusion wouldn’t change if it was 40% or 50%. 

Having constructed this model, I then simulated on a spreadsheet 600 at bats for each player—450 in non-clutch situations and 150 under clutch conditions—and figured for each player his batting average in “clutch” situations and his batting average in non-clutch situations.  I did this for two seasons for each of the 100 players, creating a “clutch differential” for each player in each season.  Each player’s intended batting average changed from season to season, but his “clutch differential” remained the same. The spreadsheet on which this experiment was conducted is named “Clutch Consistency.XLS”, and I will e-mail a copy of this spreadsheet to anyone who asks.  At first glance it just looks like a vast collection of random numbers, but I think you can figure it out with a little effort.

This method does not exactly mirror Cramer’s method, in his 1977 article which I was using as a kind of whipping boy in Underestimating the Fog.  What I have described as “Cramer’s method” is in fact two methods—an (a) method which was used to determine whether a player was a clutch hitter in any given season, and a (b) method which was used to determine whether those players identified as clutch players were consistent from season to season.  I was interested entirely in the questions raised by the (b) method.  The subject of my article could be stated as “Will Cramer’s (b) method work reliably under real-life conditions, if we assume that his (a) method works?”  The (a) method I never discussed at all, for three reasons—
1.  That this was not relevant to my article,
2.  That his (a) method is much more complicated, and much harder to replicate in a model, than the method I preferred, and
3.  I’ll tell you later. 

Anyway, in my model, we know that clutch hitting does exist, and that it does exist at what seems to me a very significant level.  Yet when I compared the “clutch differentials” of the 100 players in the two seasons, the year-to-year consistency was far, far below the level at which any conclusion could be drawn from the data.  Despite all of the steps I took to make clutch ability easier to spot in the model than it would be in real life, it remains essentially invisible. 

In the study, a player’s clutch contribution was labeled as “consistent” if he hit better in clutch situations than he did overall in both simulated seasons, or if he hit worse in both seasons.  His clutch contribution was labeled as “inconsistent” if he was better one year and worse the other. 

As you would expect, 50% of the players who had no actual clutch differential were consistent, and 50% were inconsistent.  But of the players who did have actual clutch differentials, 62.2% were consistent, while 37.8% tested as inconsistent, given these conditions. 

Overall, then, 52.4% of the players in the study showed consistency in their clutch contribution.    If 52.44% of the players in a group are consistent from year to year and there are 100 players in the group, what is the random chance that 50 of them or fewer will show up as consistent in one test? 

It’s 35%.  Thus, no conclusion whatsoever can be drawn from the apparent lack of consistency in the data.  Even when we know that the clutch effect does exist within the data, even when we give that effect an unreasonably clear chance to manifest itself, there is still a 35% chance that it will entirely disappear under this type of scrutiny. 

What if 40% of the players have an “actual clutch effect”, rather than 20%?
At 40%, there is still a 14% chance that 50 or fewer of the 100 players will have positive year-to-year consistency—which means that we are still in a position where no conclusion can be drawn from the lack of documented consistency.  Even if 50% of the players have an actual clutch effect, there remains a 9% chance that this would not show up in a test of 100 players. 

2.  Random Observation

Part of the problem with measuring “agreement” is that “agreement” narrows the odds, and thus profoundly changes the percentages.  Suppose that half of the players in a group are good clutch hitters, and half are poor clutch hitters.  Suppose that you have a test of clutch ability which is 80% accurate.  Under those conditions, how many players will measure as consistent, meaning that they measure the same both years? 

68%.  64% will measure as “consistent” accurately—.80 times .80—and 4% will measure as “consistent” due to a repeated inaccuracy.  If the measurement is 80% accurate, in a two-year period 64% of the players will have two accurate measurements, and 4% will have two inaccurate measurements.
If the test of clutch ability is 70% accurate, then, it will test as 58% accurate (.49 + .09).  If the test of clutch ability is 60% accurate, it will test as 52% accurate (.36 plus .16).

Thus, in order to achieve 62% agreement, as we did in the model above, you have to have a test which is 75% accurate.  This is actually more of a problem in the catcher-ERA studies than it is in the clutch hitting studies. 

3.  Reaction to Underestimating the Fog

In the first few weeks after “Underestimating the Fog” was published, I got reactions which were all over the map.  However, the one thing that nobody said, in the first few weeks—at least, nobody said it where I happened to see it—was that what I was saying was not correct.  Thus, I felt no pressure, in those opening weeks, to demonstrate that what I was saying was correct.

However, in the February, 2005 edition of By the Numbers—which I think came out in June, 2005, go figure—there were two articles which touched on the veracity of my central claim, and thus prompted me to put my supporting work on record. 

These two articles tend to broaden the debate, and raise a number of points that I wanted to comment on.  In the first of those two articles (Comments on “Underestimating the Fog”), Jim Albert writes:

I was interested in a statement that James made in this article regarding the existence of individual platoon tendencies.  This was counter to the general conclusions Jay Bennett and I made in Chapter 4 of Curve Ball. 

However, Dr. Albert doesn’t say what the statement was that he disagreed with, and, pardon my obtuseness, but I’m not able to figure it out.  I’ve read his comment three or four times, but my math skills are limited, and I just can’t figure out what it is I said that he disagrees with.  My ability to respond is thus impaired. 

With this exception, I think that the rest of Dr. Albert’s comments, including those critical of the article, seem to me to be fair and well-considered, and I have no response to them. 

The following article, however, the Phil Birnbaum article entitled “Clutch Hitting and the Cramer Test”, contains a number of statements that I wanted to comment on.
1)  For the sake of clarity, the issue that I was discussing in Underestimating the Fog is peripheral to Birnbaum’s article, and the issue that Birnbaum is discussing in his article was on the periphery of my article.  I was writing about whether Cramer’s (b) method works.  Birnbaum is writing about whether Clutch Hitting could exist.  These are not articles discussing the same subject. 

2)  I don’t think that Birnbaum himself is confused about this (point 1), but he appends to his article a head-note which seems to suggest that he is responding directly to my article, and follows this by quoting two or three things I had said and responding to them.  This creates the impression, to the reader, that we are writing about the same central issue.  The longer his article goes, the more it drifts away from being a response to Underestimating the Fog. 

3)  In my article I had written that “random data proves nothing—and it cannot be used as a proof of nothingness.  Why?  Because whenever you do a study, if your study completely fails, you will get random data.  Therefore, when you get random data, all you may conclude is that your study has failed.”

In response to this, Birnbaum says that “This is certainly false.  It is true that when you get random data, it is possible that ‘your study has failed.’  But it is surely possible, by examining your method, to show that the study was indeed well-designed, and that the random data does indeed reasonably suggest a finding of no effect.”

Reasonably suggests?  We’re not talking about reasonable suggestions here; we’re talking about valid inferences from the data.  Cramer didn’t say that his data “reasonably suggests” the absence of clutch hitters; he said—incorrectly—that his data “established clearly that clutch hitting cannot be an important or general phenomenon.”  Joe Morgan, Tim McCarver, and generations of sportscasters before them have reasonably suggested that some players may have a special ability to rise to the occasion.  The task in front of us is not to reasonably suggest the opposite, it is to find clear and convincing evidence one way or the other.

In the process of doing this, studies resulting in random data show only that the study has failed to identify clutch hitting ability.  I stand by my statement without any reservation.
4)  What Birnbaum means by “The Cramer Test” in his title is also Cramer’s (b) method; he doesn’t actually use Cramer’s (a) method, either.  In point of fact, nothing in Birnbaum’s article examines the effectiveness either of Cramer’s (a) method OR his (b) method.  Birnbaum’s article examines not whether Cramer’s method works, but whether his conclusion—that clutch hitting doesn’t exist at a significant level—is true.    He poses this question:

Bill James’ disputes this result, writing that “it is simply not possible to detect consistency in clutch hitting by the use of this method.”  Is he correct?  If clutch hitting were a consistent skill, would the Cramer test have been powerful enough to pick it up?

But he never actually addresses this question.  His subsequent research has to do with whether Cramer is correct, and has nothing at all to do with whether his method works.  He drops Cramer’s (a) method, and performs a test of statistical significance on the (b) method, the results of which, in my opinion, he misinterprets.

5)  For the sake of clarity, I take no position whatsoever about whether clutch hitting exists or does not exists.  I simply don’t have any idea.

6)  Either Birnbaum or myself is profoundly confused about the difference between “no evidence of effect” and “evidence of no effect.”  Birnbaum writes:

The results:  a correlation coefficient ® of .0155, for an r-squared of .0002.  These are very low numbers; the probability of an f-statistic (that is, the significance level) was .86.  Put another way, that’s a 14% significance level—far from the 95% we usually want in order to conclude that there’s an effect. 

But this data—and all of Birnbaum’s data—actually doesn’t indicate that there is no effect.  In fact, it shows that there is some evidence that there may be such an effect, but that this evidence merely is far too weak to say for sure one way or the other.  This is a very, very different thing—and one absolutely may not segue from one into the other in the way that Birnbaum is attempting.

Why?  For this reason.  Suppose that you took a ten-at-bat sample of Stan Musial’s career, and asked “does this ten at bat sample provide clear and convincing evidence that Musial was an above-average hitter?” 

Of course the answer would be “no, it doesn’t.”  In the ten at bats Musial might go 4-for-10 with 2 homers, but in a ten-at-bat sample, A. J. Hinch might go 4-for-10 with 2 homers.  You would conclude, by Birnbaum’s method, that this provided very, very little evidence that Musial was in fact an above-average hitter.

Suppose that you broke Musial’s 1948 season down into a series of 61 ten-at-bat sequences, and tested each one for evidence that Musial was an above-average hitter.
You would certainly fail, 61 times in a row—indeed, in many of those ten-at-bat samples, Musial would appear to be a below average hitter.

By Birnbaum’s logic, this would provide overwhelming evidence that Stan Musial in 1948 was not really an above-average hitter, since he had failed 61 straight significance tests.

But wait a minute. . .the real-life problem is worse than that.  Suppose that you took each ten-at-bat sample of Musial’s season, and you buried it in a pile of one thousand at bats by ordinary hitters, and you then tested the significance of the 1010-at-bat composite.  This would make the f-statistic (significance level) much higher, while making the correlation coefficient even lower.  You quite certainly would find no evidence whatsoever that Musial was pushing the group to be above average.
This is the real-life problem that we confront here.  The clutch hitting contribution, if it does exist, is buried in large piles of random and confusing data, with very little marking the clutch contribution to enable us to dig it out and examine it.
I’m not saying it can’t be done; there are lots of clever people in the world, and it probably can be done, eventually.  But the problem is a hell of a lot harder than Birnbaum realizes.
7)  Birnbaum writes “Let’s suppose a clutch hitting ability existed, that the ability was normally distributed with a standard deviation (SD) of .030 (that is, 30 points of batting average.)”

But the scale proposed here is massive.  The standard deviation of batting average itself isn’t thirty points.  The standard deviation of batting average, for all players qualifying for the batting title in the years 2000 to 2004, is 28 points (.0277).

Birnbaum’s argument is “if a clutch hitting ability existed on this scale, this analysis would find it.”  But if a clutch hitting ability existed on anything remotely approaching that scale, Stevie Wonder could find it.  If a clutch hitting ability existed on anything like that scale, we wouldn’t be having this discussion.

If the standard deviation of clutch ability was 30 points, there would be a very significant number of players who hit 50 points better in clutch situations, throughout their careers.  If that was the case, we would have known it 20 years ago.  If the standard deviation of clutch ability was 30 points, there would be one or two players in each generation who would improve their performance in clutch situations by 100 points.  We could find that without doing any of this stuff.
8)  No one has ever suggested that clutch hitting operates on that scale.  Listen to the things that Tim McCarver says about clutch hitting, or Joe Morgan, or any of those druids.  What they are saying is not that EVERYBODY has some huge clutch effect, but rather, that there are some few players—some tough, veteran players who have real character, and who might someday even go on to become TV broadcasters—who are able to come through in the clutch.  Sometimes.
In my model of the problem, I envisioned this as 20% of the players, having a clutch effect of 25 points (.025).  That creates a standard deviation, for the group as a whole, of eleven points.

Maybe it’s not eleven; maybe it’s 12, or 14, or 6, or 2.  It sure as hell isn’t 30.

9)  Let us talk for a moment about Cramer’s (a) method.

Cramer’s (a) method—his method of determining whether a player was or was not a clutch hitter—was to contrast two measurements.  One was an estimate of the player’s presumptive win contribution, based on his total batting statistics.  A home run is a home run.  If a player hit a home run in the ninth inning of a 12-1 ballgame, that was the same as if he hit a walk-off homer in the bottom of the ninth.  The other was an event-by-event assessment of what the player had contributed to his team’s wins.  If a player hit a home run in the ninth inning of a 12-1 ballgame, that would essentially be a non-event, whereas if a player hit a David Ortiz shot, that might be worth 100 times as much.
If a player ranked much better in the second evaluation than in the first, Cramer’s (a) method designated him a clutch hitter.  If he ranked much better in the first evaluation, Cramer designated him as a non-clutch player.

Neither Birnbaum nor I, in discussing Cramer’s article, made any effort to replicate or to examine this method, what I have been calling Cramer’s (a) method.  We both tested his (b) method, but replaced his (a) method with something more straightforward.  I had three reasons for not doing so, two of which I explained before.
My third reason for skipping this system is that I wanted a system which I knew would work.  I wanted to test whether or not Cramer’s (b) method would work if we assumed that his (a) method worked reasonably well.  I therefore substituted an (a) method that I knew would work, demonstrated that it did work, and moved forward from there.

This leaves unexamined the question of whether or not Cramer’s (a) method would work.  Could one, in fact, identify clutch hitters by contrasting a player’s overall offensive work with his win contribution, figured from the sequence of events?

I don’t know.  I’m skeptical.  I doubt that it would work.  The problem, it seems to me, is that the method might be heavily liable to random influences.
Here’s how we could tell if the method works or not. . ..I’ll get around to doing this eventually, I suppose, if nobody beats me to it.  Construct a “model universe”, as I did in my study, and designate 15 or 20% of the players as clutch hitters, as I did in my study.  Then simulate games, and evaluate the output by the method Cramer used to evaluate the real-life events.
One would then be in a position to ask “do the players who are ACTUALLY clutch hitters, in the underlying codes, show up as clutch hitters in the output?”  By random chance, 50% of them would show as better clutch hitters than neutral-case hitters.  By the method I used, 75% of the clutch hitters were identified as clutch hitters.  I would be very, very surprised if Cramer’s (a) method would match that.  I would guess you would get 53, 55% accuracy, somewhere in there.

Why?  Too much weight on too few outcomes.  I am guessing—but I don’t really know—that in Cramer’s (a) method, 50% of the variance between the player’s situation-neutral win contribution and his situational win contribution will be determined by 30 at bats by fewer (if the player plays regularly).  Thus, the player’s ranking in this system would seem to be heavily influenced by random deviations in performance in a small number of at bats, and thus the players who were “truly” clutch hitters, in the model, might very often not be identified as clutch players.

10)  Again for the sake of clarity, I am not suggesting that my “clutch indicator” systems works, either.  My system worked, in my model, only because I set up the model to enable it to work within the model.  It wouldn’t work worth a crap in real life.
Also, my system was 75% accurate only in the sense of agreeing that a clutch hitter was a clutch hitter if we already knew that he was.  But my system would also identify as clutch hitters a large number of players who actually weren’t coded to hit well in the clutch, but who had merely done so at random.
Ultimately, what we need is a system which can reliably identify a clutch hitter, if one exists.  That doesn’t seem to me like an impossible problem.  But we’re nowhere near to having such a thing.
11)  Birnbaum did attempt to demonstrate that his (a) method worked; he just did a couple of things that, in my opinion, undermine his attempt.

Look, what I was trying to say in Underestimating the Fog is “You can’t assume that your system works.  You have to prove that it works.  You have to demonstrate that it works, detail by detail.”

We are no closer to that now than we were a year ago.  Cramer’s article remains immensely important, for this reason:  that it proposed a road map through a wilderness.  That was a wonderful thing; I appreciated that 28 years ago, and I appreciate it now.
But the first maps drawn of America showed huge waterways cutting through the Rocky Mountains—and that was after the explorers finally realized they weren’t in India.  Maps drawn of the moon even fifty years ago were comically inaccurate.
I don’t know how accurate Cramer’s (a) method really is.  But the limitations of his (b) method are such that, even if his (a) method was 100% accurate, that might not be enough to justify the conclusions he thought he had reached. . .the conclusions that we thought he had reached.  I doubt that the (a) method works, either. 

It is my opinion that there is an immense amount of work to be done before we really begin to understand this issue. 

And Birnbaum’s response:

Response to “Mapping the Fog”

In a famous 1977 clutch-hitting study, Dick Cramer took 122 players who had substantial playing time in both 1969 and 1970. He ran a regression on their 1969 clutch performance versus their 1970 performance. Finding a low correlation, he concluded that clutch performance did not repeat, and that, therefore, this constituted strong evidence that clutch ability did not exist.

Bill James, in his recent essay “Underestimating the Fog,” disputes that the Cramer study did indeed disprove clutch hitting.

My essay, “Clutch Hitting and the Cramer Test,” disagreed with Bill. In “Mapping the Fog,” Bill James criticized aspects of my essay, and reasserted that his position was correct.

But I still believe that Bill is not correct.

Bill’s position can be summarized by these two quotes, from “Underestimating the Fog:”

“… even if clutch-hitting skill did exist and was extremely important, [Cramer’s] analysis would still reach the conclusion that it did, because it is not possible to detect consistency by the use of this method [regression on this year’s clutch performance against next year’s].”

“… random data proves nothing – and it cannot be used as proof of nothingness. Why? Because whenever you do a study, if your study completely fails, you will get random data. Therefore, when you get random data, all you may conclude is that your study has failed.”

To which I respond:

1. Yes, random data on its own proves nothing. But combined with evidence that your test would have found an effect if it existed, the random data is evidence that the effect doesn’t exist.

2. It is possible to detect clutch-hitting consistency (at reasonable, non-trivial levels) by the use of the Cramer test.

3. It is possible to show what effects the Cramer test is capable of finding, and, therefore, to what extent a “finding of no effect” disproves clutch hitting.

On number 1, Bill charges me with a fallacy – the fallacy of believing that, if a test finds no evidence of clutch hitting, this means that clutch hitting does not exist. I agree with Bill that this logic would be seriously incorrect – but I neither stated it nor implied it. My point was that if a test finds no evidence of clutch hitting, and you can show that the test would have found clutch hitting if it existed, well, then, and only then, are you entitled to draw a conclusion about the non-existence of clutch hitting.

Either Bill misread what I said, or I didn’t say it clearly enough.
Point number 2 is the most important, because it’s the point of greatest contention between Bill and myself. Bill thinks you can’t detect clutch hitting by the use of the Cramer test. I believe you can.

The reason for the difference is that we’re using different tests.

Bill’s test, in essence, consists of looking at players in consecutive years, and assigning each player one of four symbols. He gets a “+ +” if he was a clutch hitter both years; “- -“ if he was a choke hitter both years; and “- +” or “+ -“ if he was split. Bill then counts the number of consistent players (+ + or - -), and compares it to the number of inconsistent players (+ - or - +). If clutch hitting existed, there would be significantly more consistent players than inconsistent.

My test – which is the same test that Cramer used (but with Bill’s measure of clutch rather than Cramer’s “(a)” measure, as Bill calls it), uses the actual numbers, and runs a regression. So if player A was 50 points higher in the clutch one year and 10 points higher the next, I add the pair (+50, -10) to my sample. I then run a regression (standard STAT101) on all the pairs, and look for a significance level.

The point is that Bill’s test is much, much weaker than mine. I think Bill is correct that with his test, “even if clutch-hitting skill did exist and was extremely important,” the test would be incapable of finding it.

By using only the signs, Bill lost a huge amount of information – he kept the consistency aspect, but not the amount of consistency. To James, a hitter who hits one point better in both years gets the same weight as a player who hits 50 points better in both years.

(As an aside, I’d bet that if Bill threw out all datapoints except those where the absolute value of clutch hitting was over 25 points both seasons, the test would be much more likely to find significance. But that’s not important right now.)

By analogy, suppose that team A wins three games against the Brewers all by scores of 5-4, while team B wins three games against the same Brewers all by scores of 10-1. Bill’s test treats the teams the same, scoring them both as “+ + +”, and is incapable of noticing that team B is actually much better than team A.

But to my test (and Cramer’s), the amount of clutch hitting is considered. And so the Cramer test is capable of finding significant clutch effects.

The first of my tests asked this question: if clutch hitting were normally distributed with standard deviation of 30 points, would the Cramer test find it?

It would and it did. The second row of my table (at the top of page 10 of “Clutch Hitting and the Cramer Test”), contains the results of 14 simulations of a season where clutch hitting was normally distributed with an SD of 30 points. Of those 14 simulations, the Cramer test found the effect, with statistical significance, in 11 of those 14 seasons. Seven of those 14 were extremely significant, rounding to .00.

Now, you could argue that 11 out of 14 isn’t enough – the test is only powerful enough 79% of the time. 21% of the time, the test will fail.

And that’s true if you only run the test on one season’s worth of data. But I ran it on 14 seasons. If clutch hitting at the .030 level should be caught 11 out of 14 times, and the real-life data (top row of the same table) showed significance 0 out of 14 times, does that not “reasonably suggest” (Bill doesn’t like this expression) that clutch hitting at .030 does not exist?

In my essay, I stopped there, but I could have done a more formal calculation. It looks like there’s about a 21% chance of failing to find significance for a single season. Let’s up that to 30% just to be conservative. We found 14 of those in a row. What’s the chance of a 30% shot happening 14 times in a row? 1 in 21 million.

That’s highly significant.

What’s Bill’s response to this test in “Mapping the Fog”? He doesn’t dispute the method or conclusion. Rather, he argues that .030 is a massive SD for clutch hitting (I implied that it was moderate; Bill is correct – it is massive). Of course this method can find an SD of 30 points, Bill says. “Stevie Wonder could find it.”

Bill writes, “maybe [the SD is] … 12, or 14, or 6, or 2. It sure as hell isn’t 30.”

Which is fair enough. But my original essay actually does go on to repeat the same test for 20 points, then 15 points, then 10 points, then 7.5 points – using exactly the same method, which Bill doesn’t dispute (and uses himself, as we will see shortly).

Bill does not mention these subsequent tests at all – nor does he mention my conclusion that the Cramer test (with 14 seasons of data) is “doubtful” with a standard deviation of 10 points, and that I agree with him that it “fails” if the SD of clutch hitting is actually only 7.5 points.

In “Mapping the Fog,” Bill suggests a different distribution – he supposes 80% of the population has no tendency for clutch hitting whatsoever, and the 20% vary uniformly (i.e., a flat curve rather than a bell curve) between -50 points clutch and +50 points clutch. He goes on to do a very similar simulation to what I did. He finds (and I agree) that this test will almost always fail to find an effect.

But Bill used his “signs” test rather than the Cramer regression, and that’s why he failed to find any effect.

To prove that, I repeated my regression, but used the James distribution rather than my normal distribution. (James says the SD of his distribution is .011, but I found .013.)

My results: out of my 56 simulated seasons, 11 showed statistical significance at the .05 level in a positive direction. If the data were random, it should have been 2.5% of 56, or 1.4.

Again I didn’t do this in the essay, but what is the probability of getting exactly 11 positives out of 56, where the chance of each positive is 2.5%? If I’ve done the calculation right, it’s about 1 in 8.6 million. We really want “11 or more”, rather than exactly 11, but I’m too lazy to run the normal approximation to binomial right now. It’s definitely less than 1 in a million, in any case. (By the way, I think the 11 successes might have been a random fluke. But even if we got only 6 successes, I (lazily) believe that would still significant at the 1% level.)

In point form, then:

—Under Bill’s distribution, the simulated Cramer Test succeeded in finding positive significance about 19% of the time in 56 tries.

—Random data would, by definition, find positive significance 2.5% of the time.

—The chance of the 19% happening by chance in 56 tries, where the real probability is 2.5%, is less than 1 in a million.
On that basis, I would conclude that the Cramer test over 14 single seasons “reasonably suggests” that a level of clutch hitting as described by Bill’s distribution does not exist.

But I guess there are really two conclusions:

—With 14 separate seasons worth of data, the Cramer test “works” in that it identifies the existence of clutch hitting at the Bill James distribution;

—As an aside, the real-life data do provide reasonable basis to conclude that if clutch hitting does indeed exist, it does so at a lower level than the Bill James distribution.

So, now going back to Bill’s original two quotes:

1. “… even if clutch-hitting skill did exist and was extremely important, [Cramer’s] analysis would still reach the conclusion that it did, because it is not possible to detect consistency by the use of this method [regression on this year’s clutch performance against next year’s].”

It seems to me that Bill believes this because he used a much weaker signs test, rather than a full regression. (Although, to be fair, I don’t know whether the Cramer test succeeds using Cramer’s own measure of clutch hitting. It might, or it might not.) I believe that the data and logic fully support the conclusion that for a large enough effect (such as Bill’s distribution) and enough seasons of data (say, the 14 that I used), the Cramer test quite easily detects consistency.

2. “… random data proves nothing – and it cannot be used as proof of nothingness. Why? Because whenever you do a study, if your study completely fails, you will get random data. Therefore, when you get random data, all you may conclude is that your study has failed.”

As I argued earlier, I believe this is not true – random data, combined with a powerful enough test, is legitimate evidence of “nothingness” – or at least, a small and bounded amount of somethingness.

And, judging by Bill’s response, I don’t think he believes this second quote himself. His own test of whether the signs test would pick up an effect proves that. If he really believed that random data proved nothing, what would be the point of checking if the test could produce non-random data? Bill’s test only makes sense if he really means that random data proves nothing only if random data would come out in any case.

And so I wonder if by this quote, Bill actually agrees with me, but originally just overstated his case.

Having said all this, my overall impression is that James and I do, in fact, substantially agree, and that a large part of our disagreement stems from the fact that James used a test that doesn’t work, whereas I used a test that does work. James correctly concludes that you can’t disprove clutch hitting from his test, and I (believe I) correctly conclude that you can disprove a certain level of clutch hitting from my test.

James writes that “I take no position whatsoever about whether clutch hitting exists or does not exist.” But he does acknowledge that if clutch hitting exists, it must have a standard deviation that doesn’t even approach 30 points. My position is similar – I don’t know whether clutch hitting exists or not either—but I believe that if it does exist, the Cramer test simulations prove that the SD must be 10 points or less.

Our only large disagreement, I think, is that Bill argues very strongly, in absolute terms, that the Cramer method can’t work. I argue that the absolutist formulation is wrong. The Cramer method is as legitimate as any other statistical method. With enough data – exactly how much data depends on the size of the effect you’re looking for—the test is powerful enough to provide good evidence for the lack of the effect.

And now, the sales pitch

This discussion took place in SABR.  You almost missed it.  Even some SABR-ites will miss it – and that’s no fun.

SABR is a fantastic organization.  For the membership you get assorted journals, newsletters, mailing lists, use of ProQuest (which is H.G. Wells-ian time travel), statistical research, historical research and the oppurtunity to learn from nearly 7000 individuals who love baseball as much as you do.

Your spouse doesn’t understand your passion for Debs Garms?  Well, I guarantee you can find someone in SABR that will.

SABR is NOT about numbers.  For me that is a fantastic part, but it’s a small part. 

It’s about the history of uniforms.  It’s about the odd plays you find at Retrosheet (the home of SABR luminaries David W. Smith and Tom Ruane, and in the back of the store you’ll find many others sitting around the pot-bellied stove, whittling and discussing a great many things).  It’s about reminiscing about the 1983 White Sox, or the 1959 White Sox, the Go-Go Sox, and less reminiscing about, and more wondering about, the Hitless Wonders, the 1906 White Sox.  Every time I look at the 1983 season, I can only figure that Sox team was the “Go Wonder Sox”.  Plus 12 wins from 1982 and minus 25 wins in 1984.  But I digress.

SABR is about listening and learning.  The range of experts on baseball things – umpires, women in baseball, the third baseman on the $100,000 Infield, baseball poetry and prose – is covered because SABR is a collective.  Everybody shares because the ultimate goal is to make baseball knowledge available and documented.

Don’t get me wrong, membership has its privileges, but many things SABR are available to non-members (browse the site!) and it grows everyday.

Have a grandfather that played?  You can contribute to the BioProject – an effort to get a short biographical entry on every player.  Don’t have a relative that played but know about a player that went to your high school?  You can contribute to the BioProject.  Just like reading about players and want to help?  You can contribute to the BioProject.

In the end, SABR is about loving baseball, and enhancing the quality of our knowledge of it.

Then there is the SABR Convention.  You get to hang out with people you always wanted to meet: me, Furtado, Forman, Mike Emeigh, Aaron Gleeman, Jon Daly, Dan Szymborski, Eric Enders, Hall of Merit’s Joe Dimino, Chris Jaffe, Anthony Giacalone, Mike Webber, Vinay, Rauseo, Burley, MGL, Bob T, Cyril Morong, Mark Stallard (just off the top of my head).

Then there are others, mostly individuals who write about baseball in some form, that would love to stand around and listen to your ideas on who the Blue Jays should trade for and why:

Rob Neyer, Alan Schwarz, David W. Smith, Tom Ruane, Tom Tippett, Scott Fischthal, Clay Davenport, Chris Kahrl, Maury Brown, Bill James, Phil Birnbaum, Jim Albert, Will Carroll, Clem Comly, Cliff Blau, Bill Nowlin, Dan Levitt.

And for me, this year in Toronto, I will get to meet Ron Johnson, a writer/analyst I greatly admire.  I can’t tell you how much that means to me.

Chris Dial Posted: June 26, 2005 at 01:01 PM | 48 comment(s) Login to Bookmark
  Related News:

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

   1. misterdirt Posted: June 26, 2005 at 04:42 PM (#1431545)
This is an extremely interesting discussion between two of baseball statistics bright minds and I think both make valid points. I share Bill James frustration with the Cramer method of "disproving the existence of clutch hitting" and yet Phil Birnbaum's criticisms of the flaws in James'analysis are equally valid. They would be great support if what Cramer was measuring by his correlation study was actually clutch hitting. Here is where I have a criticism of Cramer's study; I don't think it measures clutch hitting at all. If I am reading Cramer's study correctly, he took all of the hitter's that showed a tendency to clutch hit in one year and correlated their tendency to clutch hit in the following year and when he found a statisticlly insignificant correlation he concluded clutch hitting probably didn't exist.

This seems to me to be very poor methodology. It assumes that the observed tendency of clutch hitting shown in the first year was actually due to a player's true ability to clutch hit and not due to random variation. This is a particular problem due to the admitted (by both James and Birnbaum) small sample size of actual clutch hitting situations in a given year. If the population thus selected includes many players who aren't actually clutch hitters, but just got lucky that year, than a low correlation with the following year only proves that not everyone who APPEARS to be a clutch hitter actually is.

This may have been what James was trying to get at by his original article. And I guess my response to Birnbaum would be that Cramer's study was not well designed so a finding of no effect is, as James points out, meaningless in this case.

But instead of proposing a hypothetical study as James does in his response "Mapping the Fog", I would have proposed a different method. If I am remembering correctly back to long distant statistics courses, the best method of showing that something might exist, in this case Clutch Hitting, is to show that the assumption that it does not exist does not explain the data, i.e. disproving the null hypothesis. For clutch hitting this would mean setting up a study to examine each hitters year by year performance in clutch situations versus his normal performance. If you could find hitters who consistantly show an improved performance in clutch situations then you would test to see whether those performances exceed what would be found by normal variations in performance if clutch hitting didn't exist at all. If they did you would have successfully proved the null hypothesis and would have to conclude that clutching hitting might be an explanation for those hitters consistantly improved performances in clutch situations. If your inference was later supported by being able to predict that those players would continue to show superior performance in clutch situations in the future than you would have a powerful argument that clutch hitting does exist and also would be able to say who has it.
   2. dcsmyth1 Posted: June 26, 2005 at 05:32 PM (#1431651)
This whole 'fog' series is interesting, but in the end it's "much ado about little", IMO.
   3. dlf Posted: June 26, 2005 at 06:34 PM (#1431836)
Chris, I think the link to is broken.
   4. ChuckO Posted: June 26, 2005 at 06:37 PM (#1431846)
In my opinion, the whole discussion is "foggy" since neither man gave an operational definition of clutch. I have yet to see one that is convincing. I'm immediately suspicious of James' study because, in his model, he assumes that one-quarter of a player's at-bats are clutch situations. I don't buy it. That's way too many. For example, if a team gets ahead 2-0 in the first inning and the opposing team never threatens, there won't be a single clutch situation for the team that is ahead in innings two through eight, since my "fuzzy" definition of clutch is a matter of coming through when the team really needs it. Here's a thought experiment. Suppose you're in the dugout with a major league manager and suppose that, before each plate appearance, you ask him if this is a clutch at-bat. I'd guess that the manager would only say that there were a couple of clutch at-bats. In many games there would be none. Now, others may disagree with my definition of clutch, but until a widely agreed upon operational definition of a clutch hitting situation is constructed, I find these studies pointless.
   5. dlf Posted: June 26, 2005 at 06:47 PM (#1431881)
Chuck -- I think you missed the point of James' study. He isn't trying to prove or disprove "clutch" but instead approve or disapprove of the method of correlating year-to-year performance. He indicates that his model grossly overestimates how many people are clutch, how large the clutch effect is, and how often clutch situations exist. By intentionally doing so, he is attempting to measure whether Creamer's approach can find the elephant in the living room. Failing to see the elephant (in James' view; disputed by Birnbaum) means that Creamer's method can't find the mouse that may or may not exist.
   6. Chris Dial Posted: June 26, 2005 at 06:47 PM (#1431882)
thanks - fixed.
   7. GGC Posted: June 26, 2005 at 07:02 PM (#1431925)
Thanks, Chris. I'll have to bookmark this one to reread later.
   8. ChuckO Posted: June 26, 2005 at 09:21 PM (#1432335)
dlf -- I see your point, and perhaps I was unclear. I guess I get a little chagrined when people discuss clutch hitting when there isn't any agreed-upon operational definition of the concept. I personally am of the opinion that it's extremely difficult, if not impossible, to determine whether or not a clutch hitting ability exists because of the small sample problem. From my understanding of what constitutes clutch hitting, I don't believe that clutch hitting situations occur often enough to get a handle on them.
   9. Scoriano Flitcraft Posted: June 26, 2005 at 09:45 PM (#1432373)
Foggy notions aside, I agree with Chuck, the definitions have to be good for the studies to matter. And IMO they should be adjusted by the quality of the opposition in such situations. Context matters a lot given the small sample sizes.

I'd also be curious if anyone has studied or thought to study whether some players can be shown to underperform or overperform in non-clutch situations, e.g., when their teams are very ulikely to win the game (down five in the bottom of the 9th), or very likely to win (up five in the top of the 9th versus a weakhitting team, perhaps using some sort of win probability, agains adjusted for opposition, etc. I am not proposing that my definitions satisy Chris, but just wondering aloud (I ype loudly) about this sorta thing generally.
   10. Chris Dial Posted: June 26, 2005 at 09:46 PM (#1432376)
that's pretty much where I fall in that part of the discussion.
   11. Robert in Manhattan Beach Posted: June 27, 2005 at 04:49 AM (#1433712)
Clutch hitting is just the proxy here. The real issue is whether the year to year correlation method works. It's become the pet method for people trying to make a splash in the community, using it to claim the nonexistence everything from control of balls in play, lefty mashing, you know the list.

As Mr. James tries to show here, a year to year correlation test will be inconclusive on even a fairly strong effect. This is well known among statistics profs and geeks, but apparently not among baseball analyists.
   12. WilyD Posted: June 27, 2005 at 11:14 AM (#1433860)
Problems you will encounter if you become a baseball analyst:

1. Your sample sizes will always be too small.
2. Even when your sample size is so large it would be victorious in the movie "Godzilla vs. the Sample Size", no one will believe you anyways, and traditionalists will want to spit on your neck.
3. Profits!!!!
   13. Tango Tiger Posted: June 27, 2005 at 12:07 PM (#1433868)
Scoriano, clutch data can be found here, so you can try to answer that question.


Andy Dolphin's clutch study lives in relative obscurity (for the moment... he's rewriting it for The Book).


The number of clutch PAs that we can all agree with is about 7%. This works out to about 50 PA per player for a 162 game season, so score one for James. (All based on LI).

You can lower the bar to an LI of almost 1.5, and that gives you 20% of all PAs. That would also be a decent clutch level.
   14. Mike Emeigh Posted: June 27, 2005 at 02:51 PM (#1434072)
As Mr. James tries to show here, a year to year correlation test will be inconclusive on even a fairly strong effect. This is well known among statistics profs and geeks, but apparently not among baseball analyists.

Right - which is one reason why baseball stat analysis has little credibility in the statistical community at large.

-- MWE
   15. Mike Emeigh Posted: June 27, 2005 at 03:01 PM (#1434090)
The other thing to consider is that clutch performance might well be defined as not performing worse in clutch situations. IIRC, in many so-called clutch situations, the typical player actually performs below his normal performance - so a player whose performance doesn't change, or whose performance gets worse but by less than the expected amount, might actually be a clutch player.

-- MWE
   16. Los Angeles Waterloo of Black Hawk Posted: June 27, 2005 at 05:36 PM (#1434408)
I will get to meet Ron Johnson

Recruit him to the world of Primates.
   17. Dufmeister Posted: June 27, 2005 at 06:17 PM (#1434488)
Why are we considering only year-to-year variation? If a player is "clutch", then there should be evidence over the whole career.

In one passage, James looks at the Musial 10 at-bat sequences. On any two consectutive 10 at-bat sequences, he could be as Birnbaum relatedly notes "++ -- +- or -+" If we look at an aggregate of "clutch" PAs over a career perhaps some of the fog would lift.

This would force a definition of clutch situations, then a baseline of what all players do in such situations, then you can test individual players to determine if they had a clutch year or career. The obvious is that a clutch year can be a statistical anomaly.

Looking at performance versus a baseline for each year may show if player is consistently clutch, chokes or merely within statistical bounds.
   18. Mike Emeigh Posted: June 27, 2005 at 06:24 PM (#1434508)
Looking at performance versus a baseline for each year may show if player is consistently clutch, chokes or merely within statistical bounds.

This is what I think you really have to do in order to measure clutch performance - define the situation and define the baseline for performance, then compare players to the baseline.

But you still have to account for other possible effects, too - primarily quality of opposition. A Boston hitter would be likely to face Mariano Rivera in a fair number of ninth-inning clutch situations, where a St. Louis hitter would be likely to face Ryan Dempster (or, previously, LaTroy Hawkins or Joe Borowski) in a similar situation.

-- MWE
   19. bsball Posted: June 27, 2005 at 08:36 PM (#1434867)
Clutch chances are not just clutch for the batter, they are clutch for the pitcher and fielders as well (and sometimes the fans). It seems like these would also help to hide any clutch ability that a batter has. And, as MWE points out, the batter is more likely to be facing a good pitcher, or at least an unfavorable platoon situation and defensive replacements in the field, and so his expected performance is below what it is for a "normal" AB.

And I think curses are stronger in clutch situations, but I haven't tested that yet.
   20. Chris Dial Posted: June 27, 2005 at 08:59 PM (#1434921)
Recruit him to the world of Primates.

Oh, he's aware.
   21. Chris Dial Posted: June 27, 2005 at 09:03 PM (#1434937)
But you still have to account for other possible effects, too - primarily quality of opposition

As bsball points out, this is a very good point.

Mike has often commented (as he did in 15) that "normal" might be "below regular average", so "clutch" *could be* performing at one's average.

He's said that before, but the way it is stated here:
"the batter is more likely to be facing a good pitcher, or at least an unfavorable platoon situation and defensive replacements in the field, and so his expected performance is below what it is for a "normal" AB."

is clear and concise.

Excellent possibility that, to my knowledge, hasn't been discussed.

Now wrt the generic, but important point, if year-to-year correlations do not describe a "successful" methodology, what exactly does?
   22. misterdirt Posted: June 27, 2005 at 09:14 PM (#1434958)
In post 1 I described the design of a study where if significance were found would indicate clutch hitting did exist. The absence of significance wouldn't prove clutch hitting doesn't exist, that is a much more difficult task. Was I unclear in my earlier post?
   23. WilyD Posted: June 28, 2005 at 12:49 AM (#1435450)
I think also a lot of the statistical analysis gives a little too much credit to how much events in baseball can be viewed as isolated. The true extent is probly somewhere around 'largely', but it certainly falls short of 'completely'. And I can't imagine a study where you really 'account for everything'. Just trying to figure out how you do it could cause your eyeballs to turn black.
   24. Dufmeister Posted: June 28, 2005 at 02:49 AM (#1435744)
A year-to-year correlation may need a little explaining. In the original study, Cramer looked at 1969 and 1970. What if he looked at 1969-72?

Did a player tend to to perform better, worse or average in clutch situations versus a standard over the study period? If you limit to looking at only two seasons or individual seasons, some random variation may take a Player A from +.010 OPS clutch player in 1969 into a -.001 OPS average player in 1970.

What if in 1971 and 1972 he is +.008 and +.005? In the Cramer study and by year-to-year correlation, this player shows no tendency for clutch from 1969 to 1970. Yet his yearly average clutch performance from 1969-72 is +.0055. Yes, I realize the study is for a spectrum of players and not a single player, but aren't we looking for the supposedly few Clutch performers?

This all comes back to deciding a standard, adjusting it for context (season, ballpark and whatever else you like), and looking at trends over significant samples of PA's and seasons.

I would throw out for discussion that the standard should be the league average OPS of any PA in the 7th or later where the ultimate result (a homerun) would increase the win expectancy significantly.

The situation would dictate the clutchness of the PA, while OPS+ in these situations would value the actual result of the PA.

There would need to be some resolution as to what significant change in win expectancy is and if that should change by inning.
   25. fret Posted: June 28, 2005 at 03:39 AM (#1435804)
the batter is more likely to be facing a good pitcher, or at least an unfavorable platoon situation and defensive replacements in the field, and so his expected performance is below what it is for a "normal" AB.

Not only that, but different types of batters may be affected in different ways by the pitching matchup. Andy Dolphin's study (linked in #13) finds that on the average, batters hit worse in clutch situations -- but the difference in performance between clutch and non-clutch situations is greater for sluggers than for singles hitters.

That could be interpreted to mean that singles hitters as a group are "clutch" and sluggers as a group are "non-clutch." But the real question is, does that difference disappear when you take into account the opposing pitcher? My guess is yes, but it could be the other way.

In any case, it would follow that linear weights-based measures overvalue sluggers and undervalue singles hitters by some small amount. Equivalently, low-SLG teams should outperform their pythag, and high-SLG teams should underperform it. I doubt that the size of the effect is very large, but it would be nice to see an estimate.
   26. fret Posted: June 28, 2005 at 06:59 AM (#1436071)
Equivalently, low-SLG teams should outperform their pythag, and high-SLG teams should underperform it.


I checked this out using the Lahman database, and any possible effect was totally overwhelmed by the noise. In fact, the (extremely weak) correlation went the opposite direction.

Let's see what the effect should be under Andy Dolphin's model. He writes:

clutch OBA - OBA = -0.007 - 0.10*(SLG-avgSLG),
clutch SLG - SLG = -0.017 - 0.11*(SLG-avgSLG).

If we use RC/AB = OBA*SLG, then approximately, for every additional 10 points of SLG, the gap between your overall RC/AB and clutch RC/AB increases by 0.001.

On average, there are 10 runs per win. Let's say that in the typical "clutch" situation, the equivalent is 5 runs per win. (In other words, the typical clutch situation is twice as important as the average.)

Now imagine that your SLG goes up by 10 points, while your OBA goes down so that you maintain the same RC/AB. By Andy Dolphin's definition, about 30% of at-bats are clutch. In those situations your RC/AB decreases by 0.0007. In the other 70% of situations your RC/AB increases by 0.0003.

With the weightings we have, the typical non-clutch situation should correspond to about 17 runs per win. So in those situations, your wins created per AB goes up by 0.00002. In clutch situations, your wins created per AB goes down by 0.00014.

Now apply this to a whole team with 5600 AB, of which 1680 are clutch and 3920 are non-clutch. If the team SLG increases by 10 points and the team OBA decreases to keep RC constant, total wins should drop by 0.16.

Or, you can look at it this way. If the league average SLG is .430, and your team slugs .370, you can expect to beat pythag by one game. If your team slugs .490, you can expect to underperform pythag by one game.

So (unless I messed up the math, which is very possible) this is the prediction. Now let's look at the data.

From 1920-2004, the worst 10% of teams in slugging (a total of 177 teams) had an aggregate SLG about 40 points worse than the league average. (Figures are crudely park-adjusted.) We would expect those teams to beat pythag by an average of 0.004 points (which is .64 wins in 162 games). In fact, they trailed pythag by an average of 0.00004 points.

The top 10% of teams in slugging had an aggregate SLG about 40 points better than the league average. We would expect those teams to trail pythag by an average of 0.004 points. In fact, they trailed it by an average of 0.0007 points.

So, if Andy Dolphin's model is correct, we should see an effect that in fact is not there.

Either that, or I made a mistake in the calculations. :)
   27. Chris Dial Posted: June 28, 2005 at 11:54 AM (#1436094)
that's very interesting.

Andy Dolphin's study (linked in #13) finds that on the average, batters hit worse in clutch situations -- but the difference in performance between clutch and non-clutch situations is greater for sluggers than for singles hitters.

First let me re-iterate that Andy's work is excellent, and I'm very impressed by his update due to peer review input (even though results are unchanged).

Now, Andy posits in his article that power hitters "swing for the fences" and thus slug lower.

How about this: they strike out more.

Singles hitters put the ball in play, generating more SFs, ROEs, FCs. I guess that wouldn't appear in their RC/AB, but it could help outproduce their Pythags.

I think singles hitting teams won't outpace Pythags because they have more IF singles which have less baserunner advancement, and power hitting teams will becuase they'll have more baserunner advancement (than teh average of the two sets of teams).
   28. Chris Dial Posted: June 28, 2005 at 11:55 AM (#1436095)
Now - back to yr-to-yr correlations:

One thing that has been dimissed (IIRC) by MGL is that there aren't any real hitters that destroy LHP. More that there is a significant platoon advantage.

I think this was based on yr-to-yr correlations - am I misremembering?
   29. Mike Emeigh Posted: June 28, 2005 at 03:53 PM (#1436375)
Now wrt the generic, but important point, if year-to-year correlations do not describe a "successful" methodology, what exactly does?

1. Define the situation.

2. Define an expected performance baseline for the situation (appropriately adjusted for park and so forth).

3. Look for hitters who consistently perform at a level above or below the baseline over a period of years. That identifies the subset of hitters who could potentially be defined as clutch/choke.

4. For each hitter in the set of clutch/choke hitters, determine whether there may be other explanations that fit the data.

I think we tend not to concern ourselves with item 4, but I think it's important in any study to consider whether there may be other explanations for the pattern that you see.

-- MWE
   30. Xenophon Posted: June 28, 2005 at 06:00 PM (#1436589)
One thing that has been dimissed (IIRC) by MGL is that there aren't any real hitters that destroy LHP. More that there is a significant platoon advantage.

I think this was based on yr-to-yr correlations - am I misremembering?

It may have been the basis in part. But when Karros was signed, MGL went so far as to say he hadn't run into a sample size large enough where a RHB's platoon split had predictive value.
   31. Robert in Manhattan Beach Posted: June 28, 2005 at 06:07 PM (#1436606)
I think this was based on yr-to-yr correlations - am I misremembering?

That was the foundation of his study, and the number one reason his conclusions shouldn't be taken seriously.
   32. Robert in Manhattan Beach Posted: June 28, 2005 at 06:26 PM (#1436653)
I want to add that Mr. James is incorrect when he says:

But if a clutch hitting ability existed on anything remotely approaching that scale, Stevie Wonder could find it. If a clutch hitting ability existed on anything like that scale, we wouldn’t be having this discussion.

If the standard deviation of clutch ability was 30 points, there would be a very significant number of players who hit 50 points better in clutch situations, throughout their careers.

With respect to MGL's platoon advantage work, there are a very significant number of players who performed far better than a standard platoon advantage over their career. (Reggie Sanders is working on his 14th straight season, check the thread for a nice list of active players) But yet we still had the discussion, and many people accepted his work as correct.

I suppose it's possible, although very unlikely IMO, that he's correct, but using the statistical methods (year to year correlations being chief among them) he used there was simply no way to reach the conclusions he reached with any degree of certainty.
   33. Xenophon Posted: June 28, 2005 at 11:12 PM (#1437208)
Reggie Sanders is working on his 14th straight season, check the thread for a nice list of active players

No, he isn't. Check his 2004 season again.
   34. Robert in Manhattan Beach Posted: June 28, 2005 at 11:52 PM (#1437360)
No, he isn't. Check his 2004 season again.

Yup, you're right. I guess that broke the string. Too bad, that was a fun stat to have around.
   35. Dag Nabbit at Posted: June 29, 2005 at 06:16 PM (#1439178)
Then there is the SABR Convention. You get to hang out with people you always wanted to meet: me, Furtado, Forman, Mike Emeigh, Aaron Gleeman, Jon Daly, Dan Szymborski, Eric Enders, Hall of Merit’s Joe Dimino, Chris Jaffe, Anthony Giacalone, Mike Webber, Vinay, Rauseo, Burley, MGL, Bob T, Cyril Morong, Mark Stallard (just off the top of my head).

IIRC, Forman said he can't go. The wife's expecting. But others are going or intend to go. Larry M., and . . . um, others.
   36. Too Much Coffee Man Posted: June 30, 2005 at 06:26 PM (#1441920)
A couple of points. Mike is probably right, in many instances, if there IS clutch hitting, it is hitting to expected levels of performance as others drop off. I believe that there is research in other fields that shows that instead of athletes performing better in the clutch, it is the case that most athletes perform worse, and some perform normal (e.g., hitting free throws).

Second, wanted to share this. With respect to "clutch hitting," we're really interested in "ability to perform in the clutch," a latent or hypothetical variable that we (or more precisely those studying it) operationalize by a measured variable - batting average in certain situations. This is a pretty good proxy, but it's not perfect. A sacrifice fly or "productive out" could be a good outcome, and a "clutch performer" could slam the ball over the fence only to have Andruw Jones jump and bring it back in play. The point is that we if we want to understand the relationship between Clutch Performance Y1 and Clutch Performance Y2, we're dependent on the correlation between two measures BA-Y1 and BA-Y2. Assuming that there's a true relationship between the latent variables, that true relationship will be attuentuated by unreliability in both our batting average variables. This is a point that James tries to make when he talks about the accuracy of the correlation. You could estimate the population correlation (rho) if you knew reliabilities of the batting averages:
Rho = (obtained r) / ((sq-root rxx)*(sq-root rxx))

Where rxx is the reliability of the batting average measure. I am not sure how you'd estimate it, but, we could, for example, look at the consistency of monthly or yearly batting averages of well-established hitters.

Here's one way of understanding this attentuation affect. Let's say I KNOW that the accuracy of predicting batting average in clutch situations in year 2 from batting averages in year 1 is 40% (whatever that means). By analogy, let's say I'm standing on a spot (year 1) and throwing darts at a target (year 2) and I hit it 40% of the time. Now suppose my target is on a platform that wobbles, caused by unreliability (of measurement in the analogy). Even though I'm every bit as accurate, I am not going to hit 40% because the target wobbles. Now imagine that I'm on a platform that also wobbles (because of measurement unreliability on my end). So, my accuracy drops even more.
   37. The Hammer Posted: June 30, 2005 at 07:28 PM (#1442260)
the batter is more likely to be facing a good pitcher, or at least an unfavorable platoon situation and defensive replacements in the field, and so his expected performance is below what it is for a "normal" AB

IMO these factors are undervalued. A late-inning clutch at-bat will commonly find a batter facing a relief monster/LOOGY with a full set of defensive replacements in place behind him.

Since comparisons are being made based on results, the quality of pitching and the quiality of the defense are important variables to be considered in this sort of study IMO.
   38. Tango Tiger Posted: June 30, 2005 at 08:28 PM (#1442540)

The variance of the observed is equal to the sum of the variance of the true of everything that exists plus the error based on the binomial.

In your case,
= var(hitters' true BA)
+ var(pitchers' true BA)
+ var(parks' true BA)
+ var(fielders' true BA)
+ var(base/out state true BA)
+ .. whatever else
+ var(luck)

From the perspective of the batter, the variance of the pitchers is typically zero. That is, the kind of pitcher a batter faces is pretty random. However, in the case of clutch situations, that wouldn't be the case. Nonetheless, it will be pretty small. All those variables will be pretty close to zero.

What you are left with is:
= var(hitters' true BA)
+ var(luck)

You know var(BA), and you know var(luck). You solve for the remaining variable. You can knock it down slightly for all the other terms that I set to zero.
   39. Tango Tiger Posted: June 30, 2005 at 08:29 PM (#1442547)
There's also a set of covariance terms, which in all likelihood will also be close to zero.
   40. Richard Gadsden Posted: June 30, 2005 at 09:38 PM (#1442767)
Couple of points here: The Dolphin study suggest that better players tend to have worse clutch performance.

This is what we'd expect because of the selection effect of only studying the major leagues - average players who choke are not going to spend long enough in the majors to get the at-bats to get into the Dolphin study (which requred 1250 PAs minimum). Average players who hit in the clutch are going to get more PAs and so are more likely to get into the study. Good players will get those PAs even if they're not able to hit in the clutch.

Of course, this assumes that ML managers, GMs and scouts have a meaningful ability to detect clutch ability. Actually, that's probably measurable.
   41. Too Much Coffee Man Posted: July 01, 2005 at 01:34 AM (#1443134)
I'm not sure we're talking about the same thing, but we could be.
rxx (actually, r with a sub xx) is the reliability of the measurement.
It's also equal to the ratio of observed score variance to true score variance,
or var(BA)/var(hitters' true BA)

So, you're right you could solve for the missing term and then calculate rxx.

I'd suspect that you could supply the missing values a lot quicker than I could.

Let's assume though that the reliability is .70 (70% of the variance in observed batting averages in clutch situations is due to underlying skill).

If you found a correlation between two years of .20, the true relationship between skill levels would be
.2/[sqr(.7)*sqr(.7)] or .29,
still relatively small, but better.

My point (and one of James') is that the use of a variable like BA in clutch situations will contain some error, and that error further obscures the likelihood of finding a relationship when one exists.
   42. Too Much Coffee Man Posted: July 01, 2005 at 01:42 AM (#1443147)
There's a couple of other suggestions I'll toss out for how this could be handled. Someone with more time (and in the latter case access to sophisticated software) could follow up on.

One is the use of banding techniques to study this. Banding is a tool used by organizations - usually cities - to examine relationships between selection tests - like a personality test or interview - and measures of job performance later on. They'd like to know if the test predicts performance - similar to the year-to-year correlation problem, but they recognize that at least one of the measures - performance - is measured imperfectly. Banding builds on the standard error of one or both variables to create bands around scores that are considered "equal". In this case, it seems a compromise between the 1/-1 system of James and the point predictions of Birnbaum. For example, you might wind up concluding that if someone hit .320 in clutch situations in year 1 and hit anywhere between .300 and .340 in year two, their performance was "identical."

The second method is latent growth curve analysis. It's a pretty sophistical statistical technique that's being used by social scientists and economists to study relative changes over time in performance. It allows you to model not only linear effects, but cubic and quadratic ones as well. For example, you might predict that persons with average to above average "clutch skills" show no strong relationship year to year, but persons who are very strong or very weak are very consistent. LGC analysis would allow you to set up and test such a model w/ the type of data that we're talking about.
   43. Slivers of Maranville descends into chaos (SdeB) Posted: July 11, 2005 at 01:01 PM (#1463360)
Two points. First, I think we are dealing with two different questions.

The first question is: does clutch hitting exist?

Now, James may or may not be right that a study like Cramer's is not adequate to analyze the question. More importantly, one cannot prove the non-existence of something. Rather, we should ask: what is the evidence for the existence of clutch hitting? And we must then admit that the evidence at present is pretty weak.

The second question is: given the state of our knowledge about clutch hitting, how should it affect in-game decisions.

Here the answer is easier. Since we really don't know anything, it should not affect in-game decisions.
   44. Neal Traven Posted: July 13, 2005 at 11:37 PM (#1470040)
Hey Chris, you've produced a wonderful advertisement for SABR. I hope whichever Primates join up because of your testimonial will mention your name when they do ... maybe you'll get some nifty t-shirt or a SABR-logo cap for bringing in so many!

I'd like to reiterate how excellent a resource the Statistical Analysis Committee email list has become in the mere months that it's been in operation. All kudos go to Dan Levitt for finally putting into operation what a lot of us had long been saying we should do but never seemed to get around to.

Unfortunately, Jim Albert's potential contribution to the issue kinda got lost in the Bill-Phil conversation. Jim's a real statistician, so perhaps he could have helped sort out James's epistemological difficulties with "not proof of anything" versus "proof of not anything" or whatever. As Bill often says, he has little in the way of formal statistical skills ... when he tries to deal with the deeper aspects of inference and hypothesis testing, it shows.

Finally, I can understand why your top-of-the-head list of SABRites didn't include me. Were I compiling such a list, I probably wouldn't include myself either. I've rather fallen behind in baseball-related activities over the past couple of years, as larger issues of national and global (as well as block-by-block, precinct-by-precinct, etc.) impact have consumed my attention. BP didn't come around asking me for an article to introduce the annual STATLG-L Hall of Fame balloting, but I didn't go to them offering to write it either ... so it just didn't happen. Doesn't seem like too many people were broken up about its absence, however; I don't think I received more than a couple of messages asking what happened to it.

None of this means I'll skip the convention or anything, of course. Along with the great days and nights of baseball talk, it'll be refreshing and relaxing to get out of the country for a few days. See you there!
   45. Chris Dial Posted: July 19, 2005 at 02:22 AM (#1482076)
I don't think anyone thinks to add that they joined "because of my recommendation".

And I didn't mean to leave you out. I'm sure you know the list could go on and on. I left out Rod Nelson too.

For some reason, I think BTF and STATLG-L don't overlap too much. Dunno why. Maybe we can get something other than Yankee/Red Sox fans on there.

We would love for you to write up and draw the vote for STATLG-L HoF here. Unless BPro has some proprietary rights to it.

We have the Hall of Merit, and I hope you are a regular contributor there - you have a ton of knowledge for that group (Me, I'm not that smart).

But Jim did send something and I'll add it (although I should have added it before).
   46. Chris Dial Posted: July 19, 2005 at 02:23 AM (#1482083)
Jim Albert sent me a note regarding this discussion, and with his permission, I am adding to the discussion (albeit belated):

"Since there seems to be a lot of discussion on statistical issues (Bill
James article and Phil Birnbaum's response), I think I should add some

When I do statistical work in baseball, I think of plausible simple models
for data. Based on my earlier work and work by others, I believe that there
is limited evidence for clutch ability in hitting. So I would start with a
model or hypothesis that says that the probability that a player gets a hit
(or some other batting measure) doesn't change across non-clutch or clutch
situations. Then I think of some way of detecting this clutch effect. One
way, as Cramer suggests, is to look at the correlation between clutch
effects for two seasons. I will reject my "no clutch effect" model if the
chance of observing this correlation is very small assuming my model.

What if I don't reject my model -- does this mean that there is no clutch
ability in baseball? NO! It just means that there is insufficient evidence
to reject my model. There is insufficient evidence for a couple of possible

1. Maybe I'm using a poor measure of the clutch effect so my test has
little power to pick up clutch ability.

2. Maybe there is a clutch ability, but it is a small effect that is
difficult to pick up.

To be honest, I think there is much agreement between James and Birnbaum.
By simulating data assuming some clutch ability, they are learning about the
power of procedures to pick up this effect. If clutch ability does exist,
it probably only exists for a small proportion of players and the size of
the effect would be small. I agree with James that Birnbaum begins with
some ridiculous assumptions about the size of this clutch effect and that
makes his article less persuasive.

Personally, I am not that interested in clutch ability as defined in these
articles. As Bill James said in some earlier work, why should a
professional hitter do better in clutch as opposed to nonclutch simulations?
When one says A-Rod is a great clutch hitter, I would rephrase this to say
that A-Rod is a great hitter who performs well in all situations, clutch and

To sum up, in statistics, we use models that are not true, but are
reasonable approximations to the data that we see. The model of "no clutch
ability" may not be exactly true, but you can use it to predict baseball
performance that mimics real data. The question "is there clutch ability?"
isn't that important if the size of the clutch effects are small.

Jim Albert"
   47. Dizzypaco Posted: July 19, 2005 at 04:25 PM (#1483084)
The last sentence of Jim Albert's contribution may not be true. He states, "The question 'is there clutch ability' isn't that important if the size of the clutch effects are small."

This would be true if the size of clutch effects for every player are small. However, this isn't what people claim to be the case -they say that a few people have a significant ability to hit in the clutch, while most others have no ability one way or another. If this last statement is true, it would have important implications for in-game management.

Let's assume for a moment that this is true - there are a few players that have a rather large ability to hit in the clutch while most players do not have this ability. As James notes, using the Cramer method would most likely fail to identify this effect.

If a few players do have this ability (which has been never proven nor disproven), then it would be very helpful information in game situations, and therefore, very important. I understand that there would be a real question about whether a perceived clutch hitter actually had this ability, but I would still argue that the possibility that a player has an ability to hit in the clutch, backed by a history of doing well in that situation, can be helpful in making decisions.

Let's say there was a player who after 10 years in the Majors was hitting .300 in clutch situations, however you define them, but only .270 in other situations. In deciding whether to use that player in a clutch situation, would you assume that he is a .270 hitter or a .300 hitter. Given that we have no definitive evidence that clutch hitting exists at all, some people would say we should assume he is a .270 hitter. I would disagree - I think it would be reasonable to suggest that he might be better than that, and that would have an effect on whether to use him.
   48. Phil Birnbaum Posted: July 21, 2005 at 04:09 AM (#1487718)

I agree. If there are only a few players with clutch ability, the Cramer test is nearly worthless. You have to do a different kind of test, like Pete Palmer's (see page 6), or Tom Ruane's update of Palmer's.

Phil Birnbaum

You must be Registered and Logged In to post comments.



<< Back to main

BBTF Partner

Dynasty League Baseball

Support BBTF


Thanks to
for his generous support.


You must be logged in to view your Bookmarks.


Page rendered in 0.7003 seconds
60 querie(s) executed