You are here > Home > Primate Studies > Discussion
 
Primate Studies — Where BTF's Members Investigate the Grand Old Game Thursday, February 12, 2004Breaking the Law of AveragesJim outlines how we can improve on statistics based on averages. "You know what the difference between hitting .250 and .300 is? It’s 25 hits. Twentyfive hits in 500 atbats is 50 points…. That means if you get one extra flare a week, just one, a gork. You get a ground ball. You get a ground ball with eyes. You get a dying quail. Just one more dying quail a week and you’re in Yankee stadium." – Crash Davis
Socalled "averaged" rate, or ratio statistics such as Ave, OBP, SLG, ERA, K:BB, etc. dominate the baseball statistics landscape to such a large degree that they are often viewed as a player’s ticket to "the show." In the following piece I will argue that in baseball these measures are inappropriately applied and statistically flawed, and I will introduce rather simple residualized measures that are more valid, flexible, and statistically meaningful alternatives. Residualized measures have already been adopted successfully in other research fields where ratio statistics have been misapplied (e.g., social and behavioral sciences). Perhaps this introduction will help residuals find their place on the landscape of baseball research as well. What are residualized measures?
A residualized score is, basically, a measure of how far a performance is above or below some prediction. In other words if we try to predict a player’s performance using an average score or a replacementlevel value, we will have produced a type of residual score that represents how far that player’s performance deviated from the typical performance. The residualized measures presented here result from using multiple regression methods to make precise predictions. Specifically, we regress some variable (e.g., the numerator of an averaged stat, like Hits) upon a predictor variable (e.g., the denominator of an averaged stat, like ABs), which in turn produces a prediction line. When we evaluate where and by how far the actual numerator value is mispredicted by this prediction line, a precise "residual" score remains.
Throughout the history of baseball, hits, walks, total bases, strike outs, etc. and their respective opportunity statistics are collected in the hope of evaluating performance. There are a few different defensible ways of dealing with, for example, Hits and AB data (with a nondefensible way being subtracting ABs from Hits): (1) use Hits alone, which is great for some evaluative purposes, but does not remove the influence of opportunity (e.g., more opportunities = more hits); (2) standardize Hits by dividing by AB as is done for most common and uncommon statistics; or (3) standardize Hits by regressing it onto ABs. Since we are all familiar with the first two options, this piece will present the third option indepth and discuss why it represents a statistically sound and simple alternative to averaging. Let’s illustrate this procedure by standardizing the 2003 season data by regressing Hits onto ABs, or differently stated, predicting Hits from ABs. This process spits out a single prediction line that best captures the relationship: predicted Hits or Ù Hits = 6.5 + .29(AB). The slope term (.29) can be interpreted as meaning that for every AB the number of predicted Hits will increase by .29, and the yintercept term (6.5) is the expected number of Hits with 0 ABs. Next, we can employ this equation to predict Hits from some number of ABs by simply plugging that AB number into the equation. Where our actual Hits score deviates from this numerical prediction, a residual score remains. The scatterplot below displays Hits and ABs data for 2003 and their prediction line. Residual scores are represented by the vertical distance that each point differs from the line. The calculation of a residualized score is easier than it appears. You don’t need to understand multiple regression to calculate a residual, you need only find the appropriate formula (examples below) and plug in the appropriate number (e.g., ABs). For example, let’s calculate a residual Hits score for a player that had 30 hits in 100 ABs during the 2003 season (i.e., a .300 hitter). Using our ^Hits equation we find that he should have 6.5 + .29(100) = 22.5, or 22.5 Hits. Where this Ù Hits score deviates from the player’s actual Hits score of 30, we have a residual score: residualized Hits or rHits = 30  22.5 = +7.5. Overall, our example player had 7.5 more hits than we would have predicted given his ABs. Here is a brief table of prediction equations from 2003 season data that can be used to find expected values for batting statistics.
There are two notably important features of residualized scores: (1) a set of residualized scores will always have a mean of zero, and (2) residuals will always be uncorrelated with their original denominator (or predictor) statistic. The first point increases the intuitive nature of the statistic because it reveals positive scores of some magnitude for players that perform above expectations, negative scores for those below expectations, and scores at or near zero for those who perform at the expected level. The second point is important because it indicates that we’ve likely removed the influence of the denominator or the opportunity stat. That is, we’ve appropriately standardized the measure in order to evaluate and compare performances divorced from total number of opportunities to perform. This is noteworthy because if we find that a reasonably strong correlation exists between the new measure and its original denominator term, we must conclude that we’ve compromised the meaning of the new statistic as an arguably opportunityfree or standardized metric. For example, if a positive correlation is found between Ave and ABs (and this correlation actually does exist), the relative meaning of say a .280 Ave will change based upon ABs; for a .280 Ave with 100 ABs will be relatively more impressive than a .280 Ave with 600 ABs. Comparing averaged to residualized statistics
In this piece I will spend less time restating the many arguments against averaged, rate, and ratio statistics (called "averaged" from this point forward) that have been presented quite cogently in the past (e.g., not defenseindependent, park factors, luck, etc.) and emphasize comparisons between averaged statistics and residualized measures using the following nonexhaustive criteria:
Intuitive nature of the statistic
Averaged statistics have an immediate appeal unlike many other statistics. This may be due to their ubiquity in our daytoday lives (e.g., cooking recipes, budgets) or their mathematical simplicity. Averaged statistics are used in arenas beyond baseball, like the arts (e.g., perspective), the sciences (e.g., means), and the rest of the sporting world (e.g., freethrow % in basketball, yards per rush in football). Needless to say, residualized measures do not share the same place in our common vocabulary, although the idea of comparing a performance to our expectations is far from a foreign concept.
In baseball, few statistics are as intuitively appealing as Ave. For most baseball fans and little leaguers alike, Ave is probably the first statistic that we understand and master. Because this prototypical averaged measure has such intuitive appeal and clarity, baseball statisticians and researchers have modeled many other statistics after the simplicity of Ave (e.g., OBP, SLG). In fact, it’s not uncommon to find baseball researchers attempting to translate unintuitive or untraditional measures into a more digestible form by scaling them to appear like Ave! Although these newer statistics share a general metric with Ave, they often lack Ave’s intuitive appeal. Let’s use SLG as an illustration: Do many fans actually comprehend what total bases per at bat means? They may be able to surmise that a base per AB (SLG = 1.000) is good, but are clueless about the bounds of a great, decent, or bad SLG. Residualized measures avoid some of these problems by having three primary and easily interpreted results: above prediction (positive numbers), below prediction (negative numbers), and atprediction scores (near zero numbers).
Another detractor from the intuitive appeal of averaged measures concerns the fact that all of these stats are not in their natural or numerator units like Hits, Ks, etc. Averaged measures have sacrificed the intuitive appeal of their original units to gain the familiar and standardized appeal of an average like Hits per AB, Ks per Inn, etc. Residualized measures are more complex in their calculation when compared to averaged measures, yet residuals have not had to sacrifice their natural units. Indeed, rHits is still strictly interpreted as a number of Hits. Because they retain their natural units, residualized measures should be as intuitively appealing as the original metrics from which they were born. The only caveat with these measures–and it can be a big impediment to intuitive appeal–is that we must qualify our natural units statement as meaning the number "above or below expectation given opportunities to perform." This qualification can be cumbersome, but in return we retain the intuitive charm of a measure in its natural units. Ease of calculation
Of all averaged statistics in baseball, Ave has to be the easiest to calculate as it employs two easily found terms in its composite (i.e., Hits & AB). In fact, at the lowend of the AB continuum (e.g., <50 AB) one can easily calculate a batting average without need of a calculator. At the higher end of the continuum we usually have to put pen to paper or finger to calculator. Other averaged measures either employ various steps (e.g., ERA, K per 9 Inn) or have more than two figures in their composites (e.g., OBP, SLG, IsoP), and thus increase in complexity as the numerator or denominator terms become more complex. Residualized measures follow a similar pattern to all of the aforementioned statistics but have an additional level of complexity due to the two steps that precede their final calculations. Regardless, residuals are simply the product of regressing traditional "numerator" stats (e.g., Hits) onto traditional "denominator" stats (e.g., ABs). Given that many of baseball’s statistical calculations are fairly automated and easily accessible at this point in time (see just about any sports websites for remarkably complex, yet automated stats) I believe that the issue of calculation complexity is becoming, notably, more moot than meaningful. 	 Information usage
For most averaged categories in baseball, a player can only "qualify" for meaningful comparisons among the baseball intelligentsia – not to mention batting titles – if that player has had at least 3 or so plate appearances for every game that his team has played. If we qualify batters across the 20002003 seasons being discussed here, we will have reduced our samples dramatically. Indeed, in each season we wipe out or ignore valuable information for essentially 70% of the sample, which equates to disregarding about 400+ nonqualifying players per season! Qualifying our measures is, at its best, a sad attempt to correct a statistic that is flawed in its ability to deal with lowopportunity figures. I find it hard to believe that anyone interested in measurement would agree to scrap so much data. The single argument that I have heard in favor of these "qualifying cuts" entails that nonqualifying players just don’t represent a large enough sample of opportunities to be valid for such comparisons in the first place… my response: Why not develop measures (hint: residualized statistics) that allow for the usage of all data, because they correctly and appropriately deal with players in the high sampling and low sampling ends of the continuum. 	
Intuitive nature of results
When comparing the intuitive nature of the results of these different approaches, we stumble upon two significant problems with averaged statistics that are not found in residualized measures. First, all averaged statistics tend to amplify results at the lowend of their denominator’s continuum. For example, the Ave statistic can produce results that range from 0 to 100% at the lowend, or unstable end, of the AB continuum (more on this later). Accordingly, many researchers, writers, and MLB itself employ the aforementioned qualified statistics that ignore these arguably unstable results. That said, without qualifying cuts in our data, we are left with results that are inappropriately suited to our purposes. Here are some Ave results from the 2003 season.
Contrast the above to findings for residualized Hits in the 2003 season using all of the available data:
It is important to note that even with qualifying cuts in place, the top performances in the AL differed when using averaged measures (Mueller) versus residualized measures (Manny Ramirez). In other words, differences between measures are found both at the lowend and highend of the opportunity continuum. 	
The second problem found with averaged measures is their inability to reflect eraspecific fluctuations. Ask yourself, do averaged measures transcend changes in hitting and pitching across generations (e.g., power vs. dead ball eras)? The answer is "yes" and "no." Averaged statistics literally mean the same thing regardless of what’s happening in an era, which seems to be a good thing. Yet we all know that if a batter posted a .300 batting average during the 1970’s that he probably performed at a higher level than a person who posted the exact same .300 average during the 2003 season, or during a hitter’s era. In comparison, residualized versions of these measures do an outstanding job of accounting for these eraspecific fluctuations by producing different prediction slopes for different eras. For example, the following table displays the top and bottom 10 alltime single season hitting performances from 1900 through 2003.
	
A further example of this approach can be demonstrated by comparing the top 10 HR hitters of all time.
From the above tables it is apparent that the residualized statistics have corrected for eraspecific performances. Although Ty Cobb would have ranked 3^{rd} on the alltime list if we used qualified AVE as our guide, the remarkable nature of his accomplishment in 1911 emerges when we examine his performance relative to his contemporaries – 71.4 hits beyond expectations specific to that season! The opposite can be said for the remarkably poor performances by Frankie Crosetti in ’39, ‘40, and ’37. In terms of HR seasons, we find that Bonds is at the top in all three approaches, yet Ruth’s relatively meager 60 HRs in ’27 now appropriately ranks second among all players with 53 HRs above his era’s statistical expectations. Comparing these numbers against the HRs per AB ratio, we find a rank order that is slightly better than the raw HR numbers, but has difficulty making fine distinctions between performances. 	 Crosstime consistency comparisons
Another important test for performance statistics has to do with their reliability, consistency, or ability to predict themselves across time. When we directly compare the statistical features of averaged measures to residualized measures (as is the purpose of this paper), the latter are clearly the better statistics to use if one is interested in seasonto season predictive ability. The crosstime consistency correlations for the Ave, OBP, SLG, plus a couple of Bill James statistics, reveal a mean crosstime correlation for the averaged measures of r = .37 (r^{2} = .15) and a mean crosstime correlation for the residualized versions of r = .65 (r^{2} = .42). In other words, the residualized measures explain, on average, a remarkable 25% more of the variance in future scores than do commonly used averaged measures. Furthermore, and quite surprisingly, all of the residualized measures outpredict their averaged counterparts in predicting future averaged scores (e.g., SLG’01 predicts SLG’02 at r = .31, but rTB’01 predicts SLG’02 at r = .35)!
In order to illustrate that even statistics born from wisdom are not immune to the problems of averaged measures, let’s take a look at a couple of Bill James’ stats that use averaging: Isolated power (IsoP) and secondary average (SecA). The SecA stat is defined as (TBH+BB+SBCS)/AB. We can residualize the numerator by regressing it onto ABs. The two leftmost matrices of the below table reveal that the residualized version (rSecA) is more consistent across time. The rSecA stats are also stronger predictors of future SecA stats (e.g., SecA’01 predicts SecA’02 at r = .52, but rSecA’01 predicts SecA’02 at r = .56).
Since the isolated power measure is defined as being equal to SLG minus Ave, we can approximate this stat with residualized measures using rTB minus rH (call it rIsoP). The two rightmost matrices in the above table reveal that the rIsoP measure is, once again, both more consistent across time and a better predictor of future IsoP stats (e.g., IsoP’01 predicts IsoP’02 at r = .45, but rIsoP’01 predicts IsoP’02 at r = .48).
If we employ qualifying cuts for both the averaged and residualized measures, their respective mean crosstime correlations are exactly the same (r = .63). Thus, although residualized measures certainly outperform averaged stats at the lowend of the opportunity continuum, both appear to be equally strong predictors when isolated to highend opportunity performances. Resulting statistical relationship with opportunity stats
Recall that residualized measures are always uncorrelated with their denominator statistic, meaning that we have appropriately removed the influence of opportunity from the residual measures. However, the denominator of average stats (e.g., Ave, OBP, SLG, etc.) has quite a large impact upon these supposedly "opportunityfree" measures. As can be seen in the following tables, all of the relationships between these averaged stats and their denominators are positive, linear, and substantial enough to be alarming.
Residualized measures are more statistically valid than averaged measures due to the latter’s troubling correlations with their denominators; yet we need to explore whether the prior is entirely unrelated to opportunity. As can be seen in the following scatterplots (Ave & rH related to ABs in 2003), we find that both averaged and residualized measures have variances that are conditional upon level of opportunity. Stated differently, batters that have relatively few ABs because of injury, September callups, demotion, etc. (e.g., 2003 Ave leaders Jesse Garcia, Kit Pellow) can have Aves that range from zero to perfection with an average variability (SD) of about 180 points. Yet as opportunities increase for healthy players, veterans, etc., their averages actually narrow in range considerably with an average variability (SD) of about 25 points. The exact opposite trend can be found with residualized measures. Players with very few ABs tend to cluster around the mean of zero and have little variability (SD @ 2 hits), while players become increasingly differentiated from the mean (SD @17 hits) as their opportunities to perform increase.
If we consider the AB or opportunity continuum to be some proxy measure for the confidence that we have in our results (i.e., confidence increases as our sampling of ABs, PAs, Inns, etc. increases), then two very important trends emerge: (1) with averaged measures, variance decreases as you advance from the lowend to the highend of the opportunity continuum, thus making distinctions between ability more difficult at the reliable end of the opportunity continuum and easier at the unreliable end; (2) with residuals, variances increase as you advance from the lowend to the highend of the opportunity continuum, thus making distinctions between ability easier at the reliable end of the continuum and more difficult at the unreliable end. Needless to say, I believe that the second trend is far more appropriate for performance evaluation.
One further yet critical difference between averages and residuals is how they answer the question: What should we do with those lowopportunity players? Residualized measures begin by putting everyone at zero – or firmly at the group mean – and then modify as we acquire more information. Guessing zero (or the mean) for a player that we have little or no information about is perfectly reasonable; for rudimentary statistical logic dictates that if we know little or nothing about some value, our best guess is the mean for that group of values. Think about this in terms of a typical September callup. If we don’t know much about him, what should our best guess be about his likely performance in the majors? Should we guess that he’ll perform at some nonspecific level between a .000 and a 1.000 Ave? Should we flip a coin? If so, we’re doing exactly what all averaged measures do. Residualized stats start players at the league mean because it is our best possible guess without any other info. Thus, unlike averages, residuals take the proper theoretical and statistical approach to solving the lowAB problem. Applicability
One can make just about any statistic into an average, and thus one can make just about any statistic into a residualized measure. Specifically, as long as the average or ratio in question has a numerator and a denominator, a residualized measure can be easily developed using those exact same numbers. For example, we can translate Nomar’s 2003 stats from .301/.345/.524 into 13/6/25. Residuals can also be used to estimate a player’s exact number of hits, walks, etc. given park factors, age curves, etc. (ie., park, age, etc. can be included in the regression equation directly, or used to correct raw data prior to the regression procedure). Here is a table of the top 10 slugging performances in 2003 which includes a column of figures corrected for ballpark – the largest difference, for obvious reasons, can be seen in Todd Helton’s numbers.
Further, because residuals do a great job with lowopportunity performances, these statistics are preferable when using "splits" (e.g., pitcher success vs. lefties or righties), minorleague data, or when evaluating September callups. For example, the following table includes results for the 2003 AAA International and Pacific Coast Leagues. Note that the residualized figures are now directly comparable across leagues.

Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
1. John M. Perkins Posted: February 12, 2004 at 03:08 AM (#614635)Thanks for an interesting article. I have a couple comments.
1. Does it really make sense to fit intercepts in your models? In effect, a parttime player gets compared to other parttime players, and a fulltime player get compared to other fulltime players. But we have actually have good reason to think that fulltime players are better as a group.
2. If batting titles were awarded on the basis of residual hits, then there would be an inherent advantage not only for playing more (which is reasonable) but also for accumulating at bats as opposed to walks and for batting earlier in the order.
3. I'm not sure that looking at residual hits really adjusts much for era. Residual hits do clearly adjust for the leaguewide mean, but isn't adjusting for the spread of, for example, batting averages perhaps the more important issue? In your list of the alltime top 10, for example, only 2 of them come after 1924.
However, you cannot use AB. The opportunities is PA. There is probably a relationship between higher AB to lower BB to higher batting average. That is, a guy who has lots of AB might have a bit less BB than he "should" for a guy at that talent level. But, the only way he got that many PAs was because he was getting lots of hits. The better terms to use are: times on base, PA, and OBA.
As well, using the residuals here to construct league leaders is confusing. The better your true talent, the more PAs you get, and therefore, the higher your expected OBA. If a player's OBA matches this expected higher OBA, then he comes out as zero. In fact, anyone who'se exactly on that line comes in at zero, even though we know that they got the extra PAs specifically because they were better.
I don't have a problem with the residuals, but rather applying them as league leaders to demonstrate something.
Finally, and clearly, we don't have a linear relationship. The intercept should be zero, and it should be curved at the extreme. Since the players we are most interested in are at the extremes, care should be taken here.
batting average (PA) = (PA ^ 0.25) * .06
You can use something similar for OBA as well.
Fixing the intercept at zero does mean that the residuals won't sum to zero, but I don't see why that would be a desirable quality for these models anyways.
As for the heteroscedasticity problem, you could easily fix it by using weights = inverse of AB (or PA or whatever). But I think I agree with Dan, maybe in this case we don't care. We're not using the standard errors for anything anyways.
A side question: is it really true that heteroscedasticity makes the estimator inconsistent? Because it's still unbiased, so I would think that it would still be consistent. Just curious.
Anyways, great, fun work. I do like the fact that the numbers are easily interpretable (despite what I said earlier about not caring if the residuals sum to zero  although I still think you should fix the intercept), because it relieves me from having to remember what the league OBP or SLG was or whatever. Like in your ARod example, I wouldn't have known off the top of my head that .345 was a subpar OBP.
It seems to me that you would still need qualifier "cutoffs" to create any sort of leaderboard. Last season, Aaron Sele was 1 for 3 with a single in 3 plate appearances. His expected hits, given the equation above, were 5.597. The fact that he got one hit means he had over six more hits than expected, despite the fact he only batted three times. He also had 13.85 more total bases than expected, a higher number amassed by Troy Glaus in 319 at bats.
Is there something I am completely missing?
The slope term (.29) can be interpreted as meaning that for every AB the number of predicted Hits will increase by .29, and the yintercept term (6.5) is the expected number of Hits with 0 ABs.[emphasis mine]
In what practical or theoretical sense can someone be "expected" to have 6.5 Hits with zero AB? Wouldn't the expectation be zero hits? I completely understand the first half of the explanation, that each at bat raises expected hits by .29. So, going back to the Aaron Sele 1 for 3 example I raise a few posts up, wouldn't it make more sense to say that he had .13 hits above expectation (1[.29 x 3])?
If we consider the AB or opportunity continuum to be some proxy measure for the confidence that we have in our results (i.e., confidence increases as our sampling of ABs, PAs, Inns, etc. increases), then two very important trends emerge: (1) with averaged measures, variance decreases as you advance from the lowend to the highend of the opportunity continuum, thus making distinctions between ability more difficult at the reliable end of the opportunity continuum and easier at the unreliable end; (2) with residuals, variances increase as you advance from the lowend to the highend of the opportunity continuum, thus making distinctions between ability easier at the reliable end of the continuum and more difficult at the unreliable end. Needless to say, I believe that the second trend is far more appropriate for performance evaluation.
Unfortunately, points (1) and (2) are wrong. It only appears to be easier at the points of higher variance because there's more variance. However, in the presence of more variance, large differences are more likely due to chance.
Residuals have their own standard errors. One of the assumptions of linear regression is that the error variance is constant (called homoscedasticity). That's not the case here. Technically this means the standard errors are off, but that usually only matters when it comes to testing our coefficients (not really of interest here). But it also means that you need to standardize your residuals to compare them. A 5 hit residual in 100 AB is a much different animal than a 5 hit residual in 600 AB.
To put that in plainer terms, here are the sd's of the residual (with intercept although I agree we shouldn't have one) at different AB levels for 2002:
150: 1.7
51100: 4.3
101200: 5.1
201300: 7.7
301400: 8.2
401500: 12.9
501600: 12.9
601+: 15.4 (but only 27 cases)
So to be statistically significantly different (at .05 level) at the 400+ PA level, you need a gap of about 25 hits between two players. Big differences at that level look not so big once you take that into consideration. For example, Manny Ramirez led the AL in rH with 26  meaning he was barely statistically significantly better than an average hitter (with that many ABs).
And those two variances (i.e. BA vs. resid) are essentially just flip sides of the same coin:
For a 280 average and 75 ABs, the expected sd (using the binomial) is about +/ 3.9 hits whereas our estimated one (in a fairly small sample) is around +/ 4.3. We can convert both of those back into points of average by dividing by 75, giving us +/ .052 and +/ .057.
For a 280 average and 500 ABs, the expected sd is about +/ 10 hits, whereas our estimated one is about +/ 12.9. Converting back to points of average, we get +/ .020 and +/ .026.
My point being that BA and residual hits are pretty much telling you the same thing. In both cases you have to keep in mind that comparisons are tough without taking opportunity into account. With BA, seemingly big differences don't mean much at the low end; with residual hits, seemingly big differences don't mean much at the high end.
Which you prefer to use is not a big deal. Obviously if you're looking at value, the residual is more useful ... though note, it's essentially a measure against average, not replacement, so there's still the question of whether a guy with 100 AB and 2 hits above average is more/less valuable than a guy with 600 AB and 2 hits below average.
An alternative method, using a bit more math but less dependent on using singleyear samples, is to take advantage of the binomial distribution. You can get the mean and the expected or population variance (or at least a darn good approximation) in the number of hits for any given BA and # of AB using the following formulas:
mean=BA*AB
VAR =BA*(1BA)*AB
So you could plug in the leagueaverage BA and an individual player's AB to get the expected number of hits and the variance in the number of hits for an average hitter with that many AB. You subtract the expected number of hits from the batters actual number of hits.
Now the tricky part is you have the variance and you want the standard deviation. So you need the square root of the variance. But your windows calculator or excel spreadsheet can do that for you.
Now compare that batter's hits above average to the standard deviation and you've got a meaningful measure of whether that batter was substantially better than average.
The advantage of this method is that your estimate of the variance/standard deviation is not based on a particular year's sample of data. The real variance is always better than the estimated variance.
Returning to our Manny comparison, he had 185 hits in 569 AB (325 BA) in 2003. The AL league BA was .267.
.267*569 = 152 expected hits
.267*.733*569 = 111.4 = VAR
sqrt (111.4) = 10.6 = SD
So Manny had 33 more hits than an average hitter in the same number of ABs. Compared to a SD of 10.6, that's a zscore of over 3. (Note this is, of course, also a comparison to average, not replacement. But here you could plug in replacementlevel BA instead of average BA if you wanted to make that comparison.)
Other problems with using residuals remain (and they apply to the alternative method as well). At least in the 2002 sample that I looked at, the mean of the residuals is related to AB  which is really just telling us that aboveaverage hitters are more likely to get ABs (the average for players with fewer than 400 AB is 248, for over 400 it's 277). But what it also means is that the slope on AB in the regression equation is not really constant across AB.
This means the relationship between hits and ABs is not exactly linear. You can barely see this in the first scatterplot. Note how at the high end of ABs, there appear to be about twice as many points above the predicted line as below. Although you can't really detect it given the small scale, there are also more points below the line at the low end of ABs.
For 2002, I ran a regression of hits on AB and ABsquared. Although the coefficient is rather teeny on the squared term, it is highly significant (t>7).
Back to a technical point. The author claims that one advantage of residuals is that they are uncorrelated with ABs (or whatever the independent variable is). The answer to that is yes and no. I mentioned that the error variance is not constant (called heteroscedasticity). One possible cause of heteroscedasticiy is omitted variables. In this case, one of those omitted variables appears to be the square of AB. So although the residual is uncorrelated with AB, correlation is a linear measure. It is correlated (ever so slightly) with ABsquared, which means the residual is not _independent_ of AB. (Note neither using averages nor the binomial I used above avoids this).
Another advantage of the model with AB and ABsquared is that the intercept is no longer statistically significant, not even close in 2002. If you run the model without an intercept, the average residual is a meager .03. Given we know that you get 0 hits with 0 AB, this would seem to be the preferred model, though I'd run it on more years before coming to that conclusion.
Residual scores are fine. In some small ways they are preferable to using averages or the binomial distribution, and in some small ways they aren't. But they're hardly a magic bullet, and for the most part they suffer from the same limiations as the more traditional way of doing things.
This is incorrect, though a common misunderstanding of regression models.
OLS regression requires absolutely no assumptions about the distribution of the variables. In fact, variables which are coded just 0 or 1 are quite common in regressions.
OLS does assume that the ERROR is normally distributed. However, even this assumption is necessary only for getting correct estimates of the standard errors. The coefficients are still unbiased no matter how screwy the error is. And this means that he residual estimate should also be unbiased.
Moreover, there are tons and tons of studies that show the normality of the error is not a particularly important assumption. OLS' standard errors are quite robust even for quite nonnormal error distributions (assuming the model is correctly specified).
The misunderstanding comes from the fact that the sum/difference of normal variables is a normal variable. So rearranging your standard regression equation:
e = Y  b0  b1*X1  b2*X2  ....
So if your Y and all your X's are normally distributed your error term is guaranteed to be normally distributed. So you'd like normal variables but they are by no means required.
The reason to be concerned about nonnormality of your variables and/or errors, not to mention heteroscedasticity, is that they are often a sign that you either have the wrong functional form (i.e. the relationship isn't linear) or you have omitted variables. The assumptions that you have the right functional form and no omitted variables are far more important assumptions of OLS regression  if they're violated, you're coefficient estimator is biased.
but i hope my students are reading this site, that's question #2 of the stats homework due tomorrow.
But that's not true. Every player is being compared to the overall average (with the intercept correction thrown in to even things out a bit). The average residual in 2002 for players with 51100 AB was 2.8 (that's negative 2.8 if it doesn't parse right), so a player who was +5 hits was nearly 8 hits better than his "lesser competition". Meanwhile, the mean residual for 501600 AB was .41, so a +5 residual there is only 4.6 hits better than his "greater" competition.
The heteroskedasticity problem renders the OLS estimates inconsistent (they will not converge to the true value as the sample size increases).
This is NOT correct unless the heteroscedasticity is due to omitted variables. All I've got handy is a 1983 edition of Neter, Wasserman, etc. Applied Linear Regression, but I quote:
When heteroscedasticity prevails but the other [assumptions  originally "conditions of model 3.1"] are met, the estimator obtained by ordinary least squares procedures are still unbiased and consistent. They are no longer minimum variance unbiased estimators.
And they should have added that your standard error estimates are biased so your tests of significance are no good.
Now, why oh why can't we make article comments work the same as CH comments. Why can't we preview? Anyway, here's a second stab at the residual variance by AB table I tried to put in my first note:
150: 1.7
51100: 4.3
101200: 5.1
201300: 7.7
301400: 8.2
401500: 12.9
501600: 12.9
601+: 15.4 (but only 27 cases)
The estimated equation is:
H = b0 + b1*AB + e
where e is the residual number of hits.
Now divide both sides by AB:
H/AB = b0/AB + b1 + e/AB
So batting average equals:
b1: the "average" BA (but not quite)
e/AB: the amount that the hitter is below/above the "average" BA
b0/AB: an adjustment factor that helps us correct for the fact that lowAB hitters (on average) are worse than highAB hitters.
This is going to lead you to something not that much different than looking at the average for a series of AB cutoffs, and comparing a player to that average. For batters with at least 100 AB, this difference ranges between 17 and 19 points of BA. If I did this right, in 2002, 58% of the time the less complicated method was closer to the observed value.
Sigh ... try this
b1: the "average" BA (but not quite) and I should have said not really
e/AB: the amount that the hitter is below/above the "average" BA
b0/AB: an adjustment factor that helps us correct for the fact that lowAB hitters (on average) are worse than highAB hitters.
I meant to include the rough estimates of the size of that adjustment factor (all negative):
100 AB: about 60 points of BA
200 AB: about 30 point of BA
300 AB: 18 points
400 AB: 15 points
500 AB: 12 points
600 AB: 10 points
To me that's another minor problem with the residualized measure in this particular case at least. The reported coefficient for AB is .29, but the average BA in any range of ABs is a good bit below 290. Even including the nonqualified batters (all the way down to 1 AB including pitchers) only 147 players out of 1134 hit 290 or better. Just 13% of batters and even just 36% of qualified batters got .29 or more hits per AB.
So by the equation, the marginal return on an AB is .29 hits, or 29 hits over 100 AB. But that's not really the return over 100 AB for an average hitter, that's the effect of better hitters getting more playing time. That is your typical 100 AB hitter might hit about 240 or get 24 hits. Your typical 200 AB hitter might hit about 265 or get 53 hits, which looks like 29 hits in 100 marginal ABs. But give that typical 100 AB hitter an extra 100 ABs and he's gonna keep hitting 240. Give that typical 200 AB hitter another 100 ABs and he's gonna keep hitting 265.
Unfortunately, although the marginal returns aren't constant in the quadratic model, you have the same basic problem that leads to odd looking marginal return values.
I'm still not sure that the best way to derive residualized hits isn't to use the average BA and the binomial distribution. You could use hits for value purposes and zscores for quality.
But Eric, your logic applies to every single regression which has ever been estimated (except maybe the one we're discussing).
If you've got omitted variables or the wrong functional form, then your regression coefficients are biased and inconsistent, so who cares about the standard errors. Since, as you state, it is impossible to know whether you've got the correct form and variables, then no regression results should ever be used.
Now I'll agree that since heteroscedasticity is a common symptom of misspecification, one should take efforts to find a way to correct it  you're probably better off. However, homoscedasticity is in no way a guarantee of correct functional form. In practice, your coefficients are always biased no matter what your error looks like, so if we're going to throw out all results with biased coefficients, we'll throw out all regression results.
In practice, as long as you've included all the variables known to be important and you've made the effort to find the functional form that fits your sample best (or from theory/previous research) and you've run some standard diagnostics, then you've done all you can do and you should interpret your results as if all the assumptions are met. But heteroscedasticity of the error term is, in my opinion, near the bottom of things you worry about.
And my comment about not mentioning biased standard errors was directed at Neter and Wasserman, not you. Sorry for any misunderstanding.
Because most of the people who are doing statistical analysis don't know how to use anything beyond OLS, have no idea how to interpret the results, and (probably most importantly) wouldn't know how to explain the process in terms that can be easily understood.
Heteroskedasticity abounds in baseball. The reason is very simple  players who aren't good enough to play at a certain level won't continue to get opportunities, and players who are among the very best will get a disproportionate number of opportunities. Pick a stat (any stat) and you will almost always find that the variance among players gets smaller as the number of opportunities grows. At 100 PAs, the population includes good players who were hurt, some LHhitting specialists, lateseason and injury replacement callups, terrible players who got replaced, and pinchrunning/defensive specialists; there's a potential for a very wide range of performances in that group. At 600 PAs, you have guys who were regulars for the entire season, and who are generally very good players, with the occasional exception of a guy who plays because his team just doesn't have anyone better at a position. The minimum performance necessary to achieve 600 PAs is (in general) higher than the minimum performance necessary to achieve 100 PAs, which restricts both the range of performance and the variance  and also restricts the extent to which that performance can deviate from the regression line.
Residual numbers have the opposite problem; if the range of normal performance is, say, .220.380 for hitters with 100 PAs, and .250.350 for hitters with 600 PA (with a .300 midpoint in each case, for the sake of argument), the residuals will be no larger than 8 hits for the hitters with 100 PA but can reach 30 hits for players with 600 PA  and the deviations from the regression line will increase as PAs increase.
One "omitted variable" here, therefore, is the minimum performance level necessary to sustain playing time at a certain number of plate appearances. As you go further up the chain, in terms of increasing numbers of plate appearances, the minimum level that you have to sustain in order to get additional opportunities also increases (up to a point  there's probably not a lot of difference between 500 PAs and 600 PAs, but there's likely to be a large difference between 100 PAs and 600 PAs).
I'm hardly an expert in statistical methods  certainly, much of what Walt and Eric Young are talking about is beyond me. I have no idea how to apply the methods that account for this problem (although given time I could probably figure it out). But I know that arbitrary opportunity cutoffs are generally the wrong approach for evaluating baseball players, as I noted when discussing platoon advantages for RHB, because the nature of baseball statistics is that players who aren't wellrounded generally get fewer opportunities than players who are more capable across the board, and by removing lowopportunity players from the analysis you may very well be removing precisely those players who are *most* likely to have the characteristics that you are trying to evaluate, in which case it would hardly be surprising that you find no evidence of them.
 MWE
Well, if you know basic calculus, you're perfectly comfortable with regression. The regression equation in the article is essentially:
y = b0 + b1*X + e
Now b0 and b1 are constants. E is a random error term that is uncorrelated with X. E also has a mean of zero. Taking the derivative with respect to X, we get:
dy/dx = b1
The slope is nothing but the expected change in y for a oneunit change in X.
But for a lot of folks, their basic calculus is even rustier than mine. Still, it's not that hard to calculate the predicted value of Y (often called Yhat) for a given value of X (we're using i to subscript the individual) and remembering that the expected value of E is zero:
Yhat(i) = b0 + b1*X(i)
Note this is very similar to the above equation and we can rearrange:
E(i) = Y(i)  b0  b1*X(i) = Y(i)  Yhat(i)
So for a given player, plug in their number of ABs for X and you'll get their predicted number of hits. Subtract that from their observed number of hits and you get their residual hits  how many more hits did they get than expected.
Really, this is not much different than taking the league average BA, multiplying it by the number of ABs, then subtracting that from actual hits. Surely most baseball fans can understand the notion that in 500 ABs, a 300 hitter will have 15 more hits than a 270 hitter (not that they can necessarily handle the multiplication). The residual scores are just a slightly more accurate way of getting at that number.
Simple regression (i.e. just one variable on the righthand side of the equals sign) as we have here is also equivalent to correlation. Correlation tells you how closely related two variables are in a linear fashion; regression just gives you a way to estimate that linear fit.
Now explaining regression is a bit trickier. The easiest thing is to say that it gives you the best fitting line for a set of points. Look at the first scatterplot of hits by AB and look at the predicted line that goes through the middle of it. That's the regression line, or Yhat = b0 + b1*X. You've got to say it does a darn good job of summarizing those data.
The "tricky" part is how do you decide which line is best. And for that, we minimize the sum of the squared errors.
Now back to complicated stuff. Did you take matrix algebra on your way to calculus? If so, here's the equation and the estimator in matrix form:
Y = XB + e
with the OLS assumptions, we can solve for B as:
Bhat = inv(X'X)*(X'Y)
Yhat = XBhat
e = Y  Yhat
One thing that keeps averages in the mainstream is that laypeople can figure out how they're calculated
Maybe. I think what keeps them in the mainstream is their use by the media. Nobody knows how to figure out quarterback ratings, computer rankings for college football and basketball, or the RPI for basketball, but laypeople talk about these things all the time. You don't know how to run a regression to make sense of residual statistics. Jim and others will run the regressions, report the residual statistics, and even give you the values for b0 and b1 if you want to calculate them yourself using the formula above. Can mainstream fans really not understand "Manny Ramirez got 26 hits more than a typical hitter would in the same number of ABs and that was the biggest differential in the AL?"
Frankly, if one can't understand at least the concept of residual statistics, I don't see how one could possibly understand anything about sabermetrics, at least at any level beyond "Neyer told me that high values on this number were good." Sure, the ordinary fan can calculate OPS, but why would they want to if they don't know what it means to say that it correlates highly with runs and that it's a better predictor than BA and HR? Imagine the following conversation:
Fan 1: I love that Tony Batista, lots of HRs and RBI.
Fan 2: But he's got a lousy OPS and doesn't get on base.
Fan 1: What's OPS?
Fan 2: It's OBP plus Slugging.
Fan 1: Who cares about those things?
Fan 2: Well, it's been shown that OPS does a better job of predicting team scoring than BA and HR.
Fan 1: What do you mean it does a better job of predicting?
Fan 2: I have no idea.
Fan 1: So why do you use OPS?
And I completely 100% disagree with everything you wrote. The importance of considering variance is to tell whether or not two things REALLY ARE DIFFERENT. Your "easily differentiated" high end numbers are, in fact, mostly not different from one another. Some of your "not easily differentiated" lowend numbers are different from one another. To claim that it is easier to see differences at the high end than the low end with residual measures is just plain wrong.
I'm sorry but it's just a fundamental of statistics that if comparing a variable across different populations (and that's what we have), you have to take their populationspecific variances (or your sample estimate thereof) into account. It's wrong to say that a guy who hit 350 in 100 AB is as good a hitter as a guy who hit 350 in 500 AB and it's wrong to say that a guy with a +5 residual in 500 AB is a better hitter than a guy with a +5 residual in 100 AB (except that we know that better hitters get more AB, so AB here is a proxy for talent). Again, if you want to make a value argument, that's fine, but residuals have problems there too unless you adjust for AB.
If you want to put that in realworld terms, using residual hits to compare across AB groups is like saying that a family that makes $55,000 in SF has a higher standard of living than a family that makes $50,000 in Des Moines even though the median HH income in SF is $55,000 but only $38,000 in Des Moines (1999 #s). Or if you want to put that in costofliving terms, $50,000 in Des Moines is the equivalent of $85,000 in SF (those are probably 2003 $, from one of those online COL calculators).
The problem with AVG and the problem with residual hits ARE THE SAME. These two are just the flipside of one another. The differentiation in average for high AB players is no better or worse than the differentiation in rH for high AB players. It's just that the variance is standardized by AB in the first case and not in the second.
You talk about specification, but we already know that the binomial distribution gives us a darn good approximation of the variance and it's specification is known. And what does the binomial tell you  as ABs increase, variance increases for hits and decreases for average. Let's take 280 and 500 AB vs 100 AB:
SD(h500 AB) = 10.0 = .020 in average
SD(h100 AB) = 4.5 = .045 in average
It's not that hard a conversion. And since when would a 20 point difference in average (just 1 sd remember) over 500 AB be hard to differentiate? Even over 1500 AB that SD is a 12 point difference in average.
The point is that to assess either average or residuals, you have to take into account the number of ABs. Making such comparisons without considering ABs is misleading in either case.
Note, this isn't just a problem with your work, it's a problem with almost all the work in sabermetrics. This is why people use cutoffs based on AB and PA or try to standardize based on AB or 162 games, in an attempt to make sure that they're comparing similar populations. That's hardly a perfect solution because PA are always in part a function of talent.
To try to put this more formally (not sure this will be right)  T is true talent.
trueBA = f(T)
AB = g(T,health,etc.)  but let's ignore all that other stuff for now
hits = trueBA*AB + e
hits = f(T)*g(T) + e
Alas, e is distributed something close to binomial and therefore VAR(e) is a function of ABs or g(t). That is:
VAR(e) = h(AB)
So we have to get AB out of the error term somehow or we're doomed to heteroscedasticity.
As it just so happens, using the natural log seems to come pretty close in my subsample, especially if you limit it to players with 50+ ABs:
ln(h) = 1.92 + 1.1 ln(ab) + e
Of course nobody's gonna understand logresiduals and when you convert back to regular units, the spread in residuals returns. There's also the depressing fact that when we look at the studentized (i.e. standardized in a fancy way) residuals from this regression, there are only a couple guys who stand out as statistically significantly better hitters than expected given their AB. There are a number significantly worse at the low AB end.
On another technical point, the coefficient greater than 1 on the ln(ab) term is interesting. You can interpret coefficients of logged variables using % changes (or the elasticity interpretation for the economists in the crowd). So a 1% increase in the number of ABs leads to a 1.1% increase in the number of hits. No that doesn't mean that you're batting over 1.000, it would work like this:
Suppose the average BA at 100 AB is 250. So you'd have 25 hits.
Increase AB by, say, 10% to 110. Hits would increase by 11% or by 2.75 hits, meaning 110 AB hitters are 252 hitters. That's the same thing that we see elsewhere, but this equation seems to make it clearer.
As for the discussion between Jim and Walt, it seems like part of the issue is the question, What are we trying to do with this statistic? If I want to answer the question, who had the most hits over expected last year, then we don't need to standardize the residuals. And that would be a very valid question to ask, and in my opinion, an interesting one. But if we're trying to decide whether player A is "significantly" better than player B, using some set of ABs and hits, then we may have to use a different method like the binomial method Walt proposes. That too would be a valid question.
Because the value of a player who plays not at all isn't the same as the value of a player who plays 600 AB and has the residual of 0 (I.e., the "average" player).
Let's look at three lines of players:
Player A 0 hits, 0 AB, rHits = 6.47
Player B 16 hits, 100 AB, rHits = 6.63
Player C 161 hits, 600 AB, rHits = 7.16
So our numbers say that player A who never plays is worth more than player B who is in turn worth more than player C. It is odd, but possible, that players who don't play are worth more than players who do, if the players who do play are so bad that they are worse than replacement level. Are players B and C worse than replacement level?
Player B had a batting average of .160. Most people would probably agree that B is below replacement level. Great. How about player C? Player C had a batting average of .268. Few people would think that a guy with a batting average of .268 is a below replacement level player. But yet this residual method (using the intercepts from last year) says that B and C are both worse than someone who doesn't play and that C, the guy with the not too shabby .268 batting average is actually the worst of the three players!
I think this suggests something is a little different about this system. Is this wrong? Probably. The one thing that is interesting about this system is because of the negative yintercept and the aggressive linear slope (.291) what ends up happening is you essentially get a different "replacement level" depending on how many AB you are looking at. I know this is something people have thought about and written about in the past, where if someone is filling in for 5 AB because the starter is temporarily out for one game then they might add value when they suck if they are slightly above this temporary replacement level. But if a team knows the starter is out for 4 months then the replacement level might be different because the team ought to be able to do different things (promote people from AAA, make a trade, etc.) so the replacement level might be a little higher.
This residual methodology gives you that, although I think it gets too high. In this methodology if you hit less than .291 you score as more valuable the fewer AB you get.
As an aside there is an advanced metric that is sort of residual like in that it measures its value in a similarly quasi independent way for PA or AB, I.e., it isn't a rate stat. And that is VORP. But I think VORP is a strong measurement as it uses runs as currency which is intuitively satisfying because runs make sense to think about and value and can be converted easily to wins and are at approximately the correct granularity to tell distinctions between players without being too precise (I.e., without having insignificant digits appear significant). In addition VORP correctly adjusts for position, park, and replacement level. That makes, IMHO, VORP an extremely valuable statistic. (and I'm sure there are other such statistics like UZR, etc. that do this as well).
But the reason this is to high, and the point I think you are missing, is that this is what you expect the *average* player who received that amount of AB to get. But by making the average 0 and giving people negative scores for less than average you lose the replacement level concept. This is grossly unfair to the players who are slightly below average while above replacement. A player who hits .269 in 600 AB has positive value for a lot of teams even though he may be a below average player. Your baseline misses this.
"The slope term (.29) can be interpreted as meaning that for every AB the number of predicted Hits will increase by .29, and the yintercept term (6.5) is the expected number of Hits with 0 ABs."
The second clause has been discussed already with regard to the appropriateness of the model, so leave that be. Please allow me to instead call your attention to the focus of the sentence, which completely fails to relate the equation back to baseball  the slope has the wrong number of significant figures should be written .290, and number of hits (predicted or otherwise) per AB is AVG.
Once you recognize .290 is an AVG, you can look at reexpressing the equation in a way that will make sense in baseball (relating it to league average, or the typical average for a full time player or whatever). The resulting formula will be algebraically more complicated, but simpler in the context of baseball.
"For example, if a positive correlation is found between Ave and ABs (and this correlation actually does exist), the relative meaning of say a .280 Ave will change based upon ABs; for a .280 Ave with 100 ABs will be relatively more impressive than a .280 Ave with 600 ABs."
Same problem here  there is no baseball meaning of "relatively more impressive" that orders the performances this way. The hitter with 600 AB contributed more to winning with his AVG than the bench warmer, and he likely was contributing at a higher rate outside of AVG as well (else the bench warmer would have gotten the extra 500 AB). There are groupings were "relative" might make more sense  position adjustments and so forth, but AB really isn't one of them.
Similarly, refering to players as "above average" because their residual is positive abuses the common understanding of the term in precisely the same way. The September callups who got fewer than 22 AB were not from Lake Wobegon.
Looking at the model... I really don't see the point.
For instance, if you feel that residualized hits is more intuitive than average, that's fine. But why bother with the fancy stuff? Hits  League Average * AB has the same intuitive feel, AND produces zero residue for the total population, AND conforms to the intuition that hitters with fewer AB tend to be "below average", AND avoids the silly zero intercept issue.
Batting average was an excellent choice to use, both simple and familiar, as a demonstration of the technique. But the article failed to demonstrate that the technique was superior  perhaps there should have been a harder problem following, or more focus on illustrating the improvements with graphs rather than tables.
You must be Registered and Logged In to post comments.
<< Back to main