Page rendered in 1.1971 seconds
59 querie(s) executed
— Where BTF's Members Investigate the Grand Old Game
Thursday, February 12, 2004
Breaking the Law of Averages
Jim outlines how we can improve on statistics based on averages.
"You know what the difference between hitting .250 and .300 is? It’s 25 hits. Twenty-five hits in 500 at-bats is 50 points…. That means if you get one extra flare a week, just one, a gork. You get a ground ball. You get a ground ball with eyes. You get a dying quail. Just one more dying quail a week and you’re in Yankee stadium." – Crash Davis
So-called "averaged" rate, or ratio statistics such as Ave, OBP, SLG, ERA, K:BB, etc. dominate the baseball statistics landscape to such a large degree that they are often viewed as a player’s ticket to "the show." In the following piece I will argue that in baseball these measures are inappropriately applied and statistically flawed, and I will introduce rather simple residualized measures that are more valid, flexible, and statistically meaningful alternatives. Residualized measures have already been adopted successfully in other research fields where ratio statistics have been misapplied (e.g., social and behavioral sciences). Perhaps this introduction will help residuals find their place on the landscape of baseball research as well.
What are residualized measures?
A residualized score is, basically, a measure of how far a performance is above or below some prediction. In other words if we try to predict a player’s performance using an average score or a replacement-level value, we will have produced a type of residual score that represents how far that player’s performance deviated from the typical performance. The residualized measures presented here result from using multiple regression methods to make precise predictions. Specifically, we regress some variable (e.g., the numerator of an averaged stat, like Hits) upon a predictor variable (e.g., the denominator of an averaged stat, like ABs), which in turn produces a prediction line. When we evaluate where and by how far the actual numerator value is mispredicted by this prediction line, a precise "residual" score remains.
Throughout the history of baseball, hits, walks, total bases, strike outs, etc. and their respective opportunity statistics are collected in the hope of evaluating performance. There are a few different defensible ways of dealing with, for example, Hits and AB data (with a non-defensible way being subtracting ABs from Hits): (1) use Hits alone, which is great for some evaluative purposes, but does not remove the influence of opportunity (e.g., more opportunities = more hits); (2) standardize Hits by dividing by AB as is done for most common and uncommon statistics; or (3) standardize Hits by regressing it onto ABs. Since we are all familiar with the first two options, this piece will present the third option in-depth and discuss why it represents a statistically sound and simple alternative to averaging. Let’s illustrate this procedure by standardizing the 2003 season data by regressing Hits onto ABs, or differently stated, predicting Hits from ABs. This process spits out a single prediction line that best captures the relationship: predicted Hits or Ù
Hits = -6.5 + .29(AB). The slope term (.29) can be interpreted as meaning that for every AB the number of predicted Hits will increase by .29, and the y-intercept term (-6.5) is the expected number of Hits with 0 ABs. Next, we can employ this equation to predict Hits from some number of ABs by simply plugging that AB number into the equation. Where our actual Hits score deviates from this numerical prediction, a residual score remains. The scatterplot below displays Hits and ABs data for 2003 and their prediction line. Residual scores are represented by the vertical distance that each point differs from the line.
The calculation of a residualized score is easier than it appears. You don’t need to understand multiple regression to calculate a residual, you need only find the appropriate formula (examples below) and plug in the appropriate number (e.g., ABs). For example, let’s calculate a residual Hits score for a player that had 30 hits in 100 ABs during the 2003 season (i.e., a .300 hitter). Using our ^Hits equation we find that he should have -6.5 + .29(100) = 22.5, or 22.5 Hits. Where this Ù
Hits score deviates from the player’s actual Hits score of 30, we have a residual score: residualized Hits or rHits =30 - 22.5 = +7.5. Overall, our example player had 7.5 more hits than we would have predicted given his ABs. Here is a brief table of prediction equations from 2003 season data that can be used to find expected values for batting statistics.
There are two notably important features of residualized scores: (1) a set of residualized scores will always have a mean of zero, and (2) residuals will always be uncorrelated with their original denominator (or predictor) statistic. The first point increases the intuitive nature of the statistic because it reveals positive scores of some magnitude for players that perform above expectations, negative scores for those below expectations, and scores at or near zero for those who perform at the expected level. The second point is important because it indicates that we’ve likely removed the influence of the denominator or the opportunity stat. That is, we’ve appropriately standardized the measure in order to evaluate and compare performances divorced from total number of opportunities to perform. This is noteworthy because if we find that a reasonably strong correlation exists between the new measure and its original denominator term, we must conclude that we’ve compromised the meaning of the new statistic as an arguably opportunity-free or standardized metric. For example, if a positive correlation is found between Ave and ABs (and this correlation actually does exist), the relative meaning of say a .280 Ave will change based upon ABs; for a .280 Ave with 100 ABs will be relatively more impressive than a .280 Ave with 600 ABs.
Comparing averaged to residualized statistics
In this piece I will spend less time restating the many arguments against averaged, rate, and ratio statistics (called "averaged" from this point forward) that have been presented quite cogently in the past (e.g., not defense-independent, park factors, luck, etc.) and emphasize comparisons between averaged statistics and residualized measures using the following non-exhaustive criteria:
Intuitive nature of the statistic
Averaged statistics have an immediate appeal unlike many other statistics. This may be due to their ubiquity in our day-to-day lives (e.g., cooking recipes, budgets) or their mathematical simplicity. Averaged statistics are used in arenas beyond baseball, like the arts (e.g., perspective), the sciences (e.g., means), and the rest of the sporting world (e.g., free-throw % in basketball, yards per rush in football). Needless to say, residualized measures do not share the same place in our common vocabulary, although the idea of comparing a performance to our expectations is far from a foreign concept.
In baseball, few statistics are as intuitively appealing as Ave. For most baseball fans and little leaguers alike, Ave is probably the first statistic that we understand and master. Because this prototypical averaged measure has such intuitive appeal and clarity, baseball statisticians and researchers have modeled many other statistics after the simplicity of Ave (e.g., OBP, SLG). In fact, it’s not uncommon to find baseball researchers attempting to translate unintuitive or untraditional measures into a more digestible form by scaling them to appear like Ave! Although these newer statistics share a general metric with Ave, they often lack Ave’s intuitive appeal. Let’s use SLG as an illustration: Do many fans actually comprehend what total bases per at bat means? They may be able to surmise that a base per AB (SLG = 1.000) is good, but are clueless about the bounds of a great, decent, or bad SLG. Residualized measures avoid some of these problems by having three primary and easily interpreted results: above prediction (positive numbers), below prediction (negative numbers), and at-prediction scores (near zero numbers).
Another detractor from the intuitive appeal of averaged measures concerns the fact that all of these stats are not in their natural or numerator units like Hits, Ks, etc. Averaged measures have sacrificed the intuitive appeal of their original units to gain the familiar and standardized appeal of an average like Hits per AB, Ks per Inn, etc. Residualized measures are more complex in their calculation when compared to averaged measures, yet residuals have not had to sacrifice their natural units. Indeed, rHits is still strictly interpreted as a number of Hits. Because they retain their natural units, residualized measures should be as intuitively appealing as the original metrics from which they were born. The only caveat with these measures–and it can be a big impediment to intuitive appeal–is that we must qualify our natural units statement as meaning the number "above or below expectation given opportunities to perform." This qualification can be cumbersome, but in return we retain the intuitive charm of a measure in its natural units.
Ease of calculation
Of all averaged statistics in baseball, Ave has to be the easiest to calculate as it employs two easily found terms in its composite (i.e., Hits & AB). In fact, at the low-end of the AB continuum (e.g., <50 AB) one can easily calculate a batting average without need of a calculator. At the higher end of the continuum we usually have to put pen to paper or finger to calculator. Other averaged measures either employ various steps (e.g., ERA, K per 9 Inn) or have more than two figures in their composites (e.g., OBP, SLG, IsoP), and thus increase in complexity as the numerator or denominator terms become more complex. Residualized measures follow a similar pattern to all of the aforementioned statistics but have an additional level of complexity due to the two steps that precede their final calculations. Regardless, residuals are simply the product of regressing traditional "numerator" stats (e.g., Hits) onto traditional "denominator" stats (e.g., ABs). Given that many of baseball’s statistical calculations are fairly automated and easily accessible at this point in time (see just about any sports websites for remarkably complex, yet automated stats) I believe that the issue of calculation complexity is becoming, notably, more moot than meaningful.
For most averaged categories in baseball, a player can only "qualify" for meaningful comparisons among the baseball intelligentsia – not to mention batting titles – if that player has had at least 3 or so plate appearances for every game that his team has played. If we qualify batters across the 2000-2003 seasons being discussed here, we will have reduced our samples dramatically. Indeed, in each season we wipe out or ignore valuable information for essentially 70% of the sample, which equates to disregarding about 400+ non-qualifying players per season! Qualifying our measures is, at its best, a sad attempt to correct a statistic that is flawed in its ability to deal with low-opportunity figures. I find it hard to believe that anyone interested in measurement would agree to scrap so much data. The single argument that I have heard in favor of these "qualifying cuts" entails that non-qualifying players just don’t represent a large enough sample of opportunities to be valid for such comparisons in the first place… my response: Why not develop measures (hint: residualized statistics) that allow for the usage of all data, because they correctly and appropriately deal with players in the high sampling and low sampling ends of the continuum.
Intuitive nature of results
When comparing the intuitive nature of the results of these different approaches, we stumble upon two significant problems with averaged statistics that are not found in residualized measures. First, all averaged statistics tend to amplify results at the low-end of their denominator’s continuum. For example, the Ave statistic can produce results that range from 0 to 100% at the low-end, or unstable end, of the AB continuum (more on this later). Accordingly, many researchers, writers, and MLB itself employ the aforementioned qualified statistics that ignore these arguably unstable results. That said, without qualifying cuts in our data, we are left with results that are inappropriately suited to our purposes. Here are some Ave results from the 2003 season.
Contrast the above to findings for residualized Hits in the 2003 season using all of the available data:
It is important to note that even with qualifying cuts in place, the top performances in the AL differed when using averaged measures (Mueller) versus residualized measures (Manny Ramirez). In other words, differences between measures are found both at the low-end and high-end of the opportunity continuum.
The second problem found with averaged measures is their inability to reflect era-specific fluctuations. Ask yourself, do averaged measures transcend changes in hitting and pitching across generations (e.g., power vs. dead ball eras)? The answer is "yes" and "no." Averaged statistics literally mean the same thing regardless of what’s happening in an era, which seems to be a good thing. Yet we all know that if a batter posted a .300 batting average during the 1970’s that he probably performed at a higher level than a person who posted the exact same .300 average during the 2003 season, or during a hitter’s era. In comparison, residualized versions of these measures do an outstanding job of accounting for these era-specific fluctuations by producing different prediction slopes for different eras. For example, the following table displays the top and bottom 10 all-time single season hitting performances from 1900 through 2003.
A further example of this approach can be demonstrated by comparing the top 10 HR hitters of all time.
From the above tables it is apparent that the residualized statistics have corrected for era-specific performances. Although Ty Cobb would have ranked 3rd on the all-time list if we used qualified AVE as our guide, the remarkable nature of his accomplishment in 1911 emerges when we examine his performance relative to his contemporaries – 71.4 hits beyond expectations specific to that season! The opposite can be said for the remarkably poor performances by Frankie Crosetti in ’39, ‘40, and ’37. In terms of HR seasons, we find that Bonds is at the top in all three approaches, yet Ruth’s relatively meager 60 HRs in ’27 now appropriately ranks second among all players with 53 HRs above his era’s statistical expectations. Comparing these numbers against the HRs per AB ratio, we find a rank order that is slightly better than the raw HR numbers, but has difficulty making fine distinctions between performances.
Cross-time consistency comparisons
Another important test for performance statistics has to do with their reliability, consistency, or ability to predict themselves across time. When we directly compare the statistical features of averaged measures to residualized measures (as is the purpose of this paper), the latter are clearly the better statistics to use if one is interested in season-to- season predictive ability. The cross-time consistency correlations for the Ave, OBP, SLG, plus a couple of Bill James statistics, reveal a mean cross-time correlation for the averaged measures of r = .37 (r2 = .15) and a mean cross-time correlation for the residualized versions of r = .65 (r2 = .42). In other words, the residualized measures explain, on average, a remarkable 25% more of the variance in future scores than do commonly used averaged measures. Furthermore, and quite surprisingly, all of the residualized measures out-predict their averaged counterparts in predicting future averaged scores (e.g., SLG’01 predicts SLG’02 at r = .31, but rTB’01 predicts SLG’02 at r = .35)!
In order to illustrate that even statistics born from wisdom are not immune to the problems of averaged measures, let’s take a look at a couple of Bill James’ stats that use averaging: Isolated power (IsoP) and secondary average (SecA). The SecA stat is defined as (TB-H+BB+SB-CS)/AB. We can residualize the numerator by regressing it onto ABs. The two left-most matrices of the below table reveal that the residualized version (rSecA) is more consistent across time. The rSecA stats are also stronger predictors of future SecA stats (e.g., SecA’01 predicts SecA’02 at r = .52, but rSecA’01 predicts SecA’02 at r = .56).
Since the isolated power measure is defined as being equal to SLG minus Ave, we can approximate this stat with residualized measures using rTB minus rH (call it rIsoP). The two right-most matrices in the above table reveal that the rIsoP measure is, once again, both more consistent across time and a better predictor of future IsoP stats (e.g., IsoP’01 predicts IsoP’02 at r = .45, but rIsoP’01 predicts IsoP’02 at r = .48).
If we employ qualifying cuts for both the averaged and residualized measures, their respective mean cross-time correlations are exactly the same (r = .63). Thus, although residualized measures certainly out-perform averaged stats at the low-end of the opportunity continuum, both appear to be equally strong predictors when isolated to high-end opportunity performances.
Resulting statistical relationship with opportunity stats
Recall that residualized measures are always uncorrelated with their denominator statistic, meaning that we have appropriately removed the influence of opportunity from the residual measures. However, the denominator of average stats (e.g., Ave, OBP, SLG, etc.) has quite a large impact upon these supposedly "opportunity-free" measures. As can be seen in the following tables, all of the relationships between these averaged stats and their denominators are positive, linear, and substantial enough to be alarming.
Residualized measures are more statistically valid than averaged measures due to the latter’s troubling correlations with their denominators; yet we need to explore whether the prior is entirely unrelated to opportunity. As can be seen in the following scatterplots (Ave & rH related to ABs in 2003), we find that both averaged and residualized measures have variances that are conditional upon level of opportunity.
Stated differently, batters that have relatively few ABs because of injury, September call-ups, demotion, etc. (e.g., 2003 Ave leaders Jesse Garcia, Kit Pellow) can have Aves that range from zero to perfection with an average variability (SD) of about 180 points. Yet as opportunities increase for healthy players, veterans, etc., their averages actually narrow in range considerably with an average variability (SD) of about 25 points. The exact opposite trend can be found with residualized measures. Players with very few ABs tend to cluster around the mean of zero and have little variability (SD @
2 hits), while players become increasingly differentiated from the mean (SD@
17 hits) as their opportunities to perform increase.
If we consider the AB or opportunity continuum to be some proxy measure for the confidence that we have in our results (i.e., confidence increases as our sampling of ABs, PAs, Inns, etc. increases), then two very important trends emerge: (1) with averaged measures, variance decreases as you advance from the low-end to the high-end of the opportunity continuum, thus making distinctions between ability more difficult at the reliable end of the opportunity continuum and easier at the unreliable end; (2) with residuals, variances increase as you advance from the low-end to the high-end of the opportunity continuum, thus making distinctions between ability easier at the reliable end of the continuum and more difficult at the unreliable end. Needless to say, I believe that the second trend is far more appropriate for performance evaluation.
One further yet critical difference between averages and residuals is how they answer the question: What should we do with those low-opportunity players? Residualized measures begin by putting everyone at zero – or firmly at the group mean – and then modify as we acquire more information. Guessing zero (or the mean) for a player that we have little or no information about is perfectly reasonable; for rudimentary statistical logic dictates that if we know little or nothing about some value, our best guess is the mean for that group of values. Think about this in terms of a typical September call-up. If we don’t know much about him, what should our best guess be about his likely performance in the majors? Should we guess that he’ll perform at some non-specific level between a .000 and a 1.000 Ave? Should we flip a coin? If so, we’re doing exactly what all averaged measures do. Residualized stats start players at the league mean because it is our best possible guess without any other info. Thus, unlike averages, residuals take the proper theoretical and statistical approach to solving the low-AB problem.
One can make just about any statistic into an average, and thus one can make just about any statistic into a residualized measure. Specifically, as long as the average or ratio in question has a numerator and a denominator, a residualized measure can be easily developed using those exact same numbers. For example, we can translate Nomar’s 2003 stats from .301/.345/.524 into 13/-6/25. Residuals can also be used to estimate a player’s exact number of hits, walks, etc. given park factors, age curves, etc. (ie., park, age, etc. can be included in the regression equation directly, or used to correct raw data prior to the regression procedure). Here is a table of the top 10 slugging performances in 2003 which includes a column of figures corrected for ballpark – the largest difference, for obvious reasons, can be seen in Todd Helton’s numbers.
Further, because residuals do a great job with low-opportunity performances, these statistics are preferable when using "splits" (e.g., pitcher success vs. lefties or righties), minor-league data, or when evaluating September call-ups. For example, the following table includes results for the 2003 AAA International and Pacific Coast Leagues. Note that the residualized figures are now directly comparable across leagues.