Don’t Fear the Reaper
A Response to Steve Treder’s “What Pitch Counts Hath Wrought”
On Hardball Times, Steve Treder recently published an article looking at the history of pitch counts. Agreeing with Bill James and Don Malcolm, Treder argues that reliance on pitch counts has been misguided. Fortunately his main concern is not sabermetric navel-gazing, so he does not critique the work done by, for example, Baseball Prospectus.
Instead, his main concern is how top starters are used less now than they used to be and whether this leads to fewer injuries. Using Tangotiger’s pitch count estimator, he looks at league leaders in pitch counts over the post-war period. Unfortunately, Treder takes aim at the wrong target. He also overlooks some of the potential victims of high pitch counts. His analysis also leaves a bit to be desired, but I’m not going to improve on it here, so I am equally guilty.
Pitchers ain’t what they used to be
Treder’s main finding is that starting in the mid-80s, there was a significant drop in the total number of pitches thrown in a season by the league leader.
There were roughly three periods. From 1946 to 1969, the heaviest-worked pitcher in the majors was estimated to throw between about 4,500 and 4,900 pitches in a season. [warning: almost every pitch count referenced in this article is estimated, not actual, and I will mostly drop the “estimated” qualifier from here on out. If we had more actual, game-by-game, pitch count data, we’d probably have settled these arguments by now.] From 1970 to 1980/1983, the league leaders tended to throw between 4,900-5,500 pitches in a season. Starting in 1984, that dropped substantially. Between 1984 and 2003, the league leader topped 4,300 pitches only twice and has dipped below 4,000 in 2003.
Along with this of course came a drop in innings pitched. Innings not pitched by the best starters leaves more innings to be pitched by other pitchers. Some of these innings are picked up by 5th starters, some are picked up by the bullpen (some of whom are probably actually more effective in their short stints than the starters).
Minor problems with Treder’s analysis
The first problem is that he has to rely on Tangotiger’s pitch count estimator. This estimator probably works quite well at the league level and probably quite well overall, but how well it does predicting extreme observations like the league leaders is, to my knowledge, unknown. [See the posted “Research Note” for more information.] Unfortunately, right now, I have access only to the pitch count data available on ESPN. But in 2003, the pitch count estimator names Roy Halladay the league leader with 3,950 pitches. In actuality, Halladay was 4th with just 3,630 pitches. Using the estimate, Halladay threw 110 pitches per start when he actually threw just 101. That seems to be an extreme case. Randy Johnson was estimated to throw 4,116 in 2002 but actually threw 3,996, a gap of 120 pitches or about 3.5 pitches per start. And in 2001, the estimator underestimates Johnson’s total by 58 pitches. That’s as far back as ESPN goes. Who knows how well the estimator has worked in predicting the league leaders in previous eras. Treder also uses Tangotiger’s “basic” estimator rather than the “expanded” one. He also uses an estimate of the number of batters faced (3*IP + H + BB) rather than actual batters faced (which is now available in the Lahman database). There are some fairly large differences between the two estimators. For example, for Bob Feller in 1947, the basic estimator gives him 4,712 pitches while the expanded one gives him just 4,509, a difference of 203 pitches. This difference between the two estimators is not the same over eras. For example, over the last 10 season leaders, the basic estimator ranges from 1% to 4.7% higher than the expanded estimator; in the first 10 seasons, the basic estimator ranges from 2.6% to 5.9% higher. The basic estimator always produces a higher pitch count in this sample (and in general). But I don’t want to overstate this. Assuming Tangotiger is correct that the expanded estimator is more accurate, the basic estimator has overstated the number of pitches thrown in every era, slightly more in the earlier era. However, the difference across eras is not extreme and the two estimators are highly correlated, so using the expanded estimator would not change any conclusions about trends. Consequently, for ease and consistency with Treder’s article, I will stick with the basic estimator.
It is just important to keep in mind that one should never compare, for example, actual pitch counts from today with estimated pitch counts from earlier eras. And also that those pitchers of yore may not have been quite as studly as the numbers in Treder’s article suggest.
The second problem is using league leaders to display a trend, unfortunately a pretty common occurrence in baseball analysis. Pretty much by definition, the league leader is the most extreme outlier. Now many times they may not be very extreme and therefore will give you a good idea of the general trend, but there will be times they won’t. For example, in 1972, Wilbur Wood led the majors with 376.7 IP, about 10% more than the #2 finisher, and was one of just 4 pitchers with more than 300 IP. In 1946, Bob Feller led the league with 371 IP, nearly 80 (>25%) more than the #2 and over 100 IP more than the NL leader. In contrast in 2003, while Halladay still led the #2 pitcher by 10%, there were at least 19 pitchers within about 25% of his total. In short, Roy Halladay is a much better example of the current trend than Wood or Feller were of their current trends.
Or to put it another way, in 2003, Halladay led the majors with 266 IP. In 1946, Howie Pollett led the NL with 266 IP and Tex Hughson was 3rd in the AL with 278. Those numbers don’t look so different from Halladay’s.
To investigate this, I looked at the #5 and #10 pitcher in terms of the number of pitches in each year, and all starters with at least 25 starts. Table 1 displays the mean number of pitches in the “medium” pitch count era (1946-1969) and the “low” pitch count era (1984-2003) for the #1, #5, and #10 finishers and all pitchers with at least 25 starts.
It would seem that using league leaders overstates the difference at 12%. Keeping in mind that starters of those days often pitched in relief, the difference in pitches between “average” starters was about 9% and the difference between “top” starters was somewhere around 6% to 9%. Those are still sizeable differences of course.
The Wrong Target
There’s an interesting line in Steve’s article:
Whatever the case, it’s certain that what pitch count limits (and their first cousin, the five-man rotation) have created is a situation in which the very best pitchers of the current day ply their trade quite a bit less frequently than did their predecessors.
There’s no disagreeing with that statement, except that it’s the first cousin that’s to blame.
I looked at the league leaders in the low pitch count era (1984-2003), ignoring the strike years of 1994 and 1995. Those 18 league-leading seasons totaled 75,435 pitches, 4,773 innings pitched, 638 starts and 3 relief appearances. That’s an average of 4,191 pitches, 265 innings pitched, and 35.4 starts.
I looked at the league leaders in the medium pitch count era (1946-1969). Those 24 league-leading seasons totaled 114,483 pitches, 7,518 innings pitched, 933 starts and 59 relief appearances. That’s an average of 4,770 pitches, 313 innings, 38.9 starts and 2.5 relief appearances.
Now in comparing these, one problem is the relief appearances. I decided to go with the rule that 3 relief appearances is the equivalent of one start, giving the low era 639 starts and the medium era 953 starts. But I’ll also take a look at it assigning just 1 IP per relief appearance.
I then formed low era to medium era ratios for games started, # of pitches thrown, and innings pitched:
|Games started ratio:||.894|
So what we see here is that the lower number of pitches thrown in the modern era is a nearly perfect match for the lower number of games started. Another way to look at it is by looking at pitches per game started. In the modern low-era, it’s 118 p/gs; in the medium era it’s 120 p/gs. The gap is a bit lower if we use the expanded pitch count estimator.
So maybe pitchers are what they used to be after all
In the last few years, we’ve been debating pitch count limits as they apply to pitches in a game. This clearly shows that it’s not that starters aren’t worked as hard on average, it’s that they’re getting fewer starts. The culprit for lower seasonal pitch totals is not enforcing in-game pitch limits, it’s that the orthodoxy of the 5-man rotation has led to fewer starts and naturally fewer innings and fewer pitches in a season.
Note, if we limit relief appearances to 1 inning, the gap widens but not hugely. The pitches/gs would still be 118 for the low era but is up to 122 in the medium era. The games started ratio is up to .91, the pitches ratio (counting starts only) is up to .885, and the innings ratio is up to .853.
We see however that the innings ratio is a good bit lower than the games started and innings pitched ratios. Treder downplays the notion that pitching today is more stressful, but I think this does show that pitchers have to work a bit harder to get through an inning now. Using these top starter numbers numbers, the average inning in the low-era is 15.8 pitches and in the medium-era 15.2, or about 4 extra pitches per 7 innings.
These are small differences, but small differences can add up. Using the pitches per inning numbers, 118 modern pitches are the equivalent in IP to about 113 pitches in the earlier period. The difference between those 113 pitches and the 120-122 that earlier pitchers threw is about half an inning. Half an inning per start is 81 IP per year for the team, or the equivalent of one reliever. Note about half that difference is due to the larger number of pitches per inning.
These differences are about the same if we look beyond the league-leaders. The gap in pitches per game started actually appears to widen slightly as you move towards the “typical” starter. If we look at the mean pitches per game started for pitchers with 25+ starts in each era, the gap is 107 to 102. With the difference in pitches per inning, this still adds up to about half an inning. And here too, the increase in pitches per inning is responsible for about half of that difference.
I think we can all agree that today’s pitchers could handle another 2 to 5 pitches per start. But that’s not really the source of the “problem” that Treder is concerned with. The questions are: (1) can we reduce the number of pitches per inning and (2) can we get rid of the 5th starter?
The missing piece
The underlying question is whether high pitch totals lead to injuries. Treder doesn’t have the data to look at this (neither do I and as far as I know no one has a good historical database of either actual pitch counts or injuries). Unfortunately, this doesn’t keep him from claiming that pitchers of yesteryear were no more likely to be injured than today. There are no data presented to support or contradict this view. And keep in mind that we are talking about very small sample sizes in this study – even if there is something like a 10% difference in injury rate between high and low pitch count eras, that’s an injury here or an injury there. We should not rely on our observation to be able to detect such small differences (not that this will stop me).
More importantly, as the above shows, the differences in seasonal pitch counts across eras are the function of fewer starts not in-game pitch limits. As such, this study adds little or nothing to the current debate about in-game limits.
However, bearing in mind the caveats above, his study might shed some light on whether the use of 5th starters has reduced injury risk. So let’s look more closely at the data that Steve does present and a specific injury claims he makes. I’ve kept Treder’s designation of the medium era (1946-1969), but it’s clear that the low era goes back to at least 1984. Note, this “helps” his point as it moves Fernando Valenzuela’s injury into the low era.
I’m going to ignore the high era. First, after omitting 1981 due to the strike, it’s only 13 seasons long. Second, knuckleballers Wilbur Wood and Phil Niekro led the league 5 times. Steve Carlton led the league three times at ages 35, 37, and 38 – long after we care about impacts on long-term health. Add two seasons by super-freak Nolan Ryan and this really isn’t an interesting group of pitchers for our purposes. Finally, neither Treder nor I consider a return to this type of pitcher usage to be likely.
In discussing the low era, Treder notes that only 3 of the league leaders over this 24-year period had their careers ended by major arm injuries. However, in those 24 seasons, there are only 15 different pitchers that led the league (no knuckleballers as far as I know). 3 out of 15 isn’t that high, but it’s higher than 3 out of 24 makes it sound.
The three pitchers cited are Sandy Koufax, Vern Bickford, and Denny McLain. Koufax was finished after his age 30 season. McLain essentially finished after his age 25 season and completely finished after age 28. Bickford didn’t make the majors until age 27, had his big year at age 29 (312 IP), but then threw just 165 and 161 IP the next two years then had just 62 IP over two seasons.
However, Steve leaves out Don Drysdale who was done at age 31 (he pitched 62 IP at age 32 with an ERA+ of 75). According to dondrysdale.com this was due to a torn rotator cuff.
He also skips over Johnny Sain, who led the league in 1948 with 4,757 pitches (315 IP and an ERA+ of 147). He did follow that with a more than respectable 243 IP, but an ERA+ of just 79. He had his last season over 200 IP in 1950 at age 32. Sain’s career consists of a great 1946-1948 stretch and a nice 1953 as a swingman. In short, he was not the same after 1948.
Robin Roberts could be another. He remained a durable starter, but he was a much less effective pitcher after age 28, which was his last of 4 seasons leading the league in pitches. Bob Feller also was not the same pitcher after age 30.
So the tally of “diminished careers” could go as high as 7 out of the 15 pitchers.
Potentially more interesting, 12 pitchers in this era led the league at least once before the age of 30. Four of those 12 had their careers essentially end by age 32 (Koufax, Drysdale, McLain, Bickford) and Robin Roberts was durable but less effective after age 28 and Bob Feller was both less durable and less effective after age 30. None of the twelve had a good late career, though Bob Friend certainly has nothing to be ashamed of.
Is that Typical?
Of course lots of pitchers don’t make it past age 32 and lots more decline in their 30s (of course most pitchers aren’t Hall of Fame caliber like Koufax, Drysdale, McLain, Feller and Roberts clearly were). Is there any evidence that the eras are different in this regard?
Without an historical database of injuries, that’s impossible to answer with certainty. I hope to follow this piece up with another that tries to find a proxy for injury and thereby analyze data beyond this small sample. But for now, we are limited. Note that no matter what we do, we’ll always have the problem that injuries that used to end careers often no longer do. Another problem is that teams carry more pitchers, so it’s probably easier for older, declining pitchers to hang on. Expansion might have had a similar effect, but I think the talent pool has expanded at least as much as the total number of roster spots.
The most obvious thing to do is to look at the low era (1984-2003, dropping the strike years). Ignoring Charlie Hough because he’s a knuckleballer and Roy Halladay because it’s too soon to tell, eight pitchers have led the league in this time. Six of them (Roger Clemens, Mike Moore, Randy Johnson, Kevin Brown, Fernando Valenzuela, Pat Hentgen) did so before 30. None of those pitchers had their careers end by age 32. Valenzuela and Hentgen had severe arm problems and were pretty much finished by that age. Kevin Brown has a reputation for being injury prone, but he was in fact quite durable and effective until age 36 (note, he wasn’t a really good pitcher until age 30). Mike Moore was never consistently effective (before or after) but he was durable through age 33 and pitched through age 35.
Given the sample size, we can’t say anything with certainty. Clearly the differences aren’t huge and open to interpretation, but I’d argue that the 5th starter era is the better of the two. All of the latter pitchers survived past 32 and 2-3 of them have had very successful late careers (depending on how you feel about Brown).
Of course, you can argue that the eras are equal by considering Hentgen and Valenzuela similar to, say, Bickford and Drysdale. Clemens and Johnson both had seasons with arm trouble. And I would argue that even if there is a difference, there’s no reason to ascribe it to having fewer starts – there are too many potentially confounding variables.
Does this have any implications for the in-game pitch count controversy?
As I noted, Treder’s focus is on the historical trend, which is perfectly proper. As such, it doesn’t address research done (and unsubstantiated claims made) the last few years by Baseball Prospectus. However, James’ and Malcolm’s focus was on research done by Baseball Prospectus. And Treder’s research gives us one interesting insight here.
All disciplines eventually become more focused on their internal squabbles than on the subject they originally formed to study. An important finding in Steve’s analysis is that the current low seasonal pitch count trend began in 1984, twenty years ago. This was long before anyone came up with PAP. And unless there was some secret sabermetric revolution 20 years ago, the move to low seasonal pitch counts was the result of “old school” thinking (or “new old school” thinking).
So while there’s much to criticize in BP’s work on pitch counts, it was folks like Whitey Herzog, Tony LaRussa, Chuck Tanner, etc. who created the low seasonal pitch count environment.
The current debate on pitcher injuries has centered around the impact that high pitch count games have on starters. The most controversial has been Baseball Prospectus’ “PAP”, “PAP3”, and “stress”. Leaving aside Prospectus’ methods, claims, etc. can seasonal pitch counts help us evaluate the impact of these measures?
PAP was based on raw pitch counts. As such, seasonal pitch counts and/or pitches per start probably correlate well with PAP and might be useful as a proxy to test whether higher PAP is associated with higher injury risk. However, even Prospectus has junked PAP.
PAP3 is a non-linear function intended or claimed to measure an increasing impact as the in-game pitch count climbs higher – e.g such that pitch 115 is more potentially harmful than pitch 114.
It is defined as:
(# pitches – 100)/3
Prospectus showed that a single game with very high PAP3 led to a small, short-term decline in effectiveness.
In that same article, they introduced “stress” which they defined as PAP3 divided by the total number of pitches, both summed over some period of time (they used career-to-date in the article). They showed that high stress pitchers were more likely to suffer an injury of at least one month (though they provided no significance test).
Unfortunately, seasonal pitch counts cannot be used to capture the non-linear nature of PAP3 or stress. Even if we calculate pitches per start, that’s not a good proxy for PAP3. For example, 30 starts of exactly 110 pitches each is equal to 30,000 points of PAP3; 15 starts of 120 pitches and 15 starts of 100 pitches produces the same mean of 110 pitches per start, but 120,000 PAP3 points.
“Stress” complicates things further.
If a low era starter has 30 starts of exactly 110 pitches each, his stress score for that season is about 9 (30,000 PAP3 divided by 3,300 pitches). If a medium era starter has 35 starts of exactly 110 pitches each, his stress score for that season is about 9 (35,000 PAP3 divided by 3,850 pitches). If Prospectus is correct, there’s no reason to expect that first starter to have a lower injury rate.
If we had found substantially different values for mean pitches per start across eras, it would be highly likely that the low era pitchers had lower values of stress and therefore we could use differences in injury rates to assess Prospectus’ finding. But we have very similar mean pitches per start in the low and high era, so there’s no immediate reason to think that one era had higher “stress” than the other.
Note, there are reasons to suspect that, even with similar pitches per start, medium era pitchers had more high pitch starts. They had more complete games, suggesting more high pitch count games. If that’s so, to maintain the same mean, they must have also had more low pitch-count games. As we saw above, between two pitchers with the same number of pitches per start, the one with more high pitch starts will have the higher PAP3. But we can’t know this nor its magnitude or impact on “stress” without game-by-game pitch count data.
Treder is analyzing the historical change to 5th starters (i.e.the reduction in the number of starts among the very top pitchers) not an historical change in pitches per start. He advocates for pitcher usage more similar to that medium era.
In conclusion, let me note that Rany Jazayerli, who came up with the idea of “pitcher abuse points”, is an advocate of a return to a “4-man” rotation – i.e. more starts for your best starters. The data in Treder’s article suggest that if today’s top starters returned to starting 38-40 games a year, they’d throw as many total pitches as their forebears.
This suggests that “both sides” agree that a return to at least the 1946-70 era is possible.
Posted: August 10, 2004 at 02:13 AM | 24 comment(s)
Login to Bookmark