Tuesday, November 04, 2003

Defensive Regression Analysis

Michael introduces his new method for evaluating defensive play.  Part 1 of 3.

INTRODUCTION

"As late as June 4, 2002, . . . there were still big questions about baseball crying out for answers; a baseball diamond was still a field of ignorance.  .  .  .  No one had established to the satisfaction of baseball intellectuals exactly which part of defense was pitching and which fielding, and no one could say exactly how important fielding was.  No one had solved the problem of fielding statistics."  Michael Lewis, Moneyball, p. 98 (W. W. Norton & Co. 2003).

Thanks to the efforts of many people, a solution to the problem of estimating the run-impact of fielding using traditional fielding and pitching statistics?and evaluating the run-impact of pitching independent of fielding?may be within reach.  In this article, I would like to introduce a new system that provides runs-saved estimates for pitchers and fielders using traditional statistics publicly available throughout the history of baseball?in other words, without recourse to non-public "zone" data, which has been compiled since the late 1980s.

The system is called Defensive Regression Analysis ("DRA").  (The acronym conveniently rhymes with ERA; fans unfamiliar with regression analysis can think of it as Defensive Runs Analysis.  I?ll provide a simple description of regression analysis in Part I below.)  To the best of my knowledge, DRA is the first defensive system that

(i)&#9;deals with pitching and fielding simultaneously, i.e., on a fully integrated basis,

(ii) &#9;does not rely on any subjective weights or factors,

(iii) &#9;uses only publicly available statistics in existence throughout the history of baseball (e.g., no Caught Stealing or Sacrifice Hits Allowed), and

(iv) &#9;estimates, using simple “linear weights” equations (similar in form to Linear Weights equations used for evaluating hitters in the official baseball encyclopedia, Total Baseball), the runs saved (allowed) by pitchers and fielders (a) relative to the league average and (b) independently of each other.

A team’s pitcher and fielder DRA ratings add up to the team DRA rating, i.e., an estimate of the number of runs the team should have allowed.  The standard error of such estimate in the 1974-2001 study used to create DRA, 19.7 runs, is less than the standard error for regression analysis models for team runs scored, as well as those of all other systems for evaluating offense described in the recent book co-authored by two former Chairs of the Sports Section of the American Statistical Association.  See Albert and Bennett, Curve Ball: Baseball, Statistics, and the Role of Chance in the Game (Copernicus Books 2001) (hereinafter, "Curve Ball").  In other words, the "parts" under DRA "add up" as well or better than the "parts" under Linear Weights or Runs Created or any other system for batting and baserunning of which I am aware.

What is probably most exciting about DRA is that it provides individual fielder runs-saved estimates that match up well with runs-saved ratings derived from proprietary zone data, such as Mitchell Lichtman?s Ultimate Zone Rating ("UZR") system (available at baseballprimer.com) and zone-based evaluations posted by Diamond Mind ("DM") on its website.  (Disclaimer: I have neither sought nor received any endorsement or any other support from Diamond Mind.  The DM evaluations that I cite are all publicly available on its webpage.  My interpretations of DM evaluations are entirely my own.)  DRA ratings have the same "scale" as UZR ratings, and an over 0.8 correlation with UZR ratings, when the latter are adjusted to incorporate DM evaluations.

Zone-type ratings, as you?re probably aware if you?re reading this, are based on highly detailed records of the actual number of balls hit into each of approximately 80 "zones" on the field.  Zone data began to be collected in the 1980s because of the fundamental problem with traditional fielding statistics: they provide no direct record of fielding opportunities.  Judging fielders based on gross plays made (e.g., the total number of fly balls caught by an outfielder) was therefore no more reliable (actually far less reliable) than judging batters by the gross number of their hits and walks.  Zone ratings essentially compare the number of plays made by the fielder compared to what the league-average fielder would make, given the same number and pattern of balls hit into his zones.  UZR converts the "plays made" numbers into runs saved, based on the change in expected runs allowed if a given play is made or not made; i.e., depending on whether the play made, on average, prevents a single or extra-base hit.

Provided in Part II are position-by-position charts of DRA and UZR runs-saved ratings for all 82 players who played at least two seasons full-time (130 or more games) at one position during the three-year time period (1999-2001) for which I had access to both UZR and DRA data.  I?m not aware that any other non-zone fielder rating system has ever been compared with zone ratings in as systematic a way.

Although you will have to be the ultimate judge, my sense is that the average rating per player over that time period under DRA essentially matches up with the UZR rating and/or DM evaluation close to 95% of the time.  In other words:

Given two or three years of publicly available data, DRA provides a reliable estimate of whether a fielder is basically average (+/- half a dozen runs per season), meaningfully above or below average (+/- a dozen runs a season), or exceptionally above or below average (+/- two dozen runs per season).

I am not promising exact matches.  For example, DRA rates Nomar Garciaparra as a +5 (runs saved) shortstop per season in 1999-2000, whereas UZR rates him at ?6.  Given the imperfection in statistics, including UZR data, as well as the practical significance of runs-saved numbers of that relatively modest magnitude, I view that as an acceptable match, meaning that both systems identify Nomar as an essentially average-fielding shortstop.  The relevant findings under DRA (and supported by UZR and DM) would include, for example, that Derek Jeter is costing the Yankees enough runs to justify moving him to third base, and that Pokey Reese was saving his teams so many runs at second base (at least in 1999-2000) that it made sense to let him play, even though he is weak hitter.

I think it fair to say that out of the 35 players with three years of data, there is only one clear and significant error, and out of the 47 players with only two years of data, there are only four clear errors.  In addition, the errors are "conservative".  By that I mean there are no false "positives" or "negatives" in the study; DRA might fail to identify a good fielder (in the study, it did not fail to identify any bad fielder), but it doesn?t rate fielders in the study as meaningfully good or bad who aren?t good or bad.  Though DRA might be slightly more conservative than UZR, the runs saved (allowed) ratings under DRA nevertheless have approximately the same "scale" of impact as UZR ratings: the highest DRA ratings are as high as (and no higher than) the highest UZR ratings; the lowest DRA ratings are as low as (and no lower than) the lowest UZR ratings.

One simple way of quantifying the overall match between DRA and UZR/DM is to "regress" the UZR average ratings per player onto the corresponding DRA ratings.  Regression analysis reveals that the average DRA rating of players in the study has, on average, almost precisely the same "scale" as the average UZR rating (the "coefficient" for DRA under the regression equation is nearly 1.0, actually, 0.95), and a correlation of 0.7.  When, for reasons explained in Part II.A.4, UZR ratings are adjusted to account for DM commentary, admittedly in a somewhat subjective manner, the correlation improves still further, to slightly over 0.8.  (All of the above results are provided in detail at the very end of Part II.)

Unlike zone systems, DRA can provide ratings for players throughout the history of baseball.  Part III (and the Appendix to Part III posted as an Excel file) provides DRA ratings of all players who played at least 130 games at a single position for at least five seasons anytime between 1974 and 2001, the time period for which I had convenient access to the relevant data.  I hope you will spend some time looking over these ratings, as they demonstrate even better than the three-year DRA-UZR-DM comparison the basic reliability of the system.  In the majority of cases, players peak in their youth.  The year-to-year ratings show remarkable consistency, including when players change teams.  DRA ratings drop, sometimes sharply, after players are injured.  There are almost no weird run-saving values?historically high assists and putout totals do not result in absurdly high ratings.  Great as he was, and notwithstanding his historic assists totals, Ozzie Smith was never "saving" 50 runs a season, even at his peak?20 runs a year was more like it.  What made Ozzie probably the greatest shortstop in history (and, therefore, probably the greatest fielder in history) is that he consistently performed at or near that level for about ten years, before declining to an average level of performance.

Although I will provide a description of the principles and methodology of DRA, as well as the results of several diagnostic tests, the "linear weights" equations per position are not provided.  (The format of the equations, however, is shown in Part I.A.)  I am currently approaching several major league organizations regarding DRA, which can serve as a simple tool for double-checking zone ratings (which, due to their computational complexity, have something of a "black box" quality) and for evaluating minor league fielders (for whom zone data is unavailable).  When I say that DRA is a simple tool, I mean that it is far, far simpler than any other pitching and fielding system (zone or non-zone) that has been described in print or a public Internet forum, either fully or in general terms.  All of the equations can fit on one page, and although most elements of the equations are completely novel, they would make immediate, intuitive sense to any serious baseball fan after a brief explanation.

I can appreciate that it is somewhat difficult to place much faith in a system when not all of the details about it are available, although most fans seem to feel comfortable accepting UZR and DM ratings without having access to the data and all of the calculations under those systems.  In the case of UZR, Mitchell Lichtman has done an excellent job of explaining the core "concept" behind zone ratings and the factors that must be considered in using zone data intelligently.  Fans, therefore, feel comfortable "buying into" the system, even though the underlying data is proprietary.  Tom Tippett at Diamond Mind is also very clear about the factors he considers in evaluating fielders although, again, his data is not publicly available.  My hope here is that

(i) &#9;the description of the general principles and methodology of DRA will reassure you that the core concepts of the system are sound and that great care has been taken in making the system work,

(ii)&#9;the results under DRA, especially how the ratings in the 1999-2001 data set compare with UZR and DM, will instill confidence in the system and pique interest in learning more about it, and

(iii)&#9;the imperfection of the output will reassure the skeptics among you that the ratings were not just cooked up.

Should it be the case that DRA is of more interest to fans than major league organizations, I would welcome the opportunity to publish the DRA equations in a book that rates the greatest fielders throughout major league history.

Were I to reveal the DRA equations here, this article could be condensed to about ten pages, and anyone who reads Bill James, Pete Palmer, Dick Cramer or TangoTiger would "get it" and endorse it immediately.  However, I?m trying to make the case for DRA indirectly, so I have to bring more subtle and lengthier arguments to bear.  As I?m using statistical techniques and approaches never tried before, some amount of background information about them is in order.  Most importantly, I?m ultimately relying on the output of the system to make the case for the system, and discussing that output is a lengthy task.

* * *

ACKNOWLEDGEMENTS

Before we go any further, I?d like to thank several people.

Dick Cramer reviewed several earlier versions of DRA, and was the first person to appreciate the underlying logic of the method.  Without his encouragement, I doubt I would have had the patience to keep working on DRA.  In addition, Dick?s research supports certain key assumptions under DRA.  Finally, as a founder of STATS, Inc., Dick made possible the compilation of the data against which DRA could be tested.  I?d like to thank Neal Traven at SABR for forwarding my first article about DRA to Dick.

Chris Dial, Mike Emeigh, Mitchell Lichtman ("MGL"), Charlie Saeger, and Tom ("TangoTiger") at baseballprimer.com have all provided key information and insights.  Charlie developed years before anyone else many fundamental ideas for fielding evaluation.  Throughout this article, I frequently use Charlie?s terminology of "context-adjusted" fielding plays, although our systems are different.  MGL has provided an invaluable service to the sabermetric community by publishing the results and basic methodology of his UZR system.  Mike Emeigh?s fielding articles helped me see a way of improving UZR, and MGL graciously accepted the suggested changes, which Chris and TangoTiger had previously suggested as well.  Although there will be times when I will point out certain UZR ratings that might still be a little "off", UZR is fundamentally reliable, DRA isn?t perfect either, and without the work and insights of Chris, Mike, MGL and Tango, there wouldn?t be a system against which I could test DRA.  Tango also provided team-level UZR data that helped me make the most recent and very significant improvement in DRA.  I?d also like to thank whoever is or was responsible for putting together the baseball-reference.com website, which is extremely well-designed.

I owe an enormous debt to all of the people who have contributed to Retrosheet, as well as to John Jarvis, both for his creative articles and for putting together team-level Retrosheet data in an easy-to-use format on his website, which was the primary source of data for DRA.  As will be explained in Part IV.C, DRA would not have been possible without Retrosheet data, but can nevertheless be applied to seasons in baseball history for which we as yet do not have such data.

Pete Palmer is the class act of sabermetrics, and has been an invaluable help to me in understanding the quality and scope of statistics throughout baseball history.

Steve Pappaterra is a great and good friend who first introduced me to the work of Bill James twenty years ago.

Bill James wrote me a kind and encouraging e-mail when I sent to him an over-long and under-organized letter in which I tried to describe an early version of the DRA method.  Although DRA differs significantly from Bill?s defensive Win Shares system, I was inspired to work on this project after reading his Win Shares book.  Moreover, I would never have been able to develop DRA had I not learned from Bill that ".  .  .  fielding statistics make vastly more sense if you look at them from the top down than they do if you look at them from the bottom up. * * *  To make sense of fielding statistics, sometimes you have to start with what the team accomplished, then ask how they accomplished that, and only then work toward the question of which player gets credit for that success."  Win Shares, p. 11.  DRA is founded upon the basic principle?introduced in Win Shares?that everything has to add up.

Michael Humphreys Posted: November 04, 2003 at 06:00 AM
1. MGL Posted: November 05, 2003 at 03:55 AM (#613826)
My appetite is whetted as well! Looking forward to the articles...
2. The Other Kurt Posted: November 05, 2003 at 03:55 AM (#613827)
Sounds great. I can not wait to see!

The fact that you won't reveal the underlying equations raises to mind dichotomy I have been mulling over recently, that of collaboration vs. corporation. While I hope you are well rewarded for your work (assuming it is as easy and accurate as you claim), at the same time, I wish DRA would be released to the public (the SABR comunity in this case) in the interest of growing baseball knowedge and understanding. What use is a "simple" defensive rating system if it is just as inaccessable to "laymen" as the complicated systems?

That said, I honestly wish you the best of luck selling the idea. Thanks Michael, and I'll be on the edge of my seat waiting for the rest of the articles.
3. Michael Humphreys Posted: November 05, 2003 at 03:55 AM (#613833)
Guys,

Thanks for your interest. I've been told by Dick Cramer and Michael Lewis that I'll be facing quite a challenge in selling the idea to a team, so I expect that I will turn it into a book about the all-time greatest fielders that will also provide the formulas and explain their derivation. I have been in contact with a number of authors of sabermetric books and have already begun to put together a book proposal with sample chapters. My goal would be to provide fans with something that Bill James recently alluded to in a recent internet "chat" (copied below): "a simple system to summarize fielding ability".

"Alpharetta, GA: I'm surprised there is not a composite scorecard-like statistic that better measures a defensive player's ability. Perhaps it would have to be altered by position (i.e. range is more important at SS than at 1B). Talk about what has kept this from happening and when it might happen.

"Bill James: It's a personal failure. No, seriously, it's a complicated issue, the factor you cited--we have to have different standards everywhere--is one of the complications. But still. . .I made mistakes, in trying to think through this issue 25 years ago, which have contributed to the fact that we're not further ahead than we are. And John Dewan and I, when we were running STATS Inc., should have sat down and done exactly what you said: come up with a simple system to summarize fielding ability. But John had one way to approach the problem and I had a different one, and we had a million other projects to work on, and we just never got it done."

I'm not exactly sure how Primer will split up the article, or how frequently the installments will be posted, but the following is a summary of what's to come:

Part I provides an overview of the principles and methodology of DRA, including many new insights about Defense Independent Pitching Statistics (?DIPS?) (the relative impact of pitchers and fielders on whether batted balls fall in as hits). Part II compares 1999-2001 DRA results with zone-based UZR ratings and Diamond Mind evaluations. Part III provides historical DRA results from 1974-2001. Part IV addresses various miscellaneous issues, including how DRA ratings can complement and improve zone ratings, the practical relevance of applying DRA to evaluate minor league fielders, DRA?s role in stirring and settling various Hall of Fame debates, how and why DRA can adapt to changing pitching and fielding dynamics and record-keeping over the course of major league history, as well as how DRA ratings (combined with Linear Weights batting and baserunning ratings) could be converted into replacement-level Win Shares and Loss Shares.

4. User unknown in local recipient table (Craig B) Posted: November 05, 2003 at 03:55 AM (#613835)
I agree, this is going to be awesome. What is so exciting about the new methods of rating fielders (Mike Emeigh's work, ZR, UZR, DM, Win Shares to a lesser extent) is that they do seem to be somewhat more in agreement than measures of the past were, incidating that we are really on to something, finally, after all this time.
5. Andrew Edwards Posted: November 05, 2003 at 03:55 AM (#613842)
Awesome, I'm very excited. It sounds like you're coming at it exactly the right way - I'll be really interested to read the next installment.

Also, if it ends up being a book, let us know - I'll want to buy it for sure.
6. studes Posted: November 05, 2003 at 03:55 AM (#613843)
This is the greatest appetite-whetter in Primer history! Sounds wonderful, Michael.

Add me to the list of people who fervently hope you go the book route and who promise to buy (at least) one copy.
7. bob mong Posted: November 05, 2003 at 03:55 AM (#613849)
Fantastic beginning! I, like everyone else, eagerly await what is to come.
8. Michael Humphreys Posted: November 05, 2003 at 03:55 AM (#613850)
Thanks again for the kind words.
9. strong silence Posted: November 06, 2003 at 03:55 AM (#613857)
Reel me in!! I'm hooked!

I am curious and look forward to reading the articles.

I am skeptical, as I have come to feel that the only way to evaluate defensive performance is with a system no one has talked about, as far as I know.

My proposed system could resolve much of the disagreement I see about defensive performance. For example, James Click in his three wonderful essays at Baseball Prospectus ("ATAD Here or There") comes to the conclusion that the Houston Astros are the best fielding team. As I write this essay the Gold Glove winners have been announced and the Astros had zero Gold Glove winners. Meanwhile, the Mariners had four Gold Glovers but were ranked below average using Click's system.

What I propose is a modified-ZR system.

Indulge me as I am fairly new and have not given a lot of thought to this. This system that I propose here uses an official scorer (independent and not affiliated or loyal to a team) to make a judgment on a fielding play. For example, if an OF snags a line drive with the bases loaded that an average OF would not have caught, the OF is credited with 3 runs saved (or 2) by the official scorer.

I realize that there are problems with this system:

1. The scorer would have to know what an average fielder would have done.

2. The scorer would have to make a judgment. And we all know what controversy that can elicit.

3. The collection of the data presents a problem

1. This system does not rely on estimates or macro-level factors such as pitching or park. For example, we don't need to know whether a pitching staff is a groundball or fly ball staff because it doesn?t matter. No credit is given to the pitcher in this proposed system because the actual play made by the fielder is compared to league average. It uses the actual play made by fielder as its only data. My main criticism of a system such as Win Shares is that it chooses an arbitrary number for the weight given the pitching. We don?t need to determine the percentage of fielding success given the pitching staff because it doesn?t matter when you look at the actual play.

Back to my original example of a line drive caught by an OF. How can one ever know if the pitcher ?held? a hitter who might otherwise have hit a home run on the pitch or whether the pitcher made a mistake because normally the ball would have been a ground ball single?

2. The scorer can determine if the player was positioned correctly by the coaching staff. With this, the scorer may be able to judge that the coaching staff made the error if the ball was not caught.

3. The scorer can determine the effect of weather (sun, rain, etc.) and fielding conditions on the player. I am reminded of Jose Cruz Jr. slipping on the grass on the ball he missed at PacBell in the Marlins series. In my view, it appeared the grass had gotten wet from the field crew watering the warning track because when you view the soil in the skid mark it was dark. Thus, the wet field caused the slip and resulting error.

4. The biggest advantage in my view is that this system gives us the information all of us really want to know -- how many runs were prevented by each fielder. As you know putouts and assists don't tell the whole story. The relationship of put outs and assists to runs prevented is unknown. But the actual play made to runs prevented is almost perfectly known.

This is the point that I may try to prove one day -- It is only by knowing the actual game situation and seeing the actual play will we know how many runs were prevented. Then we can evaluate defensive performance.
10. tangotiger Posted: November 06, 2003 at 03:55 AM (#613858)
If you look back to some of the large presentations at Primer, you will find that the best commentary is generated with multi-day presentations.
11. tangotiger Posted: November 06, 2003 at 03:55 AM (#613859)
It is only by knowing the actual game situation and seeing the actual play will we know how many runs were prevented.

But why not try to quantify all this? Put a god-damn tracker on every player's belt buckle, and you know exactly where he was. Put the "FoxPuckTrax" in the baseball, so you know exactly where the baseball is at all times. You can probably even measure the spin of the ball. Measure the rainfall every minute, and measure how much rain gets absorbed by each surface. Measure where the sun is, and how much cloud covering there is. We already know what the inning/score/base/out situations are.

You don't have to see the game to make these determinations (though you need to see the game to appreciate the game itself). You can establish the exact context of everything by technology.

How much will this cost? I figure you can take Rey Ordonez' 16 million\$ contract for 4 years and take that money and send Rey Rey to the minors, and have a perfect system. There is so much wasted money on misevaluating player performance, that this system will pay itself.
12. studes Posted: November 06, 2003 at 03:55 AM (#613861)
strongsilence, I'm not impressed with Click's work. All he's doing is documenting regression to the mean. Astro fielders were bound to look good in his system, because Astro pitchers had unfavorable DER variances in the past, and they regressed toward the mean. I believe the two best teams in his system were the two with the worst expectations beforehand.
13. tangotiger Posted: November 06, 2003 at 03:55 AM (#613865)
For instance, if the official scorer thinks that only 10% of outfielders would be able to make that catch and save 2 runs, the OF gets credit for 90% of the 2 runs, or 1.8.

This is the concept behind UZR.
14. Michael Humphreys Posted: November 07, 2003 at 03:55 AM (#613868)
Gerry and Rally Monkey,

Yes, the article will have ratings for Jeter, Concepcion and Ozzie Smith. The article will show the DRA ratings for all players who played at least five full-time (130+ games) seasons at a position anytime between 1974-2001. An Excel spreadsheet will be attached that shows the ratings per season. Ozzie was the best, overall, in the 1974-2001 sample; Jeter was the worst shortstop. Concepcion was the second best shortstop, and I suggest in the article that he should get more consideration for the Hall of Fame.

Ozzie's single season ratings (in chronological order, with gaps for seasons of less-than-130-game play:

19, 13, 27, 13 (1981 strike), 31, 13, [ ], 18, 10, 20, 20, 10, -11, -13, 9, 12.

Dave's:

27, 38, 23, 13, 5, 18, -7, 11 (1981 strike), 4, 4, -9.

Jeter's:

-3, -13, -9, -21, -29, -26.
15. Charles Saeger Posted: November 07, 2003 at 03:55 AM (#613870)
There will be three parts? I didn't think there was that much stuff.

Incidentally, I can independently verify the exceptionalness of Dave Concepcion's fielding using two different methods than Mike's. Concepcion's 1975 season is probably one of the greatest defensive seasons ever.

As for Ozzie ... Mike may well be overrating Ozzie's 1980 season. The Padres were a bad team with a groundball pitching staff. When I do a simple SS assists as a percentage of team A+H-HR (my quick-and-dirty measure), he comes out a couple of plays worse than the man for whom he would be traded, Garry Templeton. Though Ozzie's 1980 was good, as was his unlisted 1984.
16. Michael Humphreys Posted: November 08, 2003 at 03:55 AM (#613871)
Charlie,

Thanks for the comment. The article, as originally written, consists of an Introduction, four Parts, and a Conclusion. I believe that Primer will post the four Parts and Conclusion in two installments, probably Part I and II on Monday, and then Parts III, IV and the Conclusion the following Monday.

I'm glad you also have independent reasons to support Concepcion's 1975 rating, which I would have normally considered to be a little too good to be true. Regarding Ozzie's +27 rating in 1980 with the Padres, yes, it could be slightly too high.

All of the ratings are only estimates, but they're estimates based solely on statistically significant relationships (determined using regression analysis) between between traditional, publicly available pitching and fielding statistics and the actual number of runs allowed by a team. That, in a nutshell, is what DRA is about, and what makes DRA different from other non-PBP or non-zone systems.

Looking forward to hearing more of your comments, especially when the historical ratings are posted. You probably have a better sense of what they should be than anyone else.

Thanks again.
17. Michael Humphreys Posted: November 09, 2003 at 03:55 AM (#613876)
Strong Silence,

"As you know putouts and assists don't tell the whole story. The relationship of put outs and assists to runs prevented is unknown."

That is precisely what DRA does. We can pick up this theme in the next installments.

You must be Registered and Logged In to post comments.

