You are here > Home > Primate Studies > Discussion
 
Primate Studies — Where BTF's Members Investigate the Grand Old Game Thursday, November 20, 2003Defensive Regression Analysis  Part 3The results.
19992001 DRAUZRDM Ratings, PositionByPosition I will go through the positions in descending order of skill/importance, what Bill James long ago described as the Defensive Spectrum: shortstop, second base, center field, third base, right field, left field and first base. As promised, I will end with a review of the career DRA ratings (through 2001) for IRod and Piazza, the best and worst fielding fulltime catchers over the past decade or so, as I lack the most uptodate UZR ratings at catcher. UZR infielder ratings include "DP" ratings; UZR outfielder ratings do not include "Arm" ratings. In Part III, in the context of the discussion of historical outfielder ratings from 19742001, I will discuss DRA arm ratings in the outfield. All numerical ratings are denominated in terms of runs saved or allowed relative to a leagueaverage fielder; e.g., +25 means 25 runs "saved"; ?12 means 12 runs "allowed".
The "Notes" column addresses examples where DRA and UZR seem to be reaching meaningfully different results. The following "code" of comments applies: "dm=dra" means that DM information strongly supports DRA; "dm~dra" means that DM information is mixed, but on balance, appears to favor DRA over UZR; "?" indicates it is unclear whether DM supports DRA or UZR; "dm~uzr" means that DM information is mixed, but on balance, appears to favor UZR; "dm=uzr" means that DM information strongly supports UZR. The one reference to "dial=dra" refers to an instance in which Chris Dial?s zone rating matches better with DRA. The one reference to "park" refers to a (Fenway) park effect.
DM commentary comes from three separate sources: team essays for 1999 and 2000, which contain capsule summaries of individual player performance, and the "Gold Glove" essay for 2001. DM?s webpage does not provide team comments for 2001 or Gold Glove essays for 1999 and 2000. In general, this mix of essays generally does not provide commentary for average or belowaverage fielders in 2001. The team comments for 1999 and 2000 more than make up for the lack of "Gold Glove" essays for those years. Shortstop
DRA and UZR basically agree that Aurilia, Clayton, Renteria, Tejada, Deivi Cruz, Nomar, and Alex S. Gonzalez were basically average over the ?99?01 period. DRA and UZR agree that Rey Sanchez was outstanding and that Neifi Perez was pretty good, at least in 2000. DRA and UZR basically agree that Jeter, Guzman and Alex Gonzalez were clearly below average, with the DRA ratings for Guzman being less extreme and the UZR rating for Jeter being less extreme. We could quibble about a few single season ratings?UZR shows Clayton as a viable Gold Glove candidate in 2001; DM?s 2001 Gold Glove Review ("DM GG") does not mention Clayton. On the other hand, DRA shows Renteria as a viable Gold Glove candidate in 2001, and DM GG doesn?t mention him either.
The significant differences are over Vizquel, Ordonez, and, possibly, ARod. DM GG had this to say about Omar, my nomination for the most overrated fielder in history:
"[Vizquel] was one of three Cleveland infielders to be rewarded with Gold Gloves [in 2000]. But that infield was below the league average in turning ground balls into outs. And according to the STATS Major League Handbook, they were fourth worst in the league in converting double plays when grounders were hit in doubleplay situations. "The bottom line is that somebody isn’t making nearly as many plays as people think . . . . "[In 2001], Cleveland’s infield was 13th in the league in the percentage of ground balls turned into outs. And they were only a hair above the league average in doubleplay percentage. "You could argue that the infield looks bad because the corner guys—Jim Thome at first, Travis Fryman and Russ Branyan at third—don’t cover much ground, and you’d be correct. Problem is, there’s absolutely no evidence that their middle infielders are doing more than their share, either . . . .
"Suffice it to say that Vizquel’s range wasn’t all that good this year." UZR rates Vizquel above average; DRA rates him below average. Rey Ordonez has a historically high UZR rating for 1999: +39. DM does not seem to suggest that Ordonez was having a historically outstanding season at short. "Error totals aren’t usually a good indication of fielding prowess, but the four errors charged against Ordonez were impressive nonetheless." DM says nothing about his range. DRA rates Ordonez?s 1999 season at +10. Regarding ARod, DM seems to take a middle position between the moderately high rating he has under UZR and the barely above average rating he has under DRA. In 2000, DM?s team comment for Seattle describes ARod?s fielding in a manner that supports DRA: "While ARod lacks the great range of some other AL shortstops, he does rate aboveaverage and has very good hands." UZR rates ARod?s 2000 season at +18; DRA rates it +9. DM has nothing to say about ARod in 2001. UZR rates him slightly above average (+8); DRA rates him as slightly below average (6). All in all, DRA appears to have "worked" in evaluating fulltime shortstops during the 19992001 period. Second Base
DRA and UZR basically agree at second base. In its 2000 Florida Marlins comment, DM classifies Luis Castillo among young players with "great speed and defense", so it?s probably the case that UZR has measured his fielding better than DRA, though DM does not elaborate at all regarding Castillo?s defense, and does not mention Castillo at all in its 2001 Gold Glove review. If DRA has failed to recognize his talent, it?s not a talent of significant magnitude. DRA, UZR and DM all agree that Pokey Reese was outstanding and that Adam Kennedy was very good, particularly in 2001. Center Field

Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
1. Bill Posted: November 20, 2003 at 03:58 AM (#614036)I hope to do a 200203 DRA article, in which I use the 19742001 regression weights "out of sample". Perhaps Chavez will show up better then.
Sweet,
Thank YOU for reading the article.
Gilbert,
After I wrote the article, I realized I should have written something about Junior, who has been given a lot of Gold Gloves, and then came across this email thread by MGL:
"Posted 5:46 a.m., June 21, 2003 (#1)  MGL
Griffey was always VERY overrated in CF and slightly overrated as a hitter, and of course his injuries and perhaps lack of conditioning have caused him to "age" faster than he should have."
I said it to you before, and I'll say it again  I watched Greg Gagne about as much as Bill James did, and his assessment is too positive. Your ratings have him right. Not only that, but his backups in Minnesota  usually Al Newman  were generally pretty good.
I hope you have enough Snickers handy...
Brilliant work Michael.
Call me an official skeptic on the magnitude of some of these numbers. I struggle saying that Pokey Reese saved 39 runs above average in 1999, or Chipper cost his team 28 runs in the same year.
I looked at David Pinto's recent fielding analysis  particularly focusing on Mark Ellis. Ellis wasn't too far off from Reese: ZR of .890 (Reese's was an awesome .905), same number of innings and peripheral stats. And Pinto says that Oakland second basemen recorded thirty more outs than expected, given where the ball was hit.
I'm guessing that it takes two to three "outs made above average" to save a run. That would mean that Reese would have had to prevent about 100 hits to save that many runs. Oakland second basemen, arguably the best in the majors last year, prevented 30 according to Pinto.
I know that you add in a ton of other factors, so a straight comparison isn't possible. But how is it possible for any second baseman to save his team 39 runs in a single season?
No. Saving an out is worth about 0.8 runs. I had a long discussion/proof on this. I'll see if I can dig it up.
============
Suppose a team with Ozzie at SS gives up on average 12 nonHR hits, and 2.6 walks every game (which of course is 27 outs). Applying .50 runs per nonHR hit (I know it should be closer to .55, but I just want to keep it basic), and .30 runs per BB, and .10 runs per out, and we get 4.08 runs scored per game. And per game, we see that Ozzie's team faces 41.6 batters (again, let's not worry about DPs, etc).
Now, let's say Ozzie was traded for Spike, and let's say for every 41.6 batters faced, there is one ball that Ozzie gets to that Spike doesn't. So, for those 41.6 batters, Spike's team records 13 nonHR hits (1 more than Oz), 2.6 walks, and 26 outs (1 less than Oz). However, there's still one more out to go! Since Spike's team gives up 13 nonHR hits / 26 outs, we can estimate that this team will give up 13.5 nonHR hits, 2.7 walks, and 27 outs per game ( a total of 43.2 batters, a remarkable 1.6 MORE batters than Oz). Anyway, applying our LW constants, and we see that Spike's team gives up 4.86 runs per game.
This number is .78 runs MORE than Ozzie. This is the result of Ozzie getting to one more hit than Spike. .50 runs for the hit, and about .30 runs for the out gives you the .80 runs.
====================
Thanks, tango.
Does the value of this system (as compared to MGL's UZRs or Pinto's new system) come simply from the fact that it can be used to evaluate the seasons before playbyplay date or do you think there's more than that?
Charles, thanks for the insight re:Gagne.
Repoz, thanks for the compliment, though I'm not sure I'm getting the "Snickers" comment.
Studes, several points. (1) I hope to apply the weights out of sample for 200203 ratings. (2) Pokey's +39 runs could easily be an excessive estimate, but I believe his twoyear average is pretty close to UZR. (3) Chipper's 28 rating is almost exactly the same as Chris Dial's zone rating for Chipper. It could be wrong (any estimate can be wrong), but I'm pretty sure the overall 19992001 rating for Chipper is right. After all, if he only cost a handful of runs a season, why would you move him to left. And I tend to trust Atlanta's judgment here. They have the best DRA team fielding rating in the '90s and the highest rating this year under David Pinto's model.
Tango, thanks for the runssaved analysis. Spot on.
Tim M, I'd like to do a pitcher rating article (Dick Cramer has made the same suggestion), but it may be some time before I can get to it. DRA is a bit like DIPS, except that (I think) DIPS allocates all BIP outcomes to fielders, whereas DRA allocates estimated infield fly outs to pitchers and the remaining BIP to fielders. Also DIPS yield a *rate* number ("DIPS" ERA); DRA yields runssaved numbers. DRA would probably provide a few meaningful adjustments to our alltime ratings for pitchers. I believe it would probably address two issues in Win Sharesthe overestimation of pitcher value in the deadball era and the underestimation of pitcher value for the real outliers such as Pedro and The Big Unit. That said, I think we all pretty much know who the best pitchers have been. Providing a better estimate of whether Mays' glove enabled him to match Mantle's peak value is something a think a lot of fans feel they don't know and would be interested in.
Joe M., thanks for taking the time with the article. I too was surprised that Manny was OK under DRA. It's certainly possible that he was worse than DRA has determined. It's also possible that he has declined a bit in the past couple of years, which would be consistent with the pattern for almost all outfielders at his age. I agree that he is a bad fielderI saw him misplay about half a dozen fly balls (without being charged for an error) in two Yankee games this season. I think his Pinto rating this year is not good, but not terrible, either.
J. Cross, DRA's relevance is in (a) providing better historical ratings, (b) evaluating minorleaguers for whom we lack zone data and (c) providing a "back of the envelope" second opinion for surprising zone ratings. As mentioned in Part IV of the article above, zone ratings have the best data, but it's actually a very big challenge to figure out how best to use it, and sometimes you get ratings that are surprising. When a surprising rating pops up, an analyst could use the DRA rating as a reference point in the course of analyzing, playbyplay, why the zone rating is the way it is. Then the analyst can determine whether it might be appropriate to adjust the zone rating. It's all about trying to answer the same question using *different methods*UZR, Pinto's probabilistic model, DRA, etc., etc.
I've kept Pete Palmer in the loop regarding DRA, but he said he didn't have the chance to try incorporating the DRA method (as described here) into his latest edition, which will be coming out soon. One possible way of getting DRA before the public would be as part of a future edition of one of Pete's books. We'll see. In the meantime, I hope to provide 200203 DRA ratings for Primer readers and will probably reveal all in a book, assuming no major league team is interested.
AED,
If you can point us to any publication (in print or online) that uses regression analysis as fully as it is used under DRA, I think we'd all appreciate knowing about it. Regression analysis *has* been used to make certain ad hoc adjustments for certain defensive statistics (I acknowledge in one of the threads to the second installment that I actually got the idea for DRA from a (presumably) regressionbased adjustment under one of Bill James' formulas (see p. 222 in Win Shares).) However, I'm not aware that anyone has used regression analysis (i) to find (or attempt to find) *all* of the statistically significant relationships between publicly available pitching and fielding statistics, in order to develop better estimates of contextadjusted fielding plays made, and (ii) to determine the statistically significant weight, in runs, for each contextadjusted pitching and fielding event.
As far as accuracy for shortstop ratings, that's great if your system has a high correlation coefficient with UZR. But that's not the whole story.
First, the sample size here is fifteen. When I was analyzing the results under DRA, I compared the rsquareds for DRA, Win Shares, and Davenport Fielding Translation ("DFT") ratings against updated UZR ratings for all shortstops evaluated in Mike Emeigh's Jeter series (the sample size was possibly somewhat bigger). DRA was significantly better than Win Shares, somewhat better than DFT. Guess who came out best? The completely nonempirical Total Baseball Fielding Linear Weights rating. As Tango says, "Sample size, sample size, sample size."
Second, it might be worth checking the shortstop ratings against the UZR ratings adjusted for Diamond Mind commentary. I don't know what the correlation coefficient for shortstop ratings is, but the overall correlation coefficient at all positions is slightly over 0.8.
Third, almost as important as the correlation is the *coefficient* of the regression result; i.e., getting the "scale" of ratings correct. In Mike Emeigh's "Jeter" sample, Win Shares ratings were too "dampened"; DFTs were about right (as are DRA ratings); Fielding Linear Weights were too "big" (too much spread between high and low ratings). In general, DRA manages to match (almost perfectly) the average "scale" of fielding impact independently determined under UZR.
Fourth, getting ratings approximately correct at *all* positions, including pitcher, and having the ratings add up to the team runs allowed, is also very important. Win Shares does the latter by definition, as does DFT. Fielding Linear Weights does not. DRA doesn't "force" the ratings to add up to team runs allowed, but the DRA estimate of team runs allowed is more accurate than Linear Weights or Runs Created estimates of team runs *scored*.
I'm sure that Primer readers would be very interested in learning more about your system, and it wouldn't be necessary to reveal all of the details. I seem to recall Mike Emeigh writing that certain key details in the DFT methodology are still proprietary, as is Tom Tippett's system at Diamond Mind. I also believe that Tango has described the basic principles of his Leverage Index for relief pitchers without revealing the mechanics.
Went back and checked the DRA ratings for Manny Ramirez. They stop in 1999his last 130+ game season in right field. So we're missing the last four years of ratings. His Range Factors as reported here at baseballreference.com dropped sharply after 1999 and stayed consistently low. (I know, I know, not the most reliable stat.) Nevertheless, it seems he did really decline after 1999, as evidenced by the the fact that his teams wouldn't *let* him play fulltime in right field.
Though I'm sure it was just a minor factor, I would not be at all surprised if the sabermetricsaavy Red Sox have figured out that Manny does enough damage in the field to cause his *overall* value to be meaningfully less than commonly perceived.
Studes,
Yes, the runweights for contextadjusted plays made at the various positions do differ, and, I think, teach us something.
The run weights for plays made at second and first were slightly higher than at short and third. I think the reason is that a singles (and doubles) on the right side of the diamond has more baserunner advancement value. If you prevent a hit with a higher value, you save more runs.
The run weight at center was consistently lower than in right and left. Two possible reasons. There may be doubles and triples "prevented/not prevented" down the lines than through the gaps. The other is that a ball *caught* in centerfield probably has a lower value to a defense, because it is likelier that a runner can tag up and advance. Why? Well, centerfield is deeper than right and left. In addition, *assists* rates at *both* corners are always higher than in center, in spite of the fact that centerfielders field many more BIP.
This is another example of how insights from DRA can complement and improve zone ratings. As far as I know, UZR tracks whether a BIP that falls in for a hit turns into a single, double or triple, as well as the average value of an out. But I don't believe that it tracks the different *baserunner advancement* value of the hit prevented or the play made, which *differs* by position. Yet another variable that could be added to a stateoftheart zone system.
It hadn't occured to me that this could be used to evaluate minor league defense. To my mind that's a more important implication than evaluating alltime greats. Of course, it is good to have multiple methods to evaluate defense since none of them is as yet that dependable.
AED, if we don't get to know how your system works do we get to know who AED is or do you, like the system, need to be cloaked in mystery? It's not that ridiculous to think that whoever comes up with the best new tool to evaluate defenses will be hired by a MLB club. It is the next big thing, right?
Thanks. Neither Edmonds nor The Big Hurt played 5 seasons of 130 or more games during the 19742001 period of the historical study.
Edmonds played four such seasons. '95 and '98 with Cal/Ana; '00 and '01 with St. Louis. The team ratings in CF for those seasons were +14, +7, 0, and 3. I believe he has battled a fair amount of injuries.
The Big Hurt had only three 130+ seasons: '92, '93, and '96. The ratings are 10, 12, and 4. The first two seasons he played 158 and 150 games; the third season he played 139. His backups probably raised the team rating slightly.
As explained in the article, the ratings at first are based only on contextadjusted assists. I explain in the article why(Saeger/James) Estimated Unassisted Putouts at First Base ("EUPO3") are probably not reliable enough for evaluating *good* or *adequate* first basemen, but *are* useful for evaluating *terrible* first basemen, and I suspect that Frank Thomas' rating would be appropriately reduced below its already poor level if his EUPO3 were calculated.
In addition, neither DRA nor Win Shares addresses the important factor of catching throws from infielders (i.e., preventing "infielder" (throwing) errors).
J. Cross,
Thanks for your support. DRA can, in theory, be applied to minor league stats. The extent to which minor league DRA ratings would "translate" into major league performance is another question. I tend to think they will, at least as well as batting statistics. I think that is the biggest potential benefit of DRA to a major league team, though I also think it is useful as a "back of the envelope" doublecheck on zone ratings.
I actually do think that good fielding systemsUZR, DRA, Diamond Mindare actually pretty reliable now. The yeartoyear correlation in such ratings for individual fielders is at least as high as the BABIP for *hitters*, and possibly as high batting average (not sure which). Even if the "persistency" is only the same as BABIP for hitters, people routinely accept the significance of the BABIP component of batting performance without batting an eye (pun intended). It will take time, but fans will eventually accept welldesigned fielding ratings, just as they have begun to accept OPS.
I'm not sure what you (MH) mean by "whether UZR incorporates into the value of a hit, baserunner advancement." Remember that UZR is context neutral  it only cares about the average value of a s,d, etc., inlcuding the average baserunning value of those hits. It does not measure a fielder's actual performance, vivavis what the baserunner and out state is when a ball is hit. Does DRA account for this? If yes, then they are fundamentally different, although the differences will diminish with larger samples.
I am confused as to exactly how DRA can "add" to a UZR rating when that UZR rating is questionable for whatever erasons. I am not implying that it can't. It's just that, as I'm sure you know, it is hard to conceptualize a system that is based on a multiple regression analysis. In fact, it is hard to intuitively grasp how it can actually work and how it can be so precise.
While RF and third base UZR ratings may have the lowest year to year correlations (and hence, may be the most "unreliable" UZR ratings), I think that by far and away the least "accurate" are the OF ratings in general, and CF in particular, for obvious reasons. The principla reason is that many, if not most, fly balls in the OF are easily caught and many of them can be caught by more than one OF'er (and CF bears the biggest brunt of that problem), as well as by an infielder. UZR rating in the IF are very straight forward I believe.
Does DRA "pick up" the valu eof the first baseman taking throws from the IF'ers? UZR does not of course. Also, I don't know if you have them (if I made them publicy available), but it is useful, for various reasons, to have the breakdowns in UZR: the "error" portion of UZR as well as the "range" portion of UZR. I think you said you did not use the "arm" potion of UZR for outfielders. Did you uze the DP portion of UZR for IF'ers? Does DRA include arms for OF'ers and Dp's for IF'ers? I assume DRA includes errors as well as range for all fielders and they are lumped together (inextricable?).
Also, my catcher ratings are straighforward and are included in Superlwts. I probably should include them in the UZR ratings although they have nothing to do with "zones." The include SB/CS, errors, and WP and PB's I think (maybe not WP's, I'm not sure). Do you have my catcher ratings from my UZR files? What does DRA "picup" as far as the catcher ratings?
In conclusion, this is one of the greatest innovations in sabermetrics of all time  at least the equivalent of Palmer's linear weights, probably better. Assuming one has the requisite data (but not the zone data) and is able to use the formulas, no one should ever talk about anything but DRA when they are discussing a player's defensive skill and/or performance. Heck, it might even turn out that DRA is better than UZR even if you have the zone data. I am blown away!
I would love to see you do a quick and dirty estimate of how DRA changes as a player ages (you have to use the "delta" approach of course to avoid selective sampling), since you can use so much data. My research with limited data points suggests that SS, 2B and CF lose 2 runs per year of age almost from the getgo (an early age), sort of like a player's triple's age curve, and that all other positions other than 1B lose 1 run per year (also from the getgo). 1B'man appears to get better with age (1 or 2 runs per year), peak around age 30 something (I forget exactly), and then decline, but there could much sample error in this assessment.
Let me try to address your last post in detail.
AED: Michael, the fact that few publish their systems with enough detail to really evaluate them (you and me included) makes it hard to tell. Personally, I can't really imagine what else these systems would be doing except for some sort of corrections to the stats from regression. But who knows, maybe we're the first!
MAH: Though I haven't provided all of the details of DRA, I have provided the basic "idea" behind DRAwhich I have never seen in print or online beforeas well as several key insights behind DRA (infield fly outs for pitchers). In addition, this article is, as far as I know, the first direct comparison of nonzone ratings with zone ratings at all positions (Mike Emeigh's article only covered results at short.) In that sense at least, I've provided more "detail" than *any* other nonzone system has ever provided (including Win Shares, DFT, you name it), enough for people to evaluate the basic plausibility of the system.
I think everyone here is completely in the dark about what your system is. You say that your rating system "actually sound[s] very much like" DRA, and that it is "also based on season stats rather than playbyplay data," but you don't really come out and say what, if any, role regression analysis has in your system. I don't understand your comment about how "these systems" (yours? mine? somebody else's?) could be a "correction to the stats from regression." There *aren't* any fielding "stats from regression" out there to "correct".
AED: "Perhaps I wasn't explicit, but I was comparing the perseason stats rather than the 3year average, so the sample size is much larger (N=36). I have gone ahead and done the comparison with all infield positions. For the singleseason infield stats (N=124), my correlation with UZR is 0.657 and yours is 0.613. So I think the shortstop numbers I initially checked (I had them handy because I had recently checked against the Jeter data Mike Emeigh wrote up a while back) showed a higher correlation with my rating than is representative, but there is still a difference between the two and I still find that we correlate with each other (0.798) way better than either of us correlates with UZR."
MAH: If anyone knowledgable about sabermetrics had achieved such results, it seems very surprising that they wouldn't have realized just how how much better such results were than Win Shares or DFT, and rushed to publish at least a summary description and results. (After all, Bill James and Clay Davenport make a lot of money selling their books.) Or already sold it to a team.
When you say (in your prior post) that you've haven't used PBP data, do you mean that you haven't used zone data, or that you haven't used Retrosheet data? As I mention in the discussion of third base ratings, there is Retrosheet (nonzone) data that I could have used that would probably have significantly increased the accuracy of DRA, particularly at third and right field. I didn't because I wanted to use only stats available throughout the history of baseball.
In addition, I might have tried further adjustments to team DRA ratings at all of the positions to *force* them to add up to team runs allowed (as Bill James does with his Runs Created estimates in Win Shares), but I wanted to stay true to the principle of no subjective weights or factors, beyond the disclosed assumptions, or, as David Smyth calls them, "practical approximations".
Are your ratings denominated in *runs*? Or is it a "rate" stat? If I had focused just on providing a better "rate" or "plays made" estimate *without* having to make the system yield runweights through a global regression of runs allowed onto pitcher and fielder plays made, I already know of another method (also not used before) to accomplish this. Although I haven't tested that approach, I'm pretty sure that it would provide a rate stat with even more accuracyas measured by correlation. But the point of DRA was to provide run estimates at each position that add up to a team runs allowed estimate. Does your system do that?
AED: The DM'd UZRs are a nice concept, but setting the "wrong" UZR values equal to your rating undermines its value as a comparison for your rating system. Since those values are set by definition to equal your own rating, they give a spuriously high correlation.
MAH: The purpose of the UZR/DRA/Diamond Mind comparison was to provide readers with the best information available with which to assess DRA. There were a few UZR ratings that seemed "off" without any reference to DRA ratings. Given that, it made sense to look at the results of another good system to resolve differences between UZR and DRA. When Diamond Mind supported UZR, the "DM'd" UZR rating is left entirely alone. When Diamond Mind supports DRA, I'm not just setting UZR equal to my own ratingI'm effectively setting UZR equal to *Diamond Mind's* rating.
There was a part of the article left out by the editors in which I explained all of this. The key sentence was as follows:
"I?m just consulting one well thoughtout, empirically supported, publicly available set of defensive evaluations, the content of which I must accept on faith (DM), in order to evaluate surprising results under *another* well thoughtout, empirically supported, publicly available set of defensive ratings, the content of which I must accept on faith (UZR)."
AED: I agree that the scale and "adding up" properties are both important. But regardless, UZR is a sufficiently superior system that I never thought there would be much interest in "obsolete" systems using only traditional fielding stats. (You've proven me wrong in that regard!)
MAH: Thanks for acknowledging the "scale" and "adding up" points. I agree that UZR is a terrific system, but there are a lot of baseball fans who buy books (Win Shares/Abstract) because they'd like to know about prezone rating fielders. In addition, I think the article demonstrates that DRA is a good "back of the envelope" system for highlighting certain UZR ratings that are worth a second look.
Again, I'm simply baffled that someone who would go to the trouble of developing such a seemingly accurate system as yours would (a) not have determined (before the DRA article came out) how much more accurate it is than other nonzone systems, and (b) have no idea that it would be of any interest to fans. That, and the fact that (a) your "regression" point above doesn't make sense, (b) you're still an anonymous poster, (c) you haven't begun to describe even the basic approach of your system and (d) you haven't provided your ratings very much makes me wonder. As I mention in the Introduction to the article, one of the key principles of DRA is that "everything has to add up." Something doesn't add up here.
You're right. It was Dwight's longevity that brought his overall rating down a little. Here are his 130+ game season ratings:
+13, [gap for 1977], +11, +18, +4, +15, 5, [gap for 1983was he hurt?] 7, 2, 4.
Yes, one of the things I'm happy with is that DRA yields very few weird singleseason results. When the Appendix of 19742001 singleseason ratings is posted, you'll see further evidence of this.
Jason,
Haven't had the opportunity to run DRA on minor league data and test predictability of major league performance. I think it will work, because minor league BABIP is fairly consistent (though still lower than major league BABIP), and because hitting can be projected to some degree.
Thanks! It has been a lot of work, and appreciation from leading analysts such as you is much appreciated in return. As I hope I?ve made clear, DRA will never displace UZR?we need UZR to capture the many factors that can?t be tracked using traditional data.
Going through your many excellent questions?
(1) Baserunner advancement value. I think what UZR does is track the singles, doubles and triples ?allowed? by a fielder in his zones, as well as the outs recorded by a fielder in his zones. I may be misunderstanding the UZR method, but in translating hits allowed/out created into *runs*, the *average* value of a single / double / triple / out *anywhere* in the field is used. What I was suggesting is that the value of a single hit through the hole on the right side of the field is probably slightly more damaging to a defense than a than a single hit through the hole on the left side, because of the greater likelihood that a runner can go from first to third. The only reason this issue even occurred to me is that the average value of a play made on the right side of the field has a slightly (just slightly) higher regression weight than a play made on the left. Similarly, the average DRA value of an out recorded in centerfield might be less than in the corners because baserunners are more likely to advance. To consider an extreme situation, a very deep high fly to centerfield in a zone that a good fielder might reach but a poor fielder might not would cause baserunners to ?hold back? in anticipation of the ball being caught (thus decreasing the *negative* impact if the ball ?falls in?), but also be able to tag up (and advance) if the ball *is* caught (thus decreasing the *positive* impact if the ball is caught). The ?spread? of impact per potential play?the ?stakes?, as it weremight be lower, thus explaining why regression weights are lower in center than in the corners.
(2) How DRA can complement UZR ratings. In the absence of reliable nonzone systems, it is difficult to assess very high or low UZR ratings. DRA provides a simple alternative rating. If the two ratings differ, it might suggest that it?s worth doublechecking how the many factors tracked under UZR impacted the rating. Maybe, after such a second look, the park factor, or the interaction between outfielders (shared zones), or something else might be worth adjusting. Or it might be the case that the UZR rating, after such close second look, appears sound. In which case, the DRA rating is probably incorrect, as it is based on limited data.
(3) How DRA works. Maybe the way to think about it is that defense is the mirror (a very imperfect mirror) of offense. Regression analysis provides good estimates of the value of ?positive? events (singles, doubles (well . . .), triples, home runs etc.). DRA provides estimates of the value of ?negative? events?each play made is a hit prevented. DRA uses regression analysis to find how things such as lefthanded pitching, GB/FB pitching, baserunners, etc., impact plays made at each position, after an adjustment is already made for BIP. DRA then regresses runsallowed onto *all* contextadjusted plays made (including pitcher ?plays made? such as SO, BB, HR, estimated infield fly outs) to find the average value, in runs, of each such contextadjusted play made. Now there are a number of techniques not described in the article that make the system work, which may explain why it?s hard to see exactly how the system can be so precise. But all of these techniques are also simple and theoretically sound. I *greatly* appreciate that readers have been willing to consider the merits of the system without the benefit of all of the details. If I ever write the book (and I do think that is much more likely to happen than my doing something with a major league team), I?m sure that people will be pleasantly surprised by the precise mechanics.
(4) UZR ratings. As I mention in the article, the fact that RF and third base UZR ratings had the lowest year to year correlations might just be the result of the small sample. I agree that outfield ratings are probably the hardest to get right, because there are many more outfield plays that could be made by two outfielders. UZR infield ratings are good, and are in most cases better at third base in the 19992001 study than DRA ratings.
(5) Specific ratings points. DRA doesn?t capture the ability of first basemen to prevent throwing errors, as mentioned in the survey of historical first base ratings.
Maybe we should talk about the need to track errors and plays not made separately. Tango?s run weight for getting on base on an error is only .02 runs greater than a single, so I have a hard time seeing the need to track errors if you have a complete UZR record of plays made and total opportunities. Errors (except at pitcher and right field) had no statistically significant relationship with runs allowed (per the global DRA regression) *after* taking into account contextadjusted plays made, so they are *not* included in ratings. This makes sense, to me at least, because errors are *already* accounted for in terms of reduced contextadjusted plays made. Errors are, for the most part, a duplicative stat. However, outfield throwing errors are probably worse than not making the throw at all.
I do use the DP portion of UZR for infielders. I didn?t use outfielder arm ratings in the 19992001 survey because I thought (though I never checked it out) that they would likely add too much ?noise?, as the ratio of standard deviation to mean for outfield assists is enormous compared with outfield putouts and infield assists. Outfielder arm ratings are included in the historical ratings, and don?t appear to make much difference, except for Barfield.
I didn?t have your catcher ratings in the UZR files forwarded to me from Tango. DRA catcher ratings shown in the article are based only on contextadjusted assists, stolen bases allowed, and WP/PB/BK. They don?t pick up anything that isn?t already better measured under UZR and Tango?s system. I just wanted to show how accurate ratings could potentially be for pre1970 ratings.
(6) Effect of age on DRA ratings. When the Appendix is released, it will show the single season ratings for players in 19742001. Whether that sample of players will be large enough to do even a quick and dirty aging analysis, I don?t know. At a glance, however, you can see that fielders decline with age, but a fair number of infielders seem to maintain their value, and, yes, more than a few first baseman seem to get better with age, though that might be because they get too lazy and make more Buckner elections.
Many thanks again.
MGL: Essentially, MAH is saying to use the linear weight value by zone. That is, what's the change in run expectancy on 1b/2b to 9deep, or 7short, or 78M, etc, etc. We obviously know that an IF single is worth far less than an OF single. Well, how about a LF single compared to a RF single?
It's kind of trivial to figure out, if you know how often a runner goes from 1b to 3b on a single. For example, it's about 3035% of the time overall. I'll guess it's 20% to LF and 50% to RF. So, a single to LF is worth:
.15 less than average times
30% of the time that a runner is on 1B
or .04 runs less than average.
So, if an OF single is worth .50 runs, then a LF single is worth .45 runs.
For an IF single, it would be:
.30 less than average (i.e., no runner goes to 3b on an IF single) times
30% of the time that a runner is on 1b
or .09 runs less than average.
So, an IF single is worth .41 runs.
In any case, going back to the OF, if we are talking about, what 100 singles to LF, then we would be off by about 4 or 5 runs by using an average run value for the OF, as opposed to a LF.
You would gain further accuracy by figuring out where in LF the ball was hit, and what the lwt run value is.
The reason why it is helpful to separate "errors" UZR and "range" UZR is that one, it gives a person a better notion of a fielder's defense, e.g., "he has good range but unstseady hands," etc. Two, if you want to incorporate or augment a UZR rating, with, say, a subjective evaluation, like from DM, it is also helpful to know what agree with what. In other words, if DM says that Ordonez only has slightly above average range at SS, and Ordonez UZR is +30 runs, it might seem as if UZR is wrong. If, however, those +30 runs are +20 in "errors" (he made very few errors) and +10 in range, then the +30 sounds more reasonable. Finally, it is possible that errors and range have a different skill/luck component, such that in taking a sample UZR (or DRA) and translating it into a UZR projection or estimate of true defensive talent, one might need to use a different regression coefficient for the error component than for the range component of UZR (or whatever metric). My suspicion is that there is more of a "luck" component in a fielder's error rate, thus a "higher" regression. An anecdote illustrating this notion is R. Orodonez' UZR jumping all over the place (his error rate is the one jumping  his range UZR is staying fairly constant).
Last question. Do you think that deriving the linear coefficients in your as yet unseen formula (I assume it is similar to Palmer's offensive linear weights formula, with more terms) from empirical data or simulations would yield more accurate results? In fact, isn't that usually the case when there are some cross correlations in the variables? For example, Plamer supposedly used a simulation to come up with his coefficients (values of each of the offensive events). Jarvis uses a regression analysis, and others, including myself use an emprical analysis, using the RE tables to calculate average changes in state. Others use a Markov chain model. It appears to me that the regression analysis, like that of Jarvis yields the least accurate coefficients for variosu reasons, one of which is the problem of crosscorrelation and nonlinearity of the regression lines, as you alluded to in your first installment, I think...
Right now all I have to evaluate you on are Defensive Win Shares, and Ozzie beats you at short, at least on a career basis. But you were without any doubt one of the top five players of all time, and I agree that you certainly outshone your peers more than Mike (Schmidt) did, and would have acquitted yourself quite well today if you had the same lifelong dietary, medical and conditioning advantages that today's players do.
Depot,
Again, the part of the article that explained why I consulted Diamond Mind was deleted by the editor. In brief, it's the old idea of "two heads are better than one", and Diamond Mind is, I think, probably the best evaluator of fielding excluding UZR. As I mentioned in the deleted part of the article (and repeated to a prior poster):
"I?m just consulting one well thoughtout, empirically supported, publicly available set of defensive evaluations, the content of which I must accept on faith (DM), in order to evaluate surprising results under *another* well thoughtout, empirically supported, publicly available set of defensive ratings, the content of which I must accept on faith (UZR)."
Tango,
Thanks for the comment. I agree that weighting hits by zone won't make a huge difference, but the idea is a good example of how DRA can yield new insights that can improve zone ratings.
MGL,
I've posted my email address above. Unfortunately, I won't have time to do more catcher ratings anytime soon. Take a look at the IRod / Piazza comparison and let me know what you think.
The regression coefficients under DRA could be subject to slight distortion from crosscorrelation, but as explained in the second installment, the adjustments I've made have reduced crosscorrelations to very low levelsnever more than .2 (except for one stat not impacting ratings) and usually .1. Unadjusted pitching and fielding stats often had correlations above .6. The standard errors in the run weight regression coefficients were all less than .03.
The format of the DRA equations for infielders is the following:
[run weight] * [Assist +/ BIP adjustment +/ (regression weight)* LHP adjustment +/ (regression weight)*GB/FB adjustment +/ (regression weight)*baserunner variable adjustment]. The resulting rating is centered to the league average to yield runs saved/allowed relative to the league average. The LHP, GB/FB and baserunner variables all come from publicly available data and involve simple add/subtract/multiply and divide arithmetic formulas
For outfielders:
[run weight] * [Putouts +/ BIP adjustment +/ (regression weight)*LHP adjustment +/ (regression weight)*GB/FB adjustment] + [run weight] * [Assists]. The resulting ratings for putouts and assists are each centered to the league average to yield runs saved/allowed relative to the league average.
DRA uses regression analysis precisely because changeinstate Markov models can't be constructed for prezone periods.
Changeinstate models are much easier for offense, because the number of categories of events *triggering* the change in state is so small: walk, single, double, triple, home run, stolen base, etc. But the number of separate BIP categories is enormouspotentially in the hundredsand sample size issues become a problem.
One of the things I've never quite understood is how Pete ran his simulations and why they should be used rather than empirical data. To use offense as an example, shouldn't one simply look at the sample of all baseout situations before and after all doubles we have records for have been hit, and then calculate the average expected value based on the average change in baseout state?
By looking only at the change in the baseout state (and *not* the average number of runs actually scored after the double) I think you preserve the Markov property. This gets a little tricky, as I mention in the article, when we evaluate steals of second base. I assume that Pete and others just look at the simple, plain change in state if one successfully steals second or is thrown out. But basestealing is *not* a Markov processstolen base attempts are *not* independent of the preceding and succeeding states in the way that *nonelective* hits, strikeouts and unintentional BB are. Teams are more likely to steal when they are ahead (and vice versa), and being ahead is significantly associated with hitting better, and better hitting teams will likely drive the runner home anyway to some nontrivial extent. I theorize in the article that this may be why John Jarvis and I get weights for stolen bases of about half the UZR weight.
Getting back on a somewhat less theoretical level, developing a changeinstate defensive model is sorta like David Pinto's *probability* model, which tracks the probability of out conversion given the parameters (direction, trajectory, speed, batter and pitcherhandedness) of every ball.
Here you would have to determine Run Expectation, not Out Probability. The run expectation would take into account, among other things, where a single was hit. Calculating the run expectation would involve collecting all the data (perhaps there is enough UZR data now) to get a good sample of changes in baseout "states" before and after each category of BIP.
As I understand it, UZR is sorta like a changeinstate model alreadyperhaps you could explain how that is. I admit that I get confused sometimes trying to work through the math.
Basically, once you determine the "expected value" of each type of BIP, then you would calculate the *change* in expectation for each BIP effected by the fielder. For example, if the expected value of a BIP the moment it leaves the bat is +.3 (a likely but uncertain hit), and the fielder makes the play such that the run expectation for the inning has dropped .3 (compared to what it was the moment *before* the BIP), he would get +.6 runs credited. If the ball drops in such that the run expectation for the inning goes up by +.6 runs (compared to what it was *before* the BIP), the fielder would get charged .3 runs. Such a system would require that all *fielding* changes get charged to the fielder, and *all* zones would have to be counted. But I *think* that the pitcher would still get the .3 "PZR" debit and the fielder would get the +.6 UZR credit or .3 UZR debit. And in "impossible" zones, the run expectation would effectively all get charged to the pitcher.
Of course, these leaves out a huge problem: who gets blamed when a ball falls in a "shared" zone? Only one player can *make* a play and get credit as above, but two (or three) can *fail* to make a play. My best guess for now is that failures should be charged in proportion to leagueaverage success rates per position for the BIP in question.
I think what I've proposed would probably help outfield ratings. Let me know what you think.
By the way, what I've described is, I think, the "AVM" system that Moneyball reports was sold to (and then replicated by) the A's.
Regarding errors, I suppose it's worth "decomposing" the error/range aspects. However, because errors only cost .02 runs more than plays "missed", the two components should add up to a run rating that would, to revert to "zone" (rather than Run Expectation) terminology, equal the difference in total plays successfully made relative to the league average out of total chances in a player's zone, multiplied by the average run savings per play made.
Thanks again.
Thanks for your interest. Unfortunately, I have the feeling it will be a while before I get to pre1974 data. I hope to do a 200203 report in a couple of months.
As for your comment:
"To use offense as an example, shouldn't one simply look at the sample of all baseout situations before and after all doubles we have records for have been hit, and then calculate the average expected value based on the average change in baseout state? "
No, because the sample size may not be enough. I find that you need 1 MILLION start base/out states to get a good number.
"By looking only at the change in the baseout state (and *not* the average number of runs actually scored after the double) I think you preserve the Markov property."
Yes, for the most part, which is why this is good.
As for you comment on steals, the effect is not as big as you think.
I published on my site somewhere the actual number of runs from the time a steal to the end of the inning, and it is REMARKABLY close to the changeinstate model that you are discussing. (both somewhere in the .18 to .20 range).
Could you explain how one can run simulations? Wouldn't you need some empirical data on changeinstate per BIP (for pitching/fielding) or per hit (for offense), at least to begin the simulation? I'm left with an impression that the simulations are somehow "bootstrapped" out of thin air. (Though I'm sure that's not the case.)
Thanks for the info on steals. When I ran DRA using CS as well as SB data, I got more normal weights for SB. For preRetrosheet years, what I might want to do is use Pete Palmer's estimates for CS. Perhaps his new Baseball Encyclopedia will have the numbers.
By the way, best of luck to you and MGL with Win Advancements and UZR. That AVM may have "invented" the approach we've been talking about is of no use to fans, and I'm sure that whatever you and MGL come up with will be greeted with enthusiasm.
From 1999 to 2002, there were 2634 SB with a runner on 1b and 1 out, and 2027 runs scored from the time the SB to the end of the inning, or a run expectancy of 0.770.
Now, not all SB result in a runner at 2b and 1 out (throwing errors, etc). And having a runner at 2B following a SB probably means that: a) the runner is above average speed, b) the runner probably hits at the top of the order, and therefore is followed by good hitters.
If instead we look at all situatiton with men on 2b and 1 out, the RE is 0.725. That's a pretty substantial .045 run difference.
Let's instead try to isolate a little of that. Let's look at only SB where the runner ends up at 2B, so that at least we are talking about the same end states. In that case, we have 2490 SB, with 1806 actual runs scored to the end of those innings, or ..... 0.725.
Hmmmmmm..... seems that we pretty much isolated the reason, haven't we?
***************
As to your other question, you need to derive statetransition rates from empirical data.
As for the book, Win Advancement and UZR will NOT be a part of it. (Though, maybe part of a second book, and I'd probably want you and/or Charlie to contribute your fielding stuff.)
The book that we are working on is a cross between The Hidden Game of Baseball and STATS Scoreboard.
Good catch. I did miss Beltre. He played enough in '99 and '00 to be included, and I just missed him.
Unfortunately, he seems to be another example of a third baseman with only two years of data that DRA has underrated.
His UZR ratings in '99 and '00 were +17 and +16. Diamond Mind strongly agrees with this assessment. In its '99 Dodgers team comment, they say, "[Beltre] showed very good range in the field. He made a lot of errors (29), but that's not unusual for a young player." In '00, Diamond Mind said, "He has excellent range at third [and has improved his fielding percentage from .931 to .934]."
DRA rates him as essentially average in '99 (3), but does get his rating up to +12 in '00. Not quite as bad as miss as the Brosius, Chavez, and Koskie, but I hope that his 2002 and 2003 DRA ratings will provide a better match.
Thanks.
As for the Lahman database, the "noncommercial" aspects will probably change since its adminstration is handled under the yahoo group baseballdatabank. I invite you to join that group to keep up.
Let?s try to clear the air so that we can address your criticisms and research. This email is long, and I?d have preferred to send it to you directly, but I don?t have your email address.
Imagine you?re back in your PhD. program defending your dissertation. Let?s say you?re working in a field that is striving to address problems that are difficult to resolve to a level of certainty or accuracy normally sought in the sciences. Let?s be even more specific?your field is Political Science, which is applying more statistical models these days, and has always been ambitious to be considered a science, to such an extent that it has persisted in literally calling itself a science when many scientists derided such pretensions.
You?ve completed the coursework for the PhD, including studies in statistics and modeling. You?ve been working on your dissertation for a year and a half. Now the two requirements for a dissertation are originality and relevance. You?ve been working with your advisor, reading all the literature in the field, and consulting with many other people in the field to make sure that your work *is* original and relevant. In fact, the problem you?re working on has been identified as perhaps the most significant problem in the field, one for which solutions have been sought for decades, and for which very imperfect solutions have resulted in wellreceived dissertations. You?ve been in constant communication with all the people in the field, so they know you and your work and respect you as a nearpeer even though you don?t as yet have your PhD.
You begin to present your dissertation and address the comments and questions of the people in the review committee, which include the leading figures in your speciality, who, again, all know each other and you. You acknowledge the limitations of your work, but the audience can see that the work is original and does represent a significant advance in the field.
Suddenly someone no one in the room knows or knows anything about bursts into the room and announces, ?Like you, I entertain the fantasy of obtaining a tenuretrack position at a major university . . . . However, the fundamental [thesis] of your dissertation is hardly novel . . . . I think the ?completely new? and ?truly novel? rhetoric [needs to] be toned down. [Furthermore,] your dissertation thesis model actually sounds very much like my own . . . . Checking the correlations between [your model and mine], I find that my system correlates significantly better with [the data] than does yours.?
How would you feel at that moment? Might you feel ?insulted??
Trying your best to contain your emotions, you explain to the speaker that you are not aware that anyone has done anything remotely like your dissertation. You ask the New Person in the Room (let?s call him ?NPR?) to provide any example of work in the public realm that is remotely similar to your work. You ask NPR followup questions about the basic approach of his model, the degree of its accuracy, and whether it addresses certain important elements of the dissertation topic.
NPR provides no example of other work in the public realm. Instead, he says, ?Personally, I can't really imagine what else these systems (such as yours and mine) would be doing except for some sort of corrections to the [models such as ours that are already out there]. But who knows, maybe we're the first!?
Now you?re really confused. NPR has accused you of making a false claim of originality, won?t provide any information to support the claim, and then says the only use for your model is as a correction to models such as yours that are already out there, but that his model, which he still hasn?t explained or even described, is better at that anyway.
NPR also describes your test results as ?spurious?, even though you took pains in your dissertation to *highlight* that you were taking an ?admittedly subjective? approach to cope with the fact that the data you were testing your system against was itself also subject to error.
Perhaps to soften the blow, NPR acknowledges that certain features of your model are ?important?, but doesn?t explain how his system addresses them. He concludes: ?But regardless, the [XYZ model?which can?t be applied to as broad a range of situations your model is attempting to address?] is a sufficiently superior system that I never thought there would be much interest in "obsolete" systems [such as yours. But, judging by the response from the other people here, y]ou've proven me wrong in that regard!?
Again, how would your feel? ?Insulted?? Exasperated? Totally confused?
You try to address NPR?s contentions and close with the following:
?I'm simply baffled that someone who would go to the trouble of developing such a seemingly accurate system as yours, NPR, would (a) not have determined before [I presented this dissertation] how much more accurate it is than [prior models that have been published], and (b) have no idea that it would be of any interest to people in the field. That, and the fact that (a) your [claim that my model is only useful to validate similar models, which you still haven?t shown exist] doesn't make sense, (b) [nobody here knows who you are], (c) you haven't begun to describe even the basic approach of your system and (d) you haven't provided your [data] very much makes me wonder. As I mention in the Introduction to the [dissertation], one of the key principles of [my model] is that ?everything has to add up.? Something doesn't add up here.?
Inflamed with indignation, NPR selectively repeats your closing statement to create the impression that you?re being totally unreasonable?leaving out the key point that the *reason* things ?don?t add up? is that ?(a) NPR?s [claim that your model is only useful to validate similar models, which NPR *still* hasn?t shown exist,] *doesn't make sense* [because there *are* no such other models], (b) [nobody here knows who NPR is], and (c) NPR *still* hasn't begun to describe even the basic approach of his system.?
But NPR has just begun: ?Now come on. To assume that the reason I haven't published my system is that I'm just making all this up is insulting. Or to claim that my [not identifying myself] discredits my work is equally absurd. I didn't see any miniresume posted in any of your articles, so I likewise have no basis on which to judge your competence as a statistician.? NPR grandly announces that he has a PhD in a more rigorous field and proceeds to explain in greater detail why your dissertation is shoddy, and, with a sigh, adds, ?if you really must know [about my system], I estimate [variable X] based on a chisquared fit to the [Y and Z] stats and then run a regression analysis that is probably very much like yours, except for using the additional estimate of [variable X]. . . . Anyway, enough of that. Again, I'm sorry if you thought I was bragging about my system; I personally don't think any systems [such as yours] are all that useful, mine included.?
Then, without missing a beat, NPR innocently continues, ?A question and then a suggestion . . . .?
AED,
I hope you?re smiling at this point, even if it?s with condescension over the temerity of analogizing my little project to a dissertation. I wouldn?t have bothered writing the story above if I didn?t believe you when you say you are a PhD. scientist / statistician.
Allow me to make some important points a little more directly.
Sabermetrics is not Physics. It?s probably a lot more like PoliSci. There?s a lot of missing and incomplete data and room for a lot of judgment.
Nevertheless, DRA is at least the first published attempt to design a fielding evaluation system based solely on empiricallysupported assumptions and statistically significant relationships between traditional pitching and fielding statistics, and between such statistics and runs allowed. That doesn?t mean the resulting ratings can be presented as per se ?correct??it?s not even possible to establish confidence intervals?but I at least think they look pretty good, based on the most comprehensive comparison ever performed between a nonzone system and a zone system.
If, per your suggestion, one deletes from the UZR sample the two/threeyear ratings that are coded as ???, ?dm~dra?, ?dm=dra?, ?dial=dra? and ?park?, i.e., the UZR ratings that don?t match with other zone systems (and one data point with a park effect), the Pearson?s r is .801. (In case you?re checking, I might be performing the test on the most recent data set, which the editors mistakenly did not include, but which is not materially different. Before attacking the results of my prior comparison as ?spurious?, you might have taken the trouble to perform this test yourself, particularly as you suggested it and are very quick with generating correlation results.)
I very much appreciate your point about providing ?sufficient details [so] that an equallyknowledgeable person c[an] reproduce? results that are claimed in an article. I hope to publish a book that will provide all of the details, so that fans can test the formulas themselves. But in the meantime, I?m providing several ideas that no one has published before and with which other analysts could begin developing their own DRAtype systems. Currently *all* good systems for evaluating fielding are, to one degree or another, unverifiable by the average fan. Though MGL provides a full description of the UZR method, you?d have to spend thousands of dollars to buy the data and untold hours developing the software programs from scratch. I trust MGL, partly because he has the guts to show the results of his system, even when a few of his ratings are probably incorrect. Diamond Mind?s zonebased system is completely proprietary, and they don?t even give actual runssaved estimates. Some aspects of Clay Davenport?s system remain shrouded in mystery. But fans still appreciate UZR, Diamond Mind and DFT ratings. It seems that some appreciate DRA ratings as well.
Again, sabermetrics is not Physics. The people on this site enjoy bringing more rigor to the analysis of baseball, but we recognize sabermetrics for what it is. We also know each other?not by resume, sometimes not even by name?but by sharing ideas over the course of many months and (sometimes) years. Your anonymity was disconcerting not because I didn?t know your name, but because neither I nor anyone else have ever communicated with you before. A record of helpful communication is the only credential anybody here cares about.
To answer your questions:
As explained in the article (shortly before the ?Mickey Mantle? discussion), I got the LF/CF/RF outfield data from John Jarvis? baseball website (he?ll show up on Google), and he doesn?t put any limitations on the use of his data.
It is curious (no value judgment there) that you get a better match with UZR if you include infield fly outs, because they *are* ignored by UZR. I encourage you to write more about this, as people have had a hard time correlating infield fly outs with skill. It sounds like they correlate with ground outs, otherwise you wouldn?t get better results. Is there any chance that they help ?dampen? or ?adjust? ratings for the GB/FB tendency of a team?s pitchers? In other words, is there any chance they act as an effective proxy for adjusting assists for the ground ball / fly ball tendency of the pitching staff?
Please write more on this topic so those of us that only see through a glass dimly now can come to see what you're doing more clearly. What basic statistical concepts do you think your audience ought to know? What's a good introduction to those concepts? (A book, a website?)
I want to start from the beginning and read again, but I want more ammunition this time.
Thanks for the compliment. I think the best place to start would be by picking up a copy of "Curve Ball" by Jim Albert and Jay Bennett, which has a good description of the application of regression analyis to evaluating offense. Barnes and Noble always seems to have a paperback copy or two on its shelves. There are probably some decent statistics books out there as well that are not too academic.
Then you might try reading again the second installment of this article, which goes stepbystep through the DRA method and how such method differs from regression analysis for offense.
I've posted my email so that you can write to me directly with followup questions.
As I mentioned in a prior post to MGL, I haven't described all of the techniques used under DRA, but the general principles. At this point I need to be in a position to represent to a major league team that the complete methodology is still proprietary, so they can feel comfortable that it will provide them with an advantage. Though I believe that DRA would be very valuable to a team, the chances are probably fairly low that any team will actually buy it. That's why I'm also pursuing the book project. Any book I publish will provide the formulas right up front, and then provide in an Appendix a technical explanation that should be satisfactory to academic statisticians.
Don't forget that Saeger's or MAH's method, even though it doesn't specifically use zone data, essentially estimates teh zone data through interpolation. If we know how many balls in play that the Yankee pitchers allow, AND we know the G/F ratio of the Yankee pitchers, AND we know the L/R distribution of the Yankee pitchers (therefore the approximate L/R distribution of their opponent hitters), AND we know either from league zone data OR from traditional assist data, what percetnage of ground balls hit by RHB's (or given up by RHP's or LHP's) go to the 3rd baseman, SS, etc., on the average, then we can approximate the zones of all batted balls, albeit those zones are large and they are not as reliable as real zone data. But, since they are close it is not surpising that a defensive metric that relies on estimating zone data from traditional stats, including R/L splits and G/F ratios of a pitching staff, will come up with similar results to a zone based system like simple ZR or UZR. In fact, I think that a system like MAH's or Saeger's (the more that I think about it, I think they are exactly the same) will capture probably 90% of what a zone based system will capture. In fact, if you look at metrics from range factor to MAH's and Saeger's DRA, and then ZR, and UZR, I think you have a continuum of defensive metrics in increasing order of accuracy. They all essentially measure the same thing, with each one using more and more data, therefore being more and more accurate (not couning the innacuracies that often come with more complicated systmes even though there is more relevant data to work with). Here are my WAG estimates of how much of the "ultimate" defensive metric each of those metric captures:
1) range factor: 25%
2) DRA 75%
3) ZR 75%
4) UZR 90%
The reason I made ZR and DRA equal, even though ZR uses more detalied data is because some of what ZR ignores or can't handle, DRA actually "picks up" (by accident) in the regression anlysis (like balls hit out of zone, the fact that balls given up by LHP's and RHP's may be hit more or less hard to various positions).
Since I don't have any idea what win shares does AND because it is more of a strict "value" metric (like clutch hitting gets more value than nonclutch hitting, even given the same raw stats), I think, than an abiliyt metric or at least a conext neutral perforemcne metric like the above are, I dodn't inlcude it. I also didn't include regular fielding percetnage, since it measures one thing and one thing only (and does a darn good job at that, except that what it measures doesn't really mean much) and the thing that it measure is one small part of fielding talent and we don't even though how much of a fielder's actual "skill" or talent is in his fielding percetnage and how much fielding percentage is luck (such as bad hops and arbitrary scoring by official scorers)...
Perhaps readers should look at Charles' own articles about his system, "ContextAdjusted Defense" ("CAD"), as well as the email threads, so they can decide for themselves how similar CAD is to DRA. Provided below is the web address to his most recent CAD article. Readers can also go to the "Authors" section of Primer, then click on Charles Saeger, then look for that as well as other articles he has written.
I sent a copy of my article to Charles (as well as you, Tango, and Mike Emeigh) about twoandahalf months ago. Charles and I exchanged about a dozen emails about various ratings. He never suggested or implied that he felt my system was the same (or even similar) to his.
http://www.baseballprimer.com/articles/charlie_saeger_20020921_0.shtml
Also, I didn't know (I don't think) that you had sent me an advance copy of your article. I changed my email address about 4 months a go. Did I respond? If not, I apologize. I rarely check my old email address. I usually scan it and delete most of the emails, which are 98% spam...
CAD and DRA are similar in that they try to adjust traditional fielding statistics for various pitching variables, such as BIP (or SO), LHP, estimates (or indirect adjustments) for GB/FB pitching, runners on base, etc. By that standard, Win Shares, Total Baseball Fielding Linear Weights, DFT, CAD and DRA are *all* similar to some extent. None is "exactly the same" as DRA. Offhand, I can't say which of those systems is most similar to DRA. Perhaps some of our readers can offer their opinions.
What's new about DRA is that it uses regression analysis in a systematic way to *measure* how various pitching variables impact fielding, as well as the value, in runs, of each pitching and regressionadjusted fielding event. In this respect, *DRA* is "the empirical counterpart" to the *other* systems, which *impute* values for various defensive events based on data derived from *offensive* models (or informed guesses).
I think the other major innovation / advantage of DRA is that the format of the equationsshown in the article as well as one of my emails to you in this threadis the simplest of any system out there, with the exception of the format of Total Baseball Fielding Linear Weights, which as yet are not empirically derived (though Pete may have some improvements in store).
Thanks for the correction. Has anyone heard from David since he left me the excellent "practical approximations" comment?
Thanks for writing. I agree that the regression techniques used here, which another poster characterized as "intrumental variables regression" (see thread to the second installment), is actually not that complicated, and would be considered very simple to someone engaged in sophisticated statistical work.
Most of the time spent on the project was devoted to making the resulting model simpler, as well as more accurate. Don't worry, thoughI haven't spent literally a yearandahalf of my life working on this. I've had plenty of other things to do.
I didn't explain this so well in my last post, but I think that there may be two factors that may explain why your ratings are better when estimated infield fly outs per infielder are included.
First, it stands to reason that an infielder with additional speed that would help his range on ground balls would also be able to exploit that speed to catch more short flies that might otherwise drop in as hits. Other analysts have had a hard time quantifying this, so I didn't focus on it (perhaps One Simplification Too Far).
Second, infield fly outs at the level of the individual fielder might provide an additional "proxy" adjustment for GB/FB pitching, and therefore provide an improved indirect adjustment for *ground* ball *opportunities*.
Both factors may explain why Pete Palmer's nonempirical Fielding Linear Weights equation (which gives infielders half credit for putouts compared to assists) had a higher correlation with UZR in the shortstop sample in Mike Emeigh's article than DFT, DRA or Win Shares. However, Pete's formula leaves too much variance in ratings (the "scale" is too big, though I think he is working on that). It sounds like you've found a way to take care of that.
I may adjust DRA to take into account your insight, and will be sure to give you credit for the idea if I do so. I've posted my email address in case you'd like to write.
Third attempt to get this right.
Infielder putouts (and, even better, your estimates for perinfielder fly outs), provide additional evidence of the *ability* to reach *ground* balls.
Perinfielder fly outs may provide a good indirect adjustment for *ground* ball *opportunities*.
Regular ERA mixes BB and SO and HR's, which have low or moderatley low regression rates, with $H, which has a very high regression rate. That is a nono in one metric (unless you somehow weight them differently).
Is anyone working on doing weighing them differently? K% shouldn't be weighed the same at HR%, afterall. Also, shouldn't this be done for hitter's projections too? Wouldn't it make sense to regress a hitter's BaBIP more than his K rate, HR rate or BB rate? Is this, in effect, what PECOTA is doing? Does Zips regress the different compenents of performance to varying degrees? Sorry to go off track here.
You are absolutely correct. If you've got about 1500 to 2000 PAs from a hitter, you would regress about 15 to 20 % his K, HR, and BB rates. For nonHR hits, it's more like 30 to 40%.
In the case of strikout rates, a pitcher who has averaged 9 strikeouts per 9 innings over the past three seasons is likely to strike out something like 9 per 9IP next season.
All samples are regressed towards the mean they were drawn from. Try it out, though I suggest you use PAs and not IP. Select the last 50 pitchers who struck out at least 25% of his BFP over a 3yr period, facing at least say 2000 batters each. Without running any numbers, if let's say these 50 pitchers averaged 28% K/BFP, and if the league mean is 18%, then I would expect their fourth year's K/KBP to be 26.5%.
In the case of BABIP rates, a pitcher whose BABIP is 0.250 over the past three seasons will probably be around 0.270 next season.
Assuming 1 year, and assuming that the league mean is .280, then this is fairly accurate.
So you need to treat each stat seperately in the projection scheme, and then combine them into single performance metrics to project overall performance.
Agreed, and correct. Those there is some interdependence going on. But, for the large part, this is correct.
My understanding of PECOTA is that the baseline projection is done without any attempt at regression
I think Nate does use regression.
I'm not sure that is the case, though; hitter BABIP tends to regress less than pitcher BABIP.
That's for sure. The yeartoyear r for hitters is about .50 and for pitchers is about .20 (with an average PA of around 500 for hitters and 700 for pitchers). You would regress at 1  r (for that level of PA).
Thanks for the regressiontothemean points. As I mention in a prior post, DRA and UZR individual ratings have about the same yeartoyear Pearson's "r" as hitter BABIP, with the UZR "r" being slightly higher than the DRA "r". Thus, if all you have is a oneyear DRA or UZR rating for a fielder, you should probably cut the rating in half for purposes of deriving a better "ability" measure.
Would you happen to know how one should regress a *two*year DRA or UZR rating? Would one, for example, calculate the "r" between twoyear ratings and third year ratings, and use *that* "r" (or weighted "r" for the first and second year rating) to provide the ability estimate?
I think, though don't quote me, that it doesn't matter how many PAs you have in year 1 or year 2, as long as the number of players (or the sum of the PAs) in year 2 is "substantial".
Again, no quoting me, if you take a player's 5yr (2500 PA) OBA, and look at his April OBA (100 PAs of 200 hitters), that you would get a similar r between the 5yr and the full year OBA of lesser number of players (625 PAs of 8 hitters).
btw, I think DRA's take on extra importance in light of MGL's report that 2b, ss, cf defense peaks in a player's first years in the majors. With playbyplay data we couldn't get a sense of an up the middle defender's value until they'd already passed their peak.
The short answer is I don't know, but it's an excellent question. Just eyeballing the yearbyyear 19742001 historical ratings, which I hope Primer will be able to post soon, the yeartoyear patterns for players with five years of data (the population shown in the spreadsheet), particularly the outstanding fielders (who are the only ones people really care about anyway), seem to be very consistent, Ozzie Smith and Mike Schmidt being notable examples.
As mentioned in the article (towards the end of this installment), I completely agree with you that DRA's practical relevance could be in identifying outstanding fielders while they're still in the minor leagues, so that you'll know to promote them and be able to exploit their value before it declines.
If CF/SS have a huge "pure speed" component to them, I would not be surprised at all that they would peak rather early (there's only so much "smarts" can make up for lack of speed, unlike say the smarts to take a walk can offset slower wrist action).
From that standpoint, the ability to spot a top fielder early would be very important.
Thanks for sharing your expertise. Would the first formula require DRA/UZR to be expressed as a "rate" stat (such as a "84%" zone rating, compared with a leagueaverage "80%" zone rating)? If that is the case, the empirical approach you've described would appear to be the one to use, particularly as the league average number is set by definition to zero, so that the regression equation would reduce to:
"Ability" or "Expected" DRA/UZR for current year ("Y") = (regression coefficient)*(PriorTwoYearAverage DRA/UZR), or possibly
"Ability" or "Expected" DRA/UZR = (regression coefficient)*(Prior Year DRA/UZR) + (regression coefficient)*(Preceding Year DRA/UZR), etc.
And yes, the rigorous way to do a projection is to apply different regressions to each component and then to combine them, which is exactly what I do. I'm not sure that Pecota does this, unless they do it implictily by comparing a player's profile to another older player and then using that older player's stats as a proxy for the younger player's future stats. Unfortunately I was not able to get my projections to Tango this year. Maybe next year.
Regarding the aging question, I find that the overall defensive effectiveness seems to peak around 27.
Wow, I think this statement is completely untrue! We just got done explaining why at most positions it is probable that a player's peak defensive value is at age 22 or 23, since most of a player's defensive ability is predicated on speed and agility, which appear to peak at a young age. My research indicated that by the time a player reaches the majors, his UZR seems to decline every year at every position other than first base. MAH stated that that was true for DRA as well. AED, where is the evidence for your statement about overall defensive ability peaking at age 27? While I don't doubt that "error defense" may peak at a later age or even remain fairly steady over most of a player's career, I highly doubt that overall defense peaks anywhere near age 27, but I'm willing to be convinced otherwise if you have any evidence...
Sorry for asking some questions you answered in your article. I didn't make it all the way to the end my first time through.
I think AED has developed a chisquared model that somehow isolates the infielder skillatcatchingTexasLeaguers component of infielder putouts. Since I haven't revealed all of the techniques of DRA, I'm hardly in a position to ask AED to reveal his proprietary technology, but it's clear that he knows what he's doing.
I tend to agree with you about the aging analysis. "Pure" range simply has a much greater impact on overall fielding effectiveness than erroravoidance. But maybe there's something we're missing.
The issue of whether and by how much a player's true defensive ability changes from one year to the next, independent of a normal aging process, is a separate issue. This can be handled by simply using a weighting process, like we do with offensive predictions (e.g., 5/3/2). The weighting process, again, separately from an age adjustment, is only to account for the fact that a player's true ability may change over time, due to injury, learning, attitude, salary, drugs, steroids, unique aging, etc. If that were not true, i.e., all players retained essentially the same abiliy over time, except for a uniform (the same for all players) aging process, then there would be no need for a weighting process (i.e., no need to care about how a player's historical stats were distributed over time, other than for purposes of doing the age adjustment).
Now, it is possible that a player's true defensive ability does not change as much over time as a player's offensive ability (again, not counting the effects of normal aging). In that case, we would not even want to weight different years. We would simply take a players last 3 years (or last 5 years or 2 years  like park factors, the more the better) defensive UZR or whatever, adjust each year for age, apply a regression coefficient to account for the sample size of those historical stats (how many "chances"), and viola, we have a prediction (or a true ability) for next year!
OTOH, even though there might be more potential "learning" with time for offense as opposed to defense, a player's unique aging (especially weight gain) might affect defense much more than offense (like A. Jones), such that we might want to use as much or even more agressive weighting in combinng historical defensive stats than we do with offensive ones. In any case, the final regression (after combining the historical stats) should be arlound the same, since the amount of that regression is mainly (not completely) a function of the size of the historical sample we are using to estimate true ability (or predict future performance), regardless of how many "years" that historical sample encompasses.
The footnote to that last sentence is that, as AED explained, technically the regression is based on 2 things. One, the sample size, and two, the exact distribution of player's true abilities in the entire population  but that is a constant for all players, so we don't really care about that, although that distribution varies among positions and among different types of players. The real method for a projection is actually a Bayesian (and arithmetic) one using those two variables  one, sample size, and two, the distribution of true abilities in the population. Using regression coefficients derived from regression analyses is a "shortcut" method that yields a less than precise (as comapred to the Bayesian anlysis) and a "onesizefitsall answer, yet it also allows us to avoid the difficult process of estimating that distribution of true abilitites (the only way to do that is to look at players with very long careers and assume that the spread in sample results for players with very long careers is just a little wider than the spread of true abilities and then to also make soime intuitive assumptions like samples of players with very long careers is probbaly heasvily biased in favor of players with very good abilities, etc.)...
Thanks, that helps.
Excellent!
In one of his posts, AED indicated that it was incorrect to include "DM'd" UZR ratings in calculating the correlation. Instead, UZR ratings that differ from Diamond Mind appraisals (and the corresponding DRA ratings) should be deleted from the sample before calculating the correlation.
My response:
"If, per [AED's] suggestion, one deletes from the UZR sample the two/threeyear ratings that are coded as ???, ?dm~dra?, ?dm=dra?, ?dial=dra? and ?park?, i.e., the UZR ratings that don?t match with other zone systems (and one data point with a park effect), the Pearson?s r is .801. (In case you?re checking, I might be performing the test on the most recent data set, which the editors mistakenly did not include, but which is not materially different. . . .)"
In other words, the correlation between DRA and UZR is still .8, whether or not calculated per AED's suggestion.
Excellent!
I am going to backpedal a little! Isn't that only true if the distribution of talent in a league is roughly normal? I think that the distribution of talent (i.e., the distribution of true BA's, true OPS's, true UZR's, etc.) is nowhere near normal, as James has been pointing out for years. The distribution of talent in MLB is taken from the far right end of a roughly normal distribution of talent from all young adult males, is it not? Which makes it heavily right skewed and chopped off that the left, with a big chunk taken off the top of the left as well (the only poor hitters in MLB are good defenders and vice versa).
Given that, I don't think it is all that easy to look at the distribution of sample metrics (again, say BA or OPS) in any given period of time and infer the distribution of true values for that metric. We know that the sample distribution is going to be roughly normal (I'm not even sure about that, since the distribution of the underlying "true values" for each player are not), but since what we are trying to figure out is not nearly distributed normally (the distribution of true values), I don't think your formula will come close to working, but I'm not sure, as I am not nearly a statistician....
I think that AED measures *team* performance at each position, which generally is close to normally distributed. He then uses a different type of regression analysis and related techniques to determine what is called the "extrabinomial variation" at each position, which is presumednot unreasonablyas the "skill" component. It's a very elegant model, and one that I had considered using, as I alluded to in a prior post.
I have the feeling that individual fielder performance for fielders who play fulltime is also much closer to a normal distribution than individual batter performance for batters who play fulltime.
If I'm guessing correctly what AED's method is, it actually is a probably a better model for determining the *number* of skill plays made at each position, and may be particularly wellsuited to estimating "skill" infielder putouts.
There were two reasons that I went with the more wellknown form of linear regression.
First, the resulting equations under AED's model have a more complicated format that would be more difficult for the average fan to appreciate in an intuitive sense. One of my goals for DRA was to create, as mentioned in a Bill James quote in a thread to the first installment, a simple fielding statistic for each position that fans could just look at and immediately apprehend, even if they might not understand or care about all the fancy math that went into deriving the equation.
Second, I'm not sure that *run values* per skill play can be derived from AED's model, because I'm not sure that *runs allowed* can be treated as a binomial count, i.e., I'm not sure that one can say that runs scored out of total outs is a "random draw" to the same extent that one can say that strikeouts out of total batters faced, or shortstop assists out of total BIP, are "random draws". In other words, I'm not sure that one could run a "global" regression of "binomial count" runs allowed onto all the "extrabinomialvariationadjusted" skill plays made at pitcher and at each of the positions, though I may be wrong on that point.
I may try experimenting with that approach to see what comes up.
While it is a given that the talent distribution at the MLB level is the far right of a bell curve, it is also a given that the playing time is much higher for the players at the far right as well. The combination of the two results in a rather typical looking distribution. (Not normal, but close enough.)
See the above link to see this in graphical action.
You must be Registered and Logged In to post comments.
<< Back to main