|
| |||
|
You are here > Home > Hall of Merit > Discussion
| |||
Hall of Merit — A Look at Baseball's All-Time Best Monday, October 11, 2004Battle of the Uber-Stat Systems (Win Shares vs. WARP)!Don’t ever say that I never gave you anything! :-) John (You Can Call Me Grandma) Murphy
Posted: October 11, 2004 at 09:46 AM | 347 comment(s)
Related News: Sabermetrics | |||
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
Since players aren't replaced solely on the basis of their fielding and since the data set for major league players at one poisition in any given year is so small, how do you analytically determine the relationship between "replacement level" and "average" on a season-by-season basis?
is the major issue
A) The WARP league strength / timeline adjustment from WARP1 to WARP2?, or
B) the whole WARP1 calc to begin with?
I assume the length-of-schedule adjustment from WARP2 to WARP3 is not a biggie for anybody - maybe exceptions for the real short-league 1880s guys.
I for one don't have a big problem with WARP1, except for maybe a bit of extra fielding credit for SS, for example.
As for the timeline adjustments, if you put 40 of us in a room, you'll soon get 45 opinions :)
WARP3 is a complete waste of time, IMO.
WARP1 is OK except that it keeps changing and we don't entirely know why.
WARP2 is somewhere in between ;-)
Still, sunnyday, the only difference between WARP2 and 3 is schedule adjustment, so by any reasonable standard, WARP3 > WARP2, except perhaps in extreme 1870s circumstances.
I think what you mean to say is WARP2 is a complete waste of time, while WARP3 is a slightly better complete waste of time. ;)
Equally seriously, I know that TPR has been largely discredited. But that mostly relates to pegging a hypothetical zero value at the average. Aside from that, which each of us can adjust for as easily as we can schedule adjust WARP1, where are people at on TPR. Considering I get all my OPS and ERA+ numbers out of the Palmer-Gillette Encyclopedia, and considering the TPR is right there on the same page, I admit to glancing at it now and again. I am not sure it is not a useful number.
I used to look at WS, WARP and TPR and throw out the one that is not like the others. Now I don't look at any of them very much.
is the major issue
A) The WARP league strength / timeline adjustment from WARP1 to WARP2?, or
B) the whole WARP1 calc to begin with?
I answered this obliquely on the Lombardi thread, but I'll present the arguments explicitly here.
The biggest issue with WARP is the WARP2 all-time context adjustment to fielding value. I think that's simply a mistake.
The second big issue with WARP is the league strength adjustments. Too much of a black box to be trustworthy, especially when the results seem counterintuitive on the visible surface (as in the competition adjustments during WWII).
The third big issue with WARP is FRAR in WARP1. It's another big black box that has tremendous effects on the rankings and that gives results not intuitively reliable.
Because of these issues, I don't rely on WARP2 or WARP3 at all, and I don't rely heavily of WARP1 as a comprehensive metric.
That said, I find several components of WARP to be quite valuable. EQA is a handy improvement, imo, on OPS+ as a batting value rate stat. DERA is a very substantial improvement on ERA+ as a pitching rate stat, with results that can be cross-checked by studies of team fielding in relation to pitching. FRAA are at least as reliable as fws at assessing fielding quality.
In general, I think WARP does a much better job than win shares of handling the changing relationship of pitching and fielding in the creation of defensive value as the conditions of the game change (especially before 1900!). However, there are problems in the ways they turn their rate measures into comprehensive metrics that prevent me from using WARP as the foundational system for my rankings.
Now until recently, I was working with much older editions of TB (1996), but I also haven't yet checked the "new" 2004 edition to see how it handles fielding. Does anyone know how much the FR calculations have changed? Do they better reflect team-contexts? Or the historical relationships of the positions and of the pitcher/defense balance?
(Mattingly 7, of course.)
I do use WARP3 because unlike Win Shares it has a time line and schedule adjustment. there are players that I trust WARP over WS (Bobby Veach, Earl Averill) but usually if push comes to shove I will use WS with a few random adjustments for mistakes I believe they make.
At the same time I love Eqa and RARP (VORP a little less so) and think that they really have the cutting edge on modern baseball statistics inever category but fielding, with UZR far better than FRAR or FRAA. It seems to me that they have these nice metrics for modern baseball and decided to trace them back through time. Maybe if a guy with an deeper understanding of baseball history went through and re-did WARP it would look much better. Or maybe not.
The differences lie in how the base rate is determined, and the exact adjustments made.
I do look at WARP1 relative to peers in the same era, and look at FRAA, but that's it for BP measures.
Now until recently, I was working with much older editions of TB (1996), but I also haven't yet checked the "new" 2004 edition to see how it handles fielding. Does anyone know how much the FR calculations have changed? Do they better reflect team-contexts? Or the historical relationships of the positions and of the pitcher/defense balance?
Fielding Runs have been reworked quite extensively, and seem to be much better. I think if you take FWAA, Win Shares/1000 Innings Fielding, and Fielding Runs all together, you can get a good handle on a player's fielding abilities.
This is true, and it is nearly impossible (I think) to correlate FWS with RRAR.
Take a fielder (like Lou Boudreau) with 54 FRAR in 1943. Using a very simplified model, suppose 9 runs = 1 win (which is basically what WARP does). That means Lou's fielding was 6 wins above replacement. In WS lingo, that's 18 FWS. That's just for fielding and that's supposed to be above replacement. We don't know what FWS replacement for a shortstop is, but if it is anything north of 0 FWS, then Boudreau would be entitled to more than 18 FWS to make it equivalent to the WARP number. The highest number of FWS ever recorded for a shortstop was 12.83 (it wasn't Boudreau).
Let's take the reverse. Boudreau had 8.8 FWS in 1943, which is 2.9 wins, which is 26.4 runs. So to equate that to an FRAR of 54, you'd have to assume that replacement level in FWS was negative 27.6 runs, which is -3.07 wins, which is -9.2 FWS, which of course, is impossible.
Another way to look at this, using a method tailored to the number of games/innings. The average shortstop in 1943 AL had 3.89 FWS in 91.7 games. Using 8.8 innings per game (just for kicks), you get 807 defensive innings, which means the FWS rate was about 4.82 FWS per 1000 innings, for an average SS.
Boudreau played in 152 games, or 1338 innings, which means the average shortstop in the FWS system would have had 6.45 FWS (4.82*1.338), or 2.14 wins, or 19.34 runs.
BP has Boudreau with 54 FRAR and 22 FRAA, which means the average player with the same playing time as Boudreau saved 32 more runs than a replacement player with the same playing time as Boudreau. If an average shortstop with the same playing time is 32 runs better than replacement with the same playing time, and the average player has 19.34 runs derived from the FWS system, then replacement must be -12.6 runs, or negative 1.4 wins, or -4.2 FWS. Again, that's impossible in the WS system.
Also, if the average shortstop with Boudreau's playing time would have had 6.45 FWS and he had 8.8, then he has FWSAA of 2.35, which is .78 wins, which is 7 runs above average (as opposed to a FRAA of 22).
Although they certainly don't equate, you can come a lot closer to deriving FRAA from FWS than you can FRAR. BP says it sets replacement as the lowest runs calculated at the position for that season, but of course, we don't know how to calculate it.
Certainly BP is using some sort of linear weights system for fielding, with tweaks, which is by its nature going to produce different results than FWS Claim Points.
I think if you take FWAA, Win Shares/1000 Innings Fielding, and Fielding Runs all together, you can get a good handle on a player's fielding abilities.
That's about all you can do, and compare them to comparables at the same position (within the same WS or WARP system).
"The Win Shares system is vastly more conservative in measuring the differences among third basemen, or players at any defensive position, than is Linear Weights."
"Linear Weights in 1952 rates Ed Yost at -35 runs, while rating Fred Hatfield of Detroit...at +19 runs. I think it is better not to assert that there is a 54-run difference between two third basemen -- a 40 homer difference -- without very solid evidence that such a gulf actually exists."
"But a 54-run difference would be equivalant to a swing of about 17 Win Shares. I do not now believe that this is a realistic estimate of the defensive impact of a third baseman. Our system would normally evaluate the difference between the league's best defensive third baseman and the worst at something more like four to five Win Shares." ...or 12-15 runs.
Three more things I noticed about WARP today, relating to hitting:
1. No positive credit is given for sacrifice hits, unlike RC which gives .50 credit. However, WARP does include sacrifice hits in the number of outs (and so does RC). That would seem to lower the BRAR and BRAA of the "little" hitters.
2. Grounding into a double play does not count as an additional out in the WARP system, but it does in runs created. That would seem to increase the BRAR and BRAA of the slow/ground ball hitters.
3. WARP uses the same formula for hitting regardless of the era, so hits, total bases, walks and steals are all worth the same throughout eras. At first I thought maybe they make up for that in moving from WARP1 to WARP2, but I don't think so, because with the translation to WARP2 they are only taking into account league difficulty.
Runs Created has some modifications tailored to era, which is why there are 24 different formulas.
"But a 54-run difference would be equivalant to a swing of about 17 Win Shares. I do not now believe that this is a realistic estimate of the defensive impact of a third baseman. Our system would normally evaluate the difference between the league's best defensive third baseman and the worst at something more like four to five Win Shares." ...or 12-15 runs
In the "new" Fielding Runs Yost is -30 with a 78 (100=ave) Fielding Rating, while Hatfield is a +12 with a 109 Fielding Rating.
Back in January, I examined the 2004 Hall of Fame ballot through the lens of Baseball Prospectus' Davenport Translated player cards. The idea was to establish a new set of sabermetric standards which could help us separate the Cooperstown wheat from the chaff, especially since Bill James' Hall of Fame Standards and Hall of Fame Monitor tools have reached their sell-by date. After all, the Hall has added 26 non-Negro League players since James last revised those tools in 1994's The Politics of Glory, and we've learned a lot since then.
These new metrics enable us to identify candidates who are as good or better than the average Hall of Famer at their position. By promoting those players for election, we can avoid further diluting the quality of the Hall's membership. Clay Davenport's Translations make an ideal tool for this endeavor because they normalize all performance records in major-league history to the same scoring environment, adjusting for park effects, quality of competition and length of schedule. All pitchers, hitters and fielders are thus rated above or below one consistent replacement level, making cross-era comparisons a breeze. Though non-statistical considerations--awards, championships, postseason performance--shouldn't be left by the wayside in weighing a player's Hall of Fame case, they're not the focus here.
Since election to the Hall of Fame requires a player to perform both at a very high level and for a long time, it's inappropriate to rely simply on career Wins Above Replacement (WARP, which for this exercise refers exclusively to the adjusted-for-all-time version. WARP3). For this process I also identified each player's peak value as determined by the player's WARP in his best five consecutive seasons (with allowances made for seasons lost to war or injury). That choice is an admittedly arbitrary one; I simply selected a peak vaue that was relatively easy to calculate and that, at five years, represented a minimum of half the career of a Hall of Famer.
This oversimplification of career and peak into One Great Number isn't meant to obscure the components which go into that figure, nor should it be taken as the end-all rating system for these players. We're looking for patterns to help us determine whether a player belongs in the Hall or doesn't and roughly where he fits. Though this piece is founded on the sabermetric credentials of Hall of Fame candidates, I've also taken the trouble to wrangle together traditional stat lines for each one, including All-Star (AS), MVP and Gold Glove (GG) awards as well as the hoary but somewhat useful Jamesian Hall of Fame Standards (HOFS) and Hall of Fame Monitor (HOFM) scores.
The career and peak WARP totals for each Hall of Famer and candidate on the ballot were tabulated and then averaged [(Career WARP + Peak WARP) / 2] to come up with a score which, because it's a better acronym than what came before, I've very self-consciously christened JAWS (JAffe WARP Score). I then calculated positional JAWS averages and compared each candidate's JAWS to those enshrined.
It should be noted that I simply followed the Hall's own system of classifying a player by the position he appeared at the most. Thus, for example, Rod Carew is classified as a second baseman, and all of his numbers count towards establishing the standards at second, even though he spent the latter half of his career at first base. This is something of an inevitability within such a system, but the if the alternative is going nuts resolving the Paul Molitors and the Harmon Killebrews into fragmentary careers at numerous positions, we'll never get anywhere.
By necessity I had to eliminate not only all Negro League-only electees, who have no major league stats, but also Satchel Paige and Monte Irvin, two great players whose presence in the Hall is largely based on their Negro League accomplishments. Other Negro Leaguers, such as Jackie Robinson, Roy Campanella and Larry Doby have been included. While their career totals are somewhat compromised by not having crossed the color line until relatively later in their careers, their peak values--especially Robinson's--contribute positively to our understanding of the Hall's standards.
Here are the positional averages, the standards, to which I'll refer throughout the piece.
POS # BRAR BRAA FRAA WARP PEAK JAWS
C 13 406 197 61 94.8 41.3 68.1
1B 18 717 465 2 98.2 43.1 70.7
2B 16 558 255 70 99.0 41.9 70.4
3B 10 594 322 48 100.2 42.2 71.2
SS 20 411 136 77 100.5 43.2 71.9
LF 18 730 462 -8 103.8 42.8 73.3
CF 17 694 445 14 108.8 46.5 77.6
RF 22 754 482 33 110.2 43.3 76.8
CI 28 673 414 18 98.9 42.8 70.8
MI 36 476 189 74 99.8 42.6 71.2
IF 64 562 287 49 99.4 42.7 71.1
OF 57 729 465 15 107.8 44.1 75.9
Middle 66 519 257 56 101.1 43.3 72.2
Corners 68 714 449 16 103.9 42.9 73.4
Hitters 134 618 354 36 102.5 43.1 72.8
A quick breeze through the other abbreviations: BRAR is Batting Runs Above Replacement, BRAA is Batting Runs Above Average; both are included here because they make good secondary measures of career and peak value. FRAA is Fielding Runs Above Average, which is a bit less messy and more meaningful to the average reader than measuring from replacement level.
It's worth noting that these figures have changed somewhat since the last time around, as Davenport has continued to revise his system--particularly the defensive elements--and adjust appropriately for the way the game has changed over 135 years of major-league history. Most notably, the spread between the average JAWS scores at various positions has been cut in half, which I interpret as a sign that the system's biases have been reduced. So without further ado, we'll move on to the 2005 Hall of Fame ballot.....
Anyone care to do it for our guys . . . if I get free time I'll take a stab at it, but I have no idea where I'd be.
It'd be interesting to see our average electee according to that, and maybe an average of our bottom 3, since we don't have the mistakes of the Hall of Fame. Also it's cool to see both the peak and career numbers . . .
should be:
"but I have no idea when that'd be"
very strange when the subconscious mind takes over for the conscious one . . .
Anyone care to do it for our guys . . . if I get free time I'll take a stab at it, but I have no idea where I'd be.
I have this data somewhere for WS and WARP1. It is the baseline for my WARP and WS evaluations. I'll try to find the spreadsheet and --- and what? How do I post it?
*A caveat with the WARP stuff in the spreadsheet. I did this spreadsheet before the first election because establishing the HoF baseline underlies about 7/8 of my HoM rating system. BP has changed the WARP calculations several times since I first did the sheet a couple of years ago, so I suspect the numbers in the sheet are not perfectly in accord with WARP anymore.
Here's what the sheet has for the HoF hitters (and Ripken), listed by position:
1. RC/27 LgRC/27 RangeFactor LgRangeFactor
2. Win Shares: 3-year peak, 5-year consec peak, 7-year peak, total and per 162 games
3. WARP1: 3-year peak, 5-year consec peak, 7-year peak, total and per 162 games
4. HOF Standards and HOF Monitor scores
Here's what is has for the HoF pitchers:
1. Win Shares: 3-year peak, 5-year consec peak, 7-year peak, total and per 100 IP
2. WARP1: 3-year peak, 5-year consec peak, 7-year peak, total and per 100 IP
3. HOF Standards and HOF Monitor scores
4. Linear Weights: 3-year peak, 5-year consec peak, 7-year peak, total and per 100 IP
5. Wins Above Team
Also, I just added significant relievers in all of the above pitching categories except #4 and #5. Only Eck and Fingers are in the HoF, but I included 15 other recognizable names (Sutter, Face, Rivera, etc.).
You will also see a little "grade sheet" at the end of each positional category. No need to pay attention to that, but it creates a little hall of fame report card using the averages in the various categories and standard deviations in those categories. The spreadsheet currently just does the calcs for the career numbers, but you can easily fix the formulas yourselves to apply them to 3-year peak or some other category...if you are so inclined. I rarely use those numbers "as is". I first eliminate the extreme cases (like Ruth's and Young's career WS on the high end, and Hafey's and Haines' career WS on the bottom end). Anyway, who cares about that.
So, does anyone want this thing? Should I e-mail it to those who request it, or can someone instruct me how to send it to the group through the HOM group on Yahoo?
I was leafing through Win Shares, and Bill James says he thinks it would be interesting to see how much "star power" a team has by taking each player's WS for that year and multiplying it by his career WS.
Maybe the same would work for a unified peak/career number. Multiply a player's WS/162 (or WARP/162) times his career WS (or WARP). Maybe take the square root of that to get it to a manageable number.
So Bobby Doerr with 281 WS and 22.61 per 162 would get a score of 79.7. Joe Gordon with 243 WS and 25.13 per 162 would get 78.2. So they are about even, despite Doerr's longer career and Gordon's higher peaks.
Using WARP1, Doerr would get 30.6 and Gordon would get 27.0, which accords with how much more favorably WARP views Doerr.
You'd have to season-length adjust and make whatever other adjustments you want before using this formula.
I think I may have proposed this in the past (without the Bill James backup and without the square root) and it did not take off, but I can't remember. :)
I was trying to rank center fielders the other day, and was using WARP fielding ratings. What to do about Max Carey? His FRAA was 32, and his FRAR was 556. Now, the difference between an average hitter and a replacement hitter during his time was only 303 runs, so how can the fielding difference by a (mostly) CF be so much more?
If you look at all NL CFs in 1917, you'll find that Edd Roush was the worst, with a fielding rate of 87. He was still well above replacement. His main backup had a rate of 91. Most players with very few games at a position get a rate of 100. But I'm wondering, if Roush is well above replacement, and he's the worst CF, who exactly is this replacement level player who would fill in for him? Some people take replacement level to be equivalent to the worst regular in a league. Looking at the 1917 AL, I find that rates for CFs go as low as 81 for Clyde Milan (still slightly above replacement.) Fielding rates over 110 and below 90 are pretty rare, whereas hitting rates commonly vary by much more than that. So how can the differences between average and replacement fielders be greater than the difference between an average and replacement hitter?
Part of the problem is that WARP treats fielding as of more importance in earlier years, when there were more balls in play. This makes sense, but the degree of difference it uses seems excessive. For Willie Davis, who played about the same number of games, the difference between FRAA and FRAR is 377.
Another problem appears in 1927. Carey has a 108 fielding rate in CF, but a 93 in RF. The other CF on the Dodgers that year also has a high rate, while the other LFs and RFs also have low rates. This looks like a matter of distribution of batted balls more than an accurate assessment of fielding performance.
Anyway, you can see why I don't submit ballots, when I can't even rank the center fielders.
Sadly, that's one of the several reasons I've chosen to eschew WARP (esp WARP2/3) in creating my own rankings.... Much as WS may have its limitations, they are known and adjustable.
The way I see it, I would value a player like this, theoretically:
1) Batting runs over replacement level at position (I would take the average of the bottom 15% of regulars as replacement level for all positions except pitcher, where I'd use the league average).
2) Fielding runs over average at position.
3) Pitching runs over replacement (I would take the average of the bottom 15% of pitchers in the same role with about 100 IP for starters, maybe 50 for relievers, give or take the length of the schedule, etc. - I'm open to idea here though).
I could see splitting 1 and 3 into - 1a) Batting value over replacement level hitter (generally .350 or so, could see the case for anywhere from .300 - .400) 1b) Defensive constant, based on position played.
What I'm still trying to figure out is a simple way to do this for WARP, without having a database of all their values. I'd pay decent money for a workable complete all-time database of their run values for offense/defense/pitching for each season. Would make it much easier to figure out league norms, etc..
Still pondering the general issue of pitching versus fielding. Going back to the Bob Lemon discussion, if pitching was a bigger part of defense in 1949 than it was in 1925, and fielding a smaller part, it seems to me that the standard deviation of ERAs would be higher in 1949 than in 1925. So, I did the following. Using all pitchers in 1924, 1925, 1948, and 1949 with over 50 innings pitched, I took the standard deviation of ERA for each team. Then I compared the average for 1924/5 with the mean for 1948/9.
In 1924/5, the average standard deviation was 0.84. For the 1948/9 period, it was 0.87. I also divided the STD by the average ERA, and came up with a figure of 0.20 for the earlier years and 0.21 for the later.
This would suggest that there wasn't much of a change in the relative importance of pitching and fielding in this period. For the 1925 NL, BP has the difference between an average and replacement level CF as 36 runs. In the 1949 AL, it shows a 19 run difference. My study seems to show that the change should be on the order of 2 or 3 runs, not 17. Of course, I could be wrong.
Win Shares Walk Thru Part I
<i>....the system is giving out absolute wins on the basis of marginal runs. 50% of the league average in runs scored, with a Pythagorean exponent of 2, corresponds to a W% of .200. It is for this reason that in old FanHome discussions myself and others said that WS had an intrinsic baseline of .200 (James changed the offensive margin line to 52%, which corresponds to about .213).
In an essay in the book, James discusses this, and says that the margin level(i.e. 52%) “is not a replacement level; it’s assumed to be a zero-win level”. This is fine on it’s face; you can assume 105% to be a zero-win level if you want. But the simple fact is that a team that scored runs at 52% of the league average with average defense will win around 20% of their games. Just because we assume this to not be the case does not mean that it is so.
Win Shares would not work for a team with a .200 W%, because the team itself would come out with negative marginal runs. If it doesn’t work at .200, how well does it work at .300, where there are real teams? That’s a rhetorical question; I don’t know. I do know that there will be a little bit of distortion every where.
In discussing the .200 subtraction, James says “Intuitively, we would assume that one player who creates 50 runs while making 400 outs does not have one-half the offensive value of a player who creates 100 runs while making 400 outs.” This is either true or not true, depending on what you mean by “value”. The first player has one-half the run value of the second player; 50/100 = 1/2, a mathematical fact. The first player will not have one-half the value of the second player if they are compared to some other standard. From zero, i.e. zero RC, one is valued at 50 and one is valued at 100.
By using team absolute wins as the unit to be split up, James implies that zero is the value line in win shares. Anyone who creates a run has done something to help the team win. It may be very small, but he has contributed more wins then zero. Wins above zero are useless in a rating system; you need wins and losses to evaluate something. If I told you one pitcher won 20 and the other won 18, what can you do? I guess you assume the guy who won 20 was more valuable. But what if he was 20-9, and the other guy was 18-5?
You can’t rate players on wins alone. You must have losses, or games. The problem with Win Shares is that they are neither wins nor wins above some baseline. They are wins above some very small baseline, re-scaled against team wins. If you want to evaluate WS against some baseline, you will have to jump through all sorts of hoops because you first must determine what a performance at that baseline will imply in win shares. Sabermetricians commonly use a .350 OW%, about 73% of the average runs/out, as the replacement level for a batter. A 73% batter though will not get 73% as many win shares as an average player. He will get less then that, because only 21%(73% - 52%) of his runs went to win shares, while for an average player it was 48%. So maybe he will get .21/.48 = 44%. I’m not sure, because I don’t jump through hoops.
Bill could use his system, and get Loss Shares, and have the whole thing balance out all right in the end. But to do it, you would have to accept negative loss shares for some players, just as you would have to accept negative win shares for some players. Since there are few players who get negative wins, and they rarely have much playing time, you can ignore them and get away with it for the most part. But in the James system, you could not just wipe out all of the negative loss shares. Any hitter who performed at greater then 152% of the league average would wind up with them, and there are (relatively) a lot hitters who create seven runs a game.
James writes in the book that with Win Shares, he has recognized that Pete Palmer was right after all in saying that using linear methods to evaluate players would result in only “limited distortions”. And it’s true that a linear method involves distortions, because when you add a player to a team, he changes the linear weights of the team. This is why Theoretical Team approaches are sometimes used. But the difference between the Palmer system and the James system is that Palmer takes one member of the team, isolates him, and evaluates him. James takes the entire team.
So while individual players vary far more in their performance then teams, they are still just a part of the team. Barry Bonds changes the linear weight values of his team, no doubt; but the difference might only be five or ten runs. Significant? Yes. Crippling to the system? Probably not. But when you take a team, particularly an unusually good or bad team, and use a linear method on the entire team, you have much bigger distortions.
Take the 1962 Mets. They scored 617 and allowed 948, in a league where the average was 726. Win Shares’ W% estimator tells me they should be (617-948+726)/(2*726) = .272. Pythagorus tells us they should be .304. That’s a difference of 5 wins. WS proceeds as if this team will win 5 less games then it probably will. Bonds’ LW estimate may be off by 1 win, but that is for him only. It does not distort the rest of the players (they cause their own smaller distortions themselves, but the error does not compound). Win Shares takes the linear distortion and thrusts it onto the whole team.
Finally, the defensive margin of 152% corresponds to a W% of about .300, compared to .213 for the offense. The only possible cutoffs which would produce equal percentages are .618/1.618 (the Fibonacci number). That is not to say that they are right, because Bill is trying to make margins that work out in a linear system, but we like to think of 2 runs and 5 allowed as being equal to the complement of 5 runs and 2 allowed. In Win Shares, this is not the case. And it could be another reason why pitchers seem to rate too low with respect to batters (and our expectations).
Couldn't you argue that a team that scores 50% of the league average number of runs but with league average defense and pitching would win 20% of its games based on their defense and pitching? Isn't that where their WS would go?
What would the winning percentage be if a team scores 52% of all runs (isn't it 48%?) and allowed 152% of league average. I know it woudl be greater than zero, but then again, are their numbers that can actually get a team to 0 wins in a pythagorean system?
And how does one calculate loss shares? Couldnt' it be argued taht a loss is just the absence of a win? You lose if you aren't doing the things (scoring runs, preventing runs) that it takes to win.
pkw (Indy, IN): With a new book out this spring, I presume we'll have to wait until at least Fall '06 for the WARP Encyclopedia to come to bookshelves. Barring that, would it be possible for a series of "basics" articles showing how WARP is calculated, the whys and whynots all explained, Win Shares-style? Why FRAR is used instead of FRAA is one question I like to see discussed. Thanks for all the great work.
Clay Davenport: If you used FRAA, then an average SS and an average 1B would have an equal rating, zero. You would need to introduce a positional adjustment, which most people calculate by using the average batting performance at a position.
I really, really don't like the idea of using batting performance to measure a fielding performance. However, assuming reasonably intelligent management, the difference in offensive level between positions should be roughly equal to the defensive difference between positions. If it wasn't - if everybody overstated the fielding value of a shortstop, for instance - the a team who used a better-hitting, poor-fielding SS would gain an advantage. Assuming the advanatge led to wins, everybody would copy them (because even an assumption of reasonable intelligence leaves us at the monkey-see monkey-do level) and the difference in fielding would decline. Anyway, FRAR essentially mimics using FRAA + fielding adjustment, but only uses fielding stats to do it.
I think it is reasonable to treat each position on the field as being roughly equal in importance, and FRAR is the vehicle I use to make it so.
I am not sure we really looked at this topic from the above angle. Though th epossibility of a WARP book is pretty exciting.
I asked him about how really low replacement levels for defense and this was his response...
jschmeagol (new york, ny): Hey Clay, I have a WARP question. How exactly do you find replacement level for defense. The reason that I ask is that it seems to be really really low. For instance, over the course of Max Carey's career the difference between FRAA and FRAR is larger than the difference between BRAA and BRAR. This doesn't seem possible, but I would think that you have a godo reason for it. Can you elaborate? Thanks
Clay Davenport: Replacement level for defense primarily depends on how many balls get hit to a given position, and what happens to them when they get there. Generally speaking, more balls in play = more FRAR for all positions, which is a lot of what's going on for Max Carey and other deadball era players. There weren't many homers, there weren't many walks or striekouts, there were lots of errors, although not as many as a generation earlier. All of those tilt the share of total runs from the pitchers to the fielders, and it enhances the FRAR.
The reason it so low ties in with this question -
The last line leads into the first question.
I also want to point out that while Davenport hasn't really put his system up for scrutiny in the way that Bill James was with WS, he does seem very open to answering emails and always answers a few WARP questions in his chats.
Also, I found this exchange in a clay davenport chat...
"pkw (Indy, IN): With a new book out this spring, I presume we'll have to wait until at least Fall '06 for the WARP Encyclopedia to come to bookshelves. Barring that, would it be possible for a series of "basics" articles showing how WARP is calculated, the whys and whynots all explained, Win Shares-style? Why FRAR is used instead of FRAA is one question I like to see discussed. Thanks for all the great work.
Clay Davenport: If you used FRAA, then an average SS and an average 1B would have an equal rating, zero. You would need to introduce a positional adjustment, which most people calculate by using the average batting performance at a position.
I really, really don't like the idea of using batting performance to measure a fielding performance. However, assuming reasonably intelligent management, the difference in offensive level between positions should be roughly equal to the defensive difference between positions. If it wasn't - if everybody overstated the fielding value of a shortstop, for instance - the a team who used a better-hitting, poor-fielding SS would gain an advantage. Assuming the advanatge led to wins, everybody would copy them (because even an assumption of reasonable intelligence leaves us at the monkey-see monkey-do level) and the difference in fielding would decline. Anyway, FRAR essentially mimics using FRAA + fielding adjustment, but only uses fielding stats to do it.
I think it is reasonable to treat each position on the field as being roughly equal in importance, and FRAR is the vehicle I use to make it so. "
I am not sure we really looked at this topic from the above angle. Though th epossibility of a WARP book is pretty exciting.
I asked him about how really low replacement levels for defense and this was his response...
"jschmeagol (new york, ny): Hey Clay, I have a WARP question. How exactly do you find replacement level for defense. The reason that I ask is that it seems to be really really low. For instance, over the course of Max Carey's career the difference between FRAA and FRAR is larger than the difference between BRAA and BRAR. This doesn't seem possible, but I would think that you have a godo reason for it. Can you elaborate? Thanks
Clay Davenport: Replacement level for defense primarily depends on how many balls get hit to a given position, and what happens to them when they get there. Generally speaking, more balls in play = more FRAR for all positions, which is a lot of what's going on for Max Carey and other deadball era players. There weren't many homers, there weren't many walks or striekouts, there were lots of errors, although not as many as a generation earlier. All of those tilt the share of total runs from the pitchers to the fielders, and it enhances the FRAR.
The reason it so low ties in with this question - "
The last line leads into the first question.
I also want to point out that while Davenport hasn't really put his system up for scrutiny in the way that Bill James was with WS, he does seem very open to answering emails and always answers a few WARP questions in his chats.
Hopefully that works. I don't know how to use the bold or italics or anything like that. The parts in quatiations are from the chat, the rest are my comments.
James didn't claim his system was perfect. Quoting from p. 2 of his book, he says "If one player in this system is credited with 20 Win Shares and another with 18, we can state with a fair degree of confidence that the one player has contributed more to his team than the other...not that we are always right; there will always be anomalies and there will always be limitations to the data, but I would be confident that we had it right a high percentage of the time."
I think WS generally meets that standard. It has its faults. I certainly wouldn't recommend using it alone or without checking the data or questioning its results if they seem anomalous, but in general I think it does a very good job of bringing together lots of information on batting (including pieces that are missing from OPS, such as double plays), fielding (much more sophisticated than anything available 10 or 15 years ago), and pitching and boiling them down to a meaningful integer.
Bingo, I agree with that 100%.
Also, assuming a normal run environment of 3.5 to 5 runs a game, for example . . . A player that creates 1 run does not necessarily contribute to winning, from the 'net' perspective. If that player also created 27 outs, he contributed much more to the losses than the wins, to the point where his net win contribution is less than zero.
That team would win 11% of it's games, using PythaganPat and a run environment of 5 R/G. The team would score 2.6 R/G and allow 7.6. Real teams rarely vary by more than 75-125% of the league. The worst of both worlds would be a team that scored 562 runs and allowed 938 in a 750 run environment. That team would play .283 ball, or win 46 games.
I think the key point that James makes is that extreme teams do little to help us understand the other 99.99% of teams. The 1898 Spiders, the 1915 Athletics, and the 1962 Mets are such utter anomalies that they tell us almost nothing.
Do I want a system that gets it all right? Sure. But my preference is for a system that hits 98% of the time and accepts the distortions at the left end of the curve rather above having no system at all.
One thing I would point out, however, is that NA Win Shares don't work for me because the run environment is so weird. When I figured them for a few of the NA Boston teams (replicating Chris Cobb's work), I realized the run environment is so extreme that it leaves several regulars looking like non-contributors.
I use win shares, but the problems of extreme teams shouldn't be minimized. The distortions may be worse at the left end of the curve due to the impact of the zero point, but there are also distortions on the right end of the curve as well.
As I understand the argument, by giving zero offensive WS (rather than a negative numbers) to players who produce below the marginal run cutoff you are artifically reducing the amount of offensive WS available for all the players who produced positive marginal runs. (If I understand their adjustments correctly, the Hardball Times folks adjust for this in their present-day WS calculations.)
One question that we should probably look at empirically is whether this distortion is the main reason why hitters on poor teams seem to be earning less WS than hitters with similar performance levels on good teams (e.g., Medwick outperforming Sisler in their best seasons, Mediwck outpeforming Johnson over their careers). It stands to reason that there is a correlation between teams that are bad and teams that give AB to sub-marginal hitters.
I use win shares, but the problems of extreme teams shouldn't be minimized. The distortions may be worse at the left end of the curve due to the impact of the zero point, but there are also distortions on the right end of the curve as well.
I think the "problem" is that Winshares is just not slightly innacurate due to extreme teams - there are a whole series of "small" inaccuracies that, added together, make the system as a whole inferior to many other ways to measure "value."
James didn't claim his system was perfect. Quoting from p. 2 of his book, he says "If one player in this system is credited with 20 Win Shares and another with 18, we can state with a fair degree of confidence that the one player has contributed more to his team than the other...not that we are always right; there will always be anomalies and there will always be limitations to the data, but I would be confident that we had it right a high percentage of the time."
I don't see this as much of a defense. I mean, I could use RBI's to measure hitters, wins to measure pitchers, say 'my system's not perfect but my system would be right a high percentage of the time.' I would not recommend using such a system...
I thought it was bad when pitchers were zero-ed out as well. Is the team's offense/defense split done before these corrections are done? Does that mean that a horribly bad-hitting pitcher will take bWS away from his non-pitching teammates and give them to his better-hitting pitching teammates as pWS?
Plus it complicates the DH-league/non-DH league situation. I'm not exactly sure who is favored there exactly. One one hand, NL batters are lowered due to this zero-ing out issue, on the other hand they're raised up because its effectively eight line-up slots competing for bWS instead of nine.
How many pitchers are below the zero win level offensively?
Can you really be a negaitve amount of wins? I guess you can but at the same time you either win the game or you don't, you can't lose games already won.
Doesn't James sy that AL hitters are screwed because of the DH and that he can't really do anything about it?
If that last one is correct, it most likely is, then how should we adjust WS so as to not penalize AL players? Should we even do this? I mean offense is less valuable in the AL because there are nine hitters as opposed to 8.
Most of them.
Using the Hardball Times's modified calculations...
They list 126 pitchers in the 2005 NL.
11 are above 0.0
11 are equal to 0.0
102 are below 0.0
chart
No one wins games by themselves. Win Shares is just divvying out a teams wins among the its players. A negative number implies you are cancelling out someone else's positive contribution.
If you look at all NL CFs in 1917, you'll find that Edd Roush was the worst, with a fielding rate of 87. He was still well above replacement. His main backup had a rate of 91. Most players with very few games at a position get a rate of 100. But I'm wondering, if Roush is well above replacement, and he's the worst CF, who exactly is this replacement level player who would fill in for him?
"Most players with very few games at a position get a rate of 100."
precisely 100? What about players with few games but not "very few"? Is it possible that the measure involves derivation from per-game or per-inning defensive data plus "regression" (the wrong term here) toward 100 based on playing time?
I don't believe it can be the main reason for the phenomenon (granted for the sake of argument). The correlation between team quality and pitcher batting quality must be low. If statistically significant, I guess it is baseballistically insignificant. But it's worth checking, if anyone has data in the right format.
I thought it was great that 2 of the first 3 questions in Clay's chat were from the HOM group.
To me, his answer to my question indicates how much WARP and WS have in common at the big picture level. Hitters are hitters when they're at bat, not shortstops or right fielders. The added value for being a better defensive player should stay on the defensive side of the ledger.
I would be kinda curious how closely BRARP+FRAA corresponds to BRAR+FRAR, but those who advocate for FRAA should have the same problems with WS that they do with WARP. BP's replacement level might be too low, but average is not the answer.
For a while after Bill James presented Win Shares (July 2001?), I thought of it and described it as fundamentally different from other measures in that the players on each team are assigned positive scores that sum to the number of team wins. The name "win shares" aptly summarizes those two features and the first, positive scores, fits the chief prior criticism of the Total Baseball Ratings, focusing on its zero sum (for all players each league season).
When I read the book, I was surprised (shouldn't have been) to learn that the second feature, the sum to number of wins (for all players on each team season), is superficial: a late step in calculation and easy to undo, except for rounding to integers.
Win Shares Part 7 - Conclusion
It is obvious from the results that this is a major problem with Win Shares.
I've brought this up before (see the SS thread. There is a discussion (around post 86 and thereabouts) of Bobby Wallace and the relative distribution of "All-Stars" (both WARP and WS) during his career.
If Win Shares was fair to each of the 8 everyday positions, there would be an approximately equal amount of value created at each position (pitchers excepted), if one totaled it over a long enough period of time so that local talent gluts even out. The OF/IF imbalance during the Deadball era is particularly dramatic, but it appears to be present in other eras also. I have seen no evidence of a corresponding era where IFers (2B,3B,SS) receive more of the total value than the OFers.
Therefore it appears that a strong bias towards OFers over IFers is built into the Win Shares system. There are two places where this could occur. Either the fielding intrinsic weights are wrong -- too many fielding WS go to OFers, not enough to IFers. Or the offensive/defensive balance is wrong -- too many Win Shares go to offense, not enough to defense, resulting in too many Win Shares being awarded to the hitting end of the defensive spectrum. Either situation will result in too many OF "All-Stars".
Its good practice for the future, but for now I give a heads up to everyone that the entire backlog needs their seasons adjusted to 162.
. . .
The biggest obstacle is to remember to do it at all. I mean, its just so quick and easy to look at the career total for career and line up seasonal WS totals to look at peak. Its those seasonal line-ups I'm most worried about. A single WS difference between each season on those lines often makes a player look quite a bit better.
Not only Win Shares but all counting rather than rate statistics. Not only season stats but career stats.
Career Win Shares per 162 games is in print, organized rather conveniently for the HOM project, in one or two books by Bill James. But that is unreliable. Hard as it is to believe in this day, HOMeboys have discovered numerous arithmetic errors.
I hope that that is not worth saying. Comparison of rates per 162 is technically pertinent regardless of the length of the season, and I guess everyone here knows that. But I fear that the virtual transition to a 162-game era makes those published rates a more attractive nuisance.
Cyril Morong's Sabermetric Research
A note on error-percentage. The error rating in Win Shares is ratio-based. Commit twice as many errors as the average and you get no credit; no errors and you get full credit. It's grading on a curve, with little or no relationship to the number of runs prevented or allowed.
To me, this makes as much sense as rating HR's on a ratio. Then Tommy Leach's 6 HR's in 1902 is more impressive than Bond's 73 HR's because the avg position player in 1902 hit about 1 HR, while the avg position player in modern times hits about 20. Ty Cobb might think this is a good way to rate HR's but I think most of us would reject any offensive rating system like that out-of-hand.
IMO, the fielding ratings in Win Shares appear to attempt to translate the stats into a modern setting and evaluate them on a modern scale. They do NOT appear to be evaluating them in the context of the game that was actually being played then.
Jaws_WARP3_Career_Peak_etc_Method
>I used the third version of WARP, which is WARP3. This version is adjusted for difficulty and for playing time, so it levels the playing field for different eras.
The WARP3 translation is quite explicitly meant NOT to level the playing field, but is meant to make sure that modern players come out ahead as "we" intuitively think they should.
Or I'm missing something.
Systems like WARP and OWP and linear weights assume standard rules: 1 out into a hit = .74 runs = .74 Wins.
For Win Shares, this works for a .500 club. For a team that wins 70% of the time, no matter how good you are, you can't generate many extra wins, and so when they are apportioned out, the great players get shafted, by the principle of diminshing returns.
For a losing club, it's tough to raise the bar (too many more runs needed to make another win), so again these players are underrated.
So, theoretcially, I conclude it is NOT true that players are great teams are overrated by WS. Our perception of such might be such, BECAUSE most single-season teams that won many games did so in part by being 'lucky' (win clsoe games), and in the WS system, these wins are credited to the players.
Zat make sense?
The argument is that players on great teams get a benefit in WS because they don't have to play teams as good as themselves. I think this goes a little far. I think that playing LF on a team that has great pitching and defense may give you a lift because you don't have to face your pitching and defense. I dont think it matters to you if you are a LFer on a great team and the opposite team has Jimmie Foxx or George Burns (the 1B) hitting cleanup because you don't play agains the other team's lineup, you play against their pitching and defense. The reverse would be true for pitchers.
There will be more win shares to go around to everyone. And any team that is great enough for their win shares to be inflated by their not having to play themselves is going to be well above average on both offense and defense.
The principle of diminishing returns kicks at a higher level of team performance than the excess win shares for not having to play yourself. That starts to show up in an 8-team league, IIRC, at about .630. The effect isn't large at that level, but it is noticeable.
Conversly, how about someone like any of the Yankees recent first basemen? How much were Tino';s stats recently helped by hitting with guys like Sheff, Bernie (until recently) Jeter, etc...?
However, I'm not sure it should. Most studies on the value of 'protection' concluded that there is little noticable effect on the hitter's quality. Yes, it may affect hi sstats in that he gets more walks and fewer HR without any good bat behind him, but overall his contributions remain the same (see Babe Ruth, before and after Lou Gehrig arrives).
As to WS and good teas/bad teams, yes, WS does not account for not having to face your own teammates. But this is certainly not a unique issue to WS; many metrics 'suffer' from the same issue.
Guys like Bob Johnson are hurt by using Win Sahres because they have TWO effects against them; diminishing returns (bad team, needs more runs to make a win) and poor teammates (doesn't get to face his own pitching).
When does the diminishing returns kick in on the low end? My back-of-the-envelope guess from taking the derivative of the OWP formula shows a *maximum* at around 0.250, which is pretty darn low. That implies that a great player can help a bad team easier than he can help a good team. Only when the team becomes quite brutal, does the effect of one player start returning to zero (sharply so, I'll admit). Anyone look at this effect with the real Win Shares calculation? Is my guess way too low?
and poor teammates (doesn't get to face his own pitching).
This effect is mentioned all the time, but it only exists indirectly in the Win Shares calculation. For the most part with Win Shares, you are competing against your own teammates for value.
The first place where other teams come into play are in the total Win Shares available -- the teams overall record. There has been some talk of needing to 11 wins and 11 losses to each team and rescaling back to 154 to get some sort of adjusted W/L record (using 154 G season example here). I don't know if that's a valid tweak or not, but I've seen it mentioned here several times
The second place where other teams come into play is the Park Factor. Win Shares uses the straight PF, not the BPF or PPF that was designed to account for not facing your own teammates. This affects the way that Win Shares splits between offense & defense and any discrepancy would only come into play when the team in question is very unbalanced (e.g. bad-pitching/decent-hitting) and then it will have the effect of dampening some of that unbalance. A decent offense with a crummy staff will already get a larger segment of the Win Shares, but by not having to face that crummy staff their context is going a bit off and they'll get a slightly smaller cut of WS (but still certainly larger). If a team has offense/defense that is equally bad (or good), then the effect of Park Factors on this split goes away.
After that, I believe its all competition within your own team for value.
As to Win Shares using the 'unscaled' PF, I'd have to think more about that one. While you want to calculate the effects of not facing your own teammates if you wish to translate stats into nomralixed stats, I'm not positive that by using team Wins this doesn't already do some of that - you ave a great team, you get to not face your mates, but then it becomes harder to generate an extra win since you are already at a WPCT where it takes more runs to get a marginal win.... need a study that I shan't take time to do right now.
Background:
WARP generates FRAR by taking FRAA and adding an amount of FRAR per game that is fixed for each position. This amount changes over time to reflect 1) shifts in defensive responsibility between pitchers and fielders and 2) shifts in defensive responsibility between positions.
For example, in 1895, an average defensive player at each position (FRAA = 0) would receive something very near to the following FRAR for a full season (132 games):
P - 9 (obviously no pitcher would play 132 games, but this shows the fielding importance of the position relative to other positions)
C - 44
1B - 16
2B - 43
3B - 32
SS - 47
LF - 30
CF - 31
RF - 14
In 1965 the FRAR for an average defensive player at each position for a 162 game season would be very near to these amounts
P - 8
C - 29
1B - 13
2B - 33
3B -23
SS - 34
LF - 17
CF - 25
RF - 15
This readiness to shift fielding value around is one of WARP's potential points of superiority to win shares, which sticks to a constant set of "inherent weights" to distribute fielding value among the positions. The one change James acknowledges in the defensive spectrum involves the 2B and 3B, which he sees as switching places on the defensive spectrum. The "inherent weights" in the fielding win share system are
C - 19%
1B - 6%
2B - 16%
3B - 12%
SS - 18%
OF - 29% (James treats OF as one position and then uses the distribution of win shares among individual players to sort out the relative value of each outfield position, but we can estimate that CF will typically land between 2B and 3B and that LF and RF will fall between 3B and 1B).
Before 1920, the weights for 2B and 3B are reversed.
WARP shows us a shifting defensive spectrum over the history of the game, where win shares does not.
However, jimd's study of average OPS+ by position also suggests a shifting defensive spectrum over the history of the game, one in which the shifts are rather different from the ones WARP presents. I'll reproduce his famous table once again:
Decade 1B LF RF CF 3B 2B Ca SS Pit
1870's +1 +4 -1 +4 +2 +2 +0 +1 -13
1880's 13 +6 +1 +5 +1 -1 -7 -2 -17
1890's +6 +9 +7 +7 +0 -2 -6 -2 -22
1900's +6 10 +9 +8 +0 +2 -9 -1 -29
1910's +6 +7 +9 10 +1 +1 -7 -4 -31
1920's +9 10 10 +8 -3 +1 -4 -7 -32
1930's 13 +8 10 +5 -1 -3 -3 -4 -36
1940's +8 11 +9 +7 +2 -3 -4 -4 -37
1950's +9 10 +7 +7 +4 -3 -1 -5 -40
1960's 11 +9 11 +7 +4 -5 -3 -6 -46
1970's 10 +8 +8 +5 +3 -5 -2 -11-45
1980's +8 +6 +6 +2 +3 -4 -4 -8 -48
1990's +9 +4 +6 +1 +1 -3 -4 -7 -50
Mean.. +9 +8 +7 +6 +1 -2 -4 -5 -36
The premise here is that the defensive importance of a position is suggested by the amount of offense the management is willing to give up at a position in order to play a competent defender there.
Avoiding, for the moment, any question of the overall weight given to fielding value in any system, let me line up the defensive spectrum for the 1890s and the 1960s as represented by WARP, WS, and the OPS+ study. (Pitchers will be left out.)
1890s
W1 -- SS C 2B 3B CF LF 1B RF (Top three spots are is 3+ times more valuable than bottom spot, 3B is twice as valuable as 1B)
WS -- C SS 3B CF 2B LF/RF 1B (Top 3 spots are 2 2/3+ times more valuable than bottom spot, 2B is twice as valuable as 1B)
OPS+ -- C SS/2B 3B 1B CF/RF LF (Top 3 spots have OPS+ below avg., 3B is 0, 1B-LF +6 to +10)
1960s
W1 -- SS 2B C CF 3B LF RF 1B (Top three spots are 2.2+ times more valuable than bottom spot, 3B & CF are almost twice as valuable as 1B)
WS -- C SS 2B CF 3B LF/RF 1B (Top 3 spots are 2 2/3+ times more valuable than bottom spot, 3B is twice as valuable as 1B)
OPS+ -- SS 2B C 3B CF LF RF/1B (Top 3 spots have OPS+ below avg., 3B is +4, RF/!B +11)
Parallel representations of the proposed defensive spectrums for other decades would show different discrepancies. One I am especially concerned about right now is 2B/3B pre 1930. OPS+ has these two positions always close in average offense, and they shift back and forth as to which is slightly higher or lower decade to decade, but WARP _always_ gives second base more defensive value. The treatment of pre-1930 first base is also a fraught issue, as is the relative importance of infield vs. outfield positions.
The big questions:
Where there are disagreements in these lists, which assessment should one accept, and why?
If one wants to use WARP or win shares, but trusts the OPS+ assessment more, how might one adjust the results of these systems?
If WARP's calculation of FRAA is run-based, is their estimate of FRAR also run-based, or is it offense-based (like the OPS+ study), or theoretical (like win shares)? If it is offense-based, what measure does it use and how does it get from offensive value to defensive value? If it is theoretical, what is the theory?
For that reason, I don't use WARP3, although I do look at the competition-strength adjustments in WARP2. For that reason, my ranking system always compares a player first to his contemporaries and only second to all other eligible players. What I care about in a comprehensive metric, therefore, is the extent to which it gives an accurate representation of value in context.
On the subject of the defensive spectrum, I have available to me three different views of the value of the defensive positions in context--WARP1, win shares, and OPS+ by position, and I am looking for reasons to accept or modify these views and their results.
I have been using win shares, with modifications but in to adjust the pitching/fielding division of defensive value, and I think that system has worked pretty well. But as we are having to make finer distinctions in the backlog, I am concerned, as in the case of Sisler/Beckley and Elliott/Boyer, that errors in the treatment, not of fielding as a whole but of the value of particular positions at particular times in the history of the game, may to the overrating or underrating of particular players. Win shares argues that the defensive spectrum has changed very little over time, but the OPS+ study argues otherwise. I know that WARP1 is designed to be more flexible on this point than win shares, but some study of WARP1 suggests that its representation of the defensive spectrum, although it is changing, does not agree with the findings of the OPS+ study.
I don't treat the OPS+ study as gospel, as its findings are influenced by the (variable) level of talent available at a position during a given decade, but still, overall, it tells a different story about the defensive spectrum than WARP does. I trust the OPS+ story as having significant validity, however, because I know how it is grounded in actual data. I'd like to know whether the WARP story is as well. If it is really grounded in the data, I could accept WARP1's results and weigh them equally win win shares'. Or, I could modify WARP's fielding assessments to fit more closely with OPS+, but that would be a lot of work, so I don't want to start in on that project unless I have a clear sense that it is warranted. Or I could just stick with win shares exclusively, and try to find ways to fine-tune its handling of the defensive spectrum or just make subjective adjustments where I think they are needed. I see how I could adjust the fielding spectrum pretty neatly in WARP1, but for win shares I don't.
Any thoughts?
1. I start with batting win shares, and move on from there to EQA, WARP1, OPS+ to get a feel for the player as a hitter.
2. I factor in intangibles (missed wartime, racism, minor league credit) and lump fielding and MLEs in with this. For fielding, I look at both WARP and WS, but do not rely on them, as they often are very different.
3. I compare the player to his contemporaries.
4. I compare the player to the remaining eligibles at his position (or positions, for guys like Tommy Leach).
5. I rank the highest ranked players at the positions against each other, paying particular attention to the question (If the HOM ended without this guy in, how would I feel about that?)
My system started much more sabermetric-based and has now returned to much more subjective analysis.