## Monday, November 30, 2020

#### Bill James: The Biggest Problem With WAR

The player’s WAR can vary by 150%, based on 3 errors of 3% each.  THAT is the basic problem with WAR.  I mean, we argue about all kinds of things.  We argue about whether clutch data should be included in runs created estimates; we argue about whether the fielding estimates are consistent with external evidence, etc.  But the REAL problem is that:

1)    Estimates are never exactly right; they are always just estimates, and

2)    WAR uses an analytical system to process those estimates which has the potential to enormously magnify whatever inaccuracies are included.

In a WAR estimate, there are dozens and dozens of internal estimates—estimates of runs created, estimates of runs saved by fielding, estimates of the run value of a single, a double, a triple or a double play, estimates of the park effect, etc.

The problem is more serious than that.  First of all, as I said, the replacement level is not really an estimate.  It’s just a made-up number.  It could be off by 20 or 25%—by itself, before it is magnified.

But that understates the problem, too.  WAR assumes that the Replacement Level is a constant.  It is NOT a constant; it’s a variable.  Some teams, an outfielder gets hurt, it doesn’t really matter because they’ve got a fourth outfielder who is about as good as the starters.  Other teams, it matters a lot because their fourth outfielder is a pair of stuffed pajamas.  The actual replacement level is specific to the locale.  Rather than trying to estimate what the replacement level actually is in this case, WAR simply assumes that it is always the same.  To return to the analogy of the wheat farmer, this is like assuming that all trucks weigh the same.  It leads to large inaccuracies.

1. Joyful Calculus Instructor Posted: November 30, 2020 at 01:51 PM (#5991623)
The biggest problem with WAR is that nobody cares about Win Shares anymore. And my historical abstract was based on a model that used Win Shares.
2. Mefisto Posted: November 30, 2020 at 01:51 PM (#5991624)
WAR is the worst of systems, except for all the others.
3. Matt Welch Posted: November 30, 2020 at 02:15 PM (#5991628)
I have never learned not to love Bill James, but I do not see how most of his arguments against WAR could not also be made against Win Shares, and in fact against his methods for comparing Craig Biggio to Barry Bonds or whoever back in the day.
4. Ron J Posted: November 30, 2020 at 02:19 PM (#5991630)
This is not James at anything close to his best. WAR ain't perfect but it does create a well thought out framework for any discussion or decision making. The problems aren't in the metric but more in the way people perceive it.
5. Eric J can SABER all he wants to Posted: November 30, 2020 at 02:33 PM (#5991638)
WAR assumes that the Replacement Level is a constant.  It is NOT a constant; it’s a variable.  Some teams, an outfielder gets hurt, it doesn’t really matter because they’ve got a fourth outfielder who is about as good as the starters.  Other teams, it matters a lot because their fourth outfielder is a pair of stuffed pajamas.  The actual replacement level is specific to the locale.  Rather than trying to estimate what the replacement level actually is in this case, WAR simply assumes that it is always the same.

This is, um, a frustrating thing to read from someone as generally respected as Bill James. WAR is intended to be context neutral, but that doesn't mean individual teams have to use it that way to make decisions. If a team has a really good fourth outfielder, they can use the difference between his expected WAR and the expectation for one of their starters to assess how much value they want back in a trade to shore up a position of weakness, for instance. The benefit of a (hopefully) context-neutral measure is that you can always add context back in.

By comparison, if you start off by comparing a player to the actual backups on his actual team, you get some undesirable results and no way to avoid them. (Yogi Berra looks WAY worse after 1955 than before; in 1958, he has a 119 OPS+, significantly above average for a catcher, but would still be below "actual replacement" level because Elston Howard had a 130.)
6. What did Billy Ripken have against ElRoy Face? Posted: November 30, 2020 at 02:35 PM (#5991640)
Some teams, an outfielder gets hurt, it doesn’t really matter because they’ve got a fourth outfielder who is about as good as the starters. Other teams, it matters a lot because their fourth outfielder is a pair of stuffed pajamas. The actual replacement level is specific to the locale. Rather than trying to estimate what the replacement level actually is in this case, WAR simply assumes that it is always the same.
Um...isn't the entire point to try to isolate the player's performance from its context? Except for the contextual factors that directly influence the performance like park factors, of course.

EDIT: 100 Coke Shares to Eric.
7. Curse of the Graffanino (dfan) Posted: November 30, 2020 at 02:38 PM (#5991641)
Win Shares are inherently tied to team wins, which loosely tethers them to reality somewhat in a way that WAR doesn't; it makes it a bit harder for a player's calculated value to go flying off into space due to a series of compounded noisy extrapolations.

Which is not to say that Win Shares are a perfect solution either. But I think that they do explicitly try to deal with this sort of issue by design.
8. Sweatpants Posted: November 30, 2020 at 02:38 PM (#5991642)
Also, as best I understand this—which is poorly—one of the WAR systems introduces another potential error for pitchers by using a number that represents how many runs the pitcher SHOULD HAVE allowed, based on his strikeouts and walks and home runs allowed, rather than how many runs he ACTUALLY allowed. The system says "this pitcher actually allowed 100 runs, but, because he had really good strikeouts and walks, we’ll treat him as if he allowed only 87 runs." That is introducing yet another potential error, by substituting an estimate for a hard fact. That may be what causes them to conclude that the American’s League’s best player in 1966 was not Frank Robinson, who won the Triple Crown and was the unanimous MVP, but Earl Wilson, a pitcher whose ERA was not much better than the league average. And the people who believe in WAR will look at that and say, "Oh, well, if that’s what the numbers show, that’s what they show," rather than saying what they should say, which is "You know, that’s really a stupid thing to say."
Yikes. First of all, he blames a Baseball-Reference WAR result (Earl Wilson beating Frank Robinson in WAR in 1966, by less than 0.1) on Fangraphs WAR methodology. Second of all, as Matt Welch pointed out, when his own systems spit out a result like that, such as Craig Biggio as the best player in baseball or Eric Hosmer being more valuable than Aaron Judge in 2017, he does NOT look at it as a stupid thing to say. His belief in his systems leads to him explaining how they got there and defending the outcome.

I don't think that WAR is some sort of correct answer, or anything more than an estimate, either. It wouldn't surprise me if in 50 years it ends up viewed as a relic of its time. It's just that James has had his say on this subject many times before. That he continues to explain why WAR isn't the answer, occasionally sneaking in comparisons to his own system that never took off, makes it hard not to think that there's some truth behind comment #1.
9. Rally Posted: November 30, 2020 at 02:41 PM (#5991643)
Read it, and not convinced. The truck and grain example doesn't seem all that relevant either.

It would if you want WAR to be interpreted very literally. For example let's compare the value of Gary Sanchez to what replacements were available to the Yankees and Yadier Molina to what replacements the Cardinals had. Then you've got a lot of uncertainty on both ends. But WAR is not meant to compare to specific replacements, but a very stable replacement level. That replacement level is set around 20 runs below the league average.

League average shifts a bit every season, but that's the benchmark. It really doesn't matter if a truer value of replacement level is 17 runs, or 23 runs worse than average. That's just a shortcut to having a stable comparison for players that gives them credit for average play, since a league average player has value.

Also, as best I understand this—which is poorly—one of the WAR systems introduces another potential error for pitchers by using a number that represents how many runs the pitcher SHOULD HAVE allowed, based on his strikeouts and walks and home runs allowed, rather than how many runs he ACTUALLY allowed. The system says "this pitcher actually allowed 100 runs, but, because he had really good strikeouts and walks, we’ll treat him as if he allowed only 87 runs." That is introducing yet another potential error, by substituting an estimate for a hard fact. That may be what causes them to conclude that the American’s League’s best player in 1966 was not Frank Robinson, who won the Triple Crown and was the unanimous MVP, but Earl Wilson, a pitcher whose ERA was not much better than the league average. And the people who believe in WAR will look at that and say, "Oh, well, if that’s what the numbers show, that’s what they show," rather than saying what they should say, which is "You know, that’s really a stupid thing to say."

Only part I can agree with here is the first sentence. Wilson's Fangraphs WAR was 4.2 that year, and 5.9 for Baseball reference, at least pitching WAR. Baseball reference is the one that likes Wilson more that year, but it has nothing to do with components, BBref is the one that uses actual runs for pitchers as the base.

Using the more favorable version of WAR for Wilson's 1966 season, it doesn't say he was better than Robinson that year, but that they were tied at 7.7. Bill wants us to say that's really stupid, but he's either reacting without trying to understand what's going on here or intentionally obfuscating here. Earl Wilson was quite a bit better than "some guy whose ERA was not much better than league average".

Wilson in 1966 did indeed have a league average ERA in 100 innings with the Red Sox, but was traded to the Tigers mid season and gave them 163 innings of a 2.59 ERA (134 ERA+). In addition Wilson allowed only 4 unearned runs, so he was better than his ERA says in run prevention. All in all, that's 264 innings of a 118 ERA+. He was third in the league in innings pitched behind Jim Kaat and Denny McLain.

Still, how does he get from 5.9 pitching WAR up to a tie with the MVP at 7.7? It would not have happened if the DH had been around a decade earlier, but thankfully it was not. Wilson was a damn good hitting pitcher. He hit .240 with 7 homers, a .500 slugging percentage. So he picks up 1.9 WAR for being better than the typical pitcher.
10. Rally Posted: November 30, 2020 at 02:49 PM (#5991647)
I'm certainly not suggesting Robinson's MVP award be taken away but Bill seems to be suggesting it's wrong to consider that Frank and Earl had similar value. Bill lived through that season, I didn't. I know quite a bit about Frank anyway since he's an inner circle guy, and relatively little about Earl Wilson. Maybe if I had been around then and remembered the triple crown season I would have reacted the way Bill thinks people should react.

Frank Robinson was quite obviously the best hitter in the league in 1966. What would it take to equal that value in other ways? I guess if Ozzie Smith was playing short and hit like, well, better than Ozzie generally did but less than Frank Robinson, that would be one way to get there. I am not going to reflexively dismiss the idea that a guy who pitches 264 innings with excellent run prevention and also slugs .500 is another way it could be done.
11. bfan Posted: November 30, 2020 at 02:55 PM (#5991651)
the American’s League’s best player in 1966 was not Frank Robinson, who won the Triple Crown

Well, in 2012, Miguel Cabrera won the triple crown and he certainly was not the best player or hitter in the league. Interestingly enough, Cabrera had a higher WAR in the year before and the year after his triple crown year.
12.  Posted: November 30, 2020 at 02:55 PM (#5991652)
So the basic problems with WAR are that people take "replacement value" too literally, don't know how it's calculated, and expect too much out of a single-number value metric :)
13. Matt Welch Posted: November 30, 2020 at 02:56 PM (#5991653)
Poor Earl Wilson! This comparison should be used mostly to marvel at a mostly forgotten, underappreciated-at-the-time great season by one of the top 20 or so starting pitchers of the '60s (who was also called the greatest hitting pitcher of the decade by...Bill James!).

14. Srul Itza Posted: November 30, 2020 at 03:48 PM (#5991666)
Well, at least it's not quite as embarrassing as defending Pete Rose when his guilt was always obvious to anyone with half a brain.

15. SoSH U at work Posted: November 30, 2020 at 04:16 PM (#5991673)
Well, at least it's not quite as embarrassing as defending Pete Rose when his guilt was always obvious to anyone with half a brain.

And that was still better than "In my day, men and boys showered together all the time."

16. sunday silence (again) Posted: November 30, 2020 at 04:55 PM (#5991683)
I was digging around trying to find some other stoopid thing that Bill James had said and wound up finding this audio clip. Its Glenn Beckert and BIlly Williams interviewing a very old Woody English and his recollection of the "called shot.":

https://chicagobaseballmuseum.org/chicago-baseball-history-news/chicago-baseball-history-feature/glenn-beckert-provoked-smiles-in-his-on-and-off-the-field-chicago-cubs-exploits/

English was at third base at the time and like a number of players says Ruth was holding up his fingers. In fact English says Ruth was holding up two fingers (probably to indicate the two strikes). Ruth if I recall had said he might have made the gesture twice so this sort of suggests that.
17. bfan Posted: November 30, 2020 at 05:21 PM (#5991691)
Has there been some attitudinal shift about Bill James? I regard him as the patron saint of baseball analytics, but I sense a disturbance in the force here.
18. Walt Davis Posted: November 30, 2020 at 05:22 PM (#5991693)
So the basic problems with WAR are that people take "replacement value" too literally, don't know how it's calculated, and expect too much out of a single-number value metric :)

If only they had Win Shares!

I mean, can you show me -1 apples? Don't get me started on imaginary numbers!!

Anyway, it's fair enough to say that, statistically speaking, WAR is a bit of a mess for the reasons James states. An estimate plus an estimate plus an estimate ... does blow out a standard error quite quickly. (Standard errors would be useful to know.) Generally taking aggregate-level coefficients and applying them to individuals risks being an ecological fallacy. And rather obviously a walk to Rickey Henderson with nobody on base is more likely to result in a run than a walk to a base-clogger like a Molina with nobody on base. And calibrating to known totals is a perfectly sensible thing to do. (Doing it by deciding by hand on individual player adjustments is totally daft though.) And at the end of the day, all any of us can do is explain the past then assume that the same rules will apply in the near future ... in a relative but not absolute sense cuz you never know when the league is gonna introduce a rabbit ball.

But it's also true that statistics relies day after day, model after model, on the "plug-in principle" -- i.e. that there are parameters in our equations that we will never "know" the value of so we plug in estimates of those parameters. Those estimates are derived using statistical principles (i.e. they come from some other model) and yes, it is important that we incorporate the uncertainty of those estimates as best we can and they add to the uncertainty of our final model. Your choice is to do that or to make #### up. (Let your prior dominate your posterior for you Bayesians out there.)

But all we've got that's concrete is runs scored. Nearly concrete are RBI, singles, doubles, etc. but those involve some rules, scoring decisions, random fluke-y stuff (he pulled a hamstring on what would have been a double) and non-random flukey stuff (Pesky's pole, vines). Strikeouts and walks are similarly nearly concrete but, we learn, are a combination of pitcher, catcher, batter, umpire and apparently park. And probably time of day and wind and whether The Mick popped one or two greenies that morning.

And until we estimate the run value of a double relative to a single, all we've got is common sense telling us that a double is obviously more valuable ... but that doesn't get us anywhere once we have to consider whether 2 doubles is good/better/worse than a HR and a single ... much less the "fact" that the guy with 2 doubles seems to be the better defender.

I mean what is the concrete answer to the question: was 680 PA with 122 RS, 122 RBI, 97 singles, 34 doubles, 2 triples, 49 HRs, 8 SB, 5 CS, 87 BB, 90 K, 24 GDP, 10 HBP, 7 SF, 11 IBB while playing 1181 innings in RF, 167 in LF and 26 at 1B while playing home games in Baltimore's Memorial Stadium in the 1966 AL more valuable than 264 innings pitched (1,059 BF, >50% more PA participated in), 94 RA (90 "earned"), 214 HA including 30 HR, 74 BB (3 IBB), 200 Ks, 6 HBP, 9 WP, 27 GDP plus 113 PA, 20 RS, 22 RBI, 14 singles, 2 triples, 7 HR, 8 BB, 36 K, 0 GDP, 1 HBP, 6 SH, 2 SF split across one very bad team (home field Fenway Park, reportedly a good-hitting park) and a good team that finished a distant third (home field Tiger Stadium, reportedly a good pitchers' park)?

We might refer to rarity here. Robinson's totals were certainly unusual -- do we need to compare just to his specific year? His "era"? How do we compare across? -- while Wilson's pitching totals weren't so much and neither were his hitting toals, but we note that it's very unusual for an individual to combine such hitting and pitching totals in a single season. I'm not sure which achievement is more rare.

Do we need to factor in that the Tigers had to trade actual players to obtain Wilson and the O's had to trade Pappas to obtain Robinson? Does it matter that Wilson was the only Tigers' starter who didn't suck that year (by ERA at least)? Does it matter than the O's 4th OF, Russ Snyder, had a 126 OPS+? Does it matter that Robinson usually had Aparicio and Blefary hitting in front of him, Brooks and Boog behind him? Does the bullpen support that Wilson received in his 24 incomplete games matter? What about his catchers and defense and umpires? Does the quality of Robinson's defense matter? Does it matter less because he had the reportedly great Paul Blair in CF for half the season?

We quicly realize we are nuts for asking the question and insane to think we could possibly ever come up with a reliable answer. And some of that is the crazy stuff WAR tries to answer and some of it is the even more finely grained detail James thinks we need to take into account (while somehow overcoming all the other sources of uncertainty).

James doesn't like cootext-neutrality. That's fair enough -- obviously a single with 2 outs and a runner on 3B contributes a lot mroe to a win than a single with 2 outs and nobody on ... unless of course you are ahead/behind by 6+ runs at the time in which case nothing you do really matters. It's reasonable to say that WPA (or whatever) should be incorporated into a backward-looking, here's what happened measure of value. But how much of that extra value goes to the guy on third and how much to the guy at the plate? Do we need a "mutual value" measure? I also suspect that if you took that seriously, you'd find that "great" seasons come down to about 150-200 PAs. Sure, the Babe seemed pretty awesome but how many of those hits and HRs came when the game was already out of reach? I strongly suspect that a heavily contextualized measure will lead to a lot more WTFs and more extreme WTFs than bWAR and fWAR.

Contextual value in baseball is a funny thing. In 1966, Robinson went hitless in 46 gamees, 30% of his starts. He failed to score a run in 67 of them, had no RBI in 83 of them, had neither in 53 of them. So in 30-40% of his games he was useless. In at least 23 of the 72 games (I counted by hand) in which he had at least 1 RBI, the game was decided by 5+ runs so maybe nearly half the time his contribution was either "negative" or didn't really matter. Now in those other 75-80 games, he may have hit something like 600/800/1200 which probably helped his team win. :-) It seems pretty clear that the only way you can come up with a sensible estimate of his value there is to acknowledge that he actually hurt his team's chances of winning in about 1/3 of his games, had an average-ish contribution 1/3 of his games and was massively valuable in the other 1/3.

Baseball is not, say, basketball. It's a really, really hard game where the PA by PA outcome for hitters is usually utter failure. That has to be acknowledged and accounted for. Any batter except maybe the craziest Bonds and Ruth seasons is more likely to hurt than help his team's chances of winning when he steps into the box. It is obviously possible to hurt so much more often than help as to have negative value.

Speaking of "opportunity" and "value", I note that in 1966 Denny McLain went 20-14 on a 89 ERA+ (Lolich 14-14 on a 73).
19. Hank Gillette Posted: November 30, 2020 at 05:30 PM (#5991695)
I think Bill James has turned into the kind of cranky old man that he used to ridicule 40 years ago. He has said he doesn’t read other people’s research and analysis, which may make sense if you are planning to publish your own research and want to make sure you are not unconsciously plagiarizing someone else, but it leaves him behind when it comes to current research.

He doesn’t seem to understand why an open framework like WAR won out over a proprietary closed system like Win Shares, which has only one person working on it, and not very often at that.

That doesn’t negate his pioneering work on baseball analysis, and showing that someone could make a good living at analyzing baseball.
20. What did Billy Ripken have against ElRoy Face? Posted: November 30, 2020 at 05:33 PM (#5991697)
Has there been some attitudinal shift about Bill James?
I think it's more of an attitudinal shift by Bill James.
21. Walt Davis Posted: November 30, 2020 at 05:39 PM (#5991699)
Has there been some attitudinal shift about Bill James? I regard him as the patron saint of baseball analytics, but I sense a disturbance in the force here.

He remains the patron saint and everybody still loves the early books. But he has made it clear over the years in his comments that he makes no effort to keep up with other people's research. You can see it here where he can't be bothered to even find out which is fWAR and which is bWAR, claims he didn't even really know that fWAR uses FIP, etc. The "shift" may have begun with Win Shares which he sort of promoted as being the end-all of such measures but it was clear he hadn't bothered with anybody else's work on the topic and made some cavlier decisions (no negative value; calibrating to win totals; achieving that calibration via idiosyncratic adjustments; arbitrary multiplication by 3) without really thinking them through. Here we are a couple of decades later and he still seems to think he cracked the code with win shares.

He's a grumpy Paul McCartney -- he did some absolutely groundbreaking stuff, he remains a solid writer but he hasn't tried to keep up (probably a good thing in McCartney's case) and he hasn't done anything impactful in 30(?) years. Unlike McCartney who only seems to have nice things to say about other musicians, James occasionally resurfaces to re-fight the battles of 2000.

Or James is like Joe Morgan -- a decade of utter greatness, a big chunk of other good work, then cranky old-manliness. In short a damn fine life well-lived while most of us only have the old-man crankiness part.
22. What did Billy Ripken have against ElRoy Face? Posted: November 30, 2020 at 05:43 PM (#5991702)
(no negative value; calibrating to win totals; achieving that calibration via idiosyncratic adjustments; arbitrary multiplication by 3)
Also, "I reserve the right to arbitrarily adjust a player's total as I see fit to match what I think his value really was."
23. Eric J can SABER all he wants to Posted: November 30, 2020 at 05:46 PM (#5991703)
And rather obviously a walk to Rickey Henderson with nobody on base is more likely to result in a run than a walk to a base-clogger like a Molina with nobody on base.

This difference should largely be addressed by including baserunning value, which WAR does. Comparing, say, Tim Raines to Yadier, B-R has a 150-run difference in their baserunning value, Fangraphs has it as 180. (Not sure why there's a gap of 30 runs there; B-R has Molina at -35 compared to -80 for Fangraphs.) The walk is counted the same, but what happens after is addressed separately.

The rest of your comment I generally agree with.
24. Hank Gillette Posted: November 30, 2020 at 05:50 PM (#5991704)
…and made some cavlier decisions (no negative value; calibrating to win totals; achieving that calibration via idiosyncratic adjustments; arbitrary multiplication by 3)

That’s something that has always bugged me about some of James’ studies: he will use some seemingly random number as a multiplier or divisor and say that the results don’t work without using that specific number. That may actually be true, but it gives the impression that he is massaging the numbers to get the results he is looking for. He is still fun to read, but I am no longer surprised at how obvious his conclusions are once you hear them. Maybe all the obvious stuff has been figured out.
25. kcgard2 Posted: November 30, 2020 at 05:55 PM (#5991705)
James has had this agenda for a long time. Most of the time, for discussion in which WAR is used, the people having the discussion want context neutrality. Think of how ridiculous it would be to base "replacement level" on the actual replacements available on the specific team that specific year. You would frequently have above league average starters below "replacement level" because the backup on the team was slightly better in less playing time. What useful information does a stat like that tell you? To address Bill's actual complaint about a fixed replacement level, that's a stupid complaint too, because it neither helps nor hurts ANY player you'd compare to any other player at any time, and as far as I can tell, the point of WAR (the point of any stat ever used, actually) is simply to compare players. Should replacement level change over long periods of time? That's a defensible position, but even then, the replacement level for any individual season would be a fixed value. Unless you want a stat that tells you that Babe Ruth's career value was less because he was on a team stacked with talent (replacement level on his teams was high), and Appier's value was more for the opposite reason. What Bill's stat tells us is that a player is worth more Win Shares by virtue of having great teammates. I wonder why your metric isn't preferred Bill - is it because the majority of the sabermetric community is ignorant and misguided except for you?

And how exactly does Win Shares avoid the problem of inaccurate inputs? It has literally the exact same issue, to the extent it's an issue. "The real problem is ... estimates are never exactly right; they are always just estimates" is one of the dumbest pieces of argumentative rhetoric I've ever heard. Every criticism leveled against WAR here applies to Win Shares, and then Win Shares earns some additional critiques if we're being honest.

If you want a context-dependent value stat, hey, that's a preference. Simply say that's your preference. Explain the pros and cons of it (in an honest way). See who you convert. Or do this, and get called out for your obvious agenda and bias.
26. Zach Posted: November 30, 2020 at 05:56 PM (#5991707)
A better word than "comparison derivative" would be "stacking tolerances." You don't want to measure a small thing by measuring two big things with independent measurement errors -- when you do that, you add the two measurement errors in quadrature. So if you want to measure whether a kid is taller than the "you must be this tall to ride" sign, you wouldn't want to use a tape measure to measure the kid, then a ruler to measure the sign. You line the kid up next to the sign and measure the distance from the top of the skull to the bottom of the line.

However, I'm not sure that measuring the run value of a player is really different from measuring the run value of a replacement player. You're going to do that with linear weights, which are calculated league wide with good statistics. There is no measurement error in counting at bats, singles, doubles, etc. There's quite a lot of measurement error in measuring defense (which is what I hate about WAR -- get those dirty defensive stats away from the nice clean offensive stats!). The run environment of the league as a whole should be accounted for when you're calculating the linear weights.

So the way to avoid stacking tolerances is to use the same value for a single, double, etc for every player and make sure that you don't ever add offensive value and defensive value.

(There's also some error due to park factors, but both systems use those in the same way, so it's not relevant to this discussion)

27. Eric J can SABER all he wants to Posted: November 30, 2020 at 05:59 PM (#5991708)
"The real problem is ... estimates are never exactly right; they are always just estimates" is one of the dumbest pieces of argumentative rhetoric I've ever heard.

"The real problem with WAR is that people use it wrong." This is a problem with people, not a problem with WAR. I don't have any particular suggestions on how to fix it, but it'd be nice to see the blame placed where it belongs.
28. Zach Posted: November 30, 2020 at 06:01 PM (#5991709)
For a grain elevator: make sure you measure the truck on the same scale both times! For a Hook's law spring, F=kx

W_t = k * (x_1)

(W_t + w_g) = k * (x_2)

so w_g = (x_2 - x_1)/k

Assuming no measurement error in where the needle ends up pointing, the percent error is delta k/k, which is the parameter inspected by the state.

If you use two scales, you run into the problem Bill's dad was noticing.
29. Walt Davis Posted: November 30, 2020 at 06:35 PM (#5991713)
"I reserve the right to arbitrarily adjust a player's total as I see fit to match what I think his value really was."

In polite company, this is referred to as "calibration via idiosyncratic adjustments." :-)

30. Starring Bradley Scotchman as RMc Posted: November 30, 2020 at 06:54 PM (#5991714)
Poor Earl Wilson!

Fun fact: Wilson was the first black player signed by the Tigers, the last of the then-16 MLB teams to do so.
31. Monty Posted: November 30, 2020 at 07:06 PM (#5991716)
Also, as best I understand this—which is poorly

Good reason to stop writing about it, Bill.
32. Zach Posted: November 30, 2020 at 07:18 PM (#5991717)
Basic error analysis absolutely is something that should be incorporated at the design level in things like WAR that lump together several types of measurement. It's very easy to measure things carelessly and get way more error than you could have gotten by planning more carefully.

It hasn't historically been used, because
1) Sabermetricians have tended to be hobbyists doing the math by hand or with spreadsheets
2) The other source of sabermetricians is big data, where errors are statistical and quickly average down to zero.
and
3) Nobody has been badly burned yet.

Back in the day, Baseball Prospectus kept doing article after article "evaluating" how well PECOTA had done over the last season. This for a forecasting system that claimed to include error bars! Five minutes plotting Z scores (Z score = (prediction - observed)/standard deviation) would have shown more than every article combined. That finally seems to have caught up with Nate Silver this year, as people noticed that he was calling most states correctly but the winning margins were well outside of the predicted ranges.
33. michaelplank has knowledgeable eyes Posted: November 30, 2020 at 07:27 PM (#5991719)
The Denver Broncos yesterday would have *killed* for a replacement level quarterback.
34. Zach Posted: November 30, 2020 at 07:46 PM (#5991723)
Yeah, I can't remember any MLB games ending up as farcical as an NFL game with a practice squad WR playing QB.
35. bfan Posted: November 30, 2020 at 07:48 PM (#5991725)
Okay, an argument against how WAR is used and why Bill James would have a problem with it.

People use WAR to stack up and compare MVP candidates all the time. Using Bill’s problem with the baseline being replacement level: it is unlikely that the next person who would show in the dodgers line up would be merely replacement level if Mookie Betts had to sit out. However, if Freddie freeman sat out Johann Camargo becomes the next man up and he is replacement level or worse. So Mookie’s value in the line up is overstated and Freeman’s is understated. It doesn’t change what an incredible player Mookie is or how great he was in 2020, but it does reflect on his value to the team.
36. Joyful Calculus Instructor Posted: November 30, 2020 at 08:00 PM (#5991729)
That’s something that has always bugged me about some of James’ studies: he will use some seemingly random number as a multiplier or divisor and say that the results don’t work without using that specific number.

Ah yes, there's a term in statistics called "p-hacking" where you look through tons of data until you find something statistically significant, but it's really a false correlation because if you look hard enough to find a small p-value you will eventually. That leads to "overfitting" where you design a model based on spurious correlations.
37. bookbook Posted: November 30, 2020 at 08:04 PM (#5991730)
I think the ultimate answer is that Bill James is a gifted narcissist. His high opinion of himself led him to defy conventional wisdom and make meaningful breakthroughs. As soon as anyone besides himself was having insights, sometime overriding his own progress, he got nasty. It turned out that it was never about understanding baseball as well as possible: it was always about Bill James.

His true crime work, and his dedicated anti-professionalism, is also utter crap.
38. What did Billy Ripken have against ElRoy Face? Posted: November 30, 2020 at 08:17 PM (#5991735)
In polite company, this is referred to as "calibration via idiosyncratic adjustments." :-)
Who the #### said we was polite around here, ashhole?

I thought you were referring specifically to adjustments in calibrating the scale to win totals - didn’t he also just have some sort of “manual override” for a player’s final number?
39. Walt Davis Posted: November 30, 2020 at 08:21 PM (#5991736)
Huh? Robinson had 680 PA, almost exactly 6 times Wilson's 113:

FR 66: 122 R, 122 RBI, 97 singles, 34 doubles, 2 triples, 49 HRs in 680 PA
EW 66: 120 R, 132 RBI, 84 singles, 0 doubles, 12 triples, 42 HRs

So is 264 IP of good pitching worth 5/6 of Robinson? We could think of that as 1,059 BF vs 567 PA so each Wilson PAA only has to be slightly more than half as far below average as Frank's PAs were above average.

Now some of you clever snots will suggest that Wilson's batting wasn't really worth 1/6 of Frank's. You'd have a point as WAR credits Robinson with 71 Rbat and Wilson with just 5. Darn OBP it seems. Yet Wilson scored and knocked in runs at an equal or better rate so are we sure Frank's meager 111-point OBP edge was really worth all that much? (Better get an estimate!) Or do we need to adjust for their on-base contexts -- surprisingly for Frank almost exactly league-average while Wilson had a lot more?

Now, if Frank had pitched to about 500 batters, we'd have an easy comparison.
40. Sweatpants Posted: November 30, 2020 at 08:34 PM (#5991738)
I think the ultimate answer is that Bill James is a gifted narcissist. His high opinion of himself led him to defy conventional wisdom and make meaningful breakthroughs. As soon as anyone besides himself was having insights, sometime overriding his own progress, he got nasty. It turned out that it was never about understanding baseball as well as possible: it was always about Bill James.

His true crime work, and his dedicated anti-professionalism, is also utter crap.
This is nastier than anything in the linked article (not that James has never been nasty).

I don't think that the part about him pushing back against other people's work fully holds up. He praised DIPS and accepted it as true, mentioning it in his 2001 book and crediting the guy who came up with it. He just really seems to have a hang-up about WAR.
41. John DiFool2 Posted: November 30, 2020 at 08:39 PM (#5991739)
That’s something that has always bugged me about some of James’ studies: he will use some seemingly random number as a multiplier or divisor and say that the results don’t work without using that specific number. That may actually be true, but it gives the impression that he is massaging the numbers to get the results he is looking for.

Recall he assailed Linear Weights years ago for doing exactly that.

[Trying to remain silent about his politics, but I can't deny that his blinders there has had a retroactive effect on how I now view his older writings]
42. Mefisto Posted: November 30, 2020 at 09:43 PM (#5991751)
@34: Al Travers begs to differ.
43. snapper (history's 42nd greatest monster) Posted: November 30, 2020 at 09:48 PM (#5991753)
Has there been some attitudinal shift about Bill James? I regard him as the patron saint of baseball analytics, but I sense a disturbance in the force here.

It's funny. I came to Sabermetrics without reading James at all. Pete Palmer's Hidden game of Baseball introduced me too it, then it was Neyer on ESPN.com, and so forth.
44. John Northey Posted: November 30, 2020 at 09:51 PM (#5991754)
I think the big challenge is to define what you are trying to accomplish with a statistic. WAR is great for stuff like judging how well a draft or trade went (as you get an assortment of players at different positions, era's, etc.) but isn't the best for single seasons as you get massive variables (defense the primary one) which don't work the best in one year measures. Also it isn't really made to work at the decimal level - a 10.1 isn't statistically better than a 9.9 player (or a 9.5 I suspect). Win Shares are useless imo as it is impossible to figure out on your own and has some random adjustments it seems. I like formulas that can be looked at and torn apart to see what makes them work.
45. the Hugh Jorgan returns Posted: November 30, 2020 at 09:52 PM (#5991755)
He just really seems to have a hang-up about WAR.

Hey if you think you'd discovered a better system in betamax and everyone went VHS, you'd be angry too!
46. snapper (history's 42nd greatest monster) Posted: November 30, 2020 at 09:52 PM (#5991756)
Trying to remain silent about his politics, but I can't deny that his blinders there has had a retroactive effect on how I now view his older writings

I thought he was relatively liberal?
47. Lowry Seasoning Salt Posted: November 30, 2020 at 10:16 PM (#5991762)
I came to Sabermetrics without reading James at all. Pete Palmer's Hidden game of Baseball introduced me too it, then it was Neyer on ESPN.com, and so forth.

Similar path here. The other big influence—not sabermetrics—was Murray Chass, who introduced me to and educated me on the business side of the game. While for me all three of James, Neyer, and Chass went from groundbreaking to somewhere near unbearable, I'm grateful for their peaks and the trails they blazed.

Nowadays, specifically when it comes to James, I keep in mind what Tango has said about him, something like, "Bill often identifies a problem without offering a solution, which he leaves us to work on." Tango's point—and as far as I'm aware he and James get along fine—is that it's best to think of James's work more as a conversation starter than a complete and rigorous analysis.
48. Ron J Posted: November 30, 2020 at 11:18 PM (#5991773)
#47 James at his best asks good questions and again at his best tries to devise methods to objectively look at these questions. But he has a preferred toolkit snd sometimes ... let's just say the studies function better as conversation starters and sometimes they're downright questionable.

Still he very rarely starts a study with a desired conclusion in mind and that's huge.

Plus he's a better writer than almost anybody in the field.
49. John DiFool2 Posted: November 30, 2020 at 11:21 PM (#5991774)
Trying to remain silent about his politics, but I can't deny that his blinders there has had a retroactive effect on how I now view his older writings

I thought he was relatively liberal?

Nope. I've read sufficient comments in his weekly column that demonstrate otherwise. Old heroes fall hard. [Tho I still acknowledge #47's 1st point]
50. vortex of dissipation Posted: November 30, 2020 at 11:35 PM (#5991775)
He's a grumpy Paul McCartney -- he did some absolutely groundbreaking stuff, he remains a solid writer but he hasn't tried to keep up (probably a good thing in McCartney's case) and he hasn't done anything impactful in 30(?) years. Unlike McCartney who only seems to have nice things to say about other musicians, James occasionally resurfaces to re-fight the battles of 2000.

If by "impactful" you mean a hit single, or something that has had any impact on the current direction of pop music, I'd agree with you. But McCartney's work from 1997's "Flaming Pie" onwards has featured quite a few magnificent songs. He's actually had a remarkable late career renaissance as far as the quality of his music is concerned, whether it gets played on the radio or not. I recently put together a two-hour playlist of post-1997 McCartney music, and it was wonderful.
51. BrianBrianson Posted: December 01, 2020 at 01:20 AM (#5991779)
That finally seems to have caught up with Nate Silver this year, as people noticed that he was calling most states correctly but the winning margins were well outside of the predicted ranges.

Ha, No. Silver has done Z-score analyses of his models in the past, and they do quite well. The people criticising him this year have no idea what Z-scores, normal distributions, sampling errors, systematic error, math, etc. are. A few recognise one or two of the words, and know just enough to misapply them. He had some bad luck getting all 50 presidential states right in 2008, and winning over people too much by dumb luck.
52. bookbook Posted: December 01, 2020 at 01:20 AM (#5991780)
This is nastier than anything in the linked article (not that James has never been nasty).

In fairness, groundbreaking revolutionary work deserves massive accolades. And his writing really was compelling.
53. Esoteric Posted: December 01, 2020 at 07:26 AM (#5991785)
I have never learned not to love Bill James
Okay, so how many people caught Matt's ultra-obscure Beach Boys reference here?
54. Never Give an Inge (Dave) Posted: December 01, 2020 at 08:13 AM (#5991787)

However, if Freddie freeman sat out Johann Camargo becomes the next man up and he is replacement level or worse. So Mookie’s value in the line up is overstated and Freeman’s is understated. It doesn’t change what an incredible player Mookie is or how great he was in 2020, but it does reflect on his value to the team.

Rosters aren’t static, the Braves can trade for or sign a better replacement if they want to. It’s not an efficient market like the systems imply, but it doesn’t make sense to think about value as dependent on the quality of one’s teammates (or GM). And likewise, if Mookie was unavailable his replacement could get hurt the next day and then they are using more of a replacement-level replacement.

That being said, I think James’ critiques of WAR are partially correct. WAR would be better expressed as a range but it doesn’t seem like anyone has attempted to quantify what the standard error is around the metric. And on the pitching side, there are some big assumptions underlying WAR when it comes to defensive support. Those should be more prominently explained and it should be easier to identify the magnitude of their effects. But Win Shares has very similar problems, as Matt Welch noted in #3.
55. Captain Joe Bivens, Elderly Northeastern Jew Posted: December 01, 2020 at 08:30 AM (#5991790)
I recently put together a two-hour playlist of post-1997 McCartney music, and it was wonderful.

Is any of it rock and roll?
56. Rally Posted: December 01, 2020 at 09:02 AM (#5991792)
Fun fact: Wilson was the first black player signed by the Tigers, the last of the then-16 MLB teams to do so.

Wilson was signed by the Red Sox and traded to the Tigers in 1966 (signed in 1953, MLB debut in 1959 - same year as Pumpsie Green). At that point they already had Willie Horton and Gates Brown, who were signed in 1961 and 1960. I don't know if any of these guys was the first black Tiger, but it definitely was not Wilson.
57. Rally Posted: December 01, 2020 at 09:09 AM (#5991794)
He had some bad luck getting all 50 presidential states right in 2008, and winning over people too much by dumb luck.

That got him a massive advance on a book deal and the backing/funding to do what he wanted in a media site. If that's bad luck, sign me up.
58. BrianBrianson Posted: December 01, 2020 at 09:30 AM (#5991799)
Yeah, okay, contextual. It was good luck then, but now it's saddled him with people proclaiming he doesn't know what he's doing because he's off in two states by more than two sigma.
59. sunday silence (again) Posted: December 01, 2020 at 09:42 AM (#5991802)

I think Bill James has turned into the kind of cranky old man that he used to ridicule 40 years ago.

I hate to single anybody out, cause this is really a sort of minor chord that's been running through this whole discussion: THe notion that Bill James has suddenly gotten old. That he no longer has it. That he's cranky.

EMphatically: NO! This is ridiculous notion.

This has always been Bill James thing. He takes on controversy, he's a maverick, he finds things that no one else finds because he wants to go against the grain. This is also the beauty of Bill James, so before I lambast James, let me just say that almost all of us have cut our teeth on BIll James, and he's body of work shadows over this entire field of Sabermetrics.

Its probably the main reason most of us are here.

Ok having said, let me just say that like Lindsay Lohan saying something stoopid, or Bobby Bonilla booting a groundball or Patton slapping a soldier; Bill James is likely to say something ridiculous at any time any place.

OK? Thats just who he is. I could cite numerous obvious examples where he defends Pete Rose's gambling, or drops hints that a baseball player is gay, or that Hal CHase was a serial philanderer or decide that Dick ALlen never helped any ball club he was on. No. I'm going to cite something more prosaic, something right in Bill Jame's wheelhouse.

Baseball research 101. Here's the quote, its the fifth excerpt on the page from this archive site:

http://baseballanalysts.com/archives/2004/07/abstracts_from_12.php

Take, for example, the Royals' third baseman George Brett. Brett in 1976 handled 501 chances, which is a lot, but made 26 errors, which is also a lot...

How many times do you think Sal Bando made 20 errors at 3b? Like four times, with a high of 24. Here's some others:

Graig Nettles one of the greatest fielding 3b: 5 times, a season high of 26
Ken BOyer 7x, 3x 24 or more errors, (did you know James ranked Boyer 12th all time at 3b and also plumped for him in the HoF at the same time he was plumping Santo?)

Darrell Evans, 7 times, each time 25 or more, high of 36
Buddy Bell, 4 times, yeah Bell was really good.

Ron Santo. 11 TIMES! 6x 25 or more erros. Effin Ron Santo committed 25 errors at 3b 6 times. his high is like 31 or something.

The point is not that committing 20 or so errors at 3b is unremarkable, well it is unremarkable. Making errors at 3b comes with the territory. And Bill James should know that cause you know BILL JAMES IS A RESEARCHER. Its not the main point. The point is that this is an easy Look Up.

Bill James plumped for Ron Santo for the HoF. He also plumped for Boyer. James is supposed to know baseball statistics. His statement above is him mailing it in. Just saying something off the cuff. Here's another quote from the same excerpt:

A lot of his throws to first base don't go to the first baseman. Of those 26 errors, I'd bet 20 were on throws...

Im not even gonna research this one. Im gonna go out on a limb and say I am 85% certain that that statement is also utter bullsheet. WHy? Cause I know two things:

1. Throwing the ball away with runners on base is just about the worse thing you can do as a 3b. Its usually at least -1.2 weighted runs with just one man on base right there.

2. Even Dick Allen for all his defensive woes didnt throw the ball away all that much. He knew they will yank you off 3b in a heartbeat if you do that often enuf.

Oh I also know that Bill James is doing his Bill James thing right here.

60. sunday silence (again) Posted: December 01, 2020 at 09:47 AM (#5991804)
I don't know if any of these guys was the first black Tiger, but it definitely was not Wilson.

Ozzie Virgil, 1958.
61. sunday silence (again) Posted: December 01, 2020 at 10:08 AM (#5991807)
Now some of you clever snots will suggest that Wilson's batting wasn't really worth 1/6 of Frank's. You'd have a point as WAR credits Robinson with 71 Rbat and Wilson with just 5. Darn OBP it seems.

The mistake here is comparing Earl Wilson's bat directly to FRobinson. Presumably the Tigers have a RF who can hit (that would be Northrup OPS+ of 120). Technically we(you) should be comparing Robinson to the avg AL or MLB RFer in terms of hitting/off. Since every position has to be filled by someone who bats.

Wilson OPS+ of 129 which is probably 100pts? more than the avg pitcher but he's only batted 113 times so maybe 15 runs created more than the average pitcher.

While some primates may disagree I think the better way is to compare a man's off contribution vs same guys playing same position. Instead of these bizarre and highly theoretical positional adjustments. I mean if you want to do it that way, then I guess you 'd have to make Frank RObinson pitch every fifth day and make Wilson play RF 4 out of every 5 days. Who wins that match up?

SO that would be how you do it. Going from memory I think frank is about 70 runs above avg on off, so vs the average RF he's what +50 or so? in terms of offense.

wilson maybe +15 runs on off. So that leaves Earl with 35 runs to catch up to Frank in our theoretical MVP revisited race.

Anyone know if there is an equivalent "runs saved" for pitching? Im sure there is.
62. sunday silence (again) Posted: December 01, 2020 at 10:18 AM (#5991809)
ACtually we should probably do this as a fun exercise to really scrutinize Frank vs Earl WIlson for the 1966 MVP. Use any method you want but try to be logical and complete. I'll do it my way comparing each player vs similar at same position.

We havent even talked about Frank's fielding which we know is pretty bad.

How bad is Frank's arm really? How many guys take the extra base on him?

And how bad is Wilson's fielding? Is he capable of the "throw the ball into the stands while trying to make the play at 1b with runners on base?"

Who is more clutchy? who costs more money? Tune in tomorrow!
63. Joyful Calculus Instructor Posted: December 01, 2020 at 10:19 AM (#5991810)
[47] Pretty sure that’s the first time that I’ve seen someone on this site say something nice about Murray Chase.
64. Ron J Posted: December 01, 2020 at 11:05 AM (#5991815)
#54 Raises hand. I have been talking that stuff off and on for ... well since before there was WAR. Again, a lot of my research went with one of my computer moves but I do have a decent sense of the range for any of the estimates which is why I'm happy that we have both fWAR and bWAR.

And why I'm likely to go grumpy old man when I see discussion that assume first decimal place precision. The standard error on the offensive components is not smaller than 14 runs (EDIT: per year) for a full time player. And that's the area that I'm pretty sure has the highest precision.

Similarly I'm confident that the standard error at the career level is not smaller than 4 wins (probably more like 5) for anybody involved in a HOF discussion.

EDIT: I should point out that Walt Davis may not raise these issues explicitly but he good at asking method related questions and the like.
65. Rally Posted: December 01, 2020 at 11:09 AM (#5991817)
We havent even talked about Frank's fielding which we know is pretty bad.

Are we sure about this? Was he limited by injury or something? His -8 fielding for the year is not an outlier or anything. He was -4 the year before, then averaged -8 for 1967-68. After that he's got one good year, one bad year, 2 average years, then became a DH.

ACtually we should probably do this as a fun exercise to really scrutinize Frank vs Earl WIlson for the 1966 MVP. Use any method you want but try to be logical and complete. I'll do it my way comparing each player vs similar at same position.

Great idea. I like that so much better than what James is doing here. Robinson obviously is so much better than Wilson, ergo, WAR is stupid. Hard to believe that unJamesian sentiment is coming from the same guy who wrote the Mattingly/Clemens and Rice/Guidry comparisons.
66. Rally Posted: December 01, 2020 at 11:21 AM (#5991819)
I see Win Shares had it 38-24 in favor of Robinson, going by the WS book and not any updates.
67. What did Billy Ripken have against ElRoy Face? Posted: December 01, 2020 at 11:29 AM (#5991820)
Robinson obviously is so much better than Wilson, ergo, WAR is stupid. Hard to believe that unJamesian sentiment is coming from the same guy who wrote the Mattingly/Clemens and Rice/Guidry comparisons.
Exactly. It's no different than "Defensive metrics say Jeter is bad, so they're wrong!!"
68.  Posted: December 01, 2020 at 12:15 PM (#5991831)
So I looked up Earl Wilson's record against the Orioles in 1966. He went 4-2 against them, not bad against eventual World Champions.

He did get blown out a couple of times; on one occasion (19 July) Frank Robinson chased him out of the box in the first inning with a 3-run homer.

But on 18 May (with Boston), he held Robinson to 1-for-5 with a GDP. In the tenth inning, Wilson homered off Jim Palmer to put the Red Sox ahead, and then got Frank on a foul pop to end the game in the bottom of the 10th.

On 14 July, Frank hit a first-inning solo HR to put Baltimore ahead (of the Tigers, now). Wilson settled down and drove in the decisive run with a SF in the the sixth, Tigers won.

The next night, Wilson hit a 3-run pinch homer in the bottom of the 13th to beat the Orioles again.

That's pretty good :)
69. sunday silence (again) Posted: December 01, 2020 at 12:19 PM (#5991833)

Are we sure about this? Was he limited by injury or something? His -8 fielding for the year is not an outlier or anything. He was -4 the year before, then averaged -8 for 1967-68. After that he's got one good year, one bad year, 2 average years, then became a DH.

I feel strongly that he is, but I can't prove it with data. A -8 on TZ is quite bad. For one thing, the TZ ratings or whatever the 1966 equivalent is, just seem to be attentuated in both negative and positive directions. [NOTE: these claims are disputed by certain primates]

Also I think one of those years you're referring to is 1970 in LAD, and I feel that the half season or whatever he played is never enuf data pts.

We talked about this in re: Rabbit Maranville. Who was obviously really good. He held down SS for like 20 years. It would follow then that he must have had some sort of peak, some time in which he was really stellar, how else do you hold on for 20 years? Everyone ages, so he must have been coming down from a really high defensive peak. But you cant see it that from TZ, I forget what the numbers were.

I just looked up Tris Speakers numbers. Speaker was considered the greatest fielding CF before 1950. We could argue that. I know Dom DiMaggio was really good at his peak but I dont think he had a long career. I know Bob Meusel was really good for awhile. Can we agree Speaker is near the top of his profession?

OK TZ gives him 10+ def WAR 4x in his career. His high is 14. (the all the rest are 10). That cant be a true evaluation of his talent? You can look at stat cast and see the best guys are hitting like 18 OAA (out against average) and we're not counting assists and holding runners (which probably max out at 10 runs/season and 5/season respectively). Betts has hit 30 def runs against replacement. Perhaps thats the outer limit (but Betts in RF what of Mays in CF?). Would 20-25 def runs for 12 seasons be reasonable for Speaker? I think so.

There's some other stuff Ive bookmarked that we can reference later. for example SOmeone did a study on Jeter, how many balls he didnt get to vs the best SS that year. Jeter is almost -40 vs avg and the other guy +30 or so. I question how they value those in weighted runs, but it give a real indication of what we would expect the spread to be for SS. 3b seems to be the other key inf. position where the spread is that large.

It follows that the same problem exists going in the negative direction. A -8 on TZ would likely be a lot worse on DRS/BIS, which is more recent development.

Great idea. I like that so much better than what James is doing here. Robinson obviously is so much better than Wilson, ergo, WAR is stupid.

RIght. I was looking at this a little while ago and there is lots of room for disagreement/areas to be plowed. The biggest problem with Wilson is that fangraphs seems to ascribe babip as a near constant measure of defense. So Earl WIlson 6.6 WAR or whatever on Baseball reference is being fueled by a .250 BaBip! It was a dead ball era (I think babip is aroiund .280 back then) but Wilson's babip is freakin awesome.

Is Wilson inducing weak ball contact or is he really lucky? Fangraphs says he's lucky so that .250 babip gets adjusted into a FIP that is higher than his ERA+. Thing is Wilson also produced a similar babip in 1967, so I dont think its just luck. we had this discussion a few days ago in the Andy PEttite thread.

The other guy who's getting jobbed by babip/fangraphs in 1966: Juan Marichal. He's hitting 10 WAR and 9 WAR in 1965-66. Fueled by babips of .238 and .223! That gets cut by Fangraphs to 6.8 and 6.3. Gee I dont know what to think. Conversley his teammate Gaylord Perry is going in the other direction. His FIP skews equally in the opposite direction suggesting he's better than that same SFG defense that is helping Marichal.

I dont have any conclusion other than it be interesting to study teammate situations, and two or three year trends to see which method is better. Its just so weird that the SFG defense is so good for Marichal and so leaky for Perry. You can also see babip trending upward in the careers of both WIlson and Marichal and probably many others. hmm

Other food for thought:

How do folks feel about positional adjustments for this sort of thing? Should a pitcher get more credit for eating innings?

We might as well throw in Tommy Agee and Yaz into the 1966 discussion. These guys field better than Frank and we might as be thorough if we're going this far.

Could throw in Jim Kaat (6.4 WAR on FG) and Sam McDowell in there (4.4 FG vs 5.0 BRef) as well.

Certainly if modern day infielders can save 30 runs vs replacement we should find some AL infielders who are being jobbed by hackneyed TZ. Tom Tresh is one candidate or DIck McAuliffe or Clete Boyer might be possible. Im not sure who the catcher would be.

70. Rally Posted: December 01, 2020 at 12:42 PM (#5991840)
Frank was +71 runs batting. Agee was +20, Yaz +15. So they’d have to be 50 runs better with the glove to make it a contest. Both rate very well by TZ, +23 and +18. That’s about the limits of what a good fielding outfielder can do based on Statcast Outs above average. You really can’t get much more out of the ratings than what Tommy and Carl are getting, so to make it work we’d have to suggest that instead of -8, Frank was really an Adam Dunn level disaster out there. I’m not going there. He was only 30, though I’ve heard he was an old 30.
71. sunday silence (again) Posted: December 01, 2020 at 12:50 PM (#5991843)
no no he cant be that bad. He could be -15 bad. So that leaves Tommy and Yaz 15 runs down. What about baserunning? Can we milk a few more runs out that way? Positional adjustment? Can Yaz play SS or something?

HOw did Agee get to +18 def. runs or whatever anyhow? His range seems about average and he has 12 assists. That's good it doesnt seem to be outstanding.

The NL seems to have all the talent in 1966, both pitching and hitting. It seems really strange. Frank came over and just lit up the league and as a fan you'd probably believe that the AL really is inferior.
72. vortex of dissipation Posted: December 01, 2020 at 01:10 PM (#5991846)
I recently put together a two-hour playlist of post-1997 McCartney music, and it was wonderful.

Is any of it rock and roll?

Yes.
73. sunday silence (again) Posted: December 01, 2020 at 01:10 PM (#5991847)
Heres the comparison of Jeter to Everett concluding about 70 balls difference.

http://fieldingbible.com/jeter.asp

My only question is his conclusion that this represents only about 30 runs. (5th para. from the end). Dont you have to add .45 runs (value of hit) vs .23 runs (the value of an out) to get .68? Also I guess we are assuming that none of these are throwing errors with men on base which will kill you; I have no reason to think Jeter was exceptionally bad at this though there might be some runs here.
74. Rally Posted: December 01, 2020 at 01:26 PM (#5991851)
Yeah, 70 plays by SS will mean about 50-55 runs.

On Agee, it should generally correlate to range factor but not always. The method used for 1966 is to allocate all the hits when he was in the field to the defensive positions. So if a hitter has 20% of his outs on balls in play made by the CF, then if he gets a hit the CF is charged with 0.2 of a hit. Add it up, compare to league average, throw in a park adjustment and you get his TZ.

I used different methods based on what data was available, that method is roughly what was used from the fifties to the 80’s.
75.  Posted: December 01, 2020 at 01:31 PM (#5991853)
I have no reason to think Jeter was exceptionally bad at this though there might be some runs here.

Jeter's fielding percentages were actually quite good. Per BB-Ref, his career fielding percentage (.976) was better than league-average over those seasons (.972) and he even led the league in the stat twice (2009, 2010) with six other top-5 finishes. It's one of the explanations for his Gold Gloves.
76. Lowry Seasoning Salt Posted: December 01, 2020 at 04:54 PM (#5991900)
63. Joyful Calculus Instructor Posted: December 01, 2020 at 10:19 AM (#5991810)
[47] Pretty sure that’s the first time that I’ve seen someone on this site say something nice about Murray Chase.

He was ahead of his time covering the business of baseball in a mainstream publication. He brought that side of the sport to large swaths of people just like how Neyer later did with sabermetrics. In terms of broad reach, the lasting effects of those two can be seen in so much of the baseball content we read these days. I'm sure other people eventually would have done similar, wide-reaching work as those guys, but credit to them for leading the way and I'm grateful for both writers' work.
77.  Posted: December 01, 2020 at 05:05 PM (#5991903)
How It Started: You know, if you look deeply into the statistics, you'll learn things that weren't expecting. Like, for example, even though everyone thinks of Jim Rice as a superstar, Roy White is just as good!

How It's Going: These statistics are telling me things that I didn't expect. These statistics are bad.
78.  Posted: December 01, 2020 at 05:17 PM (#5991904)
Risk-free replacement level production readily and immediately available simply does not exist in the marketplace (*) and thus the whole enterprise collapses.

(*) Nor does anything remotely approximating it. Quick, who are the CFs who are going to give you replacement level CF production next year? How exactly are you determining that, and how big are your error bars? Are you adjusting the quality of the replacement player down to account for the risk inherent in these error bars? (**) If you can't go into the marketplace and get predictable replacement player production -- and you can't do anything close -- the concept Challenger-space-station fails.

(**) If this isn't happening, stated WAR is significantly understated.
79. bookbook Posted: December 01, 2020 at 05:23 PM (#5991905)
#78, there’s a real paucity of friction-free surfaces in the world. Yet, simplified classical physics is still extremely useful.
80. Howie Menckel Posted: December 01, 2020 at 05:26 PM (#5991906)
just stumbled across the 1995 Bill James Baseball Abstract among my belongings.

this is coming off the debacle that was 1994, of course.

anyone interested in his take on specific players - no request too small!

:)
81.  Posted: December 01, 2020 at 05:28 PM (#5991907)
#78, there’s a real paucity of friction-free surfaces in the world. Yet, simplified classical physics is still extremely useful.

It really isn't wins over any kind of player or even player archetype -- that's basically false advertising at this point -- but instead "Wins Over Baseline," only the baseline is completely arbitrary and makes no inductive or deductive sense and follows from no sensible premise or set of premises. "Mike Trout hit .... and fielded ... and baseran ....; OK, let's say you had a CF who hit ... and fielded ... and baseran ..., Mike Trout was ... wins better than this player." There's no reason to use the imaginary player's production as anything that means anything as a baseline. That production isn't available in the marketplace and doesn't have any independent tangible meaning or significance. Yeah, if you could actually go into the marketplace and get assured replacement level production it would make sense. But you can't. You can't even approximate the concept.

Win shares actually does make sense -- Team X won 92 games, which of its players contributed how many of these wins to the total? That endeavor makes perfect sense.
82. Zach Posted: December 01, 2020 at 05:36 PM (#5991909)
Ah yes, there's a term in statistics called "p-hacking" where you look through tons of data until you find something statistically significant, but it's really a false correlation because if you look hard enough to find a small p-value you will eventually. That leads to "overfitting" where you design a model based on spurious correlations.

More like the opposite of that, to be honest. Read through the Win Shares book and you'll see that he's constantly coming up with reasonability tests checking to see that similar players are evaluated similarly. That's his process, and I don't think it works for anybody who doesn't have strong opinions about the relative value of, say, Louis Aparicio and Rabbit Maranville. But James is much more resistant to simply applying a statistical technique and trusting the results than average, not less resistant.

His whole issue here is that he wants a rating system to be completely logically coherent, and he thinks that WAR is cutting corners.
83. sunday silence (again) Posted: December 01, 2020 at 05:42 PM (#5991913)
The method used for 1966 is to allocate all the hits when he was in the field to the defensive positions. So if a hitter has 20% of his outs on balls in play made by the CF, then if he gets a hit the CF is charged with 0.2 of a hit.

Rally: I wonder if you've seen this extended debate about Dave Winfield's fielding here. It starts with Bill James article but the comments are real interesting, using I believe the very same method you are describing:

https://www.billjamesonline.com/winfield_and_evans/

what do you think about the conclusions and how much error do we expect in this method?
The method used for 1966 is to allocate all the hits when he was in the field to the defensive positions. So if a hitter has 20% of his outs on balls in play made by the CF, then if he gets a hit the CF is charged with 0.2 of a hit.
84. snapper (history's 42nd greatest monster) Posted: December 01, 2020 at 05:52 PM (#5991916)
It really isn't wins over any kind of player or even player archetype -- that's basically false advertising at this point -- but instead "Wins Over Baseline," only the baseline is completely arbitrary and makes no inductive or deductive sense and follows from no sensible premise or set of premises.

Incorrect. Replacement level is calibrated so that all all replacement level team would win about 40 games, which is the worst results ever observed.
85. sunday silence (again) Posted: December 01, 2020 at 05:52 PM (#5991917)

Jeter's fielding percentages were actually quite good. Per BB-Ref, his career fielding percentage (.976) was better than league-average over those seasons (.972) and he even led the league in the stat twice (2009, 2010)

Kiko: Yes but, the issue isnt his fielding range. It's: Did he make bad throws an inordinate amount of time? Do we have any data on that? I cant imagine that he would but I didnt really watch him much.

Another question: you have worked on another component of OF defense and that is getting to balls and turning doubles into singles and such. Would this sort of thing be reflected in the baserunnner advances data? if someone does give up more baserunner advances would that not reflect his ability to get to base hits? How much more defense value can be added or subtracted by studying this?
86. sunday silence (again) Posted: December 01, 2020 at 05:58 PM (#5991918)

Incorrect. Replacement level is calibrated so that all all replacement level team would win about 40 games, which is the worst results ever observed.

Actually I think the number is 48 wins but you have the right idea. It's like a .300 team. How many teams have you seen play .300 ball? Actually I guess the expansion Mets played .250 ball so I guess the assumption is that they were worse than replacement level.
87.  Posted: December 01, 2020 at 06:06 PM (#5991920)
Incorrect. Replacement level is calibrated so that all all replacement level team would win about 40 games, which is the worst results ever observed.

Right -- that's "Wins over Baseline" as I said. And as I've also said, the "calibration" is impossible in the real world. If the BTF community spent each winter running a parallel project to the Hall of Merit wherein they predicted which five players at each position would generate replacement level production the next season, they'd probably be off by 50% or more when those players' actual production is compared to predicted. I could let you pick five "replacement teams" for next year and you could take the closest one at each position and your projected error bar would have to be at least 20% and that's being generous.

You can't obtain guaranteed replacement level production in the real marketplace, in the way you can obtain risk-free returns in the financial marketplace. You can't even approximate the concept. That collapses the entire endeavor. All that's left is gut feeling a baseline of a "kinda passable player but not very good and in some ways kinda shitty." That's a meaningless baseline.
88. Rally Posted: December 01, 2020 at 06:14 PM (#5991923)
If the BTF community spent each winter running a parallel project to the Hall of Merit wherein they predicted which five players at each position would generate replacement level production the next season, they'd probably be off by 50% or more when those players' actual production is compared to predicted.

What would happen is:

1. Some would play better than replacement level
2. Some would play worse than replacement level
3. Many wouldn't play at all, because what team actually goes into a season trying to find playing time for replacement level talent?

But if you add up the sum of value for the ones who do play, as a group they would be very close to replacement level.
89. Rally Posted: December 01, 2020 at 06:15 PM (#5991925)
Rally: I wonder if you've seen this extended debate about Dave Winfield's fielding here. It starts with Bill James article but the comments are real interesting, using I believe the very same method you are describing:

I kind of remember it happened but have forgotten anything about it.
90. SoSH U at work Posted: December 01, 2020 at 06:26 PM (#5991928)
Did he make bad throws an inordinate amount of time?

It looks like about half his errors in his career-high 24-error 2000 season were on throws, but I don't know what the typical figure is.

If I had to guess, I'd say that absent any specific trend, he would be more likely to get charged with a throwing error than the average player, if only because fielding errors can be scored hits, and a longtime superstar such as Jeter would get that benefit more frequently than a mediocre player. Throwing errors are much harder to write off that way.

But that wouldn't speak to any particular trend he had on miscues fielding v. throwing.
91.  Posted: December 01, 2020 at 06:31 PM (#5991930)
What would happen is:

1. Some would play better than replacement level
2. Some would play worse than replacement level
3. Many wouldn't play at all, because what team actually goes into a season trying to find playing time for replacement level talent?

Exactly -- but replacement level isn't adjusted down for that uncertainty of performance as it should be to make any sense.

But if you add up the sum of value for the ones who do play, as a group they would be very close to replacement level.

Be happy to wager on the validity of that one, but even assuming it's true, the real world doesn't let you play 20 CFs at a time and aggregate their performance.
92. Never Give an Inge (Dave) Posted: December 01, 2020 at 07:02 PM (#5991936)
SBB, you're basically just arguing that the replacement level is too high. Fair enough, but that's just an opinion that needs support, not something that follows from any sort of principles that you've established here. There still has to be a baseline of some sort in order to compare players with different amounts of playing time.

Exactly -- but replacement level isn't adjusted down for that uncertainty of performance as it should be to make any sense.

Be happy to wager on the validity of that one, but even assuming it's true, the real world doesn't let you play 20 CFs at a time and aggregate their performance.

Not sure what your point is. Sometimes the replacement-level guy plays at a below replacement level. Sometimes Edwin Diaz or Andrew Benintendi does the same thing. So what, that doesn't change the fundamental exercise.
93. snapper (history's 42nd greatest monster) Posted: December 01, 2020 at 07:16 PM (#5991938)
Exactly -- but replacement level isn't adjusted down for that uncertainty of performance as it should be to make any sense.

Why should it be. If the mean is zero the positive variance offsets the negative.

You can't obtain guaranteed replacement level production in the real marketplace, in the way you can obtain risk-free returns in the financial marketplace.

You can't actually do that either, except for a very short time. One month T-bills might be risk free, but 10 yr. Notes have a lot of interest rate risk.
94.  Posted: December 01, 2020 at 07:16 PM (#5991939)
SBB, you're basically just arguing that the replacement level is too high.

I'm arguing that there's no such thing as replacement level. As to the "too high" point, the only thing I'm arguing is that the replacement level that's wrongly assumed to exist pegs the level too high. But that's neither here nor there; the choice of level is completely arbitrary in the first instance. It's a level. So are a bunch of other levels. There are a lot more variables and what not, but at the end of the proverbial day there's no more substance to WAR than there is to noting that a .300 batting average is []% above a .275 average and []% above a .250. Or that Mike Trout played 25% better than a player who played 25% worse than him and 45% better than a player 45% worse than him. That's what it reduces to.

Not sure what your point is.

That there's no such thing as replacement level.
95.  Posted: December 01, 2020 at 07:22 PM (#5991941)
Why should it be.

Because it was premised on the idea of freely available talent. Once that premise fades away, it's just a performance level just like any other performance level. You might just as well use "performance above really good player" or "performance above really shitty player." It's entirely arbitrary.

If the mean is zero the positive variance offsets the negative.

But if you get the negative, you haven't obtained freely available replacement level performance.

One month T-bills might be risk free, but 10 yr. Notes have a lot of interest rate risk.

Not if you hold them to maturity. If you hold them to maturity, and they don't default, you know your return exactly. Compare and contrast the "returns" of baseball players. Kinda ... not really that, right? (The risk free rate is a well-known feature of modern finance; no need to relitigate that one. The only weirdo quibble one could make is that US debt isn't cosmically 100% assured against default, but it's 99.999% and that's good enough.)
96.  Posted: December 01, 2020 at 07:41 PM (#5991944)
The last sentence of 94 could use a bit of refinement. There IS such a thing as "the level of performance that freely-available talent has established in the succeeding year over the past X years of baseball history." That's empirically measurable. That could be a useful proxy for at least the level of performance being contemplated by the WARriors, but to get it right, you'd have to know the standard deviation of the performance as well. But even if you measured that right, which is doable, you couldn't go into the freely-available real world performance market and purchase that performance.
97. snapper (history's 42nd greatest monster) Posted: December 01, 2020 at 09:35 PM (#5991953)
Not if you hold them to maturity. If you hold them to maturity, and they don't default, you know your return exactly.

Nonsense. Locking in a return of 0.91% is not risk free. If rates rise significantly over the period, you have lost substantially. If you want risk-free you have to have near zero-duration.

The one month Tsy is at 0.07%, the 10 year is at 0.91%. The only reason you get 84 extra bps is because you're taking duration risk. If rates risk 2%, the 10-Yr Note will lose something like 15-20% of its value. That's risk.

Holding to duration only spreads the loss over multiple years. In a trading portfolio, you have to mark the loss to market immediately.

The "risk-free rate" in WACC calculations (for example) is every bit as theoretical as replacement level.
98. Never Give an Inge (Dave) Posted: December 01, 2020 at 10:25 PM (#5991960)

Replacement level is just one component of WAR, and it’s pretty much irrelevant when comparing players of equal playing time. So for things like MVP discussions it’s largely moot. You can compare them against average (WAA) or whatever baseline you want to choose.

But if you do want to compare guys who played different amounts of time, where you set the baseline is important. You can’t simply throw your hands up and complain there’s no such thing as freely available talent. I mean you can, but it doesn’t get you any closer to the truth.
99. Walt Davis Posted: December 02, 2020 at 01:25 AM (#5991972)
I think the ultimate answer is that Bill James is a gifted narcissist. His high opinion of himself led him to defy conventional wisdom and make meaningful breakthroughs. As soon as anyone besides himself was having insights, sometime overriding his own progress, he got nasty.

Hey, that's my schtick.

On Robinson vs Wilson hitting -- chill man, I was just having fun with the fact that Wilson had (nearly) exactly 1/6 of Robinson's PAs and if you multiplied his numbers by 6, he (nearly) matched Robinson on R, RBI and HR. The proper reaction to that is "damn, Earl Wilson could hit!" not "the proper way to make this comparison ..."

Somebody mentioned Wilson's fielding. bWAR (and near as I can tell fWAR) just assumes its effects are captured in RA which, on average in large samples, it will be. But pitchers have the luxury that fielding miscues that don't actually result in runs essentially don't count (but then the ones that do, count in full) while WARpos gets dinged regardless of the result. TZ isn't even calculated for pitchers, DRS is but only appears on b-r under a pitcher's advance fielding stats.
100. Ron J Posted: December 02, 2020 at 07:07 AM (#5991979)
#99 I've made the point often enough that Wilson had top tier power. Considering the time, 35 HR in 838 is very good. Not Killebrew and company but ...

A one trick pony, but it's a good trick.

Terrible PH. Not that you really want a .195 hitter who didn't walk much in most PH scenarios.

