Baseball Primer Newsblog— The Best News Links from the Baseball Newsstand
Sunday, November 04, 2012
Nate Silver, Cosh, Tango, PECOTA, Marcel the Monkey!...It’s almost like the golden days of Primer! Backlasher, RossCW come on down!
The whole world is suddenly talking about election pundit Nate Silver, and as a longtime heckler of Silver I find myself at a bit of a loss. These days, Silver is saying all the right things about statistical methodology and epistemological humility; he has written what looks like a very solid popular book about statistical forecasting; he has copped to being somewhat uncomfortable with his status as an all-seeing political guru, which tends to defuse efforts to make a nickname like “Mr. Overrated” stick; and he has, by challenging a blowhard to a cash bet, also damaged one of my major criticisms of his probabilistic presidential-election forecasts. That last move even earned Silver some prissy, ill-founded criticism from the public editor of the New York Times, which could hardly be better calculated to make me appreciate the man more.
...For most players in most years, Silver’s PECOTA worked pretty well. But the world of baseball research, like the world of political psephology, does have its cranky internet termites. They pointed out that PECOTA seemed to blunder when presented with unique players who lack historical comparators, particularly singles-hitting Japanese weirdo Ichiro Suzuki. More importantly, PECOTA produced reasonable predictions, but they were only marginally better than those generated by extremely simple models anyone could build. The baseball analyst known as “Tom Tango” (a mystery man I once profiled for Maclean’s, if you can call it a profile) created a baseline for projection systems that he named the “Marcels” after the monkey on the TV show Friends—the idea being that you must beat the Marcels, year-in and year-out, to prove you actually know more than a monkey. PECOTA didn’t offer much of an upgrade on the Marcels—sometimes none at all.
PECOTA came under added scrutiny in 2009, when it offered an outrageously high forecast—one that was derided immediately, even as people waited in fear and curiosity to see if it would pan out—for Baltimore Orioles rookie catcher Matt Wieters. Wieters did have a decent first year, but he has not, as PECOTA implied he would, rolled over the American League like the Kwantung Army sweeping Manchuria. By the time of the Wieters Affair, Silver had departed Baseball Prospectus for psephological godhood, ultimately leaving his proprietary model behind in the hands of a friendly skeptic, Colin Wyers, who was hired by BPro. In a series of 2010 posts by Wyers and others called “Reintroducing PECOTA”—though it could reasonably have been entitled “Why We Have To Bulldoze This Pigsty And Rebuild It From Scratch”—one can read between the lines.
|
Support BBTF
Thanks to aleskel for his generous support.
Bookmarks
You must be logged in to view your Bookmarks.
Hot Topics
Newsblog: [OTP-May] Politico: Congressional baseball game, May 1, 1926 (3705 - 3:48pm, May 21)Last:  Joe KehoskieNewsblog: Sherman: Mets' roster of rubbish makes it impossible to evaluate Collins (46 - 3:47pm, May 21)Last: if nature called, ladodger34 would listenNewsblog: Joe Maddon calls ump's position 'baseball anarchy' (17 - 3:46pm, May 21)Last: GodNewsblog: JM Catellier: Is Pedro Martinez a First Ballot Hall of Famer? (92 - 3:46pm, May 21)Last: cardsfanboyNewsblog: Posnanski: Jeff Francoeur and ANT (21 - 3:45pm, May 21)Last: AROMNewsblog: OT: The Soccer Thread, May 2013 (1038 - 3:45pm, May 21)Last:  Shooty is in the Trust TreeNewsblog: Slate: The Dreaded C-Word (13 - 3:41pm, May 21)Last: Der_KNewsblog: White Sox Ace Chris Sale Eats and Eats and Eats Without Gaining Any Weight (37 - 3:38pm, May 21)Last: The Good FaceNewsblog: Dollar Sign on the Glistening Muscle: Scouting Ballplayers in 1980s Playgirl (17 - 3:32pm, May 21)Last: smileyyNewsblog: Yanks, Manchester City awarded MLS expansion team (15 - 3:29pm, May 21)Last: Pat Rapper's DelightNewsblog: OMNICHATTER for MAY 21, 2013 (7 - 3:23pm, May 21)Last: Rickey Fredonia Fudge Duckery Precious TwiddleNewsblog: SB Nation: Five lost scouting reports (5 - 3:13pm, May 21)Last: Floyd ThursbyNewsblog: OT: NBA Monthly Thread - May 2013 (1008 - 3:13pm, May 21)Last:  smileyyNewsblog: Barry Bonds: Detroit Tigers' Miguel Cabrera 'the best' ... but not better than me (46 - 3:10pm, May 21)Last: snapper (history's 42nd greatest monster)Newsblog: Washington Post: Tom Boswell: Yankees Are Monuments To Baseball Success (15 - 3:03pm, May 21)Last: Slivers of Maranville (SdeB)
|
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
1. villageidiom Posted: November 04, 2012 at 08:59 AM (#4292491)That is a very good reason for shunning PECOTA in favor of systems made by people trying to advance the state of knowledge instead. Black boxes are inherently less useful than models where you can see all the gears, maybe even tweak if you feel like tinkering. Of course people should be told that there are things just as good out there that are transparent and free.
Next year Pecota saddled me up with a whole host of bums..weird..like it KNEW how to screw up a team..not just the rooks, injury cases, and psychotic maniacs it routinely overates..It picked them ALL.
No more Pecota.
This piece does a good job of arguing that Silver's baseball projections, like his political projections, aren't notably better than the projections put together by other smart folks in the field. In 2008 and 2010, Silver's projections did fine, but not notably better than other folks in the field. This seems like a good and important point - Silver isn't a "wizard", he's a good writer with a good model that spits out results of a quality similar to the models of other folks who aren't as good at writing.**
Then parts of the article seem to hint at something much worse, that Silver is actually terrible at projecting things. The huge blockquote from that Wyers article is ironic - he says we can't take Silver's projections at face value given that they're black-boxed, but he takes Wyers' criticism at face value even though Wyers doesn't actually open up the box for anyone to test his critiques either. That part of the article has no actual data behind it, and is entirely unconvincing.
**I wouldn't be surprised at all if Sam Wang's simpler, open-source model is better at projections than Silver's black-boxed, likely overfitted model, but Wang isn't close to Silver's class as a writer and I'm going to keep reading Silver regardless of which models overperform his.
At least it made the Weiters run on time.
I first did projections in 2006. The first year wasn't that good. From 2007-2010 I had no problem beating PECOTA. Since then I haven't published anything in the public domain. After 2007 frankly I was more worried about competing with ZIPS than PECOTA. Nate's last year running it was 2008 I think.
Saw Silver on CBS Sunday morning today. His fans for the most part are as clueless as his detractors. Despite the press, this doesn't require anything high tech or need advanced statistical techniques. You just need the data. Take a simple average of the polls within each state, add up the electoral votes, and you get the same results that Silver is touting. 4th grade math will suffice, if you have the data. And your predictions are only as good as the data. As Nate has said himself, Romney's chance at a win pretty much comes down to the polls being wrong in a systematic fashion.
The probabilities can matter. A 5 point lead in Ohio, for example, would make for a very different race than a .5 point lead.
…Like a Bizarro-world subway system where texting while drunk is mandatory for on-duty drivers, there were many possible points of derailment, and diagnosing problems across a set of busy people in different time zones often took longer than it should have. But we plowed along with the system with few changes despite its obvious drawbacks; Nate knew the ins and outs of it, in the end it produced results, and rebuilding the thing sensibly would be a huge undertaking. We knew that we weren’t adequately prepared in the event that Nate got hit by a bus, but such is the plight of the small partnership.
This is a point that I was trying to make in the other thread, but got sidetracked away from emphasizing. Part of the reason you want to want to use a mathematically sound approach is so that you can be really rigorous about what the model is actually telling you and what you're putting in yourself. If you don't have a really good idea of how you're describing a system, you'll tend to write a program that grows exponentially as you try different things, fix some bugs, introduce some others...
If you have a really good idea about the mathematical relationships between different working parts, you can zero in on problems without introducing new complexity. It might not show up the first time you try to test your results, but in the long run a simple model that's mathematically sound will tell you a lot more than a complex model that might seem to fit historical data better.
stat folks should be celebrating these inroads versus sounding like a bunch of pouty pollies claiming this and that about his approach
sounds small and if i was in a harsher mood might write the word pathetic
edit
to be clear if folks have an honest intellectual critique that is fine. but this harping claiming the guy is doing something anyone could do rings hollow. plenty of folks 'could' have done it. and some have tried to compete
have not seen anyone do it like silver. so if it's so easy where are the folks rushing to be the next cool thing? that is how markets work. there is no cost to entry.
oh wait, that's right, it's a lot of work.
not directed at you. just wanted to clarify
anyone can be a critic
In contrast, something like Linear Weights is trivial, both to calculate and to improve upon. You can add in timeline or park adjustments, calculate new weights for different years or run scoring environments. You can do it in any language you want, and the computer time will be trivial on a modern computer. Rigor isn't just about mathematical pissing contests -- it's a way to save effort and increase reliability.
It's the same as all the developers who talk about how trivial Twitter is and how they could have done it. Yep, could have. Didn't.
well, he did the grunt work. sorry lazy people.
part marketing the idea in a way that appeals to the public
This was part of my problem with BPro -- it existed before Silver/PECOTA so I see no reason to blame Nate for it. But every edition of the annual bragged about all the stuff they got right the year before, oddly not mentioning all the stuff they'd gotten wrong.
In fairness, marketing anything statistical in an honest fashion is nearly impossible because it's all about probability and uncertainty. We laugh about it, but Wieters at his 5th percentile is not an incorrect prediction -- the model is only wrong if the Wieters of the world end up at their 5th percentile or worse substantially more than 5% of the time. When you evaluate a proposed model like this in the real world, you run thousands of simulations from a pre-determined distribution and see if you reproduce that distribution. In baseball, you get to run that experiment once for Wieters and maybe a dozen times for players like Wieters.
Take Jason Heyward. He's had OPS+s of 131, 93 and 117. Of course no model is going to come close to pegging those on the money. (Neither is any scout or manager or GM.) That's the kind of input data baseball projections work with -- way more noise than signal. Calling any of these guys out for getting one player spectacularly wrong is pointless -- fun, but pointless. The most you can hope a model would do is to look at Heyward's 131 and say "he probably is (or is not) that good." You've got to find entire types that they get wrong or you have to find the model that produces a smaller MSE (or less bias or something) before you can start saying the model is "wrong."
But similarly every projection should be saying more than "Heyward is projected to a 133 OPS+". So PECOTA published 5th and 95th percentiles. These tended to be quite wide (nature of the data) but this was also Silver saying quite clearly "even for this broad range, I know I'm still going to be dead, dead wrong 10% of the time." Ten percent is not a small percentage really. ZiPS is projecting something like 1500 players a year and 150 are going to be really, really bad projections. Look back at the old TO Swisher-Betemit post. Swisher's 90% confidence interval for OPS+ was about 75 to 145. All told there was a 15-20% chance Swisher was going to repeat his White Sox season. That's as close as ZiPS could get its prediction (at the time, things may have improved). That's not a bad model, that's highly variable data with low signal.
Take a simple average of the polls within each state
You can do better than this. First, the means should be weighted -- a 2 point lead in a poll of 1000 is worth more than a 3 point lead in a poll of 500. Well, kinda and maybe because it's also easy to calculate a standard error on that. Each sample is independent so your weighted mean is the mean of a sample equivalent to the full sample size -- assuming reasonably similar methodology and they are all using reasonably similar methodology. It's quite easy to calculate the standard error of a proportion: sqrt(P(1-P)/n). The standard error is about 1.5% at n=1000 (or a margin of error of +/- 3%) and you can divide that by 1.4 for each doubling in sample size.
If there's a tricky bit it's in how far back timewise do you go. Still this is why Obama "appears" safe (in the RDP meaning of the term!) -- average the last 4-5 polls in any state and you're talking about a sample size of about 4,000 which means a margin of error down to 1.5%. While no single poll may give Obama a comfy lead in Ohio, the pooled analysis puts him outside the margin of error
Assuming the polls are of sufficient quality. The response rate in phone surveys is pretty abysmal.
Now take that back to baseball. For a sample of 4,000, you can get about a +/- 1.5% margin of error on a proportion (anywhere near 50%). That's 6-7 full season's worth of PAs and let's say the proportion we are interested in is OBP. Well, 1.5% is 15 points of OBP so your best guess is that the player's true talent OBP is somewhere in the 320-350 range (for example). Now, in the upcoming season, that 320-350 OBP gets to play out across 600 random PAs which would probably give us a range something like 300-370 if not wider.
That, effectively is what happened. Granted, there's only one Wieters, but a whole class of people got bad predictions (principally) because BPro used an insanely optimistic EL league factor (see post 3). It failed the laugh test before and after the season.
Anyway, I think MCoA (among others) is mostly correct on Silver - for presidential elections, I'm not sure the model has much value added, but the mind behind it does and I go to his site daily.
What did Baseball Prospectus and Nate come along and do that Shandler at BBHQ wasn't doing sooner, in terms of basic philosophy?
DVOA and DYOA are:
1. wonderfully self-consistent
2. utterly useless in the real world
Bingo
Brian Burke has a really good post today about it.
You're velcome! ;-)
Full credit to JE for dropping it in the political thread. I should have said that in the first post. My bad.
I think one reason that Silver is getting so much flak in recent weeks is that unlike other notable polling aggregators like Real Clear Politics, he's just one person. It's so much easier to ding a singular person than to ding Generic Polling Aggregator because you disagree with them. It's his model, it's his forecast, it's his credibility on the line. He has few safety nets and buffers if he is wrong. If RCP is wrong, they can conceivably fire people and get back on track. If Nate is wrong, his career is in doubt.
Essentially, he got crowned as a genius by a bunch of stupid people, for saying a bunch of stuff that was patently obvious to anyone not invested in all the horserace BS that dominates political coverage in major news orgs. That's not Silver's fault, obviously, and as HW said above, it's hard to begrudge the guy for taking advantage of the situation he found himself in. But I think it's worth keeping in mind who his audience really is - not the people who understand basic statistical concepts, and certainly not the people who are creating advanced statistical models themselves. No, his audience are the morons who employ him at the NYT and the braindead children in the media who get distracted by every poll that's released during an election season.
And frankly, I think it's awesome that someone like him, with the patience to explain how these things work, actually finds himself with a prominent voice. Because he's singlehandedly raising the IQ of polling discussions a few desperately needed points.
Yes he has. He's done a great job. The impatience occasionally shows through, but he's awfully good at explaining what he's doing.
thankfully burke is not nitpicky like so many others.
but it does baffle me about stats people looking to undermine a guy successfully pimping the value of their field. that's some crazy stuff.
It happens all the time with popularizers. Ask an evolutionary biologist about Stephen Jay Gould back in his heyday and all you would hear is complaints about how he simplifies things, and he acts as if debunked principles are still known to be true because the new theory is too hard to turn into catchy metaphors, and he promotes the work of his allies at the expense of his rivals, etc.
Mass confusion and delusion is typically most effectively countered by simple truths.
he may be wrong. he may be right. but compared to tons of people in the political arena - in fact, virtually everyone no matter what their political beliefs - he looks like a genius.
i'm not comparing him to someone like sam wang, who i only recently discovered. and i'm sure there are others. but compared with anyone on cnn / msnbc / fox, nate comes out far ahead.
i'm reading his book, and i find it an interesting read. he even admits there's a huge opportunity in the political field and he just happens to fill that niche. i have a feeling that when the 'better' punditry is taking place, nate will have moved on to another area - just as he has moved on from poker and baseball.
Did someone at BPro enter bad MLE translations for Wieter's minor leagues after Nate left, or did Silver leave them with a Pecota containing bad MLE translations?
If it's the latter, then Silver should take a (minor) hit for the Wieters forecast error.
From the description of Pecota, it seems like Nate built a rats nest that's difficult to maintain, let alone continue to develop, and difficult to verify is working as designed (that the inputs to every sub-part and the whole produce the outputs you expect based on your model). I don't have a great deal of experience with Excel, but I do have experience building complex applications and simulations in C, C++, Java, and even Objective C. It sure seems to me that Excel is a terrible choice for building complex systems. I've built some large Excel spreadsheets and had some basic (non-automated) means of ensuring sub-sheets calculations are verified, so maybe a stud Excel jockey like Nate had no problems ensuring all the parts of the model were correctly calculated (with the exception of some bad inputs like the Wieters MLE translations data).
Making Pecota proprietary probably added another barrier to verification of it's operation. If Pecota could have been open sourced (and I understand it probably couldn't and still have the value it had, esp. given Nate's proprietary ideas and insights), it would have benefited from many eyes finding flaws that Nate and his partners might have over-looked.
I think it was the size of the miss that made it example worthy, it didn't seem like a 1 or even 2 standard deviations error. The problem is for us statistics 101 dropouts no one is quantifying how much error margin is reasonable, and how much sample size is necessary to prove Nate's predictions are good or bad. I believe Cosh is saying that many more elections are needed to know, but how many?
I agree with Harvey's well written post BTW. I more than admire Nate Silver for "getting into the ring", the fact is I'm jealous. Not of the fame he's gained, but because he has worked on some really cool ideas that other people (like me) just spout off about "I could do this, if I found the time". He found the time, he did it, and must have had a great deal of fun when it was finally up and running and he could just add more and more ideas to it.
But I'm not jealous of him having to do all that work in excel. Ugh.
As someone who has developed a ton of code, even if he did it may not be his fault. There are many factors that go into making a god design. From the outside trying to figure out who to blame for s single prediction is kind of silly (though he gets credit likely beynd what is deserved so I guess OK).
I'm reminded of this, from Calvin Coolidge:
There almost had to be a mistake with the inputs and not the calculations.
I like Burke, and he's done some great work with regard to the NFL, win probability, and decision-making. But I also think he displays some professional jealousy at times towards those whose work has gone more mainstream.
He's quick to criticize Football Outsiders (sometimes in nitpicky ways(*)), and I get the same feeling here. He also tends to only see the numbers side of things when it comes to NFL decision-making. Now, he's absolutely right in the aggregate, but his criticisms of individual coaching decisions - which are affected by plenty of factors besides strict win probability - are often over-the-top.
(*) The "Curse of 370" is the best example, here. Now, his criticisms are certainly valid - there's nothing magical about the 370th carry that will cause a RB to get hurt - but they miss the forest for the trees. FO has often pointed out the very same fact, and really uses "370" as more of an indicator that some sort of regression is coming and that general RB overuse (could be 371 carries, could be 419) should be avoided.
You must be Registered and Logged In to post comments.
<< Back to main