User Comments, Suggestions, or Complaints | Privacy Policy | Terms of Service | Advertising
Page rendered in 0.4827 seconds
47 querie(s) executed
| ||||||||
Baseball Primer Newsblog — The Best News Links from the Baseball Newsstand Wednesday, April 30, 2014Dan Szymborski: 10 Lessons I Have Learned about Creating a Projection System
Repoz
Posted: April 30, 2014 at 08:56 PM | 68 comment(s)
Login to Bookmark
Tags: sabermetrics, site news |
Login to submit news.
Support BBTFThanks to BookmarksYou must be logged in to view your Bookmarks. Hot TopicsNewsblog: OT - 2017-18 NBA thread (All-Star Weekend to End of Time edition)
(2219 - 8:30am, Apr 20) Last: JC in DC Newsblog: OTP 2018 Apr 16: Beto strikes out but is a hit at baseball fundraiser (878 - 8:23am, Apr 20) Last: Mellow Mouse, Benevolent Space Tyrant Newsblog: It’s not just ownership that’s keeping Jose Reyes a Met (19 - 8:22am, Apr 20) Last: eric Newsblog: OT: Winter Soccer Thread (1496 - 7:29am, Apr 20) Last: Jose is an Absurd Doubles Machine Sox Therapy: Are The Angels A Real Team? (11 - 7:21am, Apr 20) Last: Jose is an Absurd Doubles Machine Newsblog: At long last, have you no sense of OMNICHATTER for March 19, 2017 (73 - 1:45am, Apr 20) Last: Dale Sams Newsblog: Bryan Price dismissed as Reds manager | MLB.com (83 - 12:49am, Apr 20) Last: Rennie's Tenet Newsblog: Primer Dugout (and link of the day) 4-19-2018 (15 - 11:45pm, Apr 19) Last: Misirlou doesn't live in the restaurant Newsblog: Braves sign Jose Bautista to a minor-league contract, will play third base (25 - 10:50pm, Apr 19) Last: Misirlou doesn't live in the restaurant Newsblog: Update: Cubs' Anthony Rizzo calls his shorter-season, pay-cut comments 'my opinion' (110 - 9:39pm, Apr 19) Last: PreservedFish Newsblog: Deadspin: The Mets Previewed A Dark, Mets-y Future Last Night (22 - 7:39pm, Apr 19) Last: Walt Davis Gonfalon Cubs: Home Sweet Home (60 - 5:51pm, Apr 19) Last: Moses Taylor, aka Hambone Fakenameington Sox Therapy: Lining Up The Minors (7 - 2:54pm, Apr 19) Last: Jose is an Absurd Doubles Machine Hall of Merit: Most Meritorious Player: 1942 Discussion (10 - 9:55am, Apr 19) Last: DL from MN Newsblog: Primer Dugout (and link of the day) 4-17-2018 (36 - 7:46am, Apr 19) Last: Hysterical & Useless |
|||||||
About Baseball Think Factory | Write for Us | Copyright © 1996-2014 Baseball Think Factory
User Comments, Suggestions, or Complaints | Privacy Policy | Terms of Service | Advertising
|
| Page rendered in 0.4827 seconds |
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
1. Greg K Posted: April 30, 2014 at 10:12 PM (#4697978)I remember Chris Dial telling me around this time that you were literally living in your mother's basement. Do I have that right?
Part of this is that we often forget that the guy who puts up a really good year at 21 or 22 is likely not that good yet, he's just had a plus season. ZiPS and I were about equally "pessimistic" about Jason Heyward -- probably for different reasons but the initial reaction to his rookie year 131 OPS+ probably should have been something like "he's probably a true 115-120 OPS+ who got a bit lucky" as opposed to "OMG, he can only get better from here." Not that a true 115-120 OPS+ at age 20 isn't sufficiently impressive.
Speaking of Heyward ... WTF Mr Peabody? bWAR already has him at +13 Rfield for the year. In 24 games. He's got 1.4 WAR, 1 WAA despite his 60 OPS+. He's on pace for more Rfield in one season than JD Drew had in his career. Is DRS headquarters in one of those states that legalized marijuana?
Willie Mays never was significantly better than he was at age 23.
He also wasn't any better than he was at 34-35. Crazy.
Heyward is the only reason Tulowitzki isn't leading the league in both oWAR and dWAR.
I wonder how widespread this is. Isn't there something about how there's a point in the season (and not really late, like 150 games) that pythag is less predictive than the team's actual won-loss record? Maybe I have that wrong.
Until 1999. (Actually, a detached converted carriage house) I was still 21!
[Edit: 2000, actually]
I get what you mean here, but I definitely got the wrong impression. I had to look up Heyward's numbers because I thought he was off to a slow start (he is).
Monthly dWar numbers are pretty close to completely meaningless.
Yeah, it's a mistake to get hyper about WAR too early, given how volatile defensive data is in small sample sizes.
The same holds true on September 30th.
How many consecutive years does Mike Trout have to do it before we decide that yeah, he really is that good?
Let's see him win an MVP before we get too excited.
Do Bugs Bunny's ringzzz contain lots of carrots?
A certain so-and-so said "It would never be as good as Bill James." I still enjoy that.
Did we get that far? I didn't think we did, but I guess memory 15, 16 years ago can be kind of spotty. I didn't add the first bit until the night before it went live, so I didn't have time to check with you.
I think I do remember now being skeptical that we could beat James.
Or, I got Pecota. Which projection system are you?
I really don't want to find out who my comps are as a human being.
Is it any better?
It was part of the practical reasons, but I was running long, so gave the abridge version.
"10 Simple Tricks About Projections Systems Discovered by a Mom"
just wanted to express appreciation on the effort as this helped further the cause of baseball analytics and how folks should think about player related forecasting.
good work and thanks
Can you please make sure that article is set up as a slide show?
Like PECOTA was for the Baseball Prospectus crew, turning ZiPS into a program rather than a bunch of gigantic, interlocking spreadsheets would make things run far more smoothly. Unfortunately, I just took my required Computer Science courses in college and just did enough to get a passing grade.
This might be something worth reexamining at some point. The payoff to time investment ratio of some of the more modern languages has really changed since I was in college (I'm a few years younger than Szym). A good scripting language like Python is pretty easy to pick up and has lots of libraries for interacting with databases, etc. Doing everything in spreadsheets isn't really avoiding the need to write a program; at some point, you're just writing a program in a language (Excel) that's badly suited to it.
R seems to be the wave of the future at my university, but it's just brutal to learn unless you have a strong background in object-oriented programming, plus the help files are borderline useless. The unfortunate thing with a free, open source program is that there's little financial interest in making it more user friendly or generating quality learning tools.
I didn't think the list could get any more annoying when broken into a slideshow. But now there are slideshows that make you "like" their facebook page to see #1.
But then I'm used to just diving into something I know nothing about.
Hoopsworld (a now defunct basketball site) used to make you answer a poll question to read any of their articles. And if you answered too fast you had to answer a second one. I'm not surprised they went under.
How about when something that could even be presented as a slideshow is instead presented as a video?
But then I'm used to just diving into something I know nothing about.
r is pretty slick
So you're saying this Willie Mays-guy was pretty good?
I'll also echo Zach's comments in #29 about the increasing ease of higher level programming languages, especially Python with its vast array of both useful libraries and high quality (and often free) tutorials. Any mathy person with solid general computer literacy should be able to start doing useful things with Python after a pretty reasonable investment of time and energy.
Who will Mike Trout finish 2nd to in the MVP vote this year?!
Josh Donaldson.
That's actually a great idea. Even as someone who uses R semi-regularly, it can be good to get a refresher and pick up some new tricks.
I enrolled a few days ago to audit it. Need to learn SQL better for work, and I hope that framing the exercises within the context of Retrosheet or other baseball data will be useful. And if not, it's still baseball and will be worthwhile from that perspective.
Any mathy person with solid general computer literacy should be able to start doing useful things with Python after a pretty reasonable investment of time and energy.
I do most of my analysis in Excel VBA (yeah yeah, I know.... but it's what I know). My job is transitioning me out of software test and into C# development and I finished my Masters last year in Applied Mathematics. Sounds like Python is something I definitely need to learn more about. Thanks for the tip.
1. Query languages (SQL, etc.) all have limitations compared to programming languages. They are designed for people who are close to being decent programmers, but not quite there. If you're going to use one, first make sure that it can do all the things you need it to do. One thing that it will almost certainly NOT do is allow you to write anything on the database. This is by design. Query languages are supposed to provide a safeguard against users who are not exactly computer programmers messing up the database.
2. If you're going to use a programming language, find one that is compatible with your database or spreadsheet. Languages are often designed to work with one DBMS, and don't work as well as others. Visual Basic, for example, is easy to learn, but it really wants you to use ACCESS to store and organize data. Also, avoid any form of COBOL. It takes forever to write decent code in COBOL. It was originally designed for bank tellers, back in the days when people thought that you'd have to be a rocket scientist to actually program computers. So its statements don't individually do much, and you have to essentially rewrite your program in the Data Division.
3. A spreadsheet and a Database Management System (DBMS) are not quite the same thing. If you're going to use a spreadsheet, make sure that it will do everything you need it to do (just like a query language).
4. Avoid object-oriented languages if you're worried about the up-front time it will take to write your program. OOL are designed to be hard and expensive to write up front, but easy and cheap to maintain because of the class hierarchy.
Also, doing something like ZIPS, when you're just one man, is a TON of work no matter what tools you use. Dan, you're a genius to have gotten this done in any reasonable time frame. Also, living in your family's guest house is not anything at all like living in your mother's basement. It's completely respectable, and you should not take any grief over it. I mean, that's where Jane Austen lived for much of her life - her brother's guest house. - Brock Hanke (geezer)
I do not ....... agree with this.
I learned SQL on the job, and it's pretty easy to make SQL statements that just spiral out of control with joins and subqueries that propagate like mushrooms. It seems to be more of an art than a science to be able to figure out when you need to nest a subquery, and when you should break out a temporary table. Baseball queries are especially prone to this, when you're joining a master player table to a season stats table to a team season table to a manager table to a postseason table...
I'm inclined to agree with this, but I don't necessarily know what I'm talking about.
I had a two-day course on OOP once upon a time - seemed totally foreign to my way of thinking (whereas SQL seems really intuitive). I don't think I'll ever be proficient in it.
***
Nice job, Dan.
Yeah, that (release 4, anyway) was my intro to OOP (expert reasoning systems, back in the day). I have been informed that I need to "unlearn" before I can really tackle the kludged versions of OOPs such as C++, etc.
Re: SQL --- Even with the INSERT INTO option, working with a query language can be a pain if your tables aren't set up quite right. Yeah, you can get pretty much anything out, but the gymnastics required can be appalling (again, assuming you didn't get to set up your data tables optimally structured). If you have a good handle on a variety of data structures, you can usually do things more quickly *IF* you are also willing to take the time to structure and populate your own database (I tinkered with writing a scripting language for storing play-by-play scoring events for our softball team --- to handle things like runners advancing on errors, or interference, etc. ... had it laid out, was putting together the processor and then we had our twins....).
Oh, and I'll give another plug for R...my grad students are starting to use it quite heavily.
How does it compare to SAS? I'm pretty much in SAS all day. The hipsters are trucking over to Stata. Those don't seem too different, and they're vast improvements over the shitshow that is SPSS.
* Working with large datasets;
* Repeated analyses that are run on different datasets of similar structures
SAS is not good at:
* interactive data analyses (including plotting)
* bleeding edge statistical techniques
R is basically the reverse -- less good at managing large datasets and not great at doing batch runs of repeated analyses, but it is a good choice for visualizing data and THE BEST choice for doing interactive analyses and the stuff that will be coming out in journals in a year or two is already available in R packages.
Stata is less good at handling large datasets and does not have the bleeding edge stuff available, but it is really the easiest of the 3 for mortals (i.e. anything below an MSc in statistics) to use and it makes *the best* plots/figures for publications and distribution.
SPSS is overpriced and irrelevant; basically only is used in institutions who aren't nimble enough to cut the cord and move to either Stata or R for interactive data analysis and modelling.
I will say that if you can use SAS and R, you can basically use anything else out there. Stata is much easier to work with than R or SAS, so I find it to be somewhat redundant as a skill if someone can use both R and SAS.
2. R is the current fave among academics for 2 reasons: (1) it's free; (2) academics ain't got no sense.
3. R Studio seems pretty much a straight clone of SAS/IML Studio which (near as I can tell) was around about 5 years ahead of R Studio.
4. You can call R from within SAS/IML.
The issue with R is that the routines are roll-your-own. They're public domain and used heavily so there's certainly some quality assurance behind at least the popular ones but you will never have the quality control that a SAS does. (Stata is somewhere in the middle) This would be less of an issue if all of the folks who were writing R packages were well-trained programmers but they aren't -- most of them are academic statisticians who at best dabble in programming and at worst are just typing in matrix formulas. SAS has also added a lot of the "fancy" models lately and even in the old days could handle most of the fancy stuff just not in obvious ways.
SAS/IML is every bit as interactive as R and the Studio seems to handle the graphics as well as R but, being in academia, I really have more experience with R Studio than IML.
If you've got access to SAS and are mainly interested in data management and statistical analysis (n > p), there's not that much reason to move away from it except everybody will think you old-fashioned. If you're just starting out, probably learn one of the newer packages. But really, if you want a successful career as a stat analyst of some variety, learn a bit about them all. At a minimum you should probably be able to at least muddle through effectively in SAS, Stata, SPSS (still popular in may research outfits) and R, maybe Python now too. Or GenStat, EPI-info, etc. if you're on the health side. Some specialty packages although I'm not sure how much future those have.
But the greater connectivity among packages should be helping -- pick whatever interface you like that gives you the data management capability in a language that fits your logic then call what you need from other packages. Like I said, from SAS you can call R and it's had SQL, Oracle, etc. interfaces forever. There are a number of other things you can call from SAS now (incl Python I think). Granted I haven't tried any of that yet -- my freaking uni is 2-3 versions of SAS behind -- but I am starting to think that I just might make it to the end of my career without having to learn 5 new packages.
Python is one I need to check out. I thought it was a programming language but I see there are a number of statisticians using it these days so obviously there's more to it than that.
By the way, I think R is great and it's what I'd use if I didn't work places that have SAS licenses (which are ridiculously expensive usually). The R team deserves kudos. And there are tons of people who work very hard at providing better interfaces, friendlier packages, better help manuals, etc. But there are now hundreds, probably thousands, of user-written procedures out there and (as with any open source) no way that quality can be assured ... and nobody to sue if it all goes wrong. :-)
this becomes more and more problematic as stats models move deeper into areas where equations can't be worked through by hand and there aren't classic test data available. When some academic develops a new model and associated R script there's often no way for anybody to check whether it produces the correct results. Is it computationally efficient? What's it's level of numeric accuracy? Does it get the right answer when the data are near-singular or the function is at its asymptote or near a boundary? Are we sure that's a global max and not a local one?
I've seen some code written by academics and it's often a nightmare -- n by n matrices when they aren't needed, inverting matrices instead of using the more stable linear solver, etc. 32-bit and more memory is making some of that stuff less of a concern but I'd be a lot more comfy with trained statistical programmers and a quality/testing department.
On the other hand, I am so old that, when I first encountered computers in college (Vanderbilt had just built a computer center in 1965, when I was a freshman), there were only 5 computer languages in the Western world: Machine, written in hex, Assembler, which is just machine with mnemonics, Fortran (FORmula TRANslator), which was for rocket scientists and had only the most primitive of print statements, COBOL (Computer Oriented Business Oriented Language), which was for bank tellers and was so careful about preventing programmers from making mistakes that you had to define every variable in the Data Division, and a money variable was NOT the same as a number variable, so you had to load money numbers into number variables just to do arithmetic, so you got writer's cramp from trying to use the damfool thing, and Algol (ALGOrithmic Language), which was used in Europe. There were rumors of people working on languages called "C" and "Basic", but those languages weren't finished yet. ISAM (Indexed Sequential Access Method) was the only thing even approaching a DBMS, and grad school in computers involved being able to program a 7-tape sort, one input, one output, five working tapes, because disc technology was just coming into the field and tape storage doesn't lead to good experiences when trying to extract a single record from an unsorted tape. - Brock
I tried to pick up Python a while back, but then the C++ integration with Rcpp made pretty much everything I might use Python for somewhat redundant. However, if you're doing a lot of data-processing, it still seems like the best route to go (and probably has the highest upside for handling large datasets this side of Matlab).
I'm on the health side, and I can say with confidence that no epidemiologist under 45 uses EPI-Info anymore, and epidemiologists over 45 aren't getting their hands dirty doing statistical analysis themselves anyway. EPI-Info has been dead for years - it's mostly SAS/Stata/R, often depending on what part of the country you're in and whatever your data people like using.
Statistical genomics (and all the big, high-dimensional big omics data) is changing so insanely quickly that a lot of the most important new methods are coming out in R. There's been an explosion of data in the past few years, but our current statistical methods haven't quite caught up yet to make sense of much of it, and the flexibility of R makes it great for distribution of new methods. Of course, you have to trust that the people who wrote the packages did it properly, but it wouldn't be bleeding edge if everything were crosschecked a dozen times by a team of computer scientists so it comes out years later.
R is what a lot of statistics departments are teaching young statisticians to use, so I assume that's what we'll be seeing a lot of for the next few years. People tend to prefer their first programming language.
Anyone who are using Python for playing around with numbers should install IPython. It's a great interactive environment and it has a notebook feature that's a bit like Mathematica.
But not many statistical analysts are statisticians. They are probably more likely to come out of economics, health or even the social sciences. And very few people should be doing genuine statistical programming -- if you are a research outfit, you want your analysts relying almost entirely on procedures written by others. Their job is primarily data management, exploratory/descriptive analysis, choosing the appropriate model, interpreting results. It's also true that R is increasingly being used outside of statistics and in statistics service teaching (see "free") so it's future is bright. Everybody should have a version on their personal computer.
So yes, the stats grads who go on to become applied stats academics or the occasional high-end private sector statistician can go ahead and program stuff up in R or Python or whatever. The other 99% of the world will want to analyze data in whatever package offers the best combo of interface, data management, graphics, etc. If a stage is reached where you can call R, Python, C++, etc. from any of those packages, it won't necessarily matter very much what they're written in and you will only need to know rudimentary levels of R, etc.
Which isn't to say that the combo package of choice won't be or shouldn't be R or iPython or some Google analytics interface or whatever.
It's mainly the idea that "everybody" should be doing statistical programming that scares me. R is an excellent statistical programming tool ... and, what, maybe 10,000 people in the world should be using it for that. Everybody else should be using it for analysis using existing R packages.
By the way, it's becoming increasingly hard to publish a new model, even in theoretical statistics, without submitting code. That code may not be well-written but there's rarely any need for an analyst to roll their own, even on the bleeding edge.
1. Variances on individual season projections are huge (Swisher -- Betemit trade) ... with partial credit to PECOTA
2. You can get a start on a decent MLE by whacking about 30 points of BA (and associated SLG) off minor-league stats
3. After a player has been away from a position for a couple of years, their quality at the position declines substantially
4. Late bloomers fade early
5. ZiPS is not a playing time projection system
6. Comps are even more fun than I thought and maybe even a bit useful
7. 99 times out of 100, last year's AA break-out is nowhere near ready to produce in the majors (prospects are doomed)
8. Most of my seat-of-the-pants career projections are not insane
9. Components are fun and maybe even a bit useful
10. Howie Kendrick was a true 370 BABIP hitter ... oh wait ... :-)
(Kendrick is a very good BABIP hitter. But his 2009 ZiPS projection had him at a 372 BABIP.)
You must be Registered and Logged In to post comments.
<< Back to main