Baseball for the Thinking Fan

Login | Register | Feedback

btf_logo
You are here > Home > Baseball Newsstand > Baseball Primer Newsblog > Discussion
Baseball Primer Newsblog
— The Best News Links from the Baseball Newsstand

Wednesday, April 30, 2014

Dan Szymborski: 10 Lessons I Have Learned about Creating a Projection System

How did ZiPS come about? The genesis of what later became ZiPS stems from conversations I had over AOL Instant Messenger, with Mets fan, SABR social gadfly, and pharmaceutical chemist Chris Dial during the late(ish) 1990s. I knew Chris from Usenet, a now mostly-dead internet distributed discussion system.

Usenet was my introduction into the wider sabermetrics community, full of lots of other names you would recognize, like Keith Law, Christina Kahrl, Voros McCracken, Sean Forman, and scads of others. Chris and I talked about making a basic projection system, that had results the public could freely access, that did 95 percent as well as projections hidden behind paywalls. The conception is similar to what Tom Tango later independently developed and coined Marcel.

Nothing came of that at the time. I didn’t revisit the idea of doing a projection system until after the turn of the millennium, when I was regularly writing transaction analysis for Baseball Think Factory, a startup of Jim Furtado and Sean Forman that I had been involved in since its conception in 2000. While I majored in math back in college, I was never much motivated by it unless it could put to use making me money or analyzing sports.

I had financial flexibility at the time due to the former preferred application of math, so I had the time and ability to put together a projection system. There wasn’t any eureka moment that led to the creation of ZiPS–I didn’t fall asleep at a game until a baseball fell on my head from a Barry Bonds tree–it just seemed like a practical thing to have when analyzing transactions.

What started as a basic projection system ended up as something much more complicated. I had the idea to incorporate some of McCracken’s DIPS research into the mix, which is the reason I named it ZiPS, in honor of it. I actually intended to call it ZiPs because CHiPs was my second-favorite show as a child (behind Dukes of Hazzard), but I mistyped it as ZiPS when it finally debuted at Baseball Think Factory.

2. People Overrate the Odds of a Player Improving

Even among many who are into the sabermetric side of baseball, there’s a belief in a neat, tidy, aging curve for players. It’s nowhere near that simple.  While you see this pattern in the aggregate, especially for hitters, nothing comes that easy. Many minor leaguers, even those of prominent talents, simply don’t improve past where they are 21 or 22, even at the higher minor league levels.

People also have an idea that a superstar at 22 is going to be even better at 27, but again, that’s not true, especially to the extent it may be true for a 22-year-old still putting together his skills. While a random 21-year-old is preferable to a random 25-year-old of similar abilities, the very young high achievers tend to plateau once they hit stardom.

Willie Mays never was significantly better than he was at age 23.  Alex Rodriguez didn’t have a traditional 27-ish peak. Neither did Mickey Mantle or Ted Williams, and so on and so on. Mike Trout‘s going to be an unreal player when he hits 27, but he’s unlikely to be in a different tier of craziness than he is/was from 2012-2014.

Repoz Posted: April 30, 2014 at 08:56 PM | 68 comment(s) Login to Bookmark
  Tags: sabermetrics, site news

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

   1. Greg K Posted: April 30, 2014 at 10:12 PM (#4697978)
I don't have much to add other than to say I enjoyed reading this.
   2. PreservedFish Posted: April 30, 2014 at 10:15 PM (#4697980)
I always envisioned ZiPS as a maddeningly complex series of interlocking spreadsheets. Glad to know it's true.
   3. Tom Nawrocki Posted: April 30, 2014 at 11:17 PM (#4698007)
I had financial flexibility at the time due to the former preferred application of math, so I had the time and ability to put together a projection system.


I remember Chris Dial telling me around this time that you were literally living in your mother's basement. Do I have that right?
   4. Walt Davis Posted: April 30, 2014 at 11:44 PM (#4698017)
2. People Overrate the Odds of a Player Improving

Part of this is that we often forget that the guy who puts up a really good year at 21 or 22 is likely not that good yet, he's just had a plus season. ZiPS and I were about equally "pessimistic" about Jason Heyward -- probably for different reasons but the initial reaction to his rookie year 131 OPS+ probably should have been something like "he's probably a true 115-120 OPS+ who got a bit lucky" as opposed to "OMG, he can only get better from here." Not that a true 115-120 OPS+ at age 20 isn't sufficiently impressive.

Speaking of Heyward ... WTF Mr Peabody? bWAR already has him at +13 Rfield for the year. In 24 games. He's got 1.4 WAR, 1 WAA despite his 60 OPS+. He's on pace for more Rfield in one season than JD Drew had in his career. Is DRS headquarters in one of those states that legalized marijuana?

Willie Mays never was significantly better than he was at age 23.

He also wasn't any better than he was at 34-35. Crazy.
   5. Tom Nawrocki Posted: April 30, 2014 at 11:47 PM (#4698024)
Speaking of Heyward ... WTF Mr Peabody? bWAR already has him at +13 Rfield for the year. In 24 games. He's got 1.4 WAR, 1 WAA despite his 60 OPS+. He's on pace for more Rfield in one season than JD Drew had in his career. Is DRS headquarters in one of those states that legalized marijuana?


Heyward is the only reason Tulowitzki isn't leading the league in both oWAR and dWAR.
   6. puck Posted: May 01, 2014 at 12:07 AM (#4698032)
This is interesting:

8. Results are “Stickier” In-Season than Season-to-Season

...Simply put, there was significantly less regression toward the mean for in-season stats than you would expect from the sample size, relative to season-to-season stats...That .400 first-half BABIP may be doomed next year, but players retain a surprisingly large amount of that bounce within the same season.


I wonder how widespread this is. Isn't there something about how there's a point in the season (and not really late, like 150 games) that pythag is less predictive than the team's actual won-loss record? Maybe I have that wrong.
   7. DJS and the Infinite Sadness Posted: May 01, 2014 at 12:48 AM (#4698041)
I remember Chris Dial telling me around this time that you were literally living in your mother's basement. Do I have that right?

Until 1999. (Actually, a detached converted carriage house) I was still 21!

[Edit: 2000, actually]
   8. tshipman Posted: May 01, 2014 at 01:53 AM (#4698051)
Heyward is the only reason Tulowitzki isn't leading the league in both oWAR and dWAR.


I get what you mean here, but I definitely got the wrong impression. I had to look up Heyward's numbers because I thought he was off to a slow start (he is).

Monthly dWar numbers are pretty close to completely meaningless.
   9. DJS and the Infinite Sadness Posted: May 01, 2014 at 01:58 AM (#4698052)
Monthly dWar numbers are pretty close to completely meaningless.

Yeah, it's a mistake to get hyper about WAR too early, given how volatile defensive data is in small sample sizes.
   10. DJS and the Infinite Sadness Posted: May 01, 2014 at 02:04 AM (#4698055)
Are! Dammit.
   11. shoewizard Posted: May 01, 2014 at 02:14 AM (#4698058)

Yeah, it's a mistake to get hyper about WAR too early, given how volatile defensive data is in small sample sizes.


The same holds true on September 30th.
   12. Hank G. Posted: May 01, 2014 at 02:24 AM (#4698060)
Part of this is that we often forget that the guy who puts up a really good year at 21 or 22 is likely not that good yet, he's just had a plus season.


How many consecutive years does Mike Trout have to do it before we decide that yeah, he really is that good?
   13. BrianBrianson Posted: May 01, 2014 at 05:33 AM (#4698066)
Trout really is that good. He's just not going to develop into Bugs Bunny.
   14. Jose Can Still Seabiscuit Posted: May 01, 2014 at 08:47 AM (#4698098)
How many consecutive years does Mike Trout have to do it before we decide that yeah, he really is that good?


Let's see him win an MVP before we get too excited.
   15. JE (Jason) Posted: May 01, 2014 at 08:48 AM (#4698099)
Trout really is that good. He's just not going to develop into Bugs Bunny.

Do Bugs Bunny's ringzzz contain lots of carrots?
   16. Foghorn Leghorn Posted: May 01, 2014 at 08:59 AM (#4698105)
Nothing came of that at the time.
Come on; we drafted some prototypes and talked about what factors we'd want. The best part was the mockery. OH TEH MOCKERY!

A certain so-and-so said "It would never be as good as Bill James." I still enjoy that.
   17. DJS and the Infinite Sadness Posted: May 01, 2014 at 10:31 AM (#4698149)
Come on; we drafted some prototypes and talked about what factors we'd want. The best part was the mockery. OH TEH MOCKERY!

Did we get that far? I didn't think we did, but I guess memory 15, 16 years ago can be kind of spotty. I didn't add the first bit until the night before it went live, so I didn't have time to check with you.

I think I do remember now being skeptical that we could beat James.
   18. Dan The Mediocre Posted: May 01, 2014 at 10:46 AM (#4698165)
To make this fit better on the current internet, it should have been titled "10 things about creating projection systems that'll make you scream!"
   19. GregD Posted: May 01, 2014 at 10:54 AM (#4698174)
And an 11th that will make you cry!
   20. SoSHially Unacceptable Posted: May 01, 2014 at 11:00 AM (#4698179)
To make this fit better on the current internet, it should have been titled "10 things about creating projection systems that'll make you scream!"


Or, I got Pecota. Which projection system are you?
   21. John M. Perkins Posted: May 01, 2014 at 11:06 AM (#4698183)
I thought creating projections for Diamond Mind was a key factor in development.
   22. Hal Chase School of Professionalism Posted: May 01, 2014 at 11:08 AM (#4698186)
Or, I got Pecota. Which projection system are you?


I really don't want to find out who my comps are as a human being.
   23. Publius Publicola Posted: May 01, 2014 at 11:09 AM (#4698187)
A certain so-and-so said "It would never be as good as Bill James." I still enjoy that.


Is it any better?
   24. DJS and the Infinite Sadness Posted: May 01, 2014 at 11:09 AM (#4698188)
I thought creating projections for Diamond Mind was a key factor in development.

It was part of the practical reasons, but I was running long, so gave the abridge version.

"10 Simple Tricks About Projections Systems Discovered by a Mom"
   25. Random Transaction Generator Posted: May 01, 2014 at 11:36 AM (#4698207)
"10 Secret Simple Tricks About Projections that the Experts Don't Want You To Know!"
   26. Harveys Wallbangers Posted: May 01, 2014 at 11:53 AM (#4698218)
dan

just wanted to express appreciation on the effort as this helped further the cause of baseball analytics and how folks should think about player related forecasting.

good work and thanks
   27. Jose Can Still Seabiscuit Posted: May 01, 2014 at 11:55 AM (#4698223)
"10 Secret Simple Tricks About Projections that the Experts Don't Want You To Know!"


Can you please make sure that article is set up as a slide show?
   28. Pat Rapper's Delight Posted: May 01, 2014 at 12:07 PM (#4698237)
"10 Secret Simple Tricks About Projections that the Experts Don't Want You To Know! You won't believe #7!"
   29. Zach Posted: May 01, 2014 at 01:10 PM (#4698271)
I really enjoyed reading this article.

Like PECOTA was for the Baseball Prospectus crew, turning ZiPS into a program rather than a bunch of gigantic, interlocking spreadsheets would make things run far more smoothly. Unfortunately, I just took my required Computer Science courses in college and just did enough to get a passing grade.

This might be something worth reexamining at some point. The payoff to time investment ratio of some of the more modern languages has really changed since I was in college (I'm a few years younger than Szym). A good scripting language like Python is pretty easy to pick up and has lots of libraries for interacting with databases, etc. Doing everything in spreadsheets isn't really avoiding the need to write a program; at some point, you're just writing a program in a language (Excel) that's badly suited to it.

   30. Der-K and the statistical werewolves. Posted: May 01, 2014 at 01:33 PM (#4698286)
I'm 40 and having been doing everything with SAS, SQL, and spreadsheets myself - I've got to dip my toes in other pools eventually...
   31. Ron J2 Posted: May 01, 2014 at 01:38 PM (#4698289)
I'm 57 and I'm still more likely to write a few awk scripts, toss the results in a csv and fire up excel. Not that I can't do things in other ways. It's just easier for me to get to where I want to be that way.
   32. ellsbury my heart at wounded knee Posted: May 01, 2014 at 01:40 PM (#4698290)
I'm 40 and having been doing everything with SAS, SQL, and spreadsheets myself - I've got to dip my toes in other pools eventually...


R seems to be the wave of the future at my university, but it's just brutal to learn unless you have a strong background in object-oriented programming, plus the help files are borderline useless. The unfortunate thing with a free, open source program is that there's little financial interest in making it more user friendly or generating quality learning tools.
   33. puck Posted: May 01, 2014 at 01:47 PM (#4698295)
"10 Secret Simple Tricks About Projections that the Experts Don't Want You To Know! You won't believe #7!"


I didn't think the list could get any more annoying when broken into a slideshow. But now there are slideshows that make you "like" their facebook page to see #1.

   34. Ron J2 Posted: May 01, 2014 at 01:54 PM (#4698301)
#32 There are some nice R tutorials out there. I've used R to do multiple regressions and it was a pretty pain free learning curve.

But then I'm used to just diving into something I know nothing about.
   35. JJ1986 Posted: May 01, 2014 at 01:58 PM (#4698304)
But now there are slideshows that make you "like" their facebook page to see #1.


Hoopsworld (a now defunct basketball site) used to make you answer a poll question to read any of their articles. And if you answered too fast you had to answer a second one. I'm not surprised they went under.
   36. Pat Rapper's Delight Posted: May 01, 2014 at 02:06 PM (#4698308)
I didn't think the list could get any more annoying when broken into a slideshow.

How about when something that could even be presented as a slideshow is instead presented as a video?
   37. Harveys Wallbangers Posted: May 01, 2014 at 02:11 PM (#4698309)
There are some nice R tutorials out there. I've used R to do multiple regressions and it was a pretty pain free learning curve.

But then I'm used to just diving into something I know nothing about.


r is pretty slick
   38. Foghorn Leghorn Posted: May 01, 2014 at 02:17 PM (#4698315)
Is it any better?
Dunno. But the conversation was around respectableness. There is no doubt ZiPS is as well respected as anyone's.
   39. Ron J2 Posted: May 01, 2014 at 02:52 PM (#4698337)
Further to #32 the actual official documentation reminds me of the old Jerry Pournelle comment on Digital's documentation. "Encrypted and translated into Swahili".
   40. alilisd Posted: May 01, 2014 at 03:24 PM (#4698355)
Willie Mays never was significantly better than he was at age 23.

He also wasn't any better than he was at 34-35.


So you're saying this Willie Mays-guy was pretty good?
   41. Fernigal McGunnigle has become a merry hat Posted: May 01, 2014 at 04:23 PM (#4698396)
It might be worth mentioning here that the (free) "Sabermetrics 101" massive online course that Andy Andres will be teaching also serves as a basic introduction to R (and some variant of SQL). I'm not suggesting that Dan needs to take Sabermetrics 101, but those of us with less knowledge of the nuts and bolts might view it as a chance to dip our toes into R while also learning to deal more successfully with the data dumps that occasionally appear around here.

I'll also echo Zach's comments in #29 about the increasing ease of higher level programming languages, especially Python with its vast array of both useful libraries and high quality (and often free) tutorials. Any mathy person with solid general computer literacy should be able to start doing useful things with Python after a pretty reasonable investment of time and energy.
   42. Ok, Griffey's Dunn (Nothing Iffey About Griffey) Posted: May 01, 2014 at 04:30 PM (#4698402)
Let's see him win an MVP before we get too excited.


Who will Mike Trout finish 2nd to in the MVP vote this year?!
   43. Fred Lynn Nolan Ryan Sweeney Agonistes Posted: May 01, 2014 at 04:41 PM (#4698409)
Who will Mike Trout finish 2nd to in the MVP vote this year?!

Josh Donaldson.
   44. ellsbury my heart at wounded knee Posted: May 01, 2014 at 04:43 PM (#4698410)
It might be worth mentioning here that the (free) "Sabermetrics 101" massive online course that Andy Andres will be teaching also serves as a basic introduction to R (and some variant of SQL). I'm not suggesting that Dan needs to take Sabermetrics 101, but those of us with less knowledge of the nuts and bolts might view it as a chance to dip our toes into R while also learning to deal more successfully with the data dumps that occasionally appear around here.


That's actually a great idea. Even as someone who uses R semi-regularly, it can be good to get a refresher and pick up some new tricks.
   45. Pat Rapper's Delight Posted: May 01, 2014 at 04:43 PM (#4698411)
It might be worth mentioning here that the (free) "Sabermetrics 101" massive online course that Andy Andres will be teaching also serves as a basic introduction to R (and some variant of SQL).

I enrolled a few days ago to audit it. Need to learn SQL better for work, and I hope that framing the exercises within the context of Retrosheet or other baseball data will be useful. And if not, it's still baseball and will be worthwhile from that perspective.

Any mathy person with solid general computer literacy should be able to start doing useful things with Python after a pretty reasonable investment of time and energy.

I do most of my analysis in Excel VBA (yeah yeah, I know.... but it's what I know). My job is transitioning me out of software test and into C# development and I finished my Masters last year in Applied Mathematics. Sounds like Python is something I definitely need to learn more about. Thanks for the tip.
   46. bjhanke Posted: May 01, 2014 at 07:13 PM (#4698496)
I spent a lot of time in the computer industry as a programmer or doing systems documentation (when a computer house finds out that you have a Master's in English, you end up doing everybody's systems doc instead of writing code). When I started (1968), mainframe computers had 32K (not meg or gig, K) of core memory that you could pull out of the machine and look at (it looked like a slat of a beehive). The last language I had to learn was Smalltalk, a pretty abstract object-oriented language. Based not on any great claims as to programming quality, but on lots of years of experience, I'd like to offer this:

1. Query languages (SQL, etc.) all have limitations compared to programming languages. They are designed for people who are close to being decent programmers, but not quite there. If you're going to use one, first make sure that it can do all the things you need it to do. One thing that it will almost certainly NOT do is allow you to write anything on the database. This is by design. Query languages are supposed to provide a safeguard against users who are not exactly computer programmers messing up the database.

2. If you're going to use a programming language, find one that is compatible with your database or spreadsheet. Languages are often designed to work with one DBMS, and don't work as well as others. Visual Basic, for example, is easy to learn, but it really wants you to use ACCESS to store and organize data. Also, avoid any form of COBOL. It takes forever to write decent code in COBOL. It was originally designed for bank tellers, back in the days when people thought that you'd have to be a rocket scientist to actually program computers. So its statements don't individually do much, and you have to essentially rewrite your program in the Data Division.

3. A spreadsheet and a Database Management System (DBMS) are not quite the same thing. If you're going to use a spreadsheet, make sure that it will do everything you need it to do (just like a query language).

4. Avoid object-oriented languages if you're worried about the up-front time it will take to write your program. OOL are designed to be hard and expensive to write up front, but easy and cheap to maintain because of the class hierarchy.

Also, doing something like ZIPS, when you're just one man, is a TON of work no matter what tools you use. Dan, you're a genius to have gotten this done in any reasonable time frame. Also, living in your family's guest house is not anything at all like living in your mother's basement. It's completely respectable, and you should not take any grief over it. I mean, that's where Jane Austen lived for much of her life - her brother's guest house. - Brock Hanke (geezer)
   47. DJS and the Infinite Sadness Posted: May 02, 2014 at 01:23 AM (#4698660)
Thanks for reading guys. For this crowd, probably not much of it was new, given that most of you saw the development of ZiPS fairly close to real-time.
   48.   Posted: May 02, 2014 at 01:34 AM (#4698671)
1. Query languages (SQL, etc.) all have limitations compared to programming languages. They are designed for people who are close to being decent programmers, but not quite there. If you're going to use one, first make sure that it can do all the things you need it to do. One thing that it will almost certainly NOT do is allow you to write anything on the database. This is by design. Query languages are supposed to provide a safeguard against users who are not exactly computer programmers messing up the database.


I do not ....... agree with this.
   49. bjhanke Posted: May 02, 2014 at 04:06 AM (#4698685)
Shock - Can I ask why you don't agree? Your phrasing makes it seem like I may have struck a nerve, which was not what I was trying to do. I was trying to get people to make sure that, if they are going to take on a large statistics project, they look at the tools they are planning to use. It's certainly not a "you can't program if you use query languages" thing. I've used SQL, and for money. It paid just as well as any other language. But it may be very important to someone doing statistical analysis to know that the query language he is using won't write on his database; it will only issue queries. If I've written something that insulted you, I apologize. If I'm just out of date (I've been retired for a decade now), I'd love to learn what I've missed. In any case, making you upset was not my intention, then or now. - Brock
   50. Jeff R., P***y Mainlander Posted: May 02, 2014 at 09:14 AM (#4698728)
Brock, you're missing the fact that SQL supports the INSERT INTO statement, which is explicitly used to insert new data rows into a table. You may be used to populating tables with some kind of import tool, but they were probably using INSERT INTO under the hood.

I learned SQL on the job, and it's pretty easy to make SQL statements that just spiral out of control with joins and subqueries that propagate like mushrooms. It seems to be more of an art than a science to be able to figure out when you need to nest a subquery, and when you should break out a temporary table. Baseball queries are especially prone to this, when you're joining a master player table to a season stats table to a team season table to a manager table to a postseason table...
   51. Morty Causa Posted: May 02, 2014 at 09:32 AM (#4698736)
That's a nice, well thought out piece, Dan.
   52. Der-K and the statistical werewolves. Posted: May 02, 2014 at 10:03 AM (#4698771)
It seems to be more of an art than a science to be able to figure out when you need to nest a subquery, and when you should break out a temporary table.

I'm inclined to agree with this, but I don't necessarily know what I'm talking about.

I had a two-day course on OOP once upon a time - seemed totally foreign to my way of thinking (whereas SQL seems really intuitive). I don't think I'll ever be proficient in it.

***

Nice job, Dan.
   53. Tom T Posted: May 02, 2014 at 10:24 AM (#4698782)
Smalltalk, a pretty abstract object-oriented language


Yeah, that (release 4, anyway) was my intro to OOP (expert reasoning systems, back in the day). I have been informed that I need to "unlearn" before I can really tackle the kludged versions of OOPs such as C++, etc.

Re: SQL --- Even with the INSERT INTO option, working with a query language can be a pain if your tables aren't set up quite right. Yeah, you can get pretty much anything out, but the gymnastics required can be appalling (again, assuming you didn't get to set up your data tables optimally structured). If you have a good handle on a variety of data structures, you can usually do things more quickly *IF* you are also willing to take the time to structure and populate your own database (I tinkered with writing a scripting language for storing play-by-play scoring events for our softball team --- to handle things like runners advancing on errors, or interference, etc. ... had it laid out, was putting together the processor and then we had our twins....).

Oh, and I'll give another plug for R...my grad students are starting to use it quite heavily.
   54. Pooty Lederhosen Posted: May 02, 2014 at 10:34 AM (#4698789)
Oh, and I'll give another plug for R...my grad students are starting to use it quite heavily.

How does it compare to SAS? I'm pretty much in SAS all day. The hipsters are trucking over to Stata. Those don't seem too different, and they're vast improvements over the shitshow that is SPSS.
   55. Der-K and the statistical werewolves. Posted: May 02, 2014 at 10:43 AM (#4698800)
Haven't used it in years, but I thought Stata was the best if all you care about is cross-sectional work. SAS was more jack of all trades - good at lots of things. (Similarly, I liked RATS for some time series applications, but there's no need for it if you already have SAS.)
   56. Russ Posted: May 02, 2014 at 11:22 AM (#4698828)
SAS is the best at:

* Working with large datasets;
* Repeated analyses that are run on different datasets of similar structures

SAS is not good at:

* interactive data analyses (including plotting)
* bleeding edge statistical techniques

R is basically the reverse -- less good at managing large datasets and not great at doing batch runs of repeated analyses, but it is a good choice for visualizing data and THE BEST choice for doing interactive analyses and the stuff that will be coming out in journals in a year or two is already available in R packages.

Stata is less good at handling large datasets and does not have the bleeding edge stuff available, but it is really the easiest of the 3 for mortals (i.e. anything below an MSc in statistics) to use and it makes *the best* plots/figures for publications and distribution.

SPSS is overpriced and irrelevant; basically only is used in institutions who aren't nimble enough to cut the cord and move to either Stata or R for interactive data analysis and modelling.

   57. Pooty Lederhosen Posted: May 02, 2014 at 12:22 PM (#4698868)
Thanks, Russ. That's very helpful. I like SAS for the combination of data management/manipulation and analysis tools that it provides. Adding PROC SQL increased its flexibility, and the most recent versions can finally handle complex sample designs with appropriate variance estimation methods. So it's really got almost all that my employer needs. I'd be more than happy to learn one of the others if it'll make me more marketable, however.
   58. Russ Posted: May 02, 2014 at 01:45 PM (#4698967)
R is making headways with packages like RSQLite and recent versions are much better at handling memory, but in the end you want to use a car to drive on a road and a boat to travel on water. There are things that SAS will always do better than R and vice versa. I rarely do any analysis exactly the same way (because my data is constantly changing) and I generally don't work with massive datasets so I almost never use SAS, but the context should drive the tool, not vice versa.

I will say that if you can use SAS and R, you can basically use anything else out there. Stata is much easier to work with than R or SAS, so I find it to be somewhat redundant as a skill if someone can use both R and SAS.
   59. DJS and the Infinite Sadness Posted: May 02, 2014 at 03:42 PM (#4699097)
Man, I usually joke that I'm the last person using STATISTICA, but I just may be.
   60. Ron J2 Posted: May 02, 2014 at 03:48 PM (#4699101)
#59 I'm old enough to have used SPSS on punch cards. I don't miss either SPSS or punch cards.
   61. Walt Davis Posted: May 02, 2014 at 07:56 PM (#4699281)
1: SAS/IML has been around for ages and does most of what R does as a language ... far less canned stuff in IML but then most of it is canned in SAS.

2. R is the current fave among academics for 2 reasons: (1) it's free; (2) academics ain't got no sense.

3. R Studio seems pretty much a straight clone of SAS/IML Studio which (near as I can tell) was around about 5 years ahead of R Studio.

4. You can call R from within SAS/IML.

The issue with R is that the routines are roll-your-own. They're public domain and used heavily so there's certainly some quality assurance behind at least the popular ones but you will never have the quality control that a SAS does. (Stata is somewhere in the middle) This would be less of an issue if all of the folks who were writing R packages were well-trained programmers but they aren't -- most of them are academic statisticians who at best dabble in programming and at worst are just typing in matrix formulas. SAS has also added a lot of the "fancy" models lately and even in the old days could handle most of the fancy stuff just not in obvious ways.

SAS/IML is every bit as interactive as R and the Studio seems to handle the graphics as well as R but, being in academia, I really have more experience with R Studio than IML.

If you've got access to SAS and are mainly interested in data management and statistical analysis (n > p), there's not that much reason to move away from it except everybody will think you old-fashioned. If you're just starting out, probably learn one of the newer packages. But really, if you want a successful career as a stat analyst of some variety, learn a bit about them all. At a minimum you should probably be able to at least muddle through effectively in SAS, Stata, SPSS (still popular in may research outfits) and R, maybe Python now too. Or GenStat, EPI-info, etc. if you're on the health side. Some specialty packages although I'm not sure how much future those have.

But the greater connectivity among packages should be helping -- pick whatever interface you like that gives you the data management capability in a language that fits your logic then call what you need from other packages. Like I said, from SAS you can call R and it's had SQL, Oracle, etc. interfaces forever. There are a number of other things you can call from SAS now (incl Python I think). Granted I haven't tried any of that yet -- my freaking uni is 2-3 versions of SAS behind -- but I am starting to think that I just might make it to the end of my career without having to learn 5 new packages.

Python is one I need to check out. I thought it was a programming language but I see there are a number of statisticians using it these days so obviously there's more to it than that.

By the way, I think R is great and it's what I'd use if I didn't work places that have SAS licenses (which are ridiculously expensive usually). The R team deserves kudos. And there are tons of people who work very hard at providing better interfaces, friendlier packages, better help manuals, etc. But there are now hundreds, probably thousands, of user-written procedures out there and (as with any open source) no way that quality can be assured ... and nobody to sue if it all goes wrong. :-)

this becomes more and more problematic as stats models move deeper into areas where equations can't be worked through by hand and there aren't classic test data available. When some academic develops a new model and associated R script there's often no way for anybody to check whether it produces the correct results. Is it computationally efficient? What's it's level of numeric accuracy? Does it get the right answer when the data are near-singular or the function is at its asymptote or near a boundary? Are we sure that's a global max and not a local one?

I've seen some code written by academics and it's often a nightmare -- n by n matrices when they aren't needed, inverting matrices instead of using the more stable linear solver, etc. 32-bit and more memory is making some of that stuff less of a concern but I'd be a lot more comfy with trained statistical programmers and a quality/testing department.
   62. bjhanke Posted: May 03, 2014 at 12:06 AM (#4699389)
Wow! Thanks to all for all the good info and the lack of ad hominem sarcasm. Some of what I said appears to be just plain out of date. It's been about 20 years since I used SQL. At the time, there was no INSERT INTO feature that I remember. In fact, what I was told was the whole difference between a programming language and a query language is that a query language can NEVER change the database, hence the term "query." So the INSERT INTO feature must be newer than my experience, and represents a paradigm shift in what a query language can do. I was also told that the nickname for SQL was "sequel." Years before, when SQL first came out, I was told to avoid it, and that its nickname was "squeal", because that was what trying to do things like nested queries made you do. I learned C++ and Smalltalk back to back in the 1990s. Compared to Smalltalk, C++, at the time, really should not have counted as object oriented. Smalltalk was so abstract that you had to buy a third party piece of software just to get a hierarchy browser (Just for anyone who is trying to explain object-oriented to someone who is not computer savvy, I find that using Plato's theory of Ideals does a great job of explaining what a class hierarchy is). C++ was just C with a few extra features, like the ++ feature for doing for/next loops. Both languages have doubtless evolved.

On the other hand, I am so old that, when I first encountered computers in college (Vanderbilt had just built a computer center in 1965, when I was a freshman), there were only 5 computer languages in the Western world: Machine, written in hex, Assembler, which is just machine with mnemonics, Fortran (FORmula TRANslator), which was for rocket scientists and had only the most primitive of print statements, COBOL (Computer Oriented Business Oriented Language), which was for bank tellers and was so careful about preventing programmers from making mistakes that you had to define every variable in the Data Division, and a money variable was NOT the same as a number variable, so you had to load money numbers into number variables just to do arithmetic, so you got writer's cramp from trying to use the damfool thing, and Algol (ALGOrithmic Language), which was used in Europe. There were rumors of people working on languages called "C" and "Basic", but those languages weren't finished yet. ISAM (Indexed Sequential Access Method) was the only thing even approaching a DBMS, and grad school in computers involved being able to program a 7-tape sort, one input, one output, five working tapes, because disc technology was just coming into the field and tape storage doesn't lead to good experiences when trying to extract a single record from an unsorted tape. - Brock
   63. Russ Posted: May 03, 2014 at 12:17 AM (#4699393)
Python is one I need to check out. I thought it was a programming language but I see there are a number of statisticians using it these days so obviously there's more to it than that.


I tried to pick up Python a while back, but then the C++ integration with Rcpp made pretty much everything I might use Python for somewhat redundant. However, if you're doing a lot of data-processing, it still seems like the best route to go (and probably has the highest upside for handling large datasets this side of Matlab).
   64. ellsbury my heart at wounded knee Posted: May 03, 2014 at 04:15 AM (#4699451)
Or GenStat, EPI-info, etc. if you're on the health side. Some specialty packages although I'm not sure how much future those have.


I'm on the health side, and I can say with confidence that no epidemiologist under 45 uses EPI-Info anymore, and epidemiologists over 45 aren't getting their hands dirty doing statistical analysis themselves anyway. EPI-Info has been dead for years - it's mostly SAS/Stata/R, often depending on what part of the country you're in and whatever your data people like using.

Statistical genomics (and all the big, high-dimensional big omics data) is changing so insanely quickly that a lot of the most important new methods are coming out in R. There's been an explosion of data in the past few years, but our current statistical methods haven't quite caught up yet to make sense of much of it, and the flexibility of R makes it great for distribution of new methods. Of course, you have to trust that the people who wrote the packages did it properly, but it wouldn't be bleeding edge if everything were crosschecked a dozen times by a team of computer scientists so it comes out years later.

R is what a lot of statistics departments are teaching young statisticians to use, so I assume that's what we'll be seeing a lot of for the next few years. People tend to prefer their first programming language.
   65. Swedish Chef Posted: May 03, 2014 at 06:20 AM (#4699454)
I tried to pick up Python a while back, but then the C++ integration with Rcpp made pretty much everything I might use Python for somewhat redundant. However, if you're doing a lot of data-processing, it still seems like the best route to go (and probably has the highest upside for handling large datasets this side of Matlab).

Anyone who are using Python for playing around with numbers should install IPython. It's a great interactive environment and it has a notebook feature that's a bit like Mathematica.
   66. GGC don't think it can get longer than a novella Posted: May 03, 2014 at 10:20 AM (#4699532)
This is making me wish that I knew more about how to use computers than Excel and Word. And I regret taking an internship over econometrics when I was finishing my degree back in the day.
   67. Walt Davis Posted: May 03, 2014 at 07:14 PM (#4699772)
R is what a lot of statistics departments are teaching young statisticians to use, so I assume that's what we'll be seeing a lot of for the next few years. People tend to prefer their first programming language.

But not many statistical analysts are statisticians. They are probably more likely to come out of economics, health or even the social sciences. And very few people should be doing genuine statistical programming -- if you are a research outfit, you want your analysts relying almost entirely on procedures written by others. Their job is primarily data management, exploratory/descriptive analysis, choosing the appropriate model, interpreting results. It's also true that R is increasingly being used outside of statistics and in statistics service teaching (see "free") so it's future is bright. Everybody should have a version on their personal computer.

So yes, the stats grads who go on to become applied stats academics or the occasional high-end private sector statistician can go ahead and program stuff up in R or Python or whatever. The other 99% of the world will want to analyze data in whatever package offers the best combo of interface, data management, graphics, etc. If a stage is reached where you can call R, Python, C++, etc. from any of those packages, it won't necessarily matter very much what they're written in and you will only need to know rudimentary levels of R, etc.

Which isn't to say that the combo package of choice won't be or shouldn't be R or iPython or some Google analytics interface or whatever.

It's mainly the idea that "everybody" should be doing statistical programming that scares me. R is an excellent statistical programming tool ... and, what, maybe 10,000 people in the world should be using it for that. Everybody else should be using it for analysis using existing R packages.

By the way, it's becoming increasingly hard to publish a new model, even in theoretical statistics, without submitting code. That code may not be well-written but there's rarely any need for an analyst to roll their own, even on the bleeding edge.
   68. Walt Davis Posted: May 03, 2014 at 07:48 PM (#4699789)
Now, let's see if my memory is good enough to get to (in no particular order) 10 lessons I learned from Dan/ZiPS

1. Variances on individual season projections are huge (Swisher -- Betemit trade) ... with partial credit to PECOTA

2. You can get a start on a decent MLE by whacking about 30 points of BA (and associated SLG) off minor-league stats

3. After a player has been away from a position for a couple of years, their quality at the position declines substantially

4. Late bloomers fade early

5. ZiPS is not a playing time projection system

6. Comps are even more fun than I thought and maybe even a bit useful

7. 99 times out of 100, last year's AA break-out is nowhere near ready to produce in the majors (prospects are doomed)

8. Most of my seat-of-the-pants career projections are not insane

9. Components are fun and maybe even a bit useful

10. Howie Kendrick was a true 370 BABIP hitter ... oh wait ... :-)

(Kendrick is a very good BABIP hitter. But his 2009 ZiPS projection had him at a 372 BABIP.)

You must be Registered and Logged In to post comments.

 

 

<< Back to main

BBTF Partner

Support BBTF

donate

Thanks to
Kiko Sakata
for his generous support.

Bookmarks

You must be logged in to view your Bookmarks.

Hot Topics

NewsblogPapelbon blows lead, gets ejected for crotch-grabbing at fans
(59 - 1:00am, Sep 16)
Last: Win Big Stein's Money

NewsblogOT: NBC.news: Valve isn’t making one gaming console, but multiple ‘Steam machines’
(813 - 12:15am, Sep 16)
Last: Win Big Stein's Money

NewsblogOT: Politics, September, 2014: ESPN honors Daily Worker sports editor Lester Rodney
(2507 - 12:09am, Sep 16)
Last: Shredder

NewsblogBowman: A year’s worth of struggles leads reason to wonder what changes are in store for the Braves
(10 - 12:03am, Sep 16)
Last: bigglou115

NewsblogA’s lose Triple-A Sacramento affiliate
(14 - 11:51pm, Sep 15)
Last: Win Big Stein's Money

NewsblogOMNICHATTER 9-15-2014
(78 - 11:46pm, Sep 15)
Last: RollingWave

NewsblogHeyman: Mariners have decided not to retain the ice-cream buying scout
(4 - 11:45pm, Sep 15)
Last: SoSHially Unacceptable

NewsblogSports Bog: Fans Switch From Skins to Nats
(62 - 11:41pm, Sep 15)
Last: boteman is not here 'til October

NewsblogCalcaterra: Derek Jeter got a bucket of crabs and a captain’s hat from the Orioles
(12 - 11:41pm, Sep 15)
Last: Jolly Old St. Nick Is A Jolly Old St. Crip

NewsblogOT: The Soccer Thread, September 2014
(219 - 10:43pm, Sep 15)
Last: frannyzoo

NewsblogOT August 2014:  Wrassle Mania I
(161 - 10:11pm, Sep 15)
Last: NJ in DC (Now with temporary employment!)

Newsblog10 Degrees: Why WAR doesn’t always add up
(340 - 9:46pm, Sep 15)
Last: cardsfanboy

NewsblogOT: September 2014 College Football thread
(245 - 9:20pm, Sep 15)
Last: spike

NewsblogJesus Montero gets heckled by Mariners cross checker during rehab stint
(67 - 9:09pm, Sep 15)
Last: Win Big Stein's Money

NewsblogKapler: Baseball’s next big competitive edge
(83 - 8:45pm, Sep 15)
Last: McCoy

Page rendered in 0.8106 seconds
52 querie(s) executed