User Comments, Suggestions, or Complaints | Privacy Policy | Terms of Service | Advertising
Page rendered in 0.8366 seconds
60 querie(s) executed
|
| |||||||||
|
You are here > Home > Notes in a Minor Key > Discussion
| |||||||||
Notes in a Minor Key — Wednesday, February 15, 2006Collecting Minor League Data - call for volunteersOver here, in comment #42, Der Komminsk-sar asked this question: Mike, did you go into Hughes’ logs one at a time, or have there been efforts to collect this data for everybody (which would be mega-awesome)? The data in question is individual, game-by-game performance data for minor league players and pitchers. I did it for Hughes because there were some questions raised on the linked thread about his GB/FB performance in 2005. Since MLB Advanced Media took over the operation of the minor league baseball Web site last year, they have published PBP logs for every minor league game on the Web site. The possibilities for using this data in prospect analysis are endless, but beyond my ability (or any one person’s, unless he has a substantial amount of free time) to collect and evaluate except on a piecemeal basis. As many people know, I do a significant amount of data extraction from the MLB Gameday data, and it usually takes me the better part of four months of my own time in the offseason, as well as keeping up with the flow during the season, to get everything that I use from that data. Hughes pitched in just 17 games, and it took me about two hours (spread over yesterday and today) to get the data for him from milb.com - it could have been done in less time if they’d been done in real time rather than after the fact, of course, because I wouldn’t have had to search for the games in which Hughes pitched, but even so - multiply that by 180 (or so) minor league teams, many of which play 140-game schedules and the rest of which play 60-70, and you get a sense of the magnitude of what’s out there. It occurs to me, though, that an individual could do this for a team, a league (small one), or a particular organization’s minor league affiliates, depending on the amount of time the individual is willing to take. And if we could get enough volunteers, we could cover all of the minors, and have the beginnings of an open-source minor-league player DB with more information than just the typical stat lines. In addition, we’ve talked a little bit about a “fan’s scouting database”, where people who see prospects play could post scouting reports. That’s something that I definitely want to do this year; I’ll set up a sticky link in the blog when the season starts for those. If you go to minor league, college, and high school games regularly or semi-regularly, by all means plan to participate there. Mike Emeigh
Posted: February 15, 2006 at 06:35 PM | 44 comment(s)
Related News: Minor Leagues, Prospect Reports, Sabermetrics |
My BookmarksYou must be logged in to view your Bookmarks. Hot Topics |
||||||||
|
About Baseball Think Factory | Write for Us | Copyright © 1996-2008 Baseball Think Factory
User Comments, Suggestions, or Complaints | Privacy Policy | Terms of Service | Advertising
|
| Page rendered in 0.8366 seconds | |||||||
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
Or am I missing something?
Going back and gathering the data after the fact is definitely a pain, I did that to get Andy Sonnanstine's L/R splits for '05 and it took a lot of time to get that done.
I plan on doing something like this for Rays prospects in 2006 though I was only intending on getting data for specific players. I may be willing to expand that for every player on the Rays affiliates, it probably wouldn't take too much more time than getting specific players since I'll be going through the whole log anyway.
No, I didn't - I included hits. I also counted popups, and counted bunts separately (there were only a couple of those). Here's what I have for Hughes's outings:
Charleston:
4/8: 8 FB, 3 GB, 0 LD, 6 K, 1 W
4/13: 7 FB, 7 GB, 3 LD, 3 K, 2 W
4/18: 3 FB, 9 GB, 0 LD, 1 bunt, 2 K, 1 W
4/26: 3 FB, 5 GB, 0 LD, 1 bunt, 8 K, 2 W
5/1: 3 FB, 7 GB, 0 LD, 8 K, 3 W
5/7: 3 FB, 8 GB, 1 LD, 7 K, 2 W
5/15: 5 FB, 7 GB, 3 LD, 7 K, 1 W
5/20: 5 FB, 10 GB, 1 LD, 7 K, 1 W
5/25: 6 FB, 7 GB, 3 LD, 7 K, 1 W, 1 HB
6/1: 8 FB, 7 GB, 4 LD, 4 K, 1 W, 1 HB
6/7: 9 FB, 7 GB, 4 LD, 5 K, 0 W
6/12: 6 FB, 7 GB, 3 LD, 8 K, 1 W, 1 HB
Totals: 66 FB, 84 GB, 22 LD, 2 bunts, 70 K, 16 W, 3 HB
Tampa:
7/7: 5 FB, 2 GB, 0 LD, 1 bunt, 2 K, 1 W
7/13 (relief): 3 FB, 2 GB, 0 LD, 6 K, 1 W, 1 HB
7/20: 1 FB, 4 GB, 0 LD, 1 K, 0 W
7/25: 3 FB, 3 FB, 1 LD, 7 K, 1 W
7/31: 3 FB, 4 GB, 4 LD, 1 bunt, 5 K, 1 W, 2 HB
Totals: 15 FB, 15 GB, 5 LD, 2 bunts, 21 K, 4 W, 3 HB
--- MWE
-- MWE
-- MWE
It further has scripts to convert this data to the retrosheet format and load it into a mysql database. The code is in perl and, from a very cursory inspection, seems to be well written and well commented.
-- MWE
Meh was supposed to get a set of keys as well, but I don't think Dan's done anything about it yet.
-- MWE
Baseball America's 2006 almanac contains L/R splits for all hitters. This is, I believe, the first year they did this, so if anyone else is looking for that data for a 2005 player, I can probably look it up much faster than you can gather the data.
If you're looking for 2005 data, that is. However, if you're gathering 2006 data in real time, you'll have L/R splits for 2006 hitters long before BA publishes its 2007 almanac.
-- MWE
So is the TSN guide (with the exception of 2004, when everyone got hosed because of the problems with the minor leagues' official stat compiler).
-- MWE
You obviously have never seen me try to compile data manually.
I'm probably missing the point here, but I'm not sure why they'd do this, as they're providing the game logs free. I doubt they see much money (by MLB standards) in minor league stats, or they could be doing a lot more with the minor league web site, like providing splits themselves.
Mike,
I looked at a few Sox prospects game logs last year and it seemed like they recorded a lot of popups. Given that popups have some skill component in MLB pitchers it made me wonder that perhaps good minor league pitching prospects tended to have higher than normal popup rates.
If this type of a project gets off the ground it would be great to keep popups separate in case that is true.
Anyone know if there is any source for SB/CS for minor league catchers?
good thing i'm getting my bonus soon - what with tango/mgl's book, the o'reilly book, the sickels book, the ba 2006 book, and maybe even bpro's annual, i'm goint to spend a lot of money on baseball books in the next month.
I can likely make a small contribution, presuming spidering doesn't work.
And thanks for the email back. I'll let you know if I find it to be useful at all.
If some technological solution isn't forthcoming, I'd be willing to try to compile some data for the Pirates' farm system. I couldn't do it all, although I could try to find somebody to help.
Yes, it would.
-- MWE
However, since one can easily cut-and-paste the game log into a text file, I think there's a compromise to be reached between spidering and doing it all by hand. I'm working on a program that will take said text file and convert it as far as possible into a standardized pbp format. There are problems with that, too, though, because the game logs don't have all the necessary information--for instance, if a defensive player never makes a play and doesn't get replaced, his name never shows up in conjunction with his position. If a starting pitcher never makes a defensive play and isn't replaced, his name never shows up in the game log at all!
Thus, the box scores would appear to come in handy...but in the box scores (unlike the game logs) complete player names aren't given. So when Travis Smith throws a complete game, going to the box score to find "Smith, P" is less than ideal.
In other words, I don't think there's a 100% technological solution to this.
I do think however, that I can reduce the amount of work in "logging" each game into a database down to 120 seconds or less, maybe much less. For the ~10,000 minor league games of 2005, I don't think I'm going to do all those myself unless I can get that number down to 15 seconds or so. I may be able to create some sort of web interface where volunteers can enter the game logs and then, upon prompting, enter things like starting pitcher names and player positions that aren't given by the game log. Or it might be the time it takes to make that sufficiently user-friendly is greater than the time it takes me to do it all myself or work with a very small group of volunteers. We'll see.
Obviously I'm thinking out loud here. Perhaps less obviously, I'm a total amateur at this stuff; I wouldn't be surprised if somebody came along and said they could easily do all the tasks I'm saying aren't technologically doable.
Are you deriving that data from the pbp logs, or from team stats already compiled on the miLB site?
I'm 90-95% done with a parser to convert every play in an miLB pbp log into something standardized and more or less like retrosheet event codes. What I've got so far deals with upwards of 99% of plays...but as anybody who's spent time with retrosheet event codes knows, the last 1%, even the last 0.1% of plays can get tricky, what with stuff like "1(B)16(2)63(1)/LTP/L1"
The bigger challenge I'm looking at now is how to make the results optimally usable. Let's assume for a second I can turn all the available pbp logs into a database that more or less mirrors what miLB has on their servers, providing us with the trickle of data they let out. What do you want from that, and how do you want it?
Should my end goal be something that has syntax virtually identical to retrosheet, so that you can use RS's parsers to come up with csv files and the like in a format you're familiar with? I'm not sure I could deliver that, as I don't know how easily one could expand the RS syntax to include all minor leagues...and I don't really want to embark on yet ANOTHER system of labelling players so that we can tell one Angel Garcia (garca006) from another (garca007).
Would you rather I aim toward something in the vein of David Pinto's day-by-day database? I'd like to come up with that sort of thing no matter what ...if I do, what functionality do you want from it?
I see no reason why you could not use MiLB's ID numbers for players as the ID keys. And they are retrievable; E-mail me off-list if you don't know how to get them.
-- MWE
Nope.
-- MWE
I'm extracting it all from PBP logs and lots of excel equations. Like you said, 99.9% of the data can be extracted from the game logs, and that final .1% is human work for things such as complete games. However, I have gotten it done to just copying and pasting, for a whole season it takes about an hour, and 99.9% of the data will be spit out. As for the defensive stats, they seem out of my league right now and I don't see how to collect them at this moment.
For the past year, I have been collecting splits for the Red Sox minors on the soxprospects message boards, but I've had to insert things manually, which can take a lot of time. Realistically, I could only do 1 team without buring out and do the others after the season. However, I'm working equations to set up the lefty-righty splits and men on base, but they are still a work in progress. I'm hoping to get this set up by the start of the season.
Understanding that we don't have great bip-location/type data from those pbp logs, I'm pretty sure I've got my program set up to collect any defensive stats one can glean from retrosheet logs. (That is, I'm tracking who's on the field at all times, and as much of the bip-location/type as the logs gives us.) Speaking of which, I'm still testing, but I'm at the point where my program handles 99.5% or so of all plays, and outputs something virtually identical to a retrosheet log. So I guess it's doable. I wouldn't necessarily have charged ahead over the last few days if I had seen your post, but BTF has been inaccessible to me (and I've been sick) for the last three days or so. I'll shoot you an email so I can clarify what you've done and maybe we can help each other out a bit.
Not of which I am aware.
-- MWE
You must be Registered and Logged In to post comments.
<< Back to main