Baseball for the Thinking Fan

Login | Register | Feedback

btf_logo
You are here > Home > Notes in a Minor Key > Discussion
Notes in a Minor Key
— 

Wednesday, February 15, 2006

Collecting Minor League Data - call for volunteers

Over here, in comment #42, Der Komminsk-sar asked this question:

Mike, did you go into Hughes’ logs one at a time, or have there been efforts to collect this data for everybody (which would be mega-awesome)?

The data in question is individual, game-by-game performance data for minor league players and pitchers. I did it for Hughes because there were some questions raised on the linked thread about his GB/FB performance in 2005. Since MLB Advanced Media took over the operation of the minor league baseball Web site last year, they have published PBP logs for every minor league game on the Web site. The possibilities for using this data in prospect analysis are endless, but beyond my ability (or any one person’s, unless he has a substantial amount of free time) to collect and evaluate except on a piecemeal basis. As many people know, I do a significant amount of data extraction from the MLB Gameday data, and it usually takes me the better part of four months of my own time in the offseason, as well as keeping up with the flow during the season, to get everything that I use from that data. Hughes pitched in just 17 games, and it took me about two hours (spread over yesterday and today) to get the data for him from milb.com - it could have been done in less time if they’d been done in real time rather than after the fact, of course, because I wouldn’t have had to search for the games in which Hughes pitched, but even so - multiply that by 180 (or so) minor league teams, many of which play 140-game schedules and the rest of which play 60-70, and you get a sense of the magnitude of what’s out there.

It occurs to me, though, that an individual could do this for a team, a league (small one), or a particular organization’s minor league affiliates, depending on the amount of time the individual is willing to take. And if we could get enough volunteers, we could cover all of the minors, and have the beginnings of an open-source minor-league player DB with more information than just the typical stat lines.

In addition, we’ve talked a little bit about a “fan’s scouting database”, where people who see prospects play could post scouting reports. That’s something that I definitely want to do this year; I’ll set up a sticky link in the blog when the season starts for those. If you go to minor league, college, and high school games regularly or semi-regularly, by all means plan to participate there.

Mike Emeigh Posted: February 15, 2006 at 06:35 PM | 44 comment(s)
  Related News: Minor LeaguesProspect ReportsSabermetrics

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

Page 1 of 1 pages
   1. NJ in DC loathes his classmates and the law Posted: February 15, 2006 at 08:15 PM (#1863612)
Mike, I've already begun collecting these types of "advanced stats" for Yankee minor leaguers and FWIW, your Hughes' numbers are wrong. I think you did GO:FO, not GB:FB because my numbers are drastically different from yours and I have every bit of confidence in them.
   2. The Hop-Clop Goes On (psa1) Posted: February 15, 2006 at 08:59 PM (#1863668)
If the file naming conventions for MLBAM are sensible, and we're just collecting text pbp data (that is, not actually watching video and making judgments) couldn't somebody write a script and theoretically extract it all in one step? My programming capabilities are just short of that, but it seems like a relatively simple thing for a real programmer to do.

Or am I missing something?
   3. Jim Wisinski Posted: February 15, 2006 at 09:28 PM (#1863691)
#2, I have no clue if that would be possible or not but the file names for the game logs are uniformly named so that shouldn't be an obstacle.

Going back and gathering the data after the fact is definitely a pain, I did that to get Andy Sonnanstine's L/R splits for '05 and it took a lot of time to get that done.

I plan on doing something like this for Rays prospects in 2006 though I was only intending on getting data for specific players. I may be willing to expand that for every player on the Rays affiliates, it probably wouldn't take too much more time than getting specific players since I'll be going through the whole log anyway.
   4. Mike Emeigh Posted: February 15, 2006 at 09:43 PM (#1863708)
I think you did GO:FO, not GB:FB because my numbers are drastically different from yours and I have every bit of confidence in them.


No, I didn't - I included hits. I also counted popups, and counted bunts separately (there were only a couple of those). Here's what I have for Hughes's outings:

Charleston:

4/8: 8 FB, 3 GB, 0 LD, 6 K, 1 W
4/13: 7 FB, 7 GB, 3 LD, 3 K, 2 W
4/18: 3 FB, 9 GB, 0 LD, 1 bunt, 2 K, 1 W
4/26: 3 FB, 5 GB, 0 LD, 1 bunt, 8 K, 2 W
5/1: 3 FB, 7 GB, 0 LD, 8 K, 3 W
5/7: 3 FB, 8 GB, 1 LD, 7 K, 2 W
5/15: 5 FB, 7 GB, 3 LD, 7 K, 1 W
5/20: 5 FB, 10 GB, 1 LD, 7 K, 1 W
5/25: 6 FB, 7 GB, 3 LD, 7 K, 1 W, 1 HB
6/1: 8 FB, 7 GB, 4 LD, 4 K, 1 W, 1 HB
6/7: 9 FB, 7 GB, 4 LD, 5 K, 0 W
6/12: 6 FB, 7 GB, 3 LD, 8 K, 1 W, 1 HB

Totals: 66 FB, 84 GB, 22 LD, 2 bunts, 70 K, 16 W, 3 HB

Tampa:

7/7: 5 FB, 2 GB, 0 LD, 1 bunt, 2 K, 1 W
7/13 (relief): 3 FB, 2 GB, 0 LD, 6 K, 1 W, 1 HB
7/20: 1 FB, 4 GB, 0 LD, 1 K, 0 W
7/25: 3 FB, 3 FB, 1 LD, 7 K, 1 W
7/31: 3 FB, 4 GB, 4 LD, 1 bunt, 5 K, 1 W, 2 HB

Totals: 15 FB, 15 GB, 5 LD, 2 bunts, 21 K, 4 W, 3 HB

--- MWE
   5. Mike Emeigh Posted: February 15, 2006 at 09:45 PM (#1863711)
I added wrong; Hughes's Ks add up to 72 at Charleston.

-- MWE
   6. Mike Emeigh Posted: February 15, 2006 at 09:48 PM (#1863715)
The FB totals include the popups as well.

-- MWE
   7. Gwyn Posted: February 15, 2006 at 09:48 PM (#1863716)
O'Reilly (a techie publishing company if you're not familiar with them), have just published Baseball Hacks by Joseph Adler. This contains code to spider the MLB gameday site and download the MLB pbp data, I would think its easily adaptable to the Minor League data.
It further has scripts to convert this data to the retrosheet format and load it into a mysql database. The code is in perl and, from a very cursory inspection, seems to be well written and well commented.
   8. NJ in DC loathes his classmates and the law Posted: February 15, 2006 at 10:01 PM (#1863729)
Oh ok, I didn't include popups in the FB total and I included bunts in GB, so that's where our difference lies.
   9. Jim Wisinski Posted: February 15, 2006 at 10:05 PM (#1863732)
By the way MWE, did you get my e-mail about the RaysBB list?
   10. Mike Emeigh Posted: February 15, 2006 at 10:06 PM (#1863735)
I'm familiar with O'Reilly, and I've seen the blurbs for Baseball Hacks. I think that spidering the Web site is a dangerous tactic, because it might lead MLB to make the data inaccessible. MLB doesn't *have* to publish game logs, and it's unlikely that anyone else will do it for the minots.

-- MWE
   11. Mike Emeigh Posted: February 15, 2006 at 10:08 PM (#1863737)
Jim, I did. I have been uncommonly busy this week at work, and haven't had much time to post before tonight. I have a batch of things piled up.

Meh was supposed to get a set of keys as well, but I don't think Dan's done anything about it yet.

-- MWE
   12. Jim Wisinski Posted: February 15, 2006 at 10:32 PM (#1863748)
Ok, I just wanted to make sure BTF-mail didn't swallow it.
   13. PhillyBooster Posted: February 15, 2006 at 11:26 PM (#1863795)
Going back and gathering the data after the fact is definitely a pain, I did that to get Andy Sonnanstine's L/R splits for '05 and it took a lot of time to get that done.


Baseball America's 2006 almanac contains L/R splits for all hitters. This is, I believe, the first year they did this, so if anyone else is looking for that data for a 2005 player, I can probably look it up much faster than you can gather the data.
   14. El Hijo del Ron Santo (Alan Keiper) Posted: February 15, 2006 at 11:29 PM (#1863799)
That almanac is also a tremendous source of minor league fielding data.
   15. Mike Emeigh Posted: February 15, 2006 at 11:39 PM (#1863815)
This is, I believe, the first year they did this, so if anyone else is looking for that data for a 2005 player, I can probably look it up much faster than you can gather the data.


If you're looking for 2005 data, that is. However, if you're gathering 2006 data in real time, you'll have L/R splits for 2006 hitters long before BA publishes its 2007 almanac.

-- MWE
   16. Mike Emeigh Posted: February 15, 2006 at 11:41 PM (#1863816)
That almanac is also a tremendous source of minor league fielding data.


So is the TSN guide (with the exception of 2004, when everyone got hosed because of the problems with the minor leagues' official stat compiler).

-- MWE
   17. PhillyBooster Posted: February 15, 2006 at 11:49 PM (#1863824)


If you're looking for 2005 data, that is. However, if you're gathering 2006 data in real time, you'll have L/R splits for 2006 hitters long before BA publishes its 2007 almanac.


You obviously have never seen me try to compile data manually.
   18. WTM Posted: February 16, 2006 at 12:07 AM (#1863837)
I think that spidering the Web site is a dangerous tactic, because it might lead MLB to make the data inaccessible.

I'm probably missing the point here, but I'm not sure why they'd do this, as they're providing the game logs free. I doubt they see much money (by MLB standards) in minor league stats, or they could be doing a lot more with the minor league web site, like providing splits themselves.
   19. The Hop-Clop Goes On (psa1) Posted: February 16, 2006 at 12:13 AM (#1863838)
Hmmm. I hadn't looked at the site before I made my original comment...y'all are probably way ahead of me here, but it's designed great for spidering. Unfortunately I don't work in Perl or I'd try to adapt the BB Hacks script, but I think it's doable, even by amateur me. I'll let you know, Mike, if it looks like I could come up with a retrosheet-like database from this stuff.
   20. philly Posted: February 16, 2006 at 01:21 AM (#1863901)
The FB totals include the popups as well.

Mike,

I looked at a few Sox prospects game logs last year and it seemed like they recorded a lot of popups. Given that popups have some skill component in MLB pitchers it made me wonder that perhaps good minor league pitching prospects tended to have higher than normal popup rates.

If this type of a project gets off the ground it would be great to keep popups separate in case that is true.
   21. AROM Posted: February 16, 2006 at 01:32 AM (#1863906)
That almanac is also a tremendous source of minor league fielding data.

Anyone know if there is any source for SB/CS for minor league catchers?
   22. rdfc Posted: February 16, 2006 at 09:59 AM (#1864291)
It would obviously be useful to have minor league left/right splits too. MLb.com promised to do those last year, but like so many things they've promised (such as the super game coverage with pitch speed etc.) that aren't part of their video packages they never showed up.
   23. NJ in DC loathes his classmates and the law Posted: February 16, 2006 at 10:34 AM (#1864328)
rdfc, the annoying part about that is that they kept those numbers along with home/away splits and RISP splits, but just didn't let the public see them on a regular basis because whenever they did a minor league gamecast each player had these splits freely available to peruse.
   24. Kyle S Posted: February 16, 2006 at 11:05 AM (#1864365)
philly: chuck james is in that camp as well; his hysterically low gb/fb ratio is in large part due to popups induced.
   25. Kyle S Posted: February 16, 2006 at 11:10 AM (#1864371)
by the way, that o'reilly book is a fantastic find. i've read the other article by that author on using R for statistical analysis and i liked that too.

good thing i'm getting my bonus soon - what with tango/mgl's book, the o'reilly book, the sickels book, the ba 2006 book, and maybe even bpro's annual, i'm goint to spend a lot of money on baseball books in the next month.
   26. Der Komminsk-sar Posted: February 16, 2006 at 11:13 AM (#1864375)
Mike, I'm glad you proposed this - I've wanted to suggest the same, but know I won't have the time to coordinate anything in '06.
I can likely make a small contribution, presuming spidering doesn't work.
   27. Dan Szymborski Posted: February 16, 2006 at 11:25 AM (#1864390)
Dan can't give keys for this blog - I forwarded the e-mail awhile ago to Jim but I guess he didn't see it.
   28. Master of Karate and Friendship (Kyle C) Posted: February 16, 2006 at 11:56 AM (#1864427)
Mike, I'd be more than happy to help in any way possible. And you may not like the idea, but I've ordered the O'Reilly book and would be willing to setup a database for this info., once I learn how of course.

And thanks for the email back. I'll let you know if I find it to be useful at all.
   29. WTM Posted: February 16, 2006 at 12:11 PM (#1864455)
Mike,

If some technological solution isn't forthcoming, I'd be willing to try to compile some data for the Pirates' farm system. I couldn't do it all, although I could try to find somebody to help.
   30. Kyle S Posted: February 16, 2006 at 12:34 PM (#1864490)
what about a "prospect wiki" that anyone who saw someone play could contribute to? it wouldn't be the most scientific thing in the world, but it would be somewhat useful, right?
   31. Mike Emeigh Posted: February 16, 2006 at 01:39 PM (#1864588)
what about a "prospect wiki" that anyone who saw someone play could contribute to? it wouldn't be the most scientific thing in the world, but it would be somewhat useful, right?


Yes, it would.

-- MWE
   32. The Hop-Clop Goes On (psa1) Posted: February 16, 2006 at 10:59 PM (#1865410)
Ok, massive adjustment to the little I said last night. As far as I know how to spider, spidering the site is NOT a viable solution. That's because the game logs are not part of the HTML; they're generated by javascript after the fact.

However, since one can easily cut-and-paste the game log into a text file, I think there's a compromise to be reached between spidering and doing it all by hand. I'm working on a program that will take said text file and convert it as far as possible into a standardized pbp format. There are problems with that, too, though, because the game logs don't have all the necessary information--for instance, if a defensive player never makes a play and doesn't get replaced, his name never shows up in conjunction with his position. If a starting pitcher never makes a defensive play and isn't replaced, his name never shows up in the game log at all!

Thus, the box scores would appear to come in handy...but in the box scores (unlike the game logs) complete player names aren't given. So when Travis Smith throws a complete game, going to the box score to find "Smith, P" is less than ideal.

In other words, I don't think there's a 100% technological solution to this.

I do think however, that I can reduce the amount of work in "logging" each game into a database down to 120 seconds or less, maybe much less. For the ~10,000 minor league games of 2005, I don't think I'm going to do all those myself unless I can get that number down to 15 seconds or so. I may be able to create some sort of web interface where volunteers can enter the game logs and then, upon prompting, enter things like starting pitcher names and player positions that aren't given by the game log. Or it might be the time it takes to make that sufficiently user-friendly is greater than the time it takes me to do it all myself or work with a very small group of volunteers. We'll see.

Obviously I'm thinking out loud here. Perhaps less obviously, I'm a total amateur at this stuff; I wouldn't be surprised if somebody came along and said they could easily do all the tasks I'm saying aren't technologically doable.
   33. Templeusox has reached his genetic threshold Posted: February 16, 2006 at 11:06 PM (#1865425)
This">This</a> was done earlier tonight on sp.com. Check out the second to last post. Is this what you guys are talking about?
   34. The Hop-Clop Goes On (psa1) Posted: February 16, 2006 at 11:53 PM (#1865496)
I'm not sure we're talking about exactly the same things.

Are you deriving that data from the pbp logs, or from team stats already compiled on the miLB site?
   35. The Hop-Clop Goes On (psa1) Posted: February 17, 2006 at 11:57 PM (#1866871)
Since I know everyone is breathless in anticipation of an update on my progress :)

I'm 90-95% done with a parser to convert every play in an miLB pbp log into something standardized and more or less like retrosheet event codes. What I've got so far deals with upwards of 99% of plays...but as anybody who's spent time with retrosheet event codes knows, the last 1%, even the last 0.1% of plays can get tricky, what with stuff like "1(B)16(2)63(1)/LTP/L1"

The bigger challenge I'm looking at now is how to make the results optimally usable. Let's assume for a second I can turn all the available pbp logs into a database that more or less mirrors what miLB has on their servers, providing us with the trickle of data they let out. What do you want from that, and how do you want it?

Should my end goal be something that has syntax virtually identical to retrosheet, so that you can use RS's parsers to come up with csv files and the like in a format you're familiar with? I'm not sure I could deliver that, as I don't know how easily one could expand the RS syntax to include all minor leagues...and I don't really want to embark on yet ANOTHER system of labelling players so that we can tell one Angel Garcia (garca006) from another (garca007).

Would you rather I aim toward something in the vein of David Pinto's day-by-day database? I'd like to come up with that sort of thing no matter what ...if I do, what functionality do you want from it?
   36. Kyle S Posted: February 18, 2006 at 09:22 PM (#1867537)
I would be satisfied with a database full of minor league stats ala the lahman database - splits, G/F, etc would be nice, but I don't have even the basics yet. The Baseball Cube has a database that they charge like $100 for, but free is always better :)
   37. Mike Emeigh Posted: February 18, 2006 at 09:31 PM (#1867546)
Should my end goal be something that has syntax virtually identical to retrosheet, so that you can use RS's parsers to come up with csv files and the like in a format you're familiar with? I'm not sure I could deliver that, as I don't know how easily one could expand the RS syntax to include all minor leagues...and I don't really want to embark on yet ANOTHER system of labelling players so that we can tell one Angel Garcia (garca006) from another (garca007).


I see no reason why you could not use MiLB's ID numbers for players as the ID keys. And they are retrievable; E-mail me off-list if you don't know how to get them.

-- MWE
   38. The Hop-Clop Goes On (psa1) Posted: February 18, 2006 at 09:50 PM (#1867573)
That could work. I see that miLB uses them for everything, so they appear in url's for player pages. I suppose a spider could retrieve all of the player-ID pairs...or is there a simpler way you are thinking of?
   39. Mike Emeigh Posted: February 18, 2006 at 09:57 PM (#1867579)
I suppose a spider could retrieve all of the player-ID pairs...or is there a simpler way you are thinking of?


Nope.

-- MWE
   40. Forever Red 9 Posted: February 19, 2006 at 05:33 PM (#1868374)
#34: Are you deriving that data from the pbp logs, or from team stats already compiled on the miLB site?

I'm extracting it all from PBP logs and lots of excel equations. Like you said, 99.9% of the data can be extracted from the game logs, and that final .1% is human work for things such as complete games. However, I have gotten it done to just copying and pasting, for a whole season it takes about an hour, and 99.9% of the data will be spit out. As for the defensive stats, they seem out of my league right now and I don't see how to collect them at this moment.

For the past year, I have been collecting splits for the Red Sox minors on the soxprospects message boards, but I've had to insert things manually, which can take a lot of time. Realistically, I could only do 1 team without buring out and do the others after the season. However, I'm working equations to set up the lefty-righty splits and men on base, but they are still a work in progress. I'm hoping to get this set up by the start of the season.
   41. The Hop-Clop Goes On (psa1) Posted: February 22, 2006 at 01:07 PM (#1871340)
As for the defensive stats, they seem out of my league right now and I don't see how to collect them at this moment.

Understanding that we don't have great bip-location/type data from those pbp logs, I'm pretty sure I've got my program set up to collect any defensive stats one can glean from retrosheet logs. (That is, I'm tracking who's on the field at all times, and as much of the bip-location/type as the logs gives us.) Speaking of which, I'm still testing, but I'm at the point where my program handles 99.5% or so of all plays, and outputs something virtually identical to a retrosheet log. So I guess it's doable. I wouldn't necessarily have charged ahead over the last few days if I had seen your post, but BTF has been inaccessible to me (and I've been sick) for the last three days or so. I'll shoot you an email so I can clarify what you've done and maybe we can help each other out a bit.
   42. Der Komminsk-sar Posted: March 20, 2006 at 05:55 PM (#1908947)
bump
   43. Kyle S Posted: March 20, 2006 at 06:12 PM (#1908966)
anyone have anything resembling a minor-league version of the lahman database? the cube sells something like this, but free is better than $85, so I figured I'd ask...
   44. Mike Emeigh Posted: March 29, 2006 at 06:22 PM (#1924714)
anyone have anything resembling a minor-league version of the lahman database?


Not of which I am aware.

-- MWE
Page 1 of 1 pages

You must be Registered and Logged In to post comments.

 

<< Back to main

Support BBTF

donate

My Bookmarks

You must be logged in to view your Bookmarks.

Hot Topics

Rule 5 Draft
(3 - 2:22pm, Nov 30)

It's never too early
(3 - 10:11pm, Sep 23)

Price's AAA Debut
(11 - 4:18pm, Aug 21)

2008 draft signings
(245 - 9:04pm, Aug 16)

Minor Moves
(16 - 5:02pm, Aug 11)

The Week Ahead
(11 - 10:42am, Aug 04)

Recent Minor Moves: NL East
(38 - 10:39pm, Jul 17)

Recent Minor Moves: NL Central
(10 - 7:38pm, Jun 15)

Draft thread
(75 - 11:28am, Jun 09)

BTF Mock Draft
(59 - 3:32pm, May 30)

Recent Minor Moves: AL Central
(4 - 12:54pm, May 30)

Vivid Seats is a sports ticket broker, concert ticket broker and theater ticket broker offering the best baseball tickets like Yankees tickets, Cubs tickets, and Red Sox tickets, as well as Police reunion tour tickets and Jersey Boys tickets.

We have baseball tickets, the NFL schedule, college football tickets and Cowboys tickets. We have NBA tickets like Celtics tickets and Lakers tickets. Plus, buy Giants tickets, Patriots tickets and Colts tickets. Also check out our MLB baseball schedule

Buy Cheap MLB Tickets

Concerts Theatre NFL Angels Dodgers MLB Celtics Theater NBA Tickets Venues NHL Lakers Tickets NFL Yankees NHL Phillies NBA Wicked Marlins MLB Concerts Cubs Mets Red Sox Wicked WWE Red Sox Mets Yankees Dodgers

Page rendered in 0.5952 seconds
61 querie(s) executed