Baseball for the Thinking Fan

Login | Register | Feedback

btf_logo
You are here > Home > Notes in a Minor Key > Discussion

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

Page 1 of 1 pages
   1. NJ is feeling better Posted: February 16, 2006 at 12:15 AM (#1863612)
Mike, I've already begun collecting these types of "advanced stats" for Yankee minor leaguers and FWIW, your Hughes' numbers are wrong. I think you did GO:FO, not GB:FB because my numbers are drastically different from yours and I have every bit of confidence in them.
   2. The Hop-Clop Goes On (psa1) Posted: February 16, 2006 at 12:59 AM (#1863668)
If the file naming conventions for MLBAM are sensible, and we're just collecting text pbp data (that is, not actually watching video and making judgments) couldn't somebody write a script and theoretically extract it all in one step? My programming capabilities are just short of that, but it seems like a relatively simple thing for a real programmer to do.

Or am I missing something?
   3. Jim Wisinski Posted: February 16, 2006 at 01:28 AM (#1863691)
#2, I have no clue if that would be possible or not but the file names for the game logs are uniformly named so that shouldn't be an obstacle.

Going back and gathering the data after the fact is definitely a pain, I did that to get Andy Sonnanstine's L/R splits for '05 and it took a lot of time to get that done.

I plan on doing something like this for Rays prospects in 2006 though I was only intending on getting data for specific players. I may be willing to expand that for every player on the Rays affiliates, it probably wouldn't take too much more time than getting specific players since I'll be going through the whole log anyway.
   4. Mike Emeigh Posted: February 16, 2006 at 01:43 AM (#1863708)
I think you did GO:FO, not GB:FB because my numbers are drastically different from yours and I have every bit of confidence in them.


No, I didn't - I included hits. I also counted popups, and counted bunts separately (there were only a couple of those). Here's what I have for Hughes's outings:

Charleston:

4/8: 8 FB, 3 GB, 0 LD, 6 K, 1 W
4/13: 7 FB, 7 GB, 3 LD, 3 K, 2 W
4/18: 3 FB, 9 GB, 0 LD, 1 bunt, 2 K, 1 W
4/26: 3 FB, 5 GB, 0 LD, 1 bunt, 8 K, 2 W
5/1: 3 FB, 7 GB, 0 LD, 8 K, 3 W
5/7: 3 FB, 8 GB, 1 LD, 7 K, 2 W
5/15: 5 FB, 7 GB, 3 LD, 7 K, 1 W
5/20: 5 FB, 10 GB, 1 LD, 7 K, 1 W
5/25: 6 FB, 7 GB, 3 LD, 7 K, 1 W, 1 HB
6/1: 8 FB, 7 GB, 4 LD, 4 K, 1 W, 1 HB
6/7: 9 FB, 7 GB, 4 LD, 5 K, 0 W
6/12: 6 FB, 7 GB, 3 LD, 8 K, 1 W, 1 HB

Totals: 66 FB, 84 GB, 22 LD, 2 bunts, 70 K, 16 W, 3 HB

Tampa:

7/7: 5 FB, 2 GB, 0 LD, 1 bunt, 2 K, 1 W
7/13 (relief): 3 FB, 2 GB, 0 LD, 6 K, 1 W, 1 HB
7/20: 1 FB, 4 GB, 0 LD, 1 K, 0 W
7/25: 3 FB, 3 FB, 1 LD, 7 K, 1 W
7/31: 3 FB, 4 GB, 4 LD, 1 bunt, 5 K, 1 W, 2 HB

Totals: 15 FB, 15 GB, 5 LD, 2 bunts, 21 K, 4 W, 3 HB

--- MWE
   5. Mike Emeigh Posted: February 16, 2006 at 01:45 AM (#1863711)
I added wrong; Hughes's Ks add up to 72 at Charleston.

-- MWE
   6. Mike Emeigh Posted: February 16, 2006 at 01:48 AM (#1863715)
The FB totals include the popups as well.

-- MWE
   7. Gwyn Posted: February 16, 2006 at 01:48 AM (#1863716)
O'Reilly (a techie publishing company if you're not familiar with them), have just published Baseball Hacks by Joseph Adler. This contains code to spider the MLB gameday site and download the MLB pbp data, I would think its easily adaptable to the Minor League data.
It further has scripts to convert this data to the retrosheet format and load it into a mysql database. The code is in perl and, from a very cursory inspection, seems to be well written and well commented.
   8. NJ is feeling better Posted: February 16, 2006 at 02:01 AM (#1863729)
Oh ok, I didn't include popups in the FB total and I included bunts in GB, so that's where our difference lies.
   9. Jim Wisinski Posted: February 16, 2006 at 02:05 AM (#1863732)
By the way MWE, did you get my e-mail about the RaysBB list?
   10. Mike Emeigh Posted: February 16, 2006 at 02:06 AM (#1863735)
I'm familiar with O'Reilly, and I've seen the blurbs for Baseball Hacks. I think that spidering the Web site is a dangerous tactic, because it might lead MLB to make the data inaccessible. MLB doesn't *have* to publish game logs, and it's unlikely that anyone else will do it for the minots.

-- MWE
   11. Mike Emeigh Posted: February 16, 2006 at 02:08 AM (#1863737)
Jim, I did. I have been uncommonly busy this week at work, and haven't had much time to post before tonight. I have a batch of things piled up.

Meh was supposed to get a set of keys as well, but I don't think Dan's done anything about it yet.

-- MWE
   12. Jim Wisinski Posted: February 16, 2006 at 02:32 AM (#1863748)
Ok, I just wanted to make sure BTF-mail didn't swallow it.
   13. PhillyBooster Posted: February 16, 2006 at 03:26 AM (#1863795)
Going back and gathering the data after the fact is definitely a pain, I did that to get Andy Sonnanstine's L/R splits for '05 and it took a lot of time to get that done.


Baseball America's 2006 almanac contains L/R splits for all hitters. This is, I believe, the first year they did this, so if anyone else is looking for that data for a 2005 player, I can probably look it up much faster than you can gather the data.
   14. El Hijo del Ron Santo (Alan Keiper) Posted: February 16, 2006 at 03:29 AM (#1863799)
That almanac is also a tremendous source of minor league fielding data.
   15. Mike Emeigh Posted: February 16, 2006 at 03:39 AM (#1863815)
This is, I believe, the first year they did this, so if anyone else is looking for that data for a 2005 player, I can probably look it up much faster than you can gather the data.


If you're looking for 2005 data, that is. However, if you're gathering 2006 data in real time, you'll have L/R splits for 2006 hitters long before BA publishes its 2007 almanac.

-- MWE
   16. Mike Emeigh Posted: February 16, 2006 at 03:41 AM (#1863816)
That almanac is also a tremendous source of minor league fielding data.


So is the TSN guide (with the exception of 2004, when everyone got hosed because of the problems with the minor leagues' official stat compiler).

-- MWE
   17. PhillyBooster Posted: February 16, 2006 at 03:49 AM (#1863824)


If you're looking for 2005 data, that is. However, if you're gathering 2006 data in real time, you'll have L/R splits for 2006 hitters long before BA publishes its 2007 almanac.


You obviously have never seen me try to compile data manually.
   18. WTM Posted: February 16, 2006 at 04:07 AM (#1863837)
I think that spidering the Web site is a dangerous tactic, because it might lead MLB to make the data inaccessible.

I'm probably missing the point here, but I'm not sure why they'd do this, as they're providing the game logs free. I doubt they see much money (by MLB standards) in minor league stats, or they could be doing a lot more with the minor league web site, like providing splits themselves.
   19. The Hop-Clop Goes On (psa1) Posted: February 16, 2006 at 04:13 AM (#1863838)
Hmmm. I hadn't looked at the site before I made my original comment...y'all are probably way ahead of me here, but it's designed great for spidering. Unfortunately I don't work in Perl or I'd try to adapt the BB Hacks script, but I think it's doable, even by amateur me. I'll let you know, Mike, if it looks like I could come up with a retrosheet-like database from this stuff.
   20. philly Posted: February 16, 2006 at 05:21 AM (#1863901)
The FB totals include the popups as well.

Mike,

I looked at a few Sox prospects game logs last year and it seemed like they recorded a lot of popups. Given that popups have some skill component in MLB pitchers it made me wonder that perhaps good minor league pitching prospects tended to have higher than normal popup rates.

If this type of a project gets off the ground it would be great to keep popups separate in case that is true.
   21. AROM Posted: February 16, 2006 at 05:32 AM (#1863906)
That almanac is also a tremendous source of minor league fielding data.

Anyone know if there is any source for SB/CS for minor league catchers?
   22. rdfc Posted: February 16, 2006 at 01:59 PM (#1864291)
It would obviously be useful to have minor league left/right splits too. MLb.com promised to do those last year, but like so many things they've promised (such as the super game coverage with pitch speed etc.) that aren't part of their video packages they never showed up.
   23. NJ is feeling better Posted: February 16, 2006 at 02:34 PM (#1864328)
rdfc, the annoying part about that is that they kept those numbers along with home/away splits and RISP splits, but just didn't let the public see them on a regular basis because whenever they did a minor league gamecast each player had these splits freely available to peruse.
   24. Kyle S Posted: February 16, 2006 at 03:05 PM (#1864365)
philly: chuck james is in that camp as well; his hysterically low gb/fb ratio is in large part due to popups induced.
   25. Kyle S Posted: February 16, 2006 at 03:10 PM (#1864371)
by the way, that o'reilly book is a fantastic find. i've read the other article by that author on using R for statistical analysis and i liked that too.

good thing i'm getting my bonus soon - what with tango/mgl's book, the o'reilly book, the sickels book, the ba 2006 book, and maybe even bpro's annual, i'm goint to spend a lot of money on baseball books in the next month.
   26. Der Komminsk-sar Posted: February 16, 2006 at 03:13 PM (#1864375)
Mike, I'm glad you proposed this - I've wanted to suggest the same, but know I won't have the time to coordinate anything in '06.
I can likely make a small contribution, presuming spidering doesn't work.
   27. Dan Szymborski Posted: February 16, 2006 at 03:25 PM (#1864390)
Dan can't give keys for this blog - I forwarded the e-mail awhile ago to Jim but I guess he didn't see it.
   28. Tom Cervo, backup catcher Posted: February 16, 2006 at 03:56 PM (#1864427)
Mike, I'd be more than happy to help in any way possible. And you may not like the idea, but I've ordered the O'Reilly book and would be willing to setup a database for this info., once I learn how of course.

And thanks for the email back. I'll let you know if I find it to be useful at all.
   29. WTM Posted: February 16, 2006 at 04:11 PM (#1864455)
Mike,

If some technological solution isn't forthcoming, I'd be willing to try to compile some data for the Pirates' farm system. I couldn't do it all, although I could try to find somebody to help.
   30. Kyle S Posted: February 16, 2006 at 04:34 PM (#1864490)
what about a "prospect wiki" that anyone who saw someone play could contribute to? it wouldn't be the most scientific thing in the world, but it would be somewhat useful, right?
   31. Mike Emeigh Posted: February 16, 2006 at 05:39 PM (#1864588)
what about a "prospect wiki" that anyone who saw someone play could contribute to? it wouldn't be the most scientific thing in the world, but it would be somewhat useful, right?


Yes, it would.

-- MWE
   32. The Hop-Clop Goes On (psa1) Posted: February 17, 2006 at 02:59 AM (#1865410)
Ok, massive adjustment to the little I said last night. As far as I know how to spider, spidering the site is NOT a viable solution. That's because the game logs are not part of the HTML; they're generated by javascript after the fact.

However, since one can easily cut-and-paste the game log into a text file, I think there's a compromise to be reached between spidering and doing it all by hand. I'm working on a program that will take said text file and convert it as far as possible into a standardized pbp format. There are problems with that, too, though, because the game logs don't have all the necessary information--for instance, if a defensive player never makes a play and doesn't get replaced, his name never shows up in conjunction with his position. If a starting pitcher never makes a defensive play and isn't replaced, his name never shows up in the game log at all!

Thus, the box scores would appear to come in handy...but in the box scores (unlike the game logs) complete player names aren't given. So when Travis Smith throws a complete game, going to the box score to find "Smith, P" is less than ideal.

In other words, I don't think there's a 100% technological solution to this.

I do think however, that I can reduce the amount of work in "logging" each game into a database down to 120 seconds or less, maybe much less. For the ~10,000 minor league games of 2005, I don't think I'm going to do all those myself unless I can get that number down to 15 seconds or so. I may be able to create some sort of web interface where volunteers can enter the game logs and then, upon prompting, enter things like starting pitcher names and player positions that aren't given by the game log. Or it might be the time it takes to make that sufficiently user-friendly is greater than the time it takes me to do it all myself or work with a very small group of volunteers. We'll see.

Obviously I'm thinking out loud here. Perhaps less obviously, I'm a total amateur at this stuff; I wouldn't be surprised if somebody came along and said they could easily do all the tasks I'm saying aren't technologically doable.
   33. Social media assassin (Templeusox) Posted: February 17, 2006 at 03:06 AM (#1865425)
This">This</a> was done earlier tonight on sp.com. Check out the second to last post. Is this what you guys are talking about?
   34. The Hop-Clop Goes On (psa1) Posted: February 17, 2006 at 03:53 AM (#1865496)
I'm not sure we're talking about exactly the same things.

Are you deriving that data from the pbp logs, or from team stats already compiled on the miLB site?
   35. The Hop-Clop Goes On (psa1) Posted: February 18, 2006 at 03:57 AM (#1866871)
Since I know everyone is breathless in anticipation of an update on my progress :)

I'm 90-95% done with a parser to convert every play in an miLB pbp log into something standardized and more or less like retrosheet event codes. What I've got so far deals with upwards of 99% of plays...but as anybody who's spent time with retrosheet event codes knows, the last 1%, even the last 0.1% of plays can get tricky, what with stuff like "1(B)16(2)63(1)/LTP/L1"

The bigger challenge I'm looking at now is how to make the results optimally usable. Let's assume for a second I can turn all the available pbp logs into a database that more or less mirrors what miLB has on their servers, providing us with the trickle of data they let out. What do you want from that, and how do you want it?

Should my end goal be something that has syntax virtually identical to retrosheet, so that you can use RS's parsers to come up with csv files and the like in a format you're familiar with? I'm not sure I could deliver that, as I don't know how easily one could expand the RS syntax to include all minor leagues...and I don't really want to embark on yet ANOTHER system of labelling players so that we can tell one Angel Garcia (garca006) from another (garca007).

Would you rather I aim toward something in the vein of David Pinto's day-by-day database? I'd like to come up with that sort of thing no matter what ...if I do, what functionality do you want from it?
   36. Kyle S Posted: February 19, 2006 at 01:22 AM (#1867537)
I would be satisfied with a database full of minor league stats ala the lahman database - splits, G/F, etc would be nice, but I don't have even the basics yet. The Baseball Cube has a database that they charge like $100 for, but free is always better :)
   37. Mike Emeigh Posted: February 19, 2006 at 01:31 AM (#1867546)
Should my end goal be something that has syntax virtually identical to retrosheet, so that you can use RS's parsers to come up with csv files and the like in a format you're familiar with? I'm not sure I could deliver that, as I don't know how easily one could expand the RS syntax to include all minor leagues...and I don't really want to embark on yet ANOTHER system of labelling players so that we can tell one Angel Garcia (garca006) from another (garca007).


I see no reason why you could not use MiLB's ID numbers for players as the ID keys. And they are retrievable; E-mail me off-list if you don't know how to get them.

-- MWE
   38. The Hop-Clop Goes On (psa1) Posted: February 19, 2006 at 01:50 AM (#1867573)
That could work. I see that miLB uses them for everything, so they appear in url's for player pages. I suppose a spider could retrieve all of the player-ID pairs...or is there a simpler way you are thinking of?
   39. Mike Emeigh Posted: February 19, 2006 at 01:57 AM (#1867579)
I suppose a spider could retrieve all of the player-ID pairs...or is there a simpler way you are thinking of?


Nope.

-- MWE
   40. Forever Red 9 Posted: February 19, 2006 at 09:33 PM (#1868374)
#34: Are you deriving that data from the pbp logs, or from team stats already compiled on the miLB site?

I'm extracting it all from PBP logs and lots of excel equations. Like you said, 99.9% of the data can be extracted from the game logs, and that final .1% is human work for things such as complete games. However, I have gotten it done to just copying and pasting, for a whole season it takes about an hour, and 99.9% of the data will be spit out. As for the defensive stats, they seem out of my league right now and I don't see how to collect them at this moment.

For the past year, I have been collecting splits for the Red Sox minors on the soxprospects message boards, but I've had to insert things manually, which can take a lot of time. Realistically, I could only do 1 team without buring out and do the others after the season. However, I'm working equations to set up the lefty-righty splits and men on base, but they are still a work in progress. I'm hoping to get this set up by the start of the season.
   41. The Hop-Clop Goes On (psa1) Posted: February 22, 2006 at 05:07 PM (#1871340)
As for the defensive stats, they seem out of my league right now and I don't see how to collect them at this moment.

Understanding that we don't have great bip-location/type data from those pbp logs, I'm pretty sure I've got my program set up to collect any defensive stats one can glean from retrosheet logs. (That is, I'm tracking who's on the field at all times, and as much of the bip-location/type as the logs gives us.) Speaking of which, I'm still testing, but I'm at the point where my program handles 99.5% or so of all plays, and outputs something virtually identical to a retrosheet log. So I guess it's doable. I wouldn't necessarily have charged ahead over the last few days if I had seen your post, but BTF has been inaccessible to me (and I've been sick) for the last three days or so. I'll shoot you an email so I can clarify what you've done and maybe we can help each other out a bit.
   42. Der Komminsk-sar Posted: March 20, 2006 at 09:55 PM (#1908947)
bump
   43. Kyle S Posted: March 20, 2006 at 10:12 PM (#1908966)
anyone have anything resembling a minor-league version of the lahman database? the cube sells something like this, but free is better than $85, so I figured I'd ask...
   44. Mike Emeigh Posted: March 29, 2006 at 10:22 PM (#1924714)
anyone have anything resembling a minor-league version of the lahman database?


Not of which I am aware.

-- MWE
Page 1 of 1 pages

You must be Registered and Logged In to post comments.

 

 

<< Back to main

Support BBTF

donate

Thanks to
Brian
for his generous support.

Bookmarks

You must be logged in to view your Bookmarks.

Syndicate

Buy MLB playoff tickets, plus 2011 World Series, 2011 ALCS tickets and NLCS game tickets. We also have Texas Rangers playoff schedule, tickets to Red Sox games and Yankees game tickets. Plus, buy Phillies baseball tickets, Tigers playoff tickets and the biggies like ALDS baseball tickets and 2011 NLDS tickets.

Demarini, Easton and TPX Baseball Bats

 

 

 

AllianceTickets.com has cheap MLB Tickets. Get all your Colorado Rockies Tickets, Seattle Mariners Tickets, San Francisco Giants Tickets and all your favorite baseball tickets here. We also carry cheap Denver Broncos Tickets, Seattle Seahawks Tickets and Denver Nuggets Tickets.

Page rendered in 0.9554 seconds
57 querie(s) executed