Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
2. Der Komminsksar
Posted: January 19, 2005 at 07:42 PM (#1088929)
bump (i'll try to contribute more than this when i get off work)
3. Chris Cobb
Posted: January 19, 2005 at 09:18 PM (#1089185)
I hope some of the recent posts from the Beckwith thread can get copied over here for longterm continuity, but let me raise a new question here for now.
It's clear that, when attempting to create seasonbyseason MLEs for NegroLeague players, some regression to the mean for seasonal totals is appropriate. The questions are: how much and how to establish the mean.
I have been doing regressions in an unsystematic way, and I'm sure it can be done better.
So far I have been using career totals as the mean towards which to regress, but I am considering changing to a rolling 5year mean, with necessarily alerations at the beginning and end of careers. Would that lead to more probable results?
I have been simply guessing about how far to regress towards the mean: I am not a trained statistician, and would welcome plainenglish guidance from those with more knowledge!
4. Rally
Posted: January 19, 2005 at 11:32 PM (#1089469)
You regress to the mean of the population the player comes from, not to his career average or anything like that.
For Negro League players, you would regress to the Negro League average. Are those stats available? If not you could regress to the AL or NL league averages, if you believe the quality of play was about equal in those leagues.
5. Chris Cobb
Posted: January 20, 2005 at 12:20 AM (#1089506)
You regress to the mean of the population the player comes from, not to his career average or anything like that.
Please pardon my ignorance, but why should you regress to the population?
NegroLeague averages are seldom available. After making the conversion to majorleague equivalents, regression to majorleague averages could be done.
6. jimd
Posted: January 20, 2005 at 12:58 AM (#1089545)
You regress to the mean of the population the player comes from, not to his career average or anything like that.
I can see that an incomplete career line would be regressed to the NeL mean. Would the incomplete seasons then be regressed to the player's regressed career line? Does that make any sense?
7. KJOK
Posted: January 20, 2005 at 01:00 AM (#1089547)
I would say that the purpose of regression is to come up with what the player's "true" performance would be if he had played a "standard" season.
I think obviously Beckwith was not an average Negro League hitter, so regressing to THAT population wouldn't be correct.
Regressing to his career totals would probably be the way I would go absent a better method. You could use your proposed 5 year mean idea IF you have sufficient # of plate appearances during that time I guess.
8. karlmagnus
Posted: January 20, 2005 at 01:02 AM (#1089551)
As a math major myself, I'm beginning to see why the NL conversions all come up with very high ML equivalents. If the NL/ML conversion factor is .85, and both leagues have averages of .250, then a .400 NL average on 125AB should be regressed to .325 to get an equivalent for a ML season of 500AB. Convert this, and you get .276 as the MLE.
Go the other way round, converting first, gives you .340MLE, which when regressed 50% as above gives you .295.
In other words, regressing and conversion aren't commutative.
(i) Is this what you're doing?
(ii) Is it correct that you should regress to the NL mean first  surely right?
(iii) Am I then correct that doing it the other way round wrongly inflates MLEs?
This would explain why we're getting so many HOMable NLers; the difference between .276 and .295 is not gigantic, but it is substantial.
I'm quite prepared to be told I'm out to lunch, and I will promise to understand why if I am.
9. karlmagnus
Posted: January 20, 2005 at 01:03 AM (#1089552)
That's I promise to TRY to understand  I may or may not succeed!
10. jimd
Posted: January 20, 2005 at 01:14 AM (#1089575)
I think I better stay out of this. I took some stat courses 30 years ago, but never had any professional use for them, so that knowledge is RUSTY. Disregard my post above. Thanks ;)
11. Paul Wendt
Posted: January 20, 2005 at 01:28 AM (#1089588)
[
one section of Beckwith #135, nearly copied here]
Gadfly #123 I will try to find reference to the study I saw that did 1940s MajorMinor Translations for you and post some Negro League TripleA comparisons when I have time. It was my understanding that the .92 conversion was overall. In other words, BA would be reduced by say .95 and slugging by .89. But I could be wrong.
Clay Davenport's minor league translation factors have some currency; indeed, I know of them only indirectly, by reference in remarks on major leagues. He uses the "overall" measure EqA. If I understand correctly, his translation factors should have Gadfly's property: magnitude between batting and slugging factors.

Also in Beckwith, Gary A #108 and jimd(page one) on ballparks used by NeL and MLB in the same year. For any year, a good share of NeL games played in MLB parks should yield a good estimate of NeLaverage park factor (where MLBaverage = 100).
12. Chris Cobb
Posted: January 20, 2005 at 01:47 AM (#1089613)
Karlmagnus,
What I have been doing is converting and then regressing.
It appears the operations are not commutative.
it is not intuitively obvious to me that the regression should precede the conversion. My intuition, and it is purely intuition, is that the other order is correct.
It is also not intuitively obvious to me that regression to the league average is appropriate (a concern seconded above by KJOK).
I may be signing a statistic book or two out of the library or contacting my math department's extension office . . .
13. TomH
Posted: January 20, 2005 at 02:37 AM (#1089680)
As a guy who has plenty of stats courses and uses some of them daytoday, I'm afraid the answer here might be more art than science. But a few guidelines I can think of:
definitely Convert before attempting to Regress.
Key Q: what is my goal? If I'm a career voter, I don't much care about regressing anyway. If I want to measure 'peak' or 'prime', I may as well regress to the length by which I measure these. But I were to measure peak for Negro leaguers, I might rely more heavily on contemporary opinion than stats (I will bow more to stats, if we have them, for career value).
Regressing to the league mean won't do anything besides drag everyone's stats to the center.
Regressing to a rolling average seems to make sense to get a truer shape of a career, as long as you kep in the mind the 'typical' career shape. As in, I could see using an uncorrected average for the years age 25 to 30, but not for 32 to 36 when we expect decline anyway.
Tom's personal most important law of stats: everything in life varies with the square root of N (the sample size). [You need 4 times as much data to cut the uncertainty in half.]
14. karlmagnus
Posted: January 20, 2005 at 03:37 AM (#1089774)
If you have a .500 average in the NL, which appears to happen often, you would then convert to a .425 in the ML, which happens a lot less often, so you'd then be regressing a statistic that wasn't "real." Regress then convert would seem to be more mathematically correct. Whether you can find a stats textbook that would tell us is extremely doubtful though, I would have thought. You might try it the other way for say Beckwith (current case in point) and see how much difference it makes (sorry to add yet more to your labors, but that's the only way to test it, I think.)
15. KJOK
Posted: January 20, 2005 at 03:59 AM (#1089813)
If you have a .500 average in the NL, which appears to happen often, you would then convert to a .425 in the ML, which happens a lot less often..
I THINK convert, then regress is correct. Even in the Major Leagues, you can have Mike Matheney hit .395 for the month of April, so a very good Negro League player hitting .450 for 60 games in the Negro Leagues would not be unexpected. You're also going to have more Negro League players who hit .125 for 60 games in the Negro Leagues than Major League players who hit even .150 for a full season, etc.
The conversion just gives you what that .450 would have been vs. Major League competition for 60 games. From there, it's no different from regressing any Major League player's 60 games into 154 or 162.
I don't do ANY regression on my MLE's, but I also tend to ignore season to season performance for Negro Leagues players and just look mainly at their career MLE's.
16. Brent
Posted: January 20, 2005 at 04:01 AM (#1089815)
There are several methods for regressing toward the mean, but the specifics of what you should be regressing toward depend on the specific dataset and problem at hand.
For example, in a paper in the 1975 Journal of the American Statistical Association, Efron and Morris looked at the following scenario: suppose you know the batting averages of 18 players over their first 45 at bats, and don't know anything else about the players. How would you estimate their averages over their remaining atbats of the season? In that case, the answer is to regress all the players toward the overall mean; if a player starts the season hitting .400, chances are he is an above average hitter, but it is also highly unlikely (unless you have additional information about him) that he will continue to hit .400 the rest of the season. The amount by which you shrink the players toward the overall mean depends on the standard deviation of their batting averages. A summary of the formula and a baseball example can be found in section 5 of this paper by Efron, though I'll warn you that there is a lot of math. A more readerfriendly version appeared in 1977 in Scientific American.
Your situation is different, but I think the same formula could apply. Instead of knowing averages for p players over a few games, you know the averages for one player over the p years of his career. You know what his average was over perhaps 50 games, but you assume he would have played perhaps 140 games under major league equivalent conditions, so you are trying to predict what he would have hit over the remaining 90 games. I think you would use the same formula, regressing toward his career average based on the standard deviation of his batting average over his career.
Do you regress first, then convert, or convert first, then regress? I don't know. Statistical theory is good at coming up with formulas, but telling you how to apply them relies more on the experience and judgment of the people doing the calculations.
BTW, Carl Morris, in addition to being a prominent statistics professor at Harvard, also dabbles in sabermetrics. He has a runs generator formula that may be the most sophisticated one around  he calls it simple, but it's too complicated for me to use. For some reason, I can't get the link to work in this post, so if you're interested I suggest that you google: "simple runs per game" carl morris.
17. Rally
Posted: January 20, 2005 at 04:08 AM (#1089823)
If the NL/ML conversion factor is .85, and both leagues have averages of .250, then a .400 NL average on 125AB should be regressed to .325 to get an equivalent for a ML season of 500AB. Convert this, and you get .276 as the MLE.
If 1 league is only .85 as good as the other, then you need to reduce the league average for the lessor league before regression, use .85*.250=.212 for the league average if you convert first, then regress.
If you do it that way, you'll get the same result if you regress first or convert first.
Example A: .400*.85 = .340 regress 50% to .212 = .276
Example B: .400 regress 50% to .250 = .325 * .85 = .276
This is all for sake of example, you'd actually regress 125 AB a lot more than 50%. You can find more on this kind of stuff at tangotiger.net.
18. Chris Cobb
Posted: January 20, 2005 at 04:20 AM (#1089836)
If you have a .500 average in the NL, which appears to happen often, you would then convert to a .425 in the ML, which happens a lot less often, so you'd then be regressing a statistic that wasn't "real."
This is not an accurate representation of NeL statistics. Few conversions that I have done, even without regression, show the NeL players as leading the major leagues in batting average, which is the result we would expect according to the above.
Here's a lightningfast survey of NeL leaders, converted, compared to MLleaders. This will use only .87 as the conversion factor for batting average, not getting into leagueoffense levels, park adjustments, or regression to the mean, but it should suffice as a demonstration.
Seasons by NegroLeague players that could, by this conversion, have won an ML batting title are marked in bold.
*This legendary batting performance by the 44yearold Pop Lloyd is the only .500 season in these records, and data posted to our site by KJOK replaced this incredible number with one that was much more credible. I use the number from Holway simply to follow my source.
My simple MLE conversion turns up 5 batting titles out of 20 for NegroLeague players. One of those is based on a batting average that has been proven to be apocryphal, and one is a tie. Only one other places above the range of majorleague averages, the .433 average posted by Mule Suttles in 1926. Even slight regression would easily pull it into the range of ML values, and this average was achieved in a hitter's park probably more extreme than any in the majors at that time.
In sum, I hope this quick example shows that there is no evidence to support the contention that heavy regression is needed to convert NeL batting statistics to major league statistics that happen "often."
19. Brent
Posted: January 20, 2005 at 04:34 AM (#1089847)
On another topic, Gadfly in post #119 of the Beckwith thread pointed to a conversion factor of .92 for TripleAAA leagues  apparently that factor is supposed to work both for TripleAAA leagues of modern times and of the 1940s.
On the Arlett thread I've been using a Bill James method, which rather than converting batting averages or slugging averages directly, converts each element of the batting line. I wondered how his approach might compare in terms of the batting average conversion. It reduces a TripleAAA player's hits by 89 percent, but it also reduces his at bats by 3 or 4 percent. Sure enough, the conversion factor for batting average turned out to be .92! (Actually, it varies depending on the player's minor league average, but for averages between .268 and .360 it rounds to .92.)
For slugging average, James applies a square root to home runs and triples, but does not use a square root for hits or doubles. The ratio depends on the characteristics of the hitter, but for a .300 hitter with 20 HR per year, the conversion factor is .90. If you increase his power to 40 HR per year, it drops to .89; if you keep the 40 HR and reduce the average to .240 (we're talking Dave Kingman) it drops to .88; if you keep the average at .300 with 5 HR per year, it rises to .91. So it's a relatively limited range, and I'd guess that for most of the players we're interested in .89 or .90 would be applicable.
My next effort on major league equivalencies is to look at the quality differential for a group of 1920s PCL players who also played in the majors. There's a Web site that has all the minor league hitting statistics for the Portland Beavers. Roughly half the regulars also played regularly in the majors, so I should be able to come up with a good sized sample. I'm still looking for more data on the PCL run environment during the 1920s though.
20. karlmagnus
Posted: January 20, 2005 at 01:34 PM (#1090259)
Looked at the other way, the converted NL leads the majors in batting averages in 5 out of 10 years; if the conversion were accurate you'd expect 1 or at least 2. I stand by and am strengthened in my conviction that the conversions are too generous.
Rallymonkey has poiunted out the flaw in convertthenregress; you should regress to the MLE of the NL average if you do it that way round, i.e. in my example regress to .212 not .250. It appears we have a systematic error here of quite some magnitude.
21. Chris Cobb
Posted: January 20, 2005 at 02:22 PM (#1090318)
Karlmagnus,
Again, your claim is simply inaccurate in its interpretation of the data at very simple levels.
1) Of these supposed five times you claim the converted NL leads the majors in batting average, one (1928) has been thrown out as bad data  I included it because it was the _only_ .500+ batting average over a tenyear period, which refuted your claim that .500 batting averages happened "rather often" in the NeL. A second year was counted as a batting title because it tied the _lower_ of the two major league batting titlists, so it can't be said to lead the major leagues. So that's 3 in 10 majorleague leading seasons.
2) That 3 in 10 leading seasons is BEFORE ANY regression, and I am not and have not argued that regression is not needed. I am trying to figure out HOW MUCH. You take incompletely processed data that I present only to demonstrate the wild factual inaccuracy of your claims that .500 averages were not unusual in the Negro Leagues and that major regression is necessary to make them fit with "real" majorleague averages. I present a quick data set to show that neither of your points has any factual basis, and you use those incompletely processed MLEs as evidence that the conversions I am doing contain "a systemic error of quite some magnitude."
The treatment of the .498 season that you advocate would regress that .498 season to .338. That is hardly appropriate. 1) If that regression were applied to every NeL season, the number of ML batting leaders would be 0 for the 28 year history of the leagues, not the 1 or 2 in ten years you say we would expect. 2) If .338 is the highest MLE a NegroLeaguer averaged during the 1920s, then NONE of them are HoMers, which contradicts a) what we would expect from the most conservative demographic estimates, b) the demonstrated performance of NegroLeague stars vs. majorleague competition, and c) their reputations.
Mule Suttles (our .498 hitter) averaged .341 for his career in 3230 at bats. We can agree, perhaps, that this is a large enough sample size that it doesn't need to be regressed to the mean?
Working just with the .87 conversion for batting average (skipping league offensive levels, park factors, and arguments about whether .87 is correct), Suttles comes out as a .297 majorleague hitter, career. That makes .338 41 points above his lifetime average. The average amount by which majorleague batting leaders exceed their lifetime averages during the 1920s is 48 points, the highest being 80 points (George Sisler's .420 in 1922). It is evident that a system that regresses the _most extreme_ NegroLeague outlier (157 points above the player's lifetime average  some drawing of this average towards the career mean is clearly necessary) to a lesser variance than the average variance for a majorleague batting leader regresses too much to give proper credit to peak value.
I hope later today to take time to think through other posts more fully. Thanks to everyone who has responded on how regression to the mean should be used.
22. Gary A
Posted: January 20, 2005 at 02:38 PM (#1090351)
It's my research that established Lloyd's .564 batting average in 1928 to be spurious; I can provide details if anybody wants them. In any case, the highest batting average in the east that year was Jud Wilson's .423, which would translate to .368.
I might also add that the highest batting average I found in the NNL for that year was Willie Wells's .365 (which would translate to .318). Pythias Russ (the .405 hitter) hit .346 in the games I have, though I'm missing two dozen games played in Chicago. Of course, these were all played in Schorling's Park, so I don't know whether Russ could have raised his overall average by 60 points by his performance in those games.
Also, the highest batting average I have in the west for 1921 is almost 50 points lower than what Holway records. I'm missing 14 St Louis games, so I suppose it's possible that Blackwell or Charleston could have raised their averages by that much.
My simple MLE conversion turns up 5 batting titles out of 20 for NegroLeague players. One of those is based on a batting average that has been proven to be apocryphal, and one is a tie. Only one other places above the range of majorleague averages, the .433 average posted by Mule Suttles in 1926. Even slight regression would easily pull it into the range of ML values, and this average was achieved in a hitter's park probably more extreme than any in the majors at that time.
Four out of 20 doesn't seem excessive to me in the least, Chris.
24. karlmagnus
Posted: January 20, 2005 at 04:30 PM (#1090580)
It's not 4 out of 20 it's 4 out of 10, the number of years (1921/26/28/29) in which the NL leader is head of either ML leader.
If you regress Mule Suttles.498 50% (which may not be the right percentage  how many ABs was that .498 on?  you should normalize to about 500 ABs, I would think, so 50% would be for 125 ABs, 60% for 180 etc.) towards his career average of .341 you get .419, which converted at 87% gives .364, a batting title in the NL but not the AL in that year. Converting first you get .433, but as rallymonkey pointed out you then have to regress that 50% not to .341, but to .341x.87, or .297, which again gives you .364. Regressing it to .341 is comparing apples and oranges; it would give you .387, but that's a meaningless number.
25. karlmagnus
Posted: January 20, 2005 at 04:34 PM (#1090590)
If Suttles' .498 was on 245 ABs, then you regress only 70% towards .341, which takes .498 to .451, converted at .87 is .392, an outstanding number and a batting champion in either league. That seems to me a fairly solid data point, certainly more realistic than .433.
26. karlmagnus
Posted: January 20, 2005 at 04:36 PM (#1090601)
Convert then regress in the latter case would give you .405, which apart from being wrong theoretically looks too high in a 1926 environment where the next best hitter is at .378.
27. karlmagnus
Posted: January 20, 2005 at 04:38 PM (#1090604)
Come to think of it, this 4 out of 10 ML leads is against Babe Ruth's prime. Come on...
28. Chris Cobb
Posted: January 20, 2005 at 04:40 PM (#1090615)
karlmagnus,
How do you determine how much to regress? You seem to be using a formula that you assume is obvious, but I, alas, am ignorant of it.
An explanation of that would be most helpful.
Data on the number of at bats Suttles had in 1926 is at home, so further work on the specifics of that season will have to wait a bit.
If we are agreed that one regresses to the player's career average, converted to a majorleague equivalent, rather than to the NeL average converted to its majorleague equivalent, I think we are on the right track. It certainly appears to me that the range you are presenting for Suttles, .364  .392 depending upon the amount of the regression, is reasonable.
It's not 4 out of 20 it's 4 out of 10, the number of years (1921/26/28/29) in which the NL leader is head of either ML leader.
That's what I meant, karlmagnus. I typed 20 by accident.
30. Gary A
Posted: January 20, 2005 at 04:52 PM (#1090650)
But Babe Ruth only won one batting title, and in that year Hornsby hit nearly 50 points higher...
31. Gary A
Posted: January 20, 2005 at 04:58 PM (#1090661)
And it's really three, not four. I know Chris is using Holway in order to have a consistent source, but that .564 really shouldn't count. Sorry to harp on that. I'll shut up now. :)
32. PhillyBooster
Posted: January 20, 2005 at 07:10 PM (#1090939)
I'd like to state for the record that I haven't understood a singled damned comment in this thread.
33. karlmagnus
Posted: January 20, 2005 at 07:17 PM (#1090958)
If you assume that a normal ML season is 500AB, then the SD of a season of 125 AB will be twice that of a normal ML season. Therefore if a player is 100 points above the long run average (of what? I think of his own career, but possibly of his own league, but anyway not of some different league) for 125AB, that is equivalent to being 50 points above the average for 500AB. Similarly 100 points above the average for 180AB is equivalent to 60 points above average for 500AB  it's a square law; 0.5 squared is 0.25, 0.6 squared is .36, 0.7 squared is .49 (245AB) etc.
I'm pretty sure that's right, and am a math major (Cambridge), but it's now 34 years since I graduated and I threw away all my stats books as that was a part of the subject I hated, so could be all wet. But that is how I would regress NL stats gained in short seasons so they were equivalent to ML stats gained in 500AB seasons.
34. jimd
Posted: January 20, 2005 at 08:01 PM (#1091071)
karlmagnus, let me run a simple example past you to see if I'm understanding you.
The 1880 Chicago Cubs went 6717 (.798) in an 84 game schedule. Using a binomial distribution, this is 5.46 standard deviations away from the expected 4242 (.500) that is the league mean.
How would this team do under a 162 game schedule? Simple extrapolation says 13032 (.802), but this is 7.70 SD away from 8181, which is too high. Regression to the mean says that it would go about 11646 (.716), which is 5.50 SD.
35. TomH
Posted: January 20, 2005 at 08:01 PM (#1091073)
Going off of Karlmagnus' post:
the formula (technically, the normal approximation to the formula) for the standard deviaiton of a proportion (which is what a batting average is) is
SD = square root of [AVG*(1AVG)/AB]
So if a player hits .400 in 100 AB, we are really sure (95%, which is two standard deviations) that his 'true' average is somewhere within +or [.4*.6/100}^.5 * 2 = .098 of .400
That looks huge, but even in 550 ABs, it's +or.042, and it's true that MLB hitters do occasionally hit 42 pts above or below their lifetime avgs in a long season, and they do sometimes hit 100 pts lower or higher in a month.
But I am NOT suggesting we 'regress' NeL stars an extra 50 pts or so for a 100 AB season. The above is based on a pre stats test that assumes we KNOW NOTHING ELSE about the player. For a Beckwith, if his typical NeL avg is .350 and he hits .400 one year, I'd weight his average something like
.400 for 100 AB
.350 for maybe 500 AB, where I estimate that my knowledge of his 'lifetime curve' is possibly 5 times the 'weight' or certainty of his one season
and so his estimated NeL avg for that year would be (.400 * 100 + .350 * 500) / 600 = .358.
voila! (or not...)
36. karlmagnus
Posted: January 20, 2005 at 08:57 PM (#1091171)
(34) Jimd, they were 298 points above .500 in 84 games, they would therefore be 298*sqrt(84/162) points above .500 in 162 games, or 215 points; thus in a 162 game schedule they would have gone .715. This calculation simply adjusts for schedule length/at bats, and not for other factors causing variation to differ (which shouldn't exist within an individual's career, but would if you were regressing to the league average.)
37. karlmagnus
Posted: January 20, 2005 at 09:01 PM (#1091182)
(35) If Beckwith's lifetime average is .350 and he goes .400 for 100AB, and you assume 500AB in a typical ML season, then his regressed average is .350 +.050*sqrt(100/500) = .374. You then have to normalise that to an MLE by multiplying by .87 or whatever.
38. karlmagnus
Posted: January 20, 2005 at 09:02 PM (#1091185)
sorry .372
39. Paul Wendt
Posted: January 20, 2005 at 09:05 PM (#1091189)
Chris Cobb #12 It is also not intuitively obvious to me that regression to the league average is appropriate (a concern seconded above by KJOK).
inappropriate inapprop inapp inapp inapp
The player's measured performance should regress toward the expected performance given his contemporary skill.
TomH #13
rightly says why you should convert first and implies why you should publish the intermediate results. Several people are interested in convertonly; few are interested in regressonly.
Brent #16
may be right about how to approach a different problem, where you have some random sample of partseasons from a player's career. Suppose you have data for 8 randomly selected 1/3seasons for someone who played 16 years.
But you don't have that; the sampled 1/3seasons are dated and you know the player's birth date and something about how careers generally develop. So Brent (regression to career average) is wrong here and TomH #13 is broadly right: Regressing to a rolling average seems to make sense to get a truer shape of a career, as long as you keep in the mind the 'typical' career shape.
Of course, that quotation isn't a complete plug'n'play solution :)

TomH #13: Tom's personal most important law of stats: everything in life varies with the square root of N (the sample size).
The application is tricky here, for those (most HOMers) who are interested in the player's fullseason achievements rather the player's skill at that time in his career. Consider a 20game sample from a 50game season, doubled to provide a 40game sample from that season. Clear? If not, consider 20 and 40games samples from a 40game season.
So, data from just a few more games is more valuable here than in the world ruled by "Tom's" law. In other words: that law is discouraging but don't let it down your search for more boxscores!
40. karlmagnus
Posted: January 20, 2005 at 09:12 PM (#1091203)
Isn't WS/WARP nonlinear on batting average?
If so convert then regress is wrong if you want career WS.
Even if not, it is wrong if you want peak WS.
41. Paul Wendt
Posted: January 20, 2005 at 09:22 PM (#1091219)
http://www.baseballthinkfactory.org/files/primer/hom_discussion/24597/
Gary A #96, jimd #99
http://www.baseballthinkfactory.org/files/primer/hom_discussion/24597/P100
Gary A #[10]8
Redland Field, Cincinnati, 1921 park factors
111, 104 adjusted in Negro Leagues (Gary A #96)
99 in National League (Gary A #96)
95 in National League (jimd #99)

As I said in #11: For any year, a good share of NeL games played in MLB parks should yield a good estimate of NeLaverage park factor (where MLBaverage = 100).
But we don't have a good share of NeL games played in MLB parks, and for some NeL games we have no data.
42. jimd
Posted: January 20, 2005 at 09:32 PM (#1091255)
Regression to the mean says that it would go about 11646 (.716),
thus in a 162 game schedule they would have gone .715.
Close enough. I think I've got the main point. Which is that the relationship to the mean, measured in standard deviations, remains constant when converting the results from a small real sample to a larger extrapolated one. (Which is what "regression to the mean" means if I think about it. D'Oh. ;)
43. Chris Cobb
Posted: January 20, 2005 at 10:42 PM (#1091413)
Thanks, everyone, for contributing to a discussion that is most educational for me, at least. I think I've followed the math well enough to be able to implement properly calculated regressions (though there will remain disagreement about what mean should be regressed to, I expect).
To be able to calculate appropriate regressions, I'll need statistics that include at bats or games more often than Holway does. That means getting hold of a MacMillan 810 edition, I think.
This comment is by no means meant to end discussion of regression, btw, just to express my appreciation and to note its implications for data gathering.
44. Chris Cobb
Posted: January 21, 2005 at 02:47 AM (#1091835)
Just posted this on the Beckwith thread. It sums up in some ways my sense of the state of the NeL MLEs, so I thought I'd repost it here in an effort to move more of that discussion from the Beckwith thread to this one:
I'll pick up on the theoretical question that I think gadfly meant for me and not for Gary A:
So, if the distribution of superstars is roughly equal between white and black, why do none of your translations for the Negro League Superstars from the 1920s and 1930s end up with a lifetime BA of .330 to .350 like their white counterparts?
I don't have a firm response to this, but here are my thoughts about the situation. I have never attempted to justify my translations in demographic terms. I don't think it's possible to derive from demographic/economic arguments the percentage of stars by race with any certainty. For the purposes of Hall of Merit elections, I believe that a quota approach to electing black player would create an inappropriate double standard. One of my goals in trying to develop accurate and reliable MLEs has been to make quota arguments unncessary. So I have not attempted to measure the results of my translations against any demographic standard.
That said, I'm not at all sure my MLEs are correct. My gut, which is sensitive to players' reputations of greatness and to expert opinions, says that they are a bit low. However, my standard for calculating them is to base no step in the process on my sense of what the numbers _ought_ to show, but to construct a system that creates statistics that are (1) derived from the best available data and (2) based on conversion methods that have been discussed by the interested membrs of the electorate and that have been generally (if not universally) accepted as sound.
If the results seem lower than they ought to be, we have the commentary of experts to challenge the results and to help us find ways to do better. I think that you are probably right that my MLEs are a little low, and I hope the electorate here will consider the serious likelihood of that based on your comments and other evidence of expert opinion.
But I can't change the system based on opinion, or it ceases to be a system that aims at an objective numerical statement of value. If we find evidence that I have erred in a calculation or gain access to evidence that leads to different conclusions about the conversions, I can make changes to improve the system accordingly. I hope to do so. I believe improvements in my handling of regression based on recent conversations will improve the system and give a fairer representation of peak value.
I can see a number of points where the evidentiary basis for the conversion factor could be improved:
1) NeL park factors from 19381948. A lot of the data for the conversion comes from Doby and Irvin in Newark. We know that was a hitter's park, but how extreme was it, exactly, and what percentage of their games did they play there? If I have used too low a park factor, that would depress the conversion factor incorrectly.
2) Data on the overall level of offense in the NeL from 19381948 in comparison to the major leagues, especially in the Negro National League. Evidence from the 1920s provided by Gary A. indicates that, although NeL levels of offenses tracked with ML levels, these diverged at times by up to 10% (I think that the difference was even greater in the late teens), with the ML levels being higher. I have taken this into account for 1920s conversions, which raises the MLEs of NegroLeague players. Eyeballing the numbers for the late 1930s and early 1940s, it looks like offensive levels were high in the Negro Leagues at that time, quite possibly higher than in the majors. If this is the case, that again, if not properly accounted for, would depress the conversion factor incorrectly.
3) Use of AAA conversion studies could help to better model the process of arriving at a conversion, provide a point of comparison for the Negro Leagues' competition level, and add statistics from NeL stars playing in the high minors to the pool of data available for the calculation of a conversion factor. Studies on the level of competition in the Mexican League would be similarly useful (and will be important for the assessment of players like Cool Papa Bell, Ray Dandridge, Martin Dihigo, and Will Bill Wright in any case).
4) Striking the right balance between conversion rates for batting and conversion rates for slugging. The discussion of the squareroot relation between the two numbers has been helpful and should help to improve the accuracy of individual conversions and provide a standard by which the conversion factors can be judged. Obviously, the .87/.82 split I am using now isn't right. It appears to be a compromise between two different conversion levels. Figuring out why my calcuation of conversion factors from the data produced this discrepancy could lead to a more reliably derived factor.
I am hopeful that I/we can do better on all of these fronts.
45. Paul Wendt
Posted: January 21, 2005 at 04:56 PM (#1092948)
how many black stars?
 production and recruitment of quality ballplayers before WWI;
 (im)maturity of baseball in the South;
 black residence in (rural) South, migration to North
#[2]71 jimd's racial and regional demographic "Adventure"
compare how many HOMers each year?
 production and recruitment of quality ballplayers;
 regional (im)maturity of baseball
46. Chris Cobb
Posted: January 21, 2005 at 05:52 PM (#1093049)
Back to regression to the mean:
Having read through the postings above, I think I understand the rationale and the formula for calculating regression to the mean.
To summarize: regression to the mean corrects for the greater variance created by small sample size by keeping the number of standard deviations from the mean constant when moving from the small actual sample to the larger, hypothetical sample of data.
The ratio of the standard deviations of the two sample sizes is the square root of the ratio of the two sample sizes. The variance from the mean in the small sample is multiplied by the ratio of the standard deviations to find the variance that is the same number of standard deviations from the mean in the larger sample.
Mule Suttles hit .498 in 1926 in 212 at bats. His career average was .340. If we accept .340 as the mean and 500 as the normal number of actual at bats in a season.
Since some voters will want to see unregressed conversions and some will want to see regressed conversions, I'll convert and then regress, using the .87 factor for now. .340 > .296, .498 > .433
his .433 MLE average is regressed as follows:
Variance from the mean: .433.296 = .137
Ratio of standard deviations: square root of 212/500
Multiply the variance by the ratio, and add it to the mean.
He still has the highest average in the majors that year, but by a small rather than a huge margin.
Have I calculated this correctly, given our assumptions about what the mean is and what the conversion factor is?
Now, this conversion leaves out two important factors in the overall conversion calculation:
1) difference in league offensive levels
2) park effects
The proper place for these in formula needs to be ascertained, and I think that placement will make a difference in the results. I think that park factors should be applied first, before conversion and before regression, and that difference in league offensive levels should be applied at the same time as the conversion factor.
Does that seem right?
47. Chris Cobb
Posted: January 21, 2005 at 05:59 PM (#1093062)
I should add also that I am inclined to use, as per Tom's and Paul Wendt's advice, some sort of a rolling average as the mean, not career average. The number of seasons to be used might depend on the number of documented at bats per season. If 200 at bats are typically documented, I'd use 3year averages. If 100 at bats are typically documented, I'd be inclined to use 5year averages.
Comments on that?
As to the number of atbats to which to regress, I'd welcome suggestions. One possibility would be to use the number from the rolling seasonal averages, as long as those were totals that might occur in a full majorleague season? Or would it be better to set a derived norm based on typical at bats per game and typical games per season, and stick with that?
Comments?
48. TomH
Posted: January 21, 2005 at 06:57 PM (#1093176)
Nice job putting post 46 into English, Chris! Couldn't have said it better if you gave me a week.
I'd recommend regressing to what we consider to be a 'normal' set of at bats for a season. Maybe 550?
49. karlmagnus
Posted: January 21, 2005 at 07:12 PM (#1093214)
Chris, I now agree 100% with your calculation, and indeed with the result as regards Suttles 1926. Of course there are endless further iterations of things we don't know and have to estimate, but I think it's a big step forward.
50. karlmagnus
Posted: January 21, 2005 at 07:16 PM (#1093225)
I also agree with TomH on "normal" set of atbats, maybe the number put up by the 4nth hitter in terms of ABs, where n is the number of teams in the league (I assume the software can do that by pressing a button.)
51. Chris Cobb
Posted: January 21, 2005 at 07:31 PM (#1093255)
Two additions to the list of issues for conversions from post 46 above:
1) changes in competition levels in the Negro Leagues. I don't think we're close to having enough evidence to calculate these statistically, but I think there's enough evidence to indicate that levels changed. This is an area that needs to be handled subjectively for now.
2) normal paths of improvement and decline for majorleague players. Since all the conversions are based on play happening in sequence in players' careers, the progression of careers must be influencing the conversion. I've tried to exclude instances where obvious improvement and decline are influencing the data, but there may be subtler influences even in the most level pieces of evidence available. Comparative data on this subject could also help to improve the conversions by enabling us to correct for its influence on the data I have used and to enlarge the set of data available by making it possible to use players during their periods of significant improvement and decline.
Nice job putting post 46 into English, Chris! Couldn't have said it better if you gave me a week
Thank you! I'm a professional with words, so that part comes easy to me. It's the numbers that I struggle to get straight!
52. karlmagnus
Posted: January 21, 2005 at 07:47 PM (#1093301)
Incidentally, Chris, atypically lousy years should of course also be regressed up to the mean. You probably realise that!
53. Paul Wendt
Posted: January 22, 2005 at 11:56 PM (#1095528)
Chris Cobb #46: Mule Suttles hit .498 in 1926 in 212 at bats. His career average was .340. [For discussion of regression, suppose] we accept .340 as the mean and 500 as the normal number of actual at bats in a season.
The 212ab sample from his 1926 season is a small part of our sample from his career, which is more than 1000ab (equivalently, he usually batted above .300). So I am comfortable with .340 as a talking point.
#46: Now, this conversion leaves out two important factors in the overall conversion calculation:
1) difference in league offensive levels
2) park effects
The proper place for these in formula needs to be ascertained . . .
#51 Two additions to the list of issues for conversions from post 46 above:
[3] changes in competition levels in the Negro Leagues. . . .
[4] normal paths of improvement and decline for majorleague players.
Rather,
[4'] Player's own path of improvement and decline, where the normal path for mlb players will be used in default of useful information about Player.
Otherwise, yes.
These issues pertain separately to the conversion of his .498 partseason average and the conversion of his .340 partcareer average. For example, you want to seasonparkadjust his .498 and careerparkadjust his .340.

Regarding [4] and [4'], iiuc.
Given ample data or a significant amount of data for adjoining seasons, you will use some (maybe weighted) 3yr or 5yr average of season records rather than his career average, and there will be no issue of improvement and decline.
54. Chris Cobb
Posted: January 23, 2005 at 03:48 AM (#1095775)
Paul,
I meant [3] and [4] as issues to address in order to improve the accuracy of the general conversion factor, though I take your point that career path improvement and decline (not to mention radical changes in offense levels such as the one that took place between the late teens and the early twenties) does make it problematic simply to use career average as a baseline.
Given ample data or a significant amount of data for adjoining seasons, you will use some (maybe weighted) 3yr or 5yr average of season records rather than his career average, and there will be no issue of improvement and decline.
Yes, and parkadjustments to the baseline will be easier to make. You imply above that 1000 ab is a sample size that gives you confidence in a baseline. At about what point does sample size become too small for confidence as a baseline?
Following Tom H's explanation of standard deviation above, I calculate that in a 1000 at bat sample, we can be 95% certain that the a player's true average is within 30 points (2 SD) of the average generated in that sample. In a 500 atbat sample, 2 SD is 40 points. In a 2000 atbat sample, 20 points. l
55. Paul Wendt
Posted: January 28, 2005 at 09:24 PM (#1109174)
Chris Cobb #54
OK, to your point.
Regarding the 1000 ab, I chose 1000 because of its size relative to 212. I am not comfortable with regression of partseason to career average where the partseason is a large part of the sample for measurement of career average. Should I be comfortable?

Ron Wargo, 1944 Ballot Discussion #94 Negro League infielders seem to have difficulty in our rankings. Are we being too harsh? While outfielders like Torriente and Hill breeze into the HOM, only Lloyd can claim that distinction for infielders so far. Johnson & Grant took some time, although Johnson made it relatively quickly. [anticipations deleted]
This is a good observation, potentially as important as the epochal bias in HOF inductions from the Negro Leagues, against those who retired before Buck O'Neil played (or followed?) the game. The infield positions {3B, SS, 2B} are underrepresented among the great Negro ballplayers by reputation, inasmuch as I know the reputations. Only John Henry Lloyd is sometimes called "maybe the best" or generally included in the top ten, I think. Lloyd's age, older than 8 or 9 of the ten, underscores his case.
Does it indicate bias? I don't know. If so, it may be a common bias. Only Honus Wagner is routinely considered one of the top ten MLB players, or sometimes called maybe the best, and only George Wright is routinely considered one of the top ten 19c players, or sometimes called maybe the best. Each is the Shortstop from the Dawn of Time, older than any of the other players commonly considered one of the top ten.
56. Gary A
Posted: February 02, 2005 at 07:10 PM (#1120631)
I have a small contribution to make to Negro League MLEs. The first Negro Leaguer (in the narrow sense, meaning someone who played in the organized NeL) to move to the major leagues wasn’t Jackie Robinson—it was Ramon (Mike) Herrera, who was a regular infielder (2b/3b) for the 192021 Cuban Stars in the Negro National League, then served as a utility infielder (mostly second baseman) for the Boston Red Sox in 192526. By 1928 he was playing second base for the eastern Cuban Stars of the ECL. (I’m assuming he must have been somewhere in organized baseball in 192224 and probably 1927—I can’t find any mention in Holway of Herrera for those years. He apparently didn’t play in the IL or AA, but that’s all I can find out for now. He played in the Cuban League for 16 seasons, from 1913 to 1930.)
Anyway, it so happens that I have his stats for 1921 and 1928, along with league and park data. So here’s how Herrera’s Negro League and major league careers compare. I have to say the results were a little surprising:
Ramon HerreraRaw Averages
YearageteamGPAAVEOBASLG
192123CSW—68307.234.304.305
1928—30CSE29—133.317.341.397
NeL total97440.261.316.334
AL totals, at ages 2728:
1925/2684308.275.320.333
I don’t have stats for Herrera’s 1920 NeL season, but Holway has him hitting .259, which seems to indicate it probably wouldn’t change his career NeL stats very much.
League Context (parkadjusted)*
YearleagueAVEOBASLG
1921NNL.268668.329507.357343
1928ECL.281519.333020.383140
NeL total.272767.330585.365572 (prorated)
1925/26AL.288000.359000.404000**
Dividing Herrera’s percentages into these adjusted, prorated league contexts, you get these relative averages:
AVEOBASLG
NeL.955976.956362.914122
AL.954861.891364.824257
Divide his major league relative averages into the corresponding NeL figures, and you get these conversion factors:
AVEOBASLG
.998834.932036.901693
In other words, Herrera’s Negro League and major league batting averages were almost the same, relative to his league and park, but he walked less and hit for less power in the American League.
There are obvious caveats:
1) This is very limited, comparing 440 NeL plate appearances to 308 AL plate appearances.
2) His NeL numbers are weighted toward the 1921 season, which is four years before his major league appearance. He hit significantly better in ’28, so he may simply have been a better hitter in the midtolate 20s—although the ’28 sample is only 29 games.
3) His NeL numbers are also divided between the ’21 NNL and ’28 ECL, which were very different leagues. I don’t really know how they stack up against each other, qualitywise; offhand, I’d say that the ’21 NNL might have been better, simply because it played the season through; the ’28 ECL disintegrated in late May, though most of the teams continued to play each other into October. (NOTE: All of the Cuban Stars’ 1928 games were against teams that were in the ECL in either ’27 or ’28—Brooklyn, Hilldale, Lincoln Giants, Baltimore, Bacharach Giantsalong with the Homestead Grays, who clearly of league quality.)
I would hardly suggest that these conversion rates are accurate for all players at all times. Still, I thought it would be good to get this out there as a data point, especially since almost nobody knows about Herrera.
*I found BA/OBA/SLG park factors for Redland Field in the 1921 NNL, and adjusted Herrera’s league context, prorated to the number of plate appearances (for OBA) and at bats (for AVE and SLG) he had at home and away. He played in 28 games at home, 40 on the road. The raw park factor (runs) for Redland Field that year was 110; but the averages show a much milder effect; and, as in the majors, the park cut home runs significantly. There are some technical steps I skipped (such as accounting for Redland Field not being among the Cubans’ road parks), as the effects are pretty small and I was reaching the point of diminishing returns. In 1928, the Cuban Stars (E) were a road team, so I didn’t bother with park effects (again, there could be an effect depending on which road parks they played in, but I doubt it was very large).
The Redland Field factors for 1921, if anyone's interested:
**These are baseballreference.com’s parkadjusted league numbers—as I understand it, they’re prorated for Herrera’s plate appearances, so they give the parkadjusted league context.
57. TomH
Posted: February 02, 2005 at 09:35 PM (#1121057)
Warning; pysch drivel ahead, no numbers involved:
All of these data are very useful, and muchly appreciated, but let me offer one caveat.
Some people (including Bill James) have pointed out that in some circumstances, the drive to succeed is an important factor. This is anecdotally seen in many feelgood stories in all sorts of sports (and nonsporting events in life), and might be reflected in things like blackvwhite exhibition games, and possibly in initial conversions to MLB. If some players see making the majors or beating the white players or whatever as a doordie item, it can surely affect their performance. The 'cornered rat' or 'mother bear' theory if you will. With fewer vocational options, kids from difficult bakcgrounds have exceeded normal expectations of 'making it' in sports. And it's plausible to me that those who crossed over from NeL to MLB may have been driven to succeed in ways some of us can ony imagine.
58. Gary A
Posted: February 03, 2005 at 12:25 AM (#1121300)
I was going to include this, but didn't have time earlier. If you want to drop Herrera's 1921 season as being too early (I honestly don't know whether this is a good idea or not), you get these conversion factors between Herrera's AL games and his 1928 ECL season:
Of course, this represents 308 AL plate appearances compared to only 133 ECL plate appearances.
59. Gary A
Posted: February 03, 2005 at 12:36 AM (#1121315)
Tom, I agree wholeheartedly that such factors undoubtedly drove many black players, both in the early days of integration and maybe later. I guess I'd differ from what you say, if I understand correctly, in that I don't really see the "drive to succeed" as a reason to discount the performances of black players in the major leagues. Ordinarily, we praise players (and people in general) for that sort of drive, and regard it as an *explanation* for success. Surely something vaguely similar (though perhaps not as intense) sometimes motivated Ty Cobb, for example (ironically enough), or Irish players in the 19th century, and we don't offer the "cornered rat" syndrome to explain away their performancesjust to help explain them. Anyway, maybe I didn't quite get what you're saying.
60. Brent
Posted: February 03, 2005 at 03:56 AM (#1121522)
Tom,
I think your point is valid and important regarding some of the blackwhite exhibition games. Reading The Pride of Havana, it's clear that while some major league managers (such as McGraw) took their exhibition games in Cuba very seriously, in other cases the major league teams treated it as a holiday, and sometimes key players wouldn't show, players would show up drunk, and so forth.
When it came to making the majors when integration arrived, however, it seems to me that both the white and black players must have had extra motivation. The white players' jobs were on the line, plus they didn't want to be showed up. As you describe, the early black players also must have had an intense desire to succeed. I think psychological factors must have worked both directions.
61. Gary A
Posted: February 03, 2005 at 07:21 PM (#1122866)
A further note on Herrera: it appears that bbref.com takes pitchers out of its calculations of league offensive context for each player. I'm not sure why they do this (I guess to enable comparison between DH and DHless leagues), but the effect is apparent: Fenway Park 192526 was more or less neutral for hitters, but the parkadjusted league averages for Herrera are several points higher than the actual league averages.
I'm not sure how to produce an equivalent adjustment for the Negro Leagues, as there were more multiposition, "doubleduty" players, so it would be harder to figure out what to subtract. Negro League pitchers probably hit better relative to league than their white counterparts, but it was still the weakesthitting positionso taking pitchers' hitting out of the league averages would cause Herrera's league context to go up, and he would look worse as a hitter in the Negro Leagues.
In other words, it appears that the Negro League MLE conversion factors I presented above should be even higher, if I can figure out how to adjust for this. Of course, all the other caveats (sample size, etc.) remain.
62. Chris Cobb
Posted: February 03, 2005 at 10:28 PM (#1123264)
A further note on Herrera: it appears that bbref.com takes pitchers out of its calculations of league offensive context for each player. I'm not sure why they do this (I guess to enable comparison between DH and DHless leagues), but the effect is apparent: Fenway Park 192526 was more or less neutral for hitters, but the parkadjusted league averages for Herrera are several points higher than the actual league averages.
Is it possible that this effect is created not by removing pitchers but by adjustments for pitching quality? That is, does bbref take the Boston pitchers out of the offensive context for the Boston hitters? I'm pretty sure, actually that bbref does this. Would it account for the effect that Gary has observed, or not?
63. Paul Wendt
Posted: February 04, 2005 at 12:46 AM (#1123531)
Yes, BPF and PPF (the Total Baseball park factors published at BBRef) account for the teammate effects. That cannot account for Gary's phenomenon because he his starting point, "Fenway Park 192526 was more or less neutral for hitters," is with reference to BPF (Bos AL 1925 BPF = 100).

Quoting the BBRef Glossary: Adjusted OPS+ [OPS+] is calculated differently from the Total Baseball PRO+ statistic. I chose OPS+ to make this difference more clear.
. . .
My method
1. Compute the runs created for the league with pitchers removed . . .
Note, TB7 adopted 'OPS+' in place of 'PRO+' used in TB3TB6, so 'OPS+' no longer suggests any difference between the statistics.
FWIW, I agree with Sean Forman (BBRef) about the calculation of season OPS+ but disagree about career OPS+. A few years ago, I reported the BBRef career calculation as a mistake, re: George Davis.

FWIW, the BBRef adjustment of ERA seems to be routine.
: AL1925 lgERA 4.40
: Boston lgERA* 4.52
Accounting for the ERA roundoff to #.##, that is consistent with any routine
: adjustment factor in interval [1.02497, 1.02958].
In turn, that fits the reported
: Boston PPF 103.
64. Brent
Posted: February 04, 2005 at 03:46 AM (#1123820)
FWIW, here are Cuban League statistics for Herrera:
Notes:
191314 – No American players in league.
191316 seasons played at Old Almendares Park.
1917 – Record show is for an alternative league that was organized (and displaced the regular league that year); games played at Oriental Park. No American players in league.
191830 seasons – most games played at New Almendares Park.
191920 – Only one American player in league.
1921 – Season lasted only 5 games. Tied for league lead in triples (1).
1924*  Special season. The regular season was terminated early when Santa Clara reached an 11.5 game lead. The weakest team (Marianao) was dropped, its players were redistributed, and a 25game special season was played.
192627 – A rival league was formed (“Triangular”) that raided many of the better players. Herrera led Cuban League in runs scored (24).
65. Brent
Posted: February 04, 2005 at 03:52 AM (#1123824)
Another note  bbref lists his birth date as December 19, 1897, which would have made him 15 years old at the start of his first Cuban League season. I think I detect a baseball age.
66. Gary A
Posted: February 04, 2005 at 05:20 AM (#1123891)
Brent, I've thought the same thing about Herrera. Makes sense he'd "adjust" his age as he tried to land a big league job, especially if he was pushing 30 or past it, as it seems he might have been.
I'd like to figure out what bbref is up to, just to make sure I get Herrera right. Here's what they have:
He gives his first step in figuring Adjusted OPS+ as "Compute the runs created for the league with pitchers removed..."
67. KJOK
Posted: February 04, 2005 at 05:48 AM (#1123927)
Sean removes pitcher batting before calculating adjusted stats  I'm about 99.9% sure of that...
68. Paul Wendt
Posted: February 04, 2005 at 03:21 PM (#1124327)
(100%. See the BBRef Glossary, linked to #63.)
1917 – Record show is for an alternative league that was organized (and displaced the regular league that year); games played at Oriental Park. No American players in league.
191830 seasons – most games played at New Almendares Park.
191920 – Only one American player in league.
1921 – Season lasted only 5 games. Tied for league lead in triples (1).
Note that five MLB seasons were played and nearly five calendar years passed between the '1917' and '1921' seasons in Cuba. I have two suggestions, one for humans who want linesortability by computer.
1917w (for winter)
191819
191920
192021
1921f (for fall)
The other is for inside a database and it amounts to the following in chron order for the five given Cuban seasons and five interpolated MLB seasons.
1917 w
1917
1918
1919 o
1919
1920 o
1920
1921 o
1921
1922 f
This will be useful for any human using a spreadsheet who needs to subtract "years" within the Cuban league, preserving familiar properties.
Eg, '1917 w' to '1922 f' is a sixyear span, 6 = 19221917+1
69. Chris Cobb
Posted: February 06, 2005 at 02:56 AM (#1127452)
Here's a question I'd like to get a clear answer on before I post my seasonbyseason Willie Foster MLEs.
When creating MLEs for pitchers, should ERA and ERA+ be converted using the same multiple that one uses for EQA with hitters?
I've been assuming that the same multiple should be used, but I'd like those with better knowledge to confirm that for me.
That appeared to be the conclusion reached in the discussion on the Wes Ferrell thread concerning how to work out the impact of pitchers' OPS+ oon ERA+, but I'm not entirely clear on that.
Is that correct, or no?
70. Brent
Posted: February 06, 2005 at 03:12 AM (#1127465)
The documentation on the BP site says that EQA is transformed to the 2.5 power to get EqR, which in turn is used for BRAA, BRAR, etc. On the other hand, the discussion of OPS+ suggested that going from OPS to runs is proportional  no squaring or power transformations needed. So that suggests that OPS+ is on a different scale that EQA, which in turn would imply that a different multiple should be used for ERA and ERA+.
71. Chris Cobb
Posted: February 06, 2005 at 03:18 AM (#1127473)
OK. I pulled up EQA because it's used in "one number" conversions between leagues.
My conversions actually track to batting average and slugging, using different rates (which should and eventually will be related by a consistent, theoretically justified ratio), not EQA.
So let me repose the question more concretely:
The evidence, so far, suggests a .87 ba/.82 sa conversion ratio. How ought the ratio for ERA+ be linked to these?
If you want to have a ba/sa pair that fits the square root formula, go with .90/.82.
I'm aware that gadfly's data may lead to a reconsideration of these ratios, but for now getting the principle of how to set up pitcher conversions in relation to hitter conversions will be enough to make some progress possible.
72. Brent
Posted: February 06, 2005 at 03:28 AM (#1127484)
The Bill James formulas that I used for Buzz Arlett used the square root formula in going from runs to hits, but that was for batters, not for pitchers. I'd say .82, but that's purely on the basis of intuition  I can't say I recall having seen the research on pitchers.
73. KJOK
Posted: February 06, 2005 at 09:42 PM (#1128634)
Copied over from Beckwith thread:
In post # 184 EricC wrote:
Because of selection bias, we arrive at the incorrect conclusion that league A is stronger.
and Brent responded:
I'm sorry to keep coming back to this argument, but now I see another flaw in it....
Understanding selection bias is such an important part of doing proper MLE's, thought I'd try to explain with an example.
Let's say we have League I and League II, and we 'know' League II is a stronger league, with League I having a TRUE strength at around 90% of the strength of League II.
Let's assume that League I has two types of players, with half of all League I players being Type A, and the other 50% being Type B, and that this split holds for all talent levels (superduperstars, allstars, very good players, average starting players, etc.) Type A players that move to League II are able to retain 95% of their value, and Type B players that move to League II retain 85% of their value (averaging back to 90% overall). The problem is that more Type A players will successfully transition to League II than Type B as detailed below, which will skew comparison results.
As players are actually selected to join League II from League I, all of the superduperstars, both Type A and Type B, will be selected (50% will be Type A, 50 % Type B)
However, when we get down to very good players, Type A players will continue to play well enough in League II to be selected, but some of the Type B players will not retain enough value to either be selected or hold a job in League II.
As you get down to average starting players, some of the Type A players will be selected into League II, but almost no Type B players will, etc.
So, the mix of players who played in both leagues will NOT be 50% Type A and 50% Type B, but might be something like 75% Type A and 25% Type B.
At this point, if we were to develop MLE's based on the performance of players moving from League I to League II, it would APPEAR that the correct conversion factor would be .93 (.75 x 95% + .25 x 85%) instead of the REAL factor of .90!
74. Chris Cobb
Posted: February 06, 2005 at 10:21 PM (#1128682)
KJOK,
A clear explanation of the theoretical problem, but should this be of practical concern to us?
Can/should we do anything other than recognize that the conversion factor is an average, and that this average might be adjusted up or down for individual players, depending on the extent to which we believe their skills sets would have fit the majorleague game, _if_ we believe that they should be evaluated according to that standard and not on their merits within the NeL context?
75. KJOK
Posted: February 07, 2005 at 07:16 AM (#1129890)
Chris:
I certainly think there's an argument to be made that we SHOULD be evaluating the Negro League players primarily on their merits within the NeL context.
However, on "Can/should we do anything other than recognize that the conversion factor is an average...", the methodology being discussed above will be using the WRONG average as the starting point, which I believe does have some practical implications in player evaluation.
76. Gadfly
Posted: February 09, 2005 at 11:40 PM (#1135220)
Well, the Beckwith thread seemed dead so I looked around and found this one. Hopefully, no one minds if I put my two cents in on some things that have been pretty thoroughly discussed:
1) Regression to the mean or the Don Padgett Problem:
(From 193748, Padgett was a .288 hitting catcheroutfielder in the National League. In 1939, Don whacked .399 in 233 at bats. Very obviously, he was not the second coming of Ted Williams.)
If you ask me, you have to adjust first for park and league effects. If you don't do this, you will be mixing different park and league effects from different seasons together and the results will simply be a mess.
Mule Suttles is a pretty good example of this. If the Mule had played in the Majors of his time, I seriously doubt that he would have ever lead the Majors in BA (maybe if he was in the Baker Bowl, having a great year and with the stars aligned just right).
In 1926, Suttles played for St. Louis in an alltime great hitters park that inflated statistics about like Colorado does presently. In fact, the park pretty much made Suttles unpitchable.
Suttles was not a pull hitter, he liked his pitches out over the plate. With a 250 foot left field wall in St. Louis, it was suicide to pitch Suttles in because he could just muscle it to, or over, that wall. So the pitchers had to put it over the plate just were he liked it.
There was no concurrent park in the Majors that inflated offense like this park.
However, in 1925, Suttles played in Rickwood for Birmingham, one of the greatest pitcher's parks in history. If you do not adjust Suttles for the parks and leagues first, you will get a average that mixes and matchs these two very odd parks together.
Of course, I realize that, with much of this statistical info being unavailable, this is easier said than done.
However, once you have adjusted as best you can for park and league, then you have to regress to the mean of the player's current talent level. The idea of regressing a player to the mean of the League is obviously worthless and regressing to the player's career average is better but still not right.
This, of course, is the really interesting question: "How many at bats are necessary for skill and luck to even out and give a true representation of a player's skill?"
My personal opinion is that 500 at bats is good but not really enough to be totally certain (as evidenced by the number of Norm Cash or Brady Anderson type fluke seasons in the Majors), but that 10001500 at bats is better.
In other words, a Negro League player should be regressed to his average over the nearest 1000 or so at bats, at the least. For example, if John Beckwith hits .450 in 200 at bats and .350 over the nearest 1000 at bats, his average should regress to .386 with a .450 in 200 at bats and then .350 in the next 350 at bats.
(And, as someone pointed out, this works in the reverse  if .250 over 200, then regress to .314.)
I think Chris Cobb has the right solution with a rolling five year plan, though I would amend it to being simply the closet 1000 or 1500 at bats
For Example:
1940: 245 at bats
1941: 302 at bats
1942: 205 at bats
1943: 256 at bats
1944: 251 at bats
normalized for 1942 would be 205 + 302 + 256 + 237/496 (1940 and 1944 together).
I've been doing this for years and it seems to work just fine with, as was pointed out, the caveat that adjustments need to be made at the beginning and end of the player's career.
When I get some more time, I'll put up two more posts on:
2) Brent's interesting posts on Buzz Arlett; and
3) KJOK's interesting posts on Ramon Herrera.
Two other random thoughts:
1) As I stated before, I don't think that the conversion factor from the Majors to the Negro Leagues would deviate much due to differences in talent distribution.
I think that the distributions of talent between both Leagues were probably very very similar. I think that the context is pretty much the same with the real difference simply being the talent level.
In other words, there is a conversion constant (for each year) between the Majors and Negro Leagues, and it is important to know this to adequately judge how the Negro Leaguers are rated. And this rating should be by how they would have performed if they were able to play in the Majors (individually, not all at once, so as not to disrupt League levels).
2) I think it's funny that, here in the Hall of Merit, the Negro Leaguers are still being badly discriminated against. For example, Dick Lundy and Frankie Frisch are very very similar players; but Frisch is in at #2 in the 1944 ballot and Lundy finished #29.
Basically, Lundy is pretty much the same player as Frisch, possibly slower, but with more power and better defense (Lundy was a shortstop, Frisch a second baseman).
This is why the true yeartoyear conversion rates are needed.
77. jonesy
Posted: February 10, 2005 at 12:39 AM (#1135284)
Mike Herrera spent the 1923, 1924 and 1927 seasons at Springfield, Massachusetts in the Eastern League.
In 1923 he hit .354 in 148 games. 204 hits in 577 at bats. He had 31 doubles, 7 triples and 15 homers. I will try and give you some other names to gauge on.
Wade LeFler led the EL with a .369 average in 1923. Elmer bowman hit .366, George Fisher .365, Herrera .354 and Si Rosenthal .338.
20 players in the league scored 100 or more runs. Walter Simpson had 131, R. Emmerich had 123 and Herrera was next with 122.
Bowman led the league with 211 hits, Ted Hauk was next with 204. Then came Herrera (204), Emmerich (202) and then John Donahue and A. H. Schinkle with 201.
In 1924 Herrera hit .303 in 152 games. He had 191 hits in 631 at bats with 34 doubles, 5 triples and 8 homers. He scored 114 runs.
Other recognizable major league names (with more than 100 hits) in the Eastern League in 1924 were Wade Lefler (not so recognizable) at .370, Lou Gehrig at .369, Earl Webb at .343 and Clyde MIlan at .316.
In 1927 Herrera hit .243 in 94 games. He had 90 hits in 371 at bats, scored 41 runs and knocked in 25. He had 12 doubles, 0 triples and 2 homers.
78. jonesy
Posted: February 10, 2005 at 02:02 AM (#1135438)
Gadfly,
I'm sure you are interested. I did find all five of the games played between the Lincoln Giants and the Philly Colored Giants in the late summer of 1928. The games were played (one in each) in Worcester and Brockton, Massachusetts, and then the final three in New Bedford. The Philly Giants won the first three and the Lincoln Giants took the final doubleheader.
I now have about 60 full boxscores on Bill Jackman in the 19251930 period. I have him 50 in five starts against major leaguer pitching opponents in 1925 and 1926. I also have him working in relief (no decision) in three more games against major league hurlers those same two seasons. I haven't even scratched the surface.
79. Gadfly
Posted: February 10, 2005 at 02:47 PM (#1136578)
Hi Jonesy, good to hear from you.
You probably already have these, but here are two other leads for Jackman.
1) Jackman, Burlin White, and company are discussed a little (3 or 4 pages) in the book 'Even the Babe Came to Play' by Robert Ashe (1991). The section about Jackman talking to batters and driving them nuts is pretty funny even with the racist undertone that Ashe gives it.
The book also states that Jackman went a reported 484 in 1927 with 2 nohitters.
2) Jackman gave an interview in the Jan. 17, 1947, Boston Traveler newspaper. I know you stated in the Rogan thread that you had a 1947 Jackman interview, but I figured that I'd post this just in case it is not this one.
I would be very interested in knowing how Jackman did against the Lincoln Giants in those 5 games. I have always wanted to find proof of Jackman's greatness and the combined totals, posted in the Rogan thread, of the 8/30 game from that 5 game series and the 9/23, 9/30, and 10/7 games in New York are the best evidence I've ever seen.
Jackman, in four games against an elite Negro League team, went 13, gave up 33 hits, 16 runs, 11 walks, while striking 28 in 33 and a third innings with his team playing very poorly behind him. Taken in context, this suggests that his reputation was deserved.
How did he do in the other 4 games of the series?
80. Gadfly
Posted: February 10, 2005 at 02:57 PM (#1136597)
Also, as Jonesy has partly posted, Ramon Herrera spent every year the 1920s playing baseball in the USA. I'm goofing around with an analysis of his career to see what shakes out; but his team history goes like this:
Ramon (Mike) Herrera 19201929:
1920 Linares’ Cuban Stars of Havana
1921 Linares’ Cuban Stars of Havana
1922 Springfield Ponies (Eastern League A)
1923 Springfield Ponies (Eastern League A)
1924 Springfield Ponies (Eastern League A)
1925 Springfield Ponies (Eastern League A)
1925 Boston Red Sox (American League), last month
1926 Boston Red Sox (American League)
1927 Mobile Bears (Southern Association A)
1927 Springfield Ponies (Eastern League A)
1928 Pompez’ Cuban Stars of New York
1929 Pueblo Steelworkers (Western League A)
Of course, during the 1920s, the current TripleA Leagues were AA Leagues, one step below the Majors. Herrera spent his decade mostly in A ball (currently AA) missing only the Texas League, two steps below the Majors.
81. jonesy
Posted: February 11, 2005 at 01:04 AM (#1137779)
I have three clippings from 1947 on Jackman but none are the one you mention. Thanks.
The first is by Jerry Nason and he compares Jackman to Paige. Mention is made of Jackman playing for East Douglas in the early 30s as a teammate of Hank Greenberg (I think it was 1929) and Jackman averaging 16 strikeouts a game with a 10 dollar bonus for each strikeout, on top of his $175 base per game. Seems very reasonable from other sources I have put together.
Then there is a story about Jackman tossing a 5 inning 32 win in the Boston Park League in July. He fanned 7 and was listed at 54 years old; agreeing with the early version of his birth year (1894 vs 1897).
I have "Even the Babe Came to Play" though I hadn't made note of the Jackman story. I have found many similar stories. He was almost as big a draw as the third base coach as he was on the mound.
I have the 472 record from, I think, a 1929 paper but I do not know how much stock I put into it.
So far, and against all comers, I have:
1925: 91 record with 46 hits allowed in 78 innings. 19 walks and 60 K's (BB and K missing from one of those games). He allowed 20 runs and among his victories were games over exmajor league hurlers Buck O'Brien (twice), King Bader and Earl Hanson.
1926: 71 record with 42 hits allowed in 75 innings. 21 walks and 71 K's. (hits, BB and K missing from one game). Victory over then future major leaguer Haskell Billings.
1927: 21 record. 20 hits in 36 innings. 8 BB and 41 (BB missing from one game).
1928: 87 record. 111 K's and 41 BB's in 123 innings (missing some IP, BB, K and hits). This record includes the Lincoln series in New York (data from Gary A.)
In the 1928 Massachusetts series with the Lincoln Giants, he:
1. Tossed a 7 inning 3hit shutout that he won 20 while fanning 9.
2. Lost a 123 CG in the final. Allowed 16 hits (5 walks and 5 K's) in 9 innings. This is the worst of the 60+ games I have.
In addition of the fall games with the Lincolns in NY, I did find mention of an early spring 1928 game in which he locked up in a 11 duel with Nip Winters. Jackman reportedly allowed but one hit and Winters two. It's undocumented beyond that.
Lots more groundwork to be done on this one. I'm on a five year plan. Only a few months into it though so I am very happy with the progress.
82. Dr. Chaleeko
Posted: February 11, 2005 at 03:35 AM (#1138100)
Lots more groundwork to be done on this one. I'm on a five year plan.
Da Kommrade, even Trotsky say, we should all be on fiveyear plan!
Sorry, Jonesy, I get all nostalgic for my Marxist college days whenever I see the words five, year, and plan falling in sequence.
; )
83. Paul Wendt
Posted: February 11, 2005 at 08:33 PM (#1139759)
Gadfly not far above: In other words, a Negro League player should be regressed to his average over the nearest 1000 or so at bats, at the least. For example, if John Beckwith hits .450 in 200 at bats and .350 over the nearest 1000 at bats, his average should regress to .386 with a .450 in 200 at bats and then .350 in the next 350 at bats.
. . .
I think Chris Cobb has the right solution with a rolling five year plan, though I would amend it to being simply the closet 1000 or 1500 at bats
I agree that a fixed number of ABs is attractive.
1000 is a nice round number.
How long is the 1000AB interval for candidatequality players at different times in NeL history? In other words, how often do we have 1000 ABs within the fiveyear moving interval? within seven years? in the 1920s? in the 1930s?
Whether 1000 is a reasonable number depends on answers to such questions as the pure statistical questions about regression.
84. Gary A
Posted: February 12, 2005 at 05:42 AM (#1140763)
I'm in the midst of a study of NNL park effects in 1921. At some point I'll present it in a more systematic way, but for now check out the American Giants' home park (Schorling):
In 35 American Giants' road games, the averages (for both teams) were .271/.341/.387, with 43 home runs in 2575 plate appearances.
In 37 American Giants' home games, the averages were .206/.274/.255, with 6 home runs in 2541 plate appearances.
That makes for these park factors:
BA: .760764
OBA: .802666
SLG: .660354
HR: .141402
Wow.
85. Chris Cobb
Posted: February 12, 2005 at 03:17 PM (#1141168)
How long is the 1000AB interval for candidatequality players at different times in NeL history? In other words, how often do we have 1000 ABs within the fiveyear moving interval? within seven years? in the 1920s? in the 1930s?
From 1920 to about 1930, I'd say that we get 1000 at bats over 5year intervals, typically, in the NNL and 7year intervals in the ECL. In the 1930s recorded at bats drop way down for most teams, with 7 or more seasons needed to garner 1000 at bats. From the 1940s, numbers are somewhat higher again, with 7 years probably being about the norm to reach 1000 ab.
My plan is to make seven years the limit for establishing a mean to which to regress: I could limit that to 5 years, if that were judged preferable.
This weekend I'll be doing new regressions for Beckwith, Lundy, and Moore, I hope, so we can address this issue with reference to specific cases, if we wish.
86. karlmagnus
Posted: February 12, 2005 at 04:10 PM (#1141208)
Gary A, the samples are too small to be able to produce a meaningful park effect. Unless we have evidence that something major was done with it, we can only go with park effects over a lengthy period of time, even in the ML. And, Chris Cobb, you need to look carefully at any season that is goosed (or, indeed, suppressed) by a funny park effect; chances are the park effect is a statistical aberration.
87. Chris Cobb
Posted: February 12, 2005 at 06:12 PM (#1141330)
karlmagnus,
I use Gary A's park effect data only to determine a general direction and magnitude for park adjustments. For an extreme pitchers' park, I'll adjust upwards by 5 to 10%. For extreme hitters' parks, the reverse. In most cases, the adjustment will be 0 to 2%. When I present my data, I'll include the park factors that I've used.
88. Gary A
Posted: February 12, 2005 at 06:35 PM (#1141349)
Karl, I'm well aware of that, as you'd know if you read my posts with any care (there are several posts on park effects in the Beckwith thread). I've been working on park effects for several seasons (192023 plus 1928) precisely in order to get better evidence. The evidence is quite abundant at this point that Schorling's Park had an extreme effect on offense in the 1920s. I'm also working on the 1916 season, to get a better sense of offense levels and park effects in the preleague era. One thing to remember about players like Pete Hill and J.H. Lloyd, who played for the American Giants in the 1910s, is that the stats we have were compiled mostly in Chicago and Indianapolis, in an *extremely* lowrun environment.
I can't constantly include caveats and qualifiers in every single post about how "this is just one season," or, "remember, this is only 300 plate appearances," or whatever. I assume everybody here knows that, and I also assume that even small pieces of information I come up with can be useful.
89. Chris Cobb
Posted: February 12, 2005 at 07:12 PM (#1141406)
Gary A.,
I know the feeling :/ .
I at least am tracking the data you are providing and am considering each piece of data in relation to the rest.
I look forward with great interest to 1916 data.
90. Paul Wendt
Posted: February 12, 2005 at 07:35 PM (#1141443)
Gary A on Schorling Park, NNL 1921: In 35 American Giants' road games, the averages (for both teams) were .271/.341/.387, with 43 home runs in 2575 plate appearances.
In 37 American Giants' home games, the averages were .206/.274/.255, with 6 home runs in 2541 plate appearances.
PA/game/team = 34.3
That makes for these park factors:
BA: .760764
OBA: .802666
SLG: .660354
HR: .141402
Each number is the singleseason home/away ratio.
Eg, for BA batting average .760=.206/.261
Not quite. For illustration, suppose the league comprises 8 ballparks and the schedule is balanced. Then each ballpark's singleseason park factor, simple version, is its home/away ratio r inflated by 8/(7+r). (That is the home/away ratio regressed approximately 1/8 of the distance from 1.) This step establishes leagueaverage 1 rather than (7+r)/8.
In the particular example, the ratio .76 for one park interpreted as a park factor implies league average park factor .97 =776/800. Inflation by 100/97=8/7.76 establishes Schorling .784, average of seven other parks 1.031.
91. jonesy
Posted: February 12, 2005 at 08:37 PM (#1141517)
Gary A. and Gadfly,
Found a great article on Jackman today that said from 1925 through midAugust of 1927, he record was 819 for the Philly Giants.
Picked up six more games for him today  all wins.
Also picked up what looks like a very reliable rundown of his teams from 1920 to his joining the Giants in 1925.
92. Gary A
Posted: February 13, 2005 at 05:33 AM (#1142454)
Paul, as you point out, I haven't done any adjustment for Schorling not being among the American Giants' road parks. This is because the schedules are very unbalanced. Hard as it is to believe, they apparently did not visit Detroit or Columbus in 1921, and only played 4 games each in Indianapolis and Cincinnati, while visiting St. Louis and KC 11 times apiece. Within the NNL itself, the American Giants played most of their games at home; it's only their road trip to the east in October that makes their overall Negro League schedule essentially balanced between home and road games.
Because accounting for all this would be very complicated and I think within a single season probably impossible, I've decided for now to stick with rudimentary park factors, just to give a sense for how statistics were affected by the park. In the long run, a multiple season analysis *might* (I hope) give us enough data to do more sophisticated (and accurate) park factors. (And I haven't even mentioned the neutralsite games...)
Nevertheless, I think there are a few big things that we know now that we didn't before, Schorling's extreme effects being probably the most important.
93. Gary A
Posted: February 13, 2005 at 05:53 AM (#1142497)
Sorry, quick correction: the American Giants played only six games in KC in 1921, not 11. They visited once, for a sixgame series, September 38.
94. Paul Wendt
Posted: February 13, 2005 at 03:36 PM (#1143037)
I agree, what a mess.
A schedule unbalanced in that sense radically different shares of home games for different teams is something we see today only when the sky is falling at the Kingdome, or whatever it was. In 1994, MLB home shares differed only moderately.
95. Gadfly
Posted: February 14, 2005 at 02:21 PM (#1144509)
In Post #19, Brent discussed his conversion factors for Buzz Arlett. Two aspects of his post fascinated me:
1) For the Pacific Coast League (PCL) during the decade of the 1920s, Brent stated that his BA conversions worked out to a .92 reduction to get a Major League Equivalency.
Since this conversion number is consistent with the present day Davenport AAA translations and my own 1940s Negro League and TripleA studies, the implication was that the conversion rate between the Major Leagues and the highest Minor Leagues (TripleA) has been fairly constant over time.
Thinking this implication over, my first thought was that the conversion rate would have had to be fairly constant throughout the century. Since the 1903 MajorMinor League Agreement, baseball talent has always been funneled up the Minor League ladder to the Majors. It makes sense that the ratio between the top rung and the next rung down would not change much.
But, on second thought, I realized that there is one huge difference between the Major LeagueMinor League relationship of the 1920s and that of the present day. In the 1920s, the Minors contained Major League players on their way up the ladder AND Major League players going down the ladder. It was not uncommon at that time for Major League players, even Hall of Famers, to finish their careers with three or four or even more years in the minors.
Many Minor League cities of the 1920s are now Major League cities, especially from the Highest Minors. The 1920s Minor Leagues had many players who played out their careers almost entirely in the Minors, despite the fact that they were obviously of Major League talent.
In the present day, the Minor Leagues, with some exceptions, basically just contains players going up the ladder. Once a player is no longer a Major League prospect, his Minor League job is in jeopardy. In this sense, the pool of present day Minor League talent is cut in half.
In other words, I would expect the Minor Leagues of the 1920s to be stronger than the equivalent present day Minor Leagues for this reason.
So I decided to do a little research and see if Buzz Arlett would support the theory.
Of course, Russell (Buzz) Arlett is a pretty fascinating player. Born in 1899, Arlett spent 1918 to 1922 (ages 1923) as a pitcher in the PCL (hitting .247 in 695 AB over those 5 seasons). In 1923, he switched to the outfield and, like some minor league Babe Ruth, became a great Minor League power hitter from 1923 to 1936 (with his career being ended by an injury at 37).
During these 14 seasons, Arlett played in all three of the highest Minor Leagues and even spent one season in the Majors. Interestingly, every city he played in from 1918 to 1936 in the Minors (not counting a brief 1934 sojourn in Birmingham) is now in the Major Leagues (Oakland, Baltimore, Minneapolis).
The first of the following two tables lists Arlett’s career (YRABHBALGET) from 19231936. The second of the following two tables lists the corresponding League totals from his career (YRABHBALG). The last number in the second table is Arlett’s BA conversion factor. For example, in 1923, Arlett hit .330 in a League that hit .300. 330 divided by 300 equals 1.100.
BUZZ ARLETT, born: Jan. 1899
1923: 445147 .330 Pacific Coast League (Oak)
1924: 698229 .328 Pacific Coast League (Oak)
1925: 710244 .344 Pacific Coast League (Oak)
1926: 667255 .382 Pacific Coast League (Oak)
1927: 658231 .351 Pacific Coast League (Oak)
1928: 561205 .365 Pacific Coast League (Oak)
1929: 722270 .374 Pacific Coast League (Oak)
1930: 618223 .361 Pacific Coast League (Oak)
1931: 418131 .313 National League (Phi)
1932: 516175 .339 International League (Bal)
1933: 531182 .343 International League (Bal)
1934: 430137 .319 American Association (Min)
1935: 425153 .360 American Association (Min)
1936: 193061 .316 American Association (Min)
1923: 5596316766 .300 Pacific Coast League 1.100
1924: 5607616719 .298 Pacific Coast League 1.101
1925: 5517915909 .288 Pacific Coast League 1.194
1926: 5399515120 .280 Pacific Coast League 1.364
1927: 5211015175 .291 Pacific Coast League 1.206
1928: 5225515212 .291 Pacific Coast League 1.254
1929: 5577416835 .301 Pacific Coast League 1.243
1930: 5510816650 .302 Pacific Coast League 1.195
1931: 4294111883 .277 National League 1.148
1932: 4503012798 .284 International League 1.194
1933: 4372512189 .279 International League 1.229
1934: 4354612685 .291 American Association 1.096
1935: 4408812905 .293 American Association 1.229
1936: 4467113159 .295 American Association 1.071
Interestingly, Arlett’s career follows the classic path. After two adjustment years (192324 at ages 2425), Arlett enters his prime in 1925 at the age of 26. At age 27, Arlett has his career year, hitting over 36 percent better than the league BA. At 29 and 30, Arlett has two other great years, both about 25 percent above league BA. But, basically, Arlett plays from 1925 to 1935 at about 20 percent above league BA with dips at age 35 and 37 as his career winds down to a close.
A superficial analysis of the Arlett data supports a conversion rate of .96, not .92. In 1931, Buzz Arlett’s Major League BA conversion was about 1.15 and his 1930 and 1932 High Minors conversion rate was about 1.20 (thus 115 divided by 120 equals about .96). Of course, this analysis leaves out two important factors. The first is the adjustment factor and the second is the park factor.
In 1930, Arlett had been playing in the PCL for 13 seasons. His statistics should be inflated by his long experience in the League. In 1931, Arlett played in the Majors for his only season. His statistics should be decreased by his inexperience in the League. However, in 1932, Arlett has the same inexperience factor working against him as he moves to another, unfamiliar, High Minor, the International League.
Logically, Arlett should have done better in the 1930 PCL than the 1932 AA, and the best possible match for his Major League 1931 BA factor would be his High Minor 1932 BA factor because the adjustment factor cancels out (i.e. Arlett was in his first year in the League in each season). Of course, the evidence does not show this since his 1930 and 1932 BA factors in the High Minors are virtually identical.
However, this is simply a park factor illusion. Oakland was a good pitcher’s park and Baltimore, like his Major League Park in Philadelphia, was a fantastic hitter’s park. So, at ages 32 and 33, Arlett played in the Majors and Minors in great hitters parks and with the same adjustment factor. Thus, when all is said and done, Buzz Arlett’s data still supports a BA conversion rate of about .96 (19311.148/19321.194).
Of course, there are several caveats to this answer. One caveat is that Buzz Arlett’s 1931 Major League season was (as is interestingly told in the Arlett thread) disrupted by a serious thumb injury. While this would actually seem to support an even greater BA conversion rate than .96 because his injury probably decreased his BA factor, there is also the possibility that Arlett was not given enough time to regress to the mean of his talent.
For the data to be completely irrefutable, it would have been necessary for Arlett to probably get 2000 or so at bats in the Majors.
Another caveat is that the exact magnitude of the park effects on Arlett in Philadelphia and Baltimore are unknown. The Baker Bowl, in Philadelphia, was the best hitter’s park in the National League. On the other hand, Terrapin Park in Baltimore was also the best hitter’s park in the International League (and Nicollet Park in Minneapolis was the best hitter’s park in the American Association).
In fact, and more to the point, Terrapin Park was a fantastic hitter’s park for a lefthanded hitter (and, once again, Nicollet Park even more so). Of course, Arlett was a switchhitter; but this simply means that most of his at bats were lefthanded against righthanded pitchers.
To sum up, it seems that the High Minors of the 1920s were of better quality than today’s present High Minors (and this is logical). Also, whether the conversion factor is .92 or .96, it is also apparent that Arlett was of near Hall of Fame or Hall of Fame quality as a hitter. As is discussed in Arlett’s thread, I think that Bill James had him pegged just about right.
If he had played in the Majors from 1923 to 1936, Buzz Arlett would have hit between .320 and .330 with over 300 HR. If he had played his whole career in the Baker Bowl, I think that he would have had some seasons of around 40 HR and a .350 BA, but that needs further study.
Obviously, further study of the High Minors in the 1920s is also needed; but there are several interesting possibilities for further comparison. The best one is probably Earl Averill.
[Finally, it would be unfair to talk about Arlett and not mention Joe Hauser.
Hauser, who like Arlett was also born in 1899, was on his way to a great Major League career when it was derailed by a series of injuries. In 1930 and 1931 (before Arlett arrived for the 1932 and 1933 seasons), Hauser played at Baltimore. In 1930, Hauser hit 63 homers. After an off year in 1931, Hauser was shipped to Minneapolis and replaced by Arlett as Baltimore’s slugger. Hauser played in Minneapolis from 1932 to 1936 and was teammates with Arlett from 1934 to 1936.
In 1932 and 1933, the lefthanded Hauser lead the AA with 49 and 69 HR for Minneapolis, also hitting .332 in the latter year. In 1934, Hauser had his last great year. In 82 games, Hauser hit 33 HR and batted .348. His year and, for all intents and purposes, the rest of his career was ended by injury. In that same year, Arlett hit .319 with a league leading 41 HR for Minneapolis. All in all, I think Arlett was a better hitter than Hauser, but the similarities between their careers are interesting and Arlett is not all that much better.
One last fascinating thing about Hauser is his 1933 homeroad HR breakdown. In 1933, Hauser hit 19 HR on the road and a staggering 50 at home in Minneapolis. It was a hell of a hitter’s park.]
The second aspect of Brent’s Post #19 that fascinated me was this:
2) His breakdown of BA conversion rates into individual components (1B2B3BHR).
But I’ll have to post analysis about that (and the related stuff on Ramon Herrera) latter.
96. Paul Wendt
Posted: February 14, 2005 at 03:26 PM (#1144580)
there is one huge difference between the Major LeagueMinor League relationship of the 1920s and that of the present day. In the 1920s, the Minors contained Major League players on their way up the ladder AND Major League players going down the ladder. It was not uncommon at that time for Major League players, even Hall of Famers, to finish their careers with three or four or even more years in the minors.
. . .
In the present day, the Minor Leagues, with some exceptions, basically just contains players going up the ladder. Once a player is no longer a Major League prospect, his Minor League job is in jeopardy. In this sense, the pool of present day Minor League talent is cut in half.
Here are some factoids and interptoids that I don't have knowledge to develop. Neither fact nor interpretation is original.
 The Farm Clubs project, Minor League Cmte, SABR, covers 1930 in one section with research relatively advanced and 1930 in another section on the contrary.
 1930s50s, the PCL had more control over its players than did AA and IL.
 Today, many AAA farms are very close to the mlb city (Pawtucket & Boston, < 1 hr). The AAA club is used as a taxi squad and a rehab site. AA has more good prospects (more talent less skill?).
 Today, the independent leagues employ many of the best players who are not prospects. Felix Jose, Nashua NH 2001(?).
97. jonesy
Posted: February 15, 2005 at 01:41 AM (#1145763)
And remember that circuits like the New England League  Class B circa 1910  and the Eastern League  Class A circa the 1920s  were often giving up more of their players directly to the big league clubs than were the AA, PCL and IL.
The reasons were purely economical. It was cheaper for the Boston Red Sox to purchase a pitcher from Lynn or Brockton, Massachusetts, than it was from Omaha of the Western League or Idianapolis of the AA. It was quicker for the Yanks, Giants or Dodgers to purchase players from New Haven or Hartford that it was from Oakland or Dallas.
And it wasn't uncommon for the major league to already own a players' contract and farm him out to Lynn or Brockton, where they could be observed and called back quickly.
The minors were a whole different ball of wax then. The Red Sox had partial ownership of teams like Sacramento, Providence and Buffalo in the 19101920 era.
98. jimd
Posted: February 16, 2005 at 11:49 PM (#1149490)
In other words, I would expect the Minor Leagues of the 1920s to be stronger than the equivalent present day Minor Leagues for this reason.
One would expect this to be true IF the ratio of Major League clubs to High Minor league clubs were the same in the 1920s as it is today. If there were more High Minor league clubs than major league clubs in the 1920's then these could help absorb those major leaguer players who were on the way out.
99. Chris Cobb
Posted: February 17, 2005 at 12:24 AM (#1149531)
Another question on regression:
It has been observed that the regression formula I have used on the Beckwith MLEs is evening out Beckwith's seasons too much.
I think this is a mathematically correct observation, because if the regression formula were applied to a player with fully documented play in 154game seasons, that player would also be regressed.
What is needed is a regression that adjusts the NeL shortseasons towards the mean an amount appropriate to projecting performance into a 154game season.
I see two possibilities on which I'd appreciate comment:
1) Getting the "largesample" average for the player using fiveyear consecutive sums, but setting the regression ratio by taking the square root of the ratio of the player's recorded at bats to his projected at bats for that season.
Hypothetical example, since I don't have real numbers handy.
Beckwith's MLE average for 1925 is .371 in 210 at bats. His fiveyear average is .342. He is projected as having 580 ML at bats for 1925. Regress .371 toward .342 by the square root of 210/580.
By this method, .371 regresses to .359
2) Getting the "largesample" average for the player using fiveyear consecutive sums, but regress only the projected at bats towards the largesample mean, giving the player full credit for his actual level of performance.
Hypothetical example:
Beckwith's MLE average for 1925 is .371 in 210 at bats. His fiveyear average is .342 in 970 at bats. He is projected as having 580 ML at bats for 1925. Regress .371 toward .342 for the 370 projected at bats by the square root of 370/970. Calculate seasonal average by combining regressed projected average for projected at bats and actual MLE average for actual at bats.
By this method, .371 regresses to .360. .360 in 370 ab and .371 in 210 at bats yields a fullseason average of .364.
The current formula would regress Beckwith's .371 average to .355.
Which of these alternatives seems like the best model to use?
100. karlmagnus
Posted: February 17, 2005 at 01:46 AM (#1149602)
The former. The latter has no basis whatever in mathematical reality it's double counting the divergence of the small sample from the mean.
Regression to the mean only flattens peaks because it takes an average for the "rest" of the season. In reality, sometimes he would hit less than average for the "rest" of the season, sometimes more, giving some seasons where his good "actual" numbers disappered altogether (not just flattened but eliminated) and some where they weren't flattened, or were only flattened a little. By regressing him only to his 5 year average, you will in any case be raising the whole "elevation" of his peak seasons.
I have raised Beckwith several notches following your analysis, which I support wholeheartedly in principle, while believing it tends still to round up in practice. But fiddling with the regression formula because the peak's not high enough, or twisting your .87 conversion to .95 to please Gadfly would lose it all credibility, as far as I'm concerned. Gadfly's cuckoo (if he can call me racist, I can call him mad!)
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
It's clear that, when attempting to create seasonbyseason MLEs for NegroLeague players, some regression to the mean for seasonal totals is appropriate. The questions are: how much and how to establish the mean.
I have been doing regressions in an unsystematic way, and I'm sure it can be done better.
So far I have been using career totals as the mean towards which to regress, but I am considering changing to a rolling 5year mean, with necessarily alerations at the beginning and end of careers. Would that lead to more probable results?
I have been simply guessing about how far to regress towards the mean: I am not a trained statistician, and would welcome plainenglish guidance from those with more knowledge!
For Negro League players, you would regress to the Negro League average. Are those stats available? If not you could regress to the AL or NL league averages, if you believe the quality of play was about equal in those leagues.
Please pardon my ignorance, but why should you regress to the population?
NegroLeague averages are seldom available. After making the conversion to majorleague equivalents, regression to majorleague averages could be done.
I can see that an incomplete career line would be regressed to the NeL mean. Would the incomplete seasons then be regressed to the player's regressed career line? Does that make any sense?
I think obviously Beckwith was not an average Negro League hitter, so regressing to THAT population wouldn't be correct.
Regressing to his career totals would probably be the way I would go absent a better method. You could use your proposed 5 year mean idea IF you have sufficient # of plate appearances during that time I guess.
Go the other way round, converting first, gives you .340MLE, which when regressed 50% as above gives you .295.
In other words, regressing and conversion aren't commutative.
(i) Is this what you're doing?
(ii) Is it correct that you should regress to the NL mean first  surely right?
(iii) Am I then correct that doing it the other way round wrongly inflates MLEs?
This would explain why we're getting so many HOMable NLers; the difference between .276 and .295 is not gigantic, but it is substantial.
I'm quite prepared to be told I'm out to lunch, and I will promise to understand why if I am.
one section of Beckwith #135, nearly copied here]
Gadfly #123
I will try to find reference to the study I saw that did 1940s MajorMinor Translations for you and post some Negro League TripleA comparisons when I have time. It was my understanding that the .92 conversion was overall. In other words, BA would be reduced by say .95 and slugging by .89. But I could be wrong.
Clay Davenport's minor league translation factors have some currency; indeed, I know of them only indirectly, by reference in remarks on major leagues. He uses the "overall" measure EqA. If I understand correctly, his translation factors should have Gadfly's property: magnitude between batting and slugging factors.

Also in Beckwith, Gary A #108 and jimd(page one) on ballparks used by NeL and MLB in the same year. For any year, a good share of NeL games played in MLB parks should yield a good estimate of NeLaverage park factor (where MLBaverage = 100).
What I have been doing is converting and then regressing.
It appears the operations are not commutative.
it is not intuitively obvious to me that the regression should precede the conversion. My intuition, and it is purely intuition, is that the other order is correct.
It is also not intuitively obvious to me that regression to the league average is appropriate (a concern seconded above by KJOK).
I may be signing a statistic book or two out of the library or contacting my math department's extension office . . .
definitely Convert before attempting to Regress.
Key Q: what is my goal? If I'm a career voter, I don't much care about regressing anyway. If I want to measure 'peak' or 'prime', I may as well regress to the length by which I measure these. But I were to measure peak for Negro leaguers, I might rely more heavily on contemporary opinion than stats (I will bow more to stats, if we have them, for career value).
Regressing to the league mean won't do anything besides drag everyone's stats to the center.
Regressing to a rolling average seems to make sense to get a truer shape of a career, as long as you kep in the mind the 'typical' career shape. As in, I could see using an uncorrected average for the years age 25 to 30, but not for 32 to 36 when we expect decline anyway.
Tom's personal most important law of stats: everything in life varies with the square root of N (the sample size). [You need 4 times as much data to cut the uncertainty in half.]
I THINK convert, then regress is correct. Even in the Major Leagues, you can have Mike Matheney hit .395 for the month of April, so a very good Negro League player hitting .450 for 60 games in the Negro Leagues would not be unexpected. You're also going to have more Negro League players who hit .125 for 60 games in the Negro Leagues than Major League players who hit even .150 for a full season, etc.
The conversion just gives you what that .450 would have been vs. Major League competition for 60 games. From there, it's no different from regressing any Major League player's 60 games into 154 or 162.
I don't do ANY regression on my MLE's, but I also tend to ignore season to season performance for Negro Leagues players and just look mainly at their career MLE's.
For example, in a paper in the 1975 Journal of the American Statistical Association, Efron and Morris looked at the following scenario: suppose you know the batting averages of 18 players over their first 45 at bats, and don't know anything else about the players. How would you estimate their averages over their remaining atbats of the season? In that case, the answer is to regress all the players toward the overall mean; if a player starts the season hitting .400, chances are he is an above average hitter, but it is also highly unlikely (unless you have additional information about him) that he will continue to hit .400 the rest of the season. The amount by which you shrink the players toward the overall mean depends on the standard deviation of their batting averages. A summary of the formula and a baseball example can be found in section 5 of this paper by Efron, though I'll warn you that there is a lot of math. A more readerfriendly version appeared in 1977 in Scientific American.
Your situation is different, but I think the same formula could apply. Instead of knowing averages for p players over a few games, you know the averages for one player over the p years of his career. You know what his average was over perhaps 50 games, but you assume he would have played perhaps 140 games under major league equivalent conditions, so you are trying to predict what he would have hit over the remaining 90 games. I think you would use the same formula, regressing toward his career average based on the standard deviation of his batting average over his career.
Do you regress first, then convert, or convert first, then regress? I don't know. Statistical theory is good at coming up with formulas, but telling you how to apply them relies more on the experience and judgment of the people doing the calculations.
BTW, Carl Morris, in addition to being a prominent statistics professor at Harvard, also dabbles in sabermetrics. He has a runs generator formula that may be the most sophisticated one around  he calls it simple, but it's too complicated for me to use. For some reason, I can't get the link to work in this post, so if you're interested I suggest that you google: "simple runs per game" carl morris.
If 1 league is only .85 as good as the other, then you need to reduce the league average for the lessor league before regression, use .85*.250=.212 for the league average if you convert first, then regress.
If you do it that way, you'll get the same result if you regress first or convert first.
Example A: .400*.85 = .340 regress 50% to .212 = .276
Example B: .400 regress 50% to .250 = .325 * .85 = .276
This is all for sake of example, you'd actually regress 125 AB a lot more than 50%. You can find more on this kind of stuff at tangotiger.net.
This is not an accurate representation of NeL statistics. Few conversions that I have done, even without regression, show the NeL players as leading the major leagues in batting average, which is the result we would expect according to the above.
Here's a lightningfast survey of NeL leaders, converted, compared to MLleaders. This will use only .87 as the conversion factor for batting average, not getting into leagueoffense levels, park adjustments, or regression to the mean, but it should suffice as a demonstration.
Seasons by NegroLeague players that could, by this conversion, have won an ML batting title are marked in bold.
Year  NeL W lead (MLE)  NeL E lead (MLE)  NL / Al leads
1920  .399 (.347)  .409 (.356)  .370 / .407
1921  .484 (.421)  .361 (.314)  .397 / .394
1922  .451 (.392)  .404 (.351)  .401 / .420
1923  .433 (.377)  .441 (.384)  .384 / .403
1924  .409 (.356)  .382 (.332)  .424 / .378
1925  .428 (.372)  .419 (.365)  .403 / .393
1926  .498 (.433)  .351 (.305)  .353 / .378
1927  .426 (.371)  .435 (.378)  .380 / .398
1928  .405 (.352)  .563* (.490)  .387/.379
1929  .390 (.339)  .464 (.404)  .398/.369
*This legendary batting performance by the 44yearold Pop Lloyd is the only .500 season in these records, and data posted to our site by KJOK replaced this incredible number with one that was much more credible. I use the number from Holway simply to follow my source.
My simple MLE conversion turns up 5 batting titles out of 20 for NegroLeague players. One of those is based on a batting average that has been proven to be apocryphal, and one is a tie. Only one other places above the range of majorleague averages, the .433 average posted by Mule Suttles in 1926. Even slight regression would easily pull it into the range of ML values, and this average was achieved in a hitter's park probably more extreme than any in the majors at that time.
In sum, I hope this quick example shows that there is no evidence to support the contention that heavy regression is needed to convert NeL batting statistics to major league statistics that happen "often."
On the Arlett thread I've been using a Bill James method, which rather than converting batting averages or slugging averages directly, converts each element of the batting line. I wondered how his approach might compare in terms of the batting average conversion. It reduces a TripleAAA player's hits by 89 percent, but it also reduces his at bats by 3 or 4 percent. Sure enough, the conversion factor for batting average turned out to be .92! (Actually, it varies depending on the player's minor league average, but for averages between .268 and .360 it rounds to .92.)
For slugging average, James applies a square root to home runs and triples, but does not use a square root for hits or doubles. The ratio depends on the characteristics of the hitter, but for a .300 hitter with 20 HR per year, the conversion factor is .90. If you increase his power to 40 HR per year, it drops to .89; if you keep the 40 HR and reduce the average to .240 (we're talking Dave Kingman) it drops to .88; if you keep the average at .300 with 5 HR per year, it rises to .91. So it's a relatively limited range, and I'd guess that for most of the players we're interested in .89 or .90 would be applicable.
My next effort on major league equivalencies is to look at the quality differential for a group of 1920s PCL players who also played in the majors. There's a Web site that has all the minor league hitting statistics for the Portland Beavers. Roughly half the regulars also played regularly in the majors, so I should be able to come up with a good sized sample. I'm still looking for more data on the PCL run environment during the 1920s though.
Rallymonkey has poiunted out the flaw in convertthenregress; you should regress to the MLE of the NL average if you do it that way round, i.e. in my example regress to .212 not .250. It appears we have a systematic error here of quite some magnitude.
Again, your claim is simply inaccurate in its interpretation of the data at very simple levels.
1) Of these supposed five times you claim the converted NL leads the majors in batting average, one (1928) has been thrown out as bad data  I included it because it was the _only_ .500+ batting average over a tenyear period, which refuted your claim that .500 batting averages happened "rather often" in the NeL. A second year was counted as a batting title because it tied the _lower_ of the two major league batting titlists, so it can't be said to lead the major leagues. So that's 3 in 10 majorleague leading seasons.
2) That 3 in 10 leading seasons is BEFORE ANY regression, and I am not and have not argued that regression is not needed. I am trying to figure out HOW MUCH. You take incompletely processed data that I present only to demonstrate the wild factual inaccuracy of your claims that .500 averages were not unusual in the Negro Leagues and that major regression is necessary to make them fit with "real" majorleague averages. I present a quick data set to show that neither of your points has any factual basis, and you use those incompletely processed MLEs as evidence that the conversions I am doing contain "a systemic error of quite some magnitude."
The treatment of the .498 season that you advocate would regress that .498 season to .338. That is hardly appropriate. 1) If that regression were applied to every NeL season, the number of ML batting leaders would be 0 for the 28 year history of the leagues, not the 1 or 2 in ten years you say we would expect. 2) If .338 is the highest MLE a NegroLeaguer averaged during the 1920s, then NONE of them are HoMers, which contradicts a) what we would expect from the most conservative demographic estimates, b) the demonstrated performance of NegroLeague stars vs. majorleague competition, and c) their reputations.
Mule Suttles (our .498 hitter) averaged .341 for his career in 3230 at bats. We can agree, perhaps, that this is a large enough sample size that it doesn't need to be regressed to the mean?
Working just with the .87 conversion for batting average (skipping league offensive levels, park factors, and arguments about whether .87 is correct), Suttles comes out as a .297 majorleague hitter, career. That makes .338 41 points above his lifetime average. The average amount by which majorleague batting leaders exceed their lifetime averages during the 1920s is 48 points, the highest being 80 points (George Sisler's .420 in 1922). It is evident that a system that regresses the _most extreme_ NegroLeague outlier (157 points above the player's lifetime average  some drawing of this average towards the career mean is clearly necessary) to a lesser variance than the average variance for a majorleague batting leader regresses too much to give proper credit to peak value.
I hope later today to take time to think through other posts more fully. Thanks to everyone who has responded on how regression to the mean should be used.
I might also add that the highest batting average I found in the NNL for that year was Willie Wells's .365 (which would translate to .318). Pythias Russ (the .405 hitter) hit .346 in the games I have, though I'm missing two dozen games played in Chicago. Of course, these were all played in Schorling's Park, so I don't know whether Russ could have raised his overall average by 60 points by his performance in those games.
Also, the highest batting average I have in the west for 1921 is almost 50 points lower than what Holway records. I'm missing 14 St Louis games, so I suppose it's possible that Blackwell or Charleston could have raised their averages by that much.
Four out of 20 doesn't seem excessive to me in the least, Chris.
If you regress Mule Suttles.498 50% (which may not be the right percentage  how many ABs was that .498 on?  you should normalize to about 500 ABs, I would think, so 50% would be for 125 ABs, 60% for 180 etc.) towards his career average of .341 you get .419, which converted at 87% gives .364, a batting title in the NL but not the AL in that year. Converting first you get .433, but as rallymonkey pointed out you then have to regress that 50% not to .341, but to .341x.87, or .297, which again gives you .364. Regressing it to .341 is comparing apples and oranges; it would give you .387, but that's a meaningless number.
How do you determine how much to regress? You seem to be using a formula that you assume is obvious, but I, alas, am ignorant of it.
An explanation of that would be most helpful.
Data on the number of at bats Suttles had in 1926 is at home, so further work on the specifics of that season will have to wait a bit.
If we are agreed that one regresses to the player's career average, converted to a majorleague equivalent, rather than to the NeL average converted to its majorleague equivalent, I think we are on the right track. It certainly appears to me that the range you are presenting for Suttles, .364  .392 depending upon the amount of the regression, is reasonable.
That's what I meant, karlmagnus. I typed 20 by accident.
I'm pretty sure that's right, and am a math major (Cambridge), but it's now 34 years since I graduated and I threw away all my stats books as that was a part of the subject I hated, so could be all wet. But that is how I would regress NL stats gained in short seasons so they were equivalent to ML stats gained in 500AB seasons.
The 1880 Chicago Cubs went 6717 (.798) in an 84 game schedule. Using a binomial distribution, this is 5.46 standard deviations away from the expected 4242 (.500) that is the league mean.
How would this team do under a 162 game schedule? Simple extrapolation says 13032 (.802), but this is 7.70 SD away from 8181, which is too high. Regression to the mean says that it would go about 11646 (.716), which is 5.50 SD.
the formula (technically, the normal approximation to the formula) for the standard deviaiton of a proportion (which is what a batting average is) is
SD = square root of [AVG*(1AVG)/AB]
So if a player hits .400 in 100 AB, we are really sure (95%, which is two standard deviations) that his 'true' average is somewhere within +or [.4*.6/100}^.5 * 2 = .098 of .400
That looks huge, but even in 550 ABs, it's +or.042, and it's true that MLB hitters do occasionally hit 42 pts above or below their lifetime avgs in a long season, and they do sometimes hit 100 pts lower or higher in a month.
But I am NOT suggesting we 'regress' NeL stars an extra 50 pts or so for a 100 AB season. The above is based on a pre stats test that assumes we KNOW NOTHING ELSE about the player. For a Beckwith, if his typical NeL avg is .350 and he hits .400 one year, I'd weight his average something like
.400 for 100 AB
.350 for maybe 500 AB, where I estimate that my knowledge of his 'lifetime curve' is possibly 5 times the 'weight' or certainty of his one season
and so his estimated NeL avg for that year would be (.400 * 100 + .350 * 500) / 600 = .358.
voila! (or not...)
It is also not intuitively obvious to me that regression to the league average is appropriate (a concern seconded above by KJOK).
inappropriate inapprop inapp inapp inapp
The player's measured performance should regress toward the expected performance given his contemporary skill.
TomH #13
rightly says why you should convert first and implies why you should publish the intermediate results. Several people are interested in convertonly; few are interested in regressonly.
Brent #16
may be right about how to approach a different problem, where you have some random sample of partseasons from a player's career. Suppose you have data for 8 randomly selected 1/3seasons for someone who played 16 years.
But you don't have that; the sampled 1/3seasons are dated and you know the player's birth date and something about how careers generally develop. So Brent (regression to career average) is wrong here and TomH #13 is broadly right:
Regressing to a rolling average seems to make sense to get a truer shape of a career, as long as you keep in the mind the 'typical' career shape.
Of course, that quotation isn't a complete plug'n'play solution :)

TomH #13:
Tom's personal most important law of stats: everything in life varies with the square root of N (the sample size).
The application is tricky here, for those (most HOMers) who are interested in the player's fullseason achievements rather the player's skill at that time in his career. Consider a 20game sample from a 50game season, doubled to provide a 40game sample from that season. Clear? If not, consider 20 and 40games samples from a 40game season.
So, data from just a few more games is more valuable here than in the world ruled by "Tom's" law. In other words: that law is discouraging but don't let it down your search for more boxscores!
If so convert then regress is wrong if you want career WS.
Even if not, it is wrong if you want peak WS.
Gary A #96, jimd #99
http://www.baseballthinkfactory.org/files/primer/hom_discussion/24597/P100
Gary A #[10]8
Redland Field, Cincinnati, 1921 park factors
111, 104 adjusted in Negro Leagues (Gary A #96)
99 in National League (Gary A #96)
95 in National League (jimd #99)

As I said in #11: For any year, a good share of NeL games played in MLB parks should yield a good estimate of NeLaverage park factor (where MLBaverage = 100).
But we don't have a good share of NeL games played in MLB parks, and for some NeL games we have no data.
thus in a 162 game schedule they would have gone .715.
Close enough. I think I've got the main point. Which is that the relationship to the mean, measured in standard deviations, remains constant when converting the results from a small real sample to a larger extrapolated one. (Which is what "regression to the mean" means if I think about it. D'Oh. ;)
To be able to calculate appropriate regressions, I'll need statistics that include at bats or games more often than Holway does. That means getting hold of a MacMillan 810 edition, I think.
This comment is by no means meant to end discussion of regression, btw, just to express my appreciation and to note its implications for data gathering.
I'll pick up on the theoretical question that I think gadfly meant for me and not for Gary A:
So, if the distribution of superstars is roughly equal between white and black, why do none of your translations for the Negro League Superstars from the 1920s and 1930s end up with a lifetime BA of .330 to .350 like their white counterparts?
I don't have a firm response to this, but here are my thoughts about the situation. I have never attempted to justify my translations in demographic terms. I don't think it's possible to derive from demographic/economic arguments the percentage of stars by race with any certainty. For the purposes of Hall of Merit elections, I believe that a quota approach to electing black player would create an inappropriate double standard. One of my goals in trying to develop accurate and reliable MLEs has been to make quota arguments unncessary. So I have not attempted to measure the results of my translations against any demographic standard.
That said, I'm not at all sure my MLEs are correct. My gut, which is sensitive to players' reputations of greatness and to expert opinions, says that they are a bit low. However, my standard for calculating them is to base no step in the process on my sense of what the numbers _ought_ to show, but to construct a system that creates statistics that are (1) derived from the best available data and (2) based on conversion methods that have been discussed by the interested membrs of the electorate and that have been generally (if not universally) accepted as sound.
If the results seem lower than they ought to be, we have the commentary of experts to challenge the results and to help us find ways to do better. I think that you are probably right that my MLEs are a little low, and I hope the electorate here will consider the serious likelihood of that based on your comments and other evidence of expert opinion.
But I can't change the system based on opinion, or it ceases to be a system that aims at an objective numerical statement of value. If we find evidence that I have erred in a calculation or gain access to evidence that leads to different conclusions about the conversions, I can make changes to improve the system accordingly. I hope to do so. I believe improvements in my handling of regression based on recent conversations will improve the system and give a fairer representation of peak value.
I can see a number of points where the evidentiary basis for the conversion factor could be improved:
1) NeL park factors from 19381948. A lot of the data for the conversion comes from Doby and Irvin in Newark. We know that was a hitter's park, but how extreme was it, exactly, and what percentage of their games did they play there? If I have used too low a park factor, that would depress the conversion factor incorrectly.
2) Data on the overall level of offense in the NeL from 19381948 in comparison to the major leagues, especially in the Negro National League. Evidence from the 1920s provided by Gary A. indicates that, although NeL levels of offenses tracked with ML levels, these diverged at times by up to 10% (I think that the difference was even greater in the late teens), with the ML levels being higher. I have taken this into account for 1920s conversions, which raises the MLEs of NegroLeague players. Eyeballing the numbers for the late 1930s and early 1940s, it looks like offensive levels were high in the Negro Leagues at that time, quite possibly higher than in the majors. If this is the case, that again, if not properly accounted for, would depress the conversion factor incorrectly.
3) Use of AAA conversion studies could help to better model the process of arriving at a conversion, provide a point of comparison for the Negro Leagues' competition level, and add statistics from NeL stars playing in the high minors to the pool of data available for the calculation of a conversion factor. Studies on the level of competition in the Mexican League would be similarly useful (and will be important for the assessment of players like Cool Papa Bell, Ray Dandridge, Martin Dihigo, and Will Bill Wright in any case).
4) Striking the right balance between conversion rates for batting and conversion rates for slugging. The discussion of the squareroot relation between the two numbers has been helpful and should help to improve the accuracy of individual conversions and provide a standard by which the conversion factors can be judged. Obviously, the .87/.82 split I am using now isn't right. It appears to be a compromise between two different conversion levels. Figuring out why my calcuation of conversion factors from the data produced this discrepancy could lead to a more reliably derived factor.
I am hopeful that I/we can do better on all of these fronts.
their meaning and statistical foundation explained by the author
how many black stars?
 production and recruitment of quality ballplayers before WWI;
 (im)maturity of baseball in the South;
 black residence in (rural) South, migration to North
#[2]71 jimd's racial and regional demographic "Adventure"
compare how many HOMers each year?
 production and recruitment of quality ballplayers;
 regional (im)maturity of baseball
Having read through the postings above, I think I understand the rationale and the formula for calculating regression to the mean.
To summarize: regression to the mean corrects for the greater variance created by small sample size by keeping the number of standard deviations from the mean constant when moving from the small actual sample to the larger, hypothetical sample of data.
The ratio of the standard deviations of the two sample sizes is the square root of the ratio of the two sample sizes. The variance from the mean in the small sample is multiplied by the ratio of the standard deviations to find the variance that is the same number of standard deviations from the mean in the larger sample.
Mule Suttles hit .498 in 1926 in 212 at bats. His career average was .340. If we accept .340 as the mean and 500 as the normal number of actual at bats in a season.
Since some voters will want to see unregressed conversions and some will want to see regressed conversions, I'll convert and then regress, using the .87 factor for now. .340 > .296, .498 > .433
his .433 MLE average is regressed as follows:
Variance from the mean: .433.296 = .137
Ratio of standard deviations: square root of 212/500
Multiply the variance by the ratio, and add it to the mean.
Suttles' 1926 MLE average, regressed: .296 + .089 = .385
He still has the highest average in the majors that year, but by a small rather than a huge margin.
Have I calculated this correctly, given our assumptions about what the mean is and what the conversion factor is?
Now, this conversion leaves out two important factors in the overall conversion calculation:
1) difference in league offensive levels
2) park effects
The proper place for these in formula needs to be ascertained, and I think that placement will make a difference in the results. I think that park factors should be applied first, before conversion and before regression, and that difference in league offensive levels should be applied at the same time as the conversion factor.
Does that seem right?
Comments on that?
As to the number of atbats to which to regress, I'd welcome suggestions. One possibility would be to use the number from the rolling seasonal averages, as long as those were totals that might occur in a full majorleague season? Or would it be better to set a derived norm based on typical at bats per game and typical games per season, and stick with that?
Comments?
I'd recommend regressing to what we consider to be a 'normal' set of at bats for a season. Maybe 550?
1) changes in competition levels in the Negro Leagues. I don't think we're close to having enough evidence to calculate these statistically, but I think there's enough evidence to indicate that levels changed. This is an area that needs to be handled subjectively for now.
2) normal paths of improvement and decline for majorleague players. Since all the conversions are based on play happening in sequence in players' careers, the progression of careers must be influencing the conversion. I've tried to exclude instances where obvious improvement and decline are influencing the data, but there may be subtler influences even in the most level pieces of evidence available. Comparative data on this subject could also help to improve the conversions by enabling us to correct for its influence on the data I have used and to enlarge the set of data available by making it possible to use players during their periods of significant improvement and decline.
Nice job putting post 46 into English, Chris! Couldn't have said it better if you gave me a week
Thank you! I'm a professional with words, so that part comes easy to me. It's the numbers that I struggle to get straight!
Mule Suttles hit .498 in 1926 in 212 at bats. His career average was .340. [For discussion of regression, suppose] we accept .340 as the mean and 500 as the normal number of actual at bats in a season.
The 212ab sample from his 1926 season is a small part of our sample from his career, which is more than 1000ab (equivalently, he usually batted above .300). So I am comfortable with .340 as a talking point.
#46:
Now, this conversion leaves out two important factors in the overall conversion calculation:
1) difference in league offensive levels
2) park effects
The proper place for these in formula needs to be ascertained . . .
#51
Two additions to the list of issues for conversions from post 46 above:
[3] changes in competition levels in the Negro Leagues. . . .
[4] normal paths of improvement and decline for majorleague players.
Rather,
[4'] Player's own path of improvement and decline, where the normal path for mlb players will be used in default of useful information about Player.
Otherwise, yes.
These issues pertain separately to the conversion of his .498 partseason average and the conversion of his .340 partcareer average. For example, you want to seasonparkadjust his .498 and careerparkadjust his .340.

Regarding [4] and [4'], iiuc.
Given ample data or a significant amount of data for adjoining seasons, you will use some (maybe weighted) 3yr or 5yr average of season records rather than his career average, and there will be no issue of improvement and decline.
I meant [3] and [4] as issues to address in order to improve the accuracy of the general conversion factor, though I take your point that career path improvement and decline (not to mention radical changes in offense levels such as the one that took place between the late teens and the early twenties) does make it problematic simply to use career average as a baseline.
Given ample data or a significant amount of data for adjoining seasons, you will use some (maybe weighted) 3yr or 5yr average of season records rather than his career average, and there will be no issue of improvement and decline.
Yes, and parkadjustments to the baseline will be easier to make. You imply above that 1000 ab is a sample size that gives you confidence in a baseline. At about what point does sample size become too small for confidence as a baseline?
Following Tom H's explanation of standard deviation above, I calculate that in a 1000 at bat sample, we can be 95% certain that the a player's true average is within 30 points (2 SD) of the average generated in that sample. In a 500 atbat sample, 2 SD is 40 points. In a 2000 atbat sample, 20 points. l
OK, to your point.
Regarding the 1000 ab, I chose 1000 because of its size relative to 212. I am not comfortable with regression of partseason to career average where the partseason is a large part of the sample for measurement of career average. Should I be comfortable?

Ron Wargo, 1944 Ballot Discussion #94
Negro League infielders seem to have difficulty in our rankings. Are we being too harsh? While outfielders like Torriente and Hill breeze into the HOM, only Lloyd can claim that distinction for infielders so far. Johnson & Grant took some time, although Johnson made it relatively quickly. [anticipations deleted]
This is a good observation, potentially as important as the epochal bias in HOF inductions from the Negro Leagues, against those who retired before Buck O'Neil played (or followed?) the game. The infield positions {3B, SS, 2B} are underrepresented among the great Negro ballplayers by reputation, inasmuch as I know the reputations. Only John Henry Lloyd is sometimes called "maybe the best" or generally included in the top ten, I think. Lloyd's age, older than 8 or 9 of the ten, underscores his case.
Does it indicate bias? I don't know. If so, it may be a common bias. Only Honus Wagner is routinely considered one of the top ten MLB players, or sometimes called maybe the best, and only George Wright is routinely considered one of the top ten 19c players, or sometimes called maybe the best. Each is the Shortstop from the Dawn of Time, older than any of the other players commonly considered one of the top ten.
Anyway, it so happens that I have his stats for 1921 and 1928, along with league and park data. So here’s how Herrera’s Negro League and major league careers compare. I have to say the results were a little surprising:
Ramon HerreraRaw Averages
YearageteamGPAAVEOBASLG
192123CSW—68307.234.304.305
1928—30CSE29—133.317.341.397
NeL total97440.261.316.334
AL totals, at ages 2728:
1925/2684308.275.320.333
I don’t have stats for Herrera’s 1920 NeL season, but Holway has him hitting .259, which seems to indicate it probably wouldn’t change his career NeL stats very much.
League Context (parkadjusted)*
YearleagueAVEOBASLG
1921NNL.268668.329507.357343
1928ECL.281519.333020.383140
NeL total.272767.330585.365572 (prorated)
1925/26AL.288000.359000.404000**
Dividing Herrera’s percentages into these adjusted, prorated league contexts, you get these relative averages:
AVEOBASLG
NeL.955976.956362.914122
AL.954861.891364.824257
Divide his major league relative averages into the corresponding NeL figures, and you get these conversion factors:
AVEOBASLG
.998834.932036.901693
In other words, Herrera’s Negro League and major league batting averages were almost the same, relative to his league and park, but he walked less and hit for less power in the American League.
There are obvious caveats:
1) This is very limited, comparing 440 NeL plate appearances to 308 AL plate appearances.
2) His NeL numbers are weighted toward the 1921 season, which is four years before his major league appearance. He hit significantly better in ’28, so he may simply have been a better hitter in the midtolate 20s—although the ’28 sample is only 29 games.
3) His NeL numbers are also divided between the ’21 NNL and ’28 ECL, which were very different leagues. I don’t really know how they stack up against each other, qualitywise; offhand, I’d say that the ’21 NNL might have been better, simply because it played the season through; the ’28 ECL disintegrated in late May, though most of the teams continued to play each other into October. (NOTE: All of the Cuban Stars’ 1928 games were against teams that were in the ECL in either ’27 or ’28—Brooklyn, Hilldale, Lincoln Giants, Baltimore, Bacharach Giantsalong with the Homestead Grays, who clearly of league quality.)
I would hardly suggest that these conversion rates are accurate for all players at all times. Still, I thought it would be good to get this out there as a data point, especially since almost nobody knows about Herrera.
*I found BA/OBA/SLG park factors for Redland Field in the 1921 NNL, and adjusted Herrera’s league context, prorated to the number of plate appearances (for OBA) and at bats (for AVE and SLG) he had at home and away. He played in 28 games at home, 40 on the road. The raw park factor (runs) for Redland Field that year was 110; but the averages show a much milder effect; and, as in the majors, the park cut home runs significantly. There are some technical steps I skipped (such as accounting for Redland Field not being among the Cubans’ road parks), as the effects are pretty small and I was reaching the point of diminishing returns. In 1928, the Cuban Stars (E) were a road team, so I didn’t bother with park effects (again, there could be an effect depending on which road parks they played in, but I doubt it was very large).
The Redland Field factors for 1921, if anyone's interested:
AVE: 1.018774
OBA: 1.012775
SLG: 0.948110
HR: 0.441514
**These are baseballreference.com’s parkadjusted league numbers—as I understand it, they’re prorated for Herrera’s plate appearances, so they give the parkadjusted league context.
All of these data are very useful, and muchly appreciated, but let me offer one caveat.
Some people (including Bill James) have pointed out that in some circumstances, the drive to succeed is an important factor. This is anecdotally seen in many feelgood stories in all sorts of sports (and nonsporting events in life), and might be reflected in things like blackvwhite exhibition games, and possibly in initial conversions to MLB. If some players see making the majors or beating the white players or whatever as a doordie item, it can surely affect their performance. The 'cornered rat' or 'mother bear' theory if you will. With fewer vocational options, kids from difficult bakcgrounds have exceeded normal expectations of 'making it' in sports. And it's plausible to me that those who crossed over from NeL to MLB may have been driven to succeed in ways some of us can ony imagine.
AVEOBASLG
0.8467570.8707370.795831
Of course, this represents 308 AL plate appearances compared to only 133 ECL plate appearances.
I think your point is valid and important regarding some of the blackwhite exhibition games. Reading The Pride of Havana, it's clear that while some major league managers (such as McGraw) took their exhibition games in Cuba very seriously, in other cases the major league teams treated it as a holiday, and sometimes key players wouldn't show, players would show up drunk, and so forth.
When it came to making the majors when integration arrived, however, it seems to me that both the white and black players must have had extra motivation. The white players' jobs were on the line, plus they didn't want to be showed up. As you describe, the early black players also must have had an intense desire to succeed. I think psychological factors must have worked both directions.
I'm not sure how to produce an equivalent adjustment for the Negro Leagues, as there were more multiposition, "doubleduty" players, so it would be harder to figure out what to subtract. Negro League pitchers probably hit better relative to league than their white counterparts, but it was still the weakesthitting positionso taking pitchers' hitting out of the league averages would cause Herrera's league context to go up, and he would look worse as a hitter in the Negro Leagues.
In other words, it appears that the Negro League MLE conversion factors I presented above should be even higher, if I can figure out how to adjust for this. Of course, all the other caveats (sample size, etc.) remain.
Is it possible that this effect is created not by removing pitchers but by adjustments for pitching quality? That is, does bbref take the Boston pitchers out of the offensive context for the Boston hitters? I'm pretty sure, actually that bbref does this. Would it account for the effect that Gary has observed, or not?

Quoting the BBRef Glossary: Adjusted OPS+
[OPS+] is calculated differently from the Total Baseball PRO+ statistic. I chose OPS+ to make this difference more clear.
. . .
My method
1. Compute the runs created for the league with pitchers removed . . .
Note, TB7 adopted 'OPS+' in place of 'PRO+' used in TB3TB6, so 'OPS+' no longer suggests any difference between the statistics.
FWIW, I agree with Sean Forman (BBRef) about the calculation of season OPS+ but disagree about career OPS+. A few years ago, I reported the BBRef career calculation as a mistake, re: George Davis.

FWIW, the BBRef adjustment of ERA seems to be routine.
: AL1925 lgERA 4.40
: Boston lgERA* 4.52
Accounting for the ERA roundoff to #.##, that is consistent with any routine
: adjustment factor in interval [1.02497, 1.02958].
In turn, that fits the reported
: Boston PPF 103.
Notes:
191314 – No American players in league.
191316 seasons played at Old Almendares Park.
1917 – Record show is for an alternative league that was organized (and displaced the regular league that year); games played at Oriental Park. No American players in league.
191830 seasons – most games played at New Almendares Park.
191920 – Only one American player in league.
1921 – Season lasted only 5 games. Tied for league lead in triples (1).
1924*  Special season. The regular season was terminated early when Santa Clara reached an 11.5 game lead. The weakest team (Marianao) was dropped, its players were redistributed, and a 25game special season was played.
192627 – A rival league was formed (“Triangular”) that raided many of the better players. Herrera led Cuban League in runs scored (24).
I'd like to figure out what bbref is up to, just to make sure I get Herrera right. Here's what they have:
yearAL aveadj ave*Boston BPF
1925.292.300100
1926.281.28697
*Park adjusted league average for Herrera/Boston
He gives his first step in figuring Adjusted OPS+ as "Compute the runs created for the league with pitchers removed..."
1917 – Record show is for an alternative league that was organized (and displaced the regular league that year); games played at Oriental Park. No American players in league.
191830 seasons – most games played at New Almendares Park.
191920 – Only one American player in league.
1921 – Season lasted only 5 games. Tied for league lead in triples (1).
Note that five MLB seasons were played and nearly five calendar years passed between the '1917' and '1921' seasons in Cuba. I have two suggestions, one for humans who want linesortability by computer.
1917w (for winter)
191819
191920
192021
1921f (for fall)
The other is for inside a database and it amounts to the following in chron order for the five given Cuban seasons and five interpolated MLB seasons.
1917 w
1917
1918
1919 o
1919
1920 o
1920
1921 o
1921
1922 f
This will be useful for any human using a spreadsheet who needs to subtract "years" within the Cuban league, preserving familiar properties.
Eg, '1917 w' to '1922 f' is a sixyear span, 6 = 19221917+1
When creating MLEs for pitchers, should ERA and ERA+ be converted using the same multiple that one uses for EQA with hitters?
I've been assuming that the same multiple should be used, but I'd like those with better knowledge to confirm that for me.
That appeared to be the conclusion reached in the discussion on the Wes Ferrell thread concerning how to work out the impact of pitchers' OPS+ oon ERA+, but I'm not entirely clear on that.
Is that correct, or no?
My conversions actually track to batting average and slugging, using different rates (which should and eventually will be related by a consistent, theoretically justified ratio), not EQA.
So let me repose the question more concretely:
The evidence, so far, suggests a .87 ba/.82 sa conversion ratio. How ought the ratio for ERA+ be linked to these?
If you want to have a ba/sa pair that fits the square root formula, go with .90/.82.
I'm aware that gadfly's data may lead to a reconsideration of these ratios, but for now getting the principle of how to set up pitcher conversions in relation to hitter conversions will be enough to make some progress possible.
In post # 184 EricC wrote:
Because of selection bias, we arrive at the incorrect conclusion that league A is stronger.
and Brent responded:
I'm sorry to keep coming back to this argument, but now I see another flaw in it....
Understanding selection bias is such an important part of doing proper MLE's, thought I'd try to explain with an example.
Let's say we have League I and League II, and we 'know' League II is a stronger league, with League I having a TRUE strength at around 90% of the strength of League II.
Let's assume that League I has two types of players, with half of all League I players being Type A, and the other 50% being Type B, and that this split holds for all talent levels (superduperstars, allstars, very good players, average starting players, etc.) Type A players that move to League II are able to retain 95% of their value, and Type B players that move to League II retain 85% of their value (averaging back to 90% overall). The problem is that more Type A players will successfully transition to League II than Type B as detailed below, which will skew comparison results.
As players are actually selected to join League II from League I, all of the superduperstars, both Type A and Type B, will be selected (50% will be Type A, 50 % Type B)
However, when we get down to very good players, Type A players will continue to play well enough in League II to be selected, but some of the Type B players will not retain enough value to either be selected or hold a job in League II.
As you get down to average starting players, some of the Type A players will be selected into League II, but almost no Type B players will, etc.
So, the mix of players who played in both leagues will NOT be 50% Type A and 50% Type B, but might be something like 75% Type A and 25% Type B.
At this point, if we were to develop MLE's based on the performance of players moving from League I to League II, it would APPEAR that the correct conversion factor would be .93 (.75 x 95% + .25 x 85%) instead of the REAL factor of .90!
A clear explanation of the theoretical problem, but should this be of practical concern to us?
Can/should we do anything other than recognize that the conversion factor is an average, and that this average might be adjusted up or down for individual players, depending on the extent to which we believe their skills sets would have fit the majorleague game, _if_ we believe that they should be evaluated according to that standard and not on their merits within the NeL context?
I certainly think there's an argument to be made that we SHOULD be evaluating the Negro League players primarily on their merits within the NeL context.
However, on "Can/should we do anything other than recognize that the conversion factor is an average...", the methodology being discussed above will be using the WRONG average as the starting point, which I believe does have some practical implications in player evaluation.
1) Regression to the mean or the Don Padgett Problem:
(From 193748, Padgett was a .288 hitting catcheroutfielder in the National League. In 1939, Don whacked .399 in 233 at bats. Very obviously, he was not the second coming of Ted Williams.)
If you ask me, you have to adjust first for park and league effects. If you don't do this, you will be mixing different park and league effects from different seasons together and the results will simply be a mess.
Mule Suttles is a pretty good example of this. If the Mule had played in the Majors of his time, I seriously doubt that he would have ever lead the Majors in BA (maybe if he was in the Baker Bowl, having a great year and with the stars aligned just right).
In 1926, Suttles played for St. Louis in an alltime great hitters park that inflated statistics about like Colorado does presently. In fact, the park pretty much made Suttles unpitchable.
Suttles was not a pull hitter, he liked his pitches out over the plate. With a 250 foot left field wall in St. Louis, it was suicide to pitch Suttles in because he could just muscle it to, or over, that wall. So the pitchers had to put it over the plate just were he liked it.
There was no concurrent park in the Majors that inflated offense like this park.
However, in 1925, Suttles played in Rickwood for Birmingham, one of the greatest pitcher's parks in history. If you do not adjust Suttles for the parks and leagues first, you will get a average that mixes and matchs these two very odd parks together.
Of course, I realize that, with much of this statistical info being unavailable, this is easier said than done.
However, once you have adjusted as best you can for park and league, then you have to regress to the mean of the player's current talent level. The idea of regressing a player to the mean of the League is obviously worthless and regressing to the player's career average is better but still not right.
This, of course, is the really interesting question: "How many at bats are necessary for skill and luck to even out and give a true representation of a player's skill?"
My personal opinion is that 500 at bats is good but not really enough to be totally certain (as evidenced by the number of Norm Cash or Brady Anderson type fluke seasons in the Majors), but that 10001500 at bats is better.
In other words, a Negro League player should be regressed to his average over the nearest 1000 or so at bats, at the least. For example, if John Beckwith hits .450 in 200 at bats and .350 over the nearest 1000 at bats, his average should regress to .386 with a .450 in 200 at bats and then .350 in the next 350 at bats.
(And, as someone pointed out, this works in the reverse  if .250 over 200, then regress to .314.)
I think Chris Cobb has the right solution with a rolling five year plan, though I would amend it to being simply the closet 1000 or 1500 at bats
For Example:
1940: 245 at bats
1941: 302 at bats
1942: 205 at bats
1943: 256 at bats
1944: 251 at bats
normalized for 1942 would be 205 + 302 + 256 + 237/496 (1940 and 1944 together).
I've been doing this for years and it seems to work just fine with, as was pointed out, the caveat that adjustments need to be made at the beginning and end of the player's career.
When I get some more time, I'll put up two more posts on:
2) Brent's interesting posts on Buzz Arlett; and
3) KJOK's interesting posts on Ramon Herrera.
Two other random thoughts:
1) As I stated before, I don't think that the conversion factor from the Majors to the Negro Leagues would deviate much due to differences in talent distribution.
I think that the distributions of talent between both Leagues were probably very very similar. I think that the context is pretty much the same with the real difference simply being the talent level.
In other words, there is a conversion constant (for each year) between the Majors and Negro Leagues, and it is important to know this to adequately judge how the Negro Leaguers are rated. And this rating should be by how they would have performed if they were able to play in the Majors (individually, not all at once, so as not to disrupt League levels).
2) I think it's funny that, here in the Hall of Merit, the Negro Leaguers are still being badly discriminated against. For example, Dick Lundy and Frankie Frisch are very very similar players; but Frisch is in at #2 in the 1944 ballot and Lundy finished #29.
Basically, Lundy is pretty much the same player as Frisch, possibly slower, but with more power and better defense (Lundy was a shortstop, Frisch a second baseman).
This is why the true yeartoyear conversion rates are needed.
In 1923 he hit .354 in 148 games. 204 hits in 577 at bats. He had 31 doubles, 7 triples and 15 homers. I will try and give you some other names to gauge on.
Wade LeFler led the EL with a .369 average in 1923. Elmer bowman hit .366, George Fisher .365, Herrera .354 and Si Rosenthal .338.
20 players in the league scored 100 or more runs. Walter Simpson had 131, R. Emmerich had 123 and Herrera was next with 122.
Bowman led the league with 211 hits, Ted Hauk was next with 204. Then came Herrera (204), Emmerich (202) and then John Donahue and A. H. Schinkle with 201.
In 1924 Herrera hit .303 in 152 games. He had 191 hits in 631 at bats with 34 doubles, 5 triples and 8 homers. He scored 114 runs.
Other recognizable major league names (with more than 100 hits) in the Eastern League in 1924 were Wade Lefler (not so recognizable) at .370, Lou Gehrig at .369, Earl Webb at .343 and Clyde MIlan at .316.
In 1927 Herrera hit .243 in 94 games. He had 90 hits in 371 at bats, scored 41 runs and knocked in 25. He had 12 doubles, 0 triples and 2 homers.
I'm sure you are interested. I did find all five of the games played between the Lincoln Giants and the Philly Colored Giants in the late summer of 1928. The games were played (one in each) in Worcester and Brockton, Massachusetts, and then the final three in New Bedford. The Philly Giants won the first three and the Lincoln Giants took the final doubleheader.
I now have about 60 full boxscores on Bill Jackman in the 19251930 period. I have him 50 in five starts against major leaguer pitching opponents in 1925 and 1926. I also have him working in relief (no decision) in three more games against major league hurlers those same two seasons. I haven't even scratched the surface.
You probably already have these, but here are two other leads for Jackman.
1) Jackman, Burlin White, and company are discussed a little (3 or 4 pages) in the book 'Even the Babe Came to Play' by Robert Ashe (1991). The section about Jackman talking to batters and driving them nuts is pretty funny even with the racist undertone that Ashe gives it.
The book also states that Jackman went a reported 484 in 1927 with 2 nohitters.
2) Jackman gave an interview in the Jan. 17, 1947, Boston Traveler newspaper. I know you stated in the Rogan thread that you had a 1947 Jackman interview, but I figured that I'd post this just in case it is not this one.
I would be very interested in knowing how Jackman did against the Lincoln Giants in those 5 games. I have always wanted to find proof of Jackman's greatness and the combined totals, posted in the Rogan thread, of the 8/30 game from that 5 game series and the 9/23, 9/30, and 10/7 games in New York are the best evidence I've ever seen.
Jackman, in four games against an elite Negro League team, went 13, gave up 33 hits, 16 runs, 11 walks, while striking 28 in 33 and a third innings with his team playing very poorly behind him. Taken in context, this suggests that his reputation was deserved.
How did he do in the other 4 games of the series?
Ramon (Mike) Herrera 19201929:
1920 Linares’ Cuban Stars of Havana
1921 Linares’ Cuban Stars of Havana
1922 Springfield Ponies (Eastern League A)
1923 Springfield Ponies (Eastern League A)
1924 Springfield Ponies (Eastern League A)
1925 Springfield Ponies (Eastern League A)
1925 Boston Red Sox (American League), last month
1926 Boston Red Sox (American League)
1927 Mobile Bears (Southern Association A)
1927 Springfield Ponies (Eastern League A)
1928 Pompez’ Cuban Stars of New York
1929 Pueblo Steelworkers (Western League A)
Of course, during the 1920s, the current TripleA Leagues were AA Leagues, one step below the Majors. Herrera spent his decade mostly in A ball (currently AA) missing only the Texas League, two steps below the Majors.
The first is by Jerry Nason and he compares Jackman to Paige. Mention is made of Jackman playing for East Douglas in the early 30s as a teammate of Hank Greenberg (I think it was 1929) and Jackman averaging 16 strikeouts a game with a 10 dollar bonus for each strikeout, on top of his $175 base per game. Seems very reasonable from other sources I have put together.
Then there is a story about Jackman tossing a 5 inning 32 win in the Boston Park League in July. He fanned 7 and was listed at 54 years old; agreeing with the early version of his birth year (1894 vs 1897).
I have "Even the Babe Came to Play" though I hadn't made note of the Jackman story. I have found many similar stories. He was almost as big a draw as the third base coach as he was on the mound.
I have the 472 record from, I think, a 1929 paper but I do not know how much stock I put into it.
So far, and against all comers, I have:
1925: 91 record with 46 hits allowed in 78 innings. 19 walks and 60 K's (BB and K missing from one of those games). He allowed 20 runs and among his victories were games over exmajor league hurlers Buck O'Brien (twice), King Bader and Earl Hanson.
1926: 71 record with 42 hits allowed in 75 innings. 21 walks and 71 K's. (hits, BB and K missing from one game). Victory over then future major leaguer Haskell Billings.
1927: 21 record. 20 hits in 36 innings. 8 BB and 41 (BB missing from one game).
1928: 87 record. 111 K's and 41 BB's in 123 innings (missing some IP, BB, K and hits). This record includes the Lincoln series in New York (data from Gary A.)
In the 1928 Massachusetts series with the Lincoln Giants, he:
1. Tossed a 7 inning 3hit shutout that he won 20 while fanning 9.
2. Lost a 123 CG in the final. Allowed 16 hits (5 walks and 5 K's) in 9 innings. This is the worst of the 60+ games I have.
In addition of the fall games with the Lincolns in NY, I did find mention of an early spring 1928 game in which he locked up in a 11 duel with Nip Winters. Jackman reportedly allowed but one hit and Winters two. It's undocumented beyond that.
Lots more groundwork to be done on this one. I'm on a five year plan. Only a few months into it though so I am very happy with the progress.
Da Kommrade, even Trotsky say, we should all be on fiveyear plan!
Sorry, Jonesy, I get all nostalgic for my Marxist college days whenever I see the words five, year, and plan falling in sequence.
; )
In other words, a Negro League player should be regressed to his average over the nearest 1000 or so at bats, at the least. For example, if John Beckwith hits .450 in 200 at bats and .350 over the nearest 1000 at bats, his average should regress to .386 with a .450 in 200 at bats and then .350 in the next 350 at bats.
. . .
I think Chris Cobb has the right solution with a rolling five year plan, though I would amend it to being simply the closet 1000 or 1500 at bats
I agree that a fixed number of ABs is attractive.
1000 is a nice round number.
How long is the 1000AB interval for candidatequality players at different times in NeL history? In other words, how often do we have 1000 ABs within the fiveyear moving interval? within seven years? in the 1920s? in the 1930s?
Whether 1000 is a reasonable number depends on answers to such questions as the pure statistical questions about regression.
In 35 American Giants' road games, the averages (for both teams) were .271/.341/.387, with 43 home runs in 2575 plate appearances.
In 37 American Giants' home games, the averages were .206/.274/.255, with 6 home runs in 2541 plate appearances.
That makes for these park factors:
BA: .760764
OBA: .802666
SLG: .660354
HR: .141402
Wow.
From 1920 to about 1930, I'd say that we get 1000 at bats over 5year intervals, typically, in the NNL and 7year intervals in the ECL. In the 1930s recorded at bats drop way down for most teams, with 7 or more seasons needed to garner 1000 at bats. From the 1940s, numbers are somewhat higher again, with 7 years probably being about the norm to reach 1000 ab.
My plan is to make seven years the limit for establishing a mean to which to regress: I could limit that to 5 years, if that were judged preferable.
This weekend I'll be doing new regressions for Beckwith, Lundy, and Moore, I hope, so we can address this issue with reference to specific cases, if we wish.
I use Gary A's park effect data only to determine a general direction and magnitude for park adjustments. For an extreme pitchers' park, I'll adjust upwards by 5 to 10%. For extreme hitters' parks, the reverse. In most cases, the adjustment will be 0 to 2%. When I present my data, I'll include the park factors that I've used.
I can't constantly include caveats and qualifiers in every single post about how "this is just one season," or, "remember, this is only 300 plate appearances," or whatever. I assume everybody here knows that, and I also assume that even small pieces of information I come up with can be useful.
I know the feeling :/ .
I at least am tracking the data you are providing and am considering each piece of data in relation to the rest.
I look forward with great interest to 1916 data.
In 35 American Giants' road games, the averages (for both teams) were .271/.341/.387, with 43 home runs in 2575 plate appearances.
In 37 American Giants' home games, the averages were .206/.274/.255, with 6 home runs in 2541 plate appearances.
PA/game/team = 34.3
That makes for these park factors:
BA: .760764
OBA: .802666
SLG: .660354
HR: .141402
Each number is the singleseason home/away ratio.
Eg, for BA batting average .760=.206/.261
Not quite. For illustration, suppose the league comprises 8 ballparks and the schedule is balanced. Then each ballpark's singleseason park factor, simple version, is its home/away ratio r inflated by 8/(7+r). (That is the home/away ratio regressed approximately 1/8 of the distance from 1.) This step establishes leagueaverage 1 rather than (7+r)/8.
In the particular example, the ratio .76 for one park interpreted as a park factor implies league average park factor .97 =776/800. Inflation by 100/97=8/7.76 establishes Schorling .784, average of seven other parks 1.031.
Found a great article on Jackman today that said from 1925 through midAugust of 1927, he record was 819 for the Philly Giants.
Picked up six more games for him today  all wins.
Also picked up what looks like a very reliable rundown of his teams from 1920 to his joining the Giants in 1925.
Because accounting for all this would be very complicated and I think within a single season probably impossible, I've decided for now to stick with rudimentary park factors, just to give a sense for how statistics were affected by the park. In the long run, a multiple season analysis *might* (I hope) give us enough data to do more sophisticated (and accurate) park factors. (And I haven't even mentioned the neutralsite games...)
Nevertheless, I think there are a few big things that we know now that we didn't before, Schorling's extreme effects being probably the most important.
A schedule unbalanced in that sense radically different shares of home games for different teams is something we see today only when the sky is falling at the Kingdome, or whatever it was. In 1994, MLB home shares differed only moderately.
1) For the Pacific Coast League (PCL) during the decade of the 1920s, Brent stated that his BA conversions worked out to a .92 reduction to get a Major League Equivalency.
Since this conversion number is consistent with the present day Davenport AAA translations and my own 1940s Negro League and TripleA studies, the implication was that the conversion rate between the Major Leagues and the highest Minor Leagues (TripleA) has been fairly constant over time.
Thinking this implication over, my first thought was that the conversion rate would have had to be fairly constant throughout the century. Since the 1903 MajorMinor League Agreement, baseball talent has always been funneled up the Minor League ladder to the Majors. It makes sense that the ratio between the top rung and the next rung down would not change much.
But, on second thought, I realized that there is one huge difference between the Major LeagueMinor League relationship of the 1920s and that of the present day. In the 1920s, the Minors contained Major League players on their way up the ladder AND Major League players going down the ladder. It was not uncommon at that time for Major League players, even Hall of Famers, to finish their careers with three or four or even more years in the minors.
Many Minor League cities of the 1920s are now Major League cities, especially from the Highest Minors. The 1920s Minor Leagues had many players who played out their careers almost entirely in the Minors, despite the fact that they were obviously of Major League talent.
In the present day, the Minor Leagues, with some exceptions, basically just contains players going up the ladder. Once a player is no longer a Major League prospect, his Minor League job is in jeopardy. In this sense, the pool of present day Minor League talent is cut in half.
In other words, I would expect the Minor Leagues of the 1920s to be stronger than the equivalent present day Minor Leagues for this reason.
So I decided to do a little research and see if Buzz Arlett would support the theory.
Of course, Russell (Buzz) Arlett is a pretty fascinating player. Born in 1899, Arlett spent 1918 to 1922 (ages 1923) as a pitcher in the PCL (hitting .247 in 695 AB over those 5 seasons). In 1923, he switched to the outfield and, like some minor league Babe Ruth, became a great Minor League power hitter from 1923 to 1936 (with his career being ended by an injury at 37).
During these 14 seasons, Arlett played in all three of the highest Minor Leagues and even spent one season in the Majors. Interestingly, every city he played in from 1918 to 1936 in the Minors (not counting a brief 1934 sojourn in Birmingham) is now in the Major Leagues (Oakland, Baltimore, Minneapolis).
The first of the following two tables lists Arlett’s career (YRABHBALGET) from 19231936. The second of the following two tables lists the corresponding League totals from his career (YRABHBALG). The last number in the second table is Arlett’s BA conversion factor. For example, in 1923, Arlett hit .330 in a League that hit .300. 330 divided by 300 equals 1.100.
BUZZ ARLETT, born: Jan. 1899
1923: 445147 .330 Pacific Coast League (Oak)
1924: 698229 .328 Pacific Coast League (Oak)
1925: 710244 .344 Pacific Coast League (Oak)
1926: 667255 .382 Pacific Coast League (Oak)
1927: 658231 .351 Pacific Coast League (Oak)
1928: 561205 .365 Pacific Coast League (Oak)
1929: 722270 .374 Pacific Coast League (Oak)
1930: 618223 .361 Pacific Coast League (Oak)
1931: 418131 .313 National League (Phi)
1932: 516175 .339 International League (Bal)
1933: 531182 .343 International League (Bal)
1934: 430137 .319 American Association (Min)
1935: 425153 .360 American Association (Min)
1936: 193061 .316 American Association (Min)
1923: 5596316766 .300 Pacific Coast League 1.100
1924: 5607616719 .298 Pacific Coast League 1.101
1925: 5517915909 .288 Pacific Coast League 1.194
1926: 5399515120 .280 Pacific Coast League 1.364
1927: 5211015175 .291 Pacific Coast League 1.206
1928: 5225515212 .291 Pacific Coast League 1.254
1929: 5577416835 .301 Pacific Coast League 1.243
1930: 5510816650 .302 Pacific Coast League 1.195
1931: 4294111883 .277 National League 1.148
1932: 4503012798 .284 International League 1.194
1933: 4372512189 .279 International League 1.229
1934: 4354612685 .291 American Association 1.096
1935: 4408812905 .293 American Association 1.229
1936: 4467113159 .295 American Association 1.071
Interestingly, Arlett’s career follows the classic path. After two adjustment years (192324 at ages 2425), Arlett enters his prime in 1925 at the age of 26. At age 27, Arlett has his career year, hitting over 36 percent better than the league BA. At 29 and 30, Arlett has two other great years, both about 25 percent above league BA. But, basically, Arlett plays from 1925 to 1935 at about 20 percent above league BA with dips at age 35 and 37 as his career winds down to a close.
A superficial analysis of the Arlett data supports a conversion rate of .96, not .92. In 1931, Buzz Arlett’s Major League BA conversion was about 1.15 and his 1930 and 1932 High Minors conversion rate was about 1.20 (thus 115 divided by 120 equals about .96). Of course, this analysis leaves out two important factors. The first is the adjustment factor and the second is the park factor.
In 1930, Arlett had been playing in the PCL for 13 seasons. His statistics should be inflated by his long experience in the League. In 1931, Arlett played in the Majors for his only season. His statistics should be decreased by his inexperience in the League. However, in 1932, Arlett has the same inexperience factor working against him as he moves to another, unfamiliar, High Minor, the International League.
Logically, Arlett should have done better in the 1930 PCL than the 1932 AA, and the best possible match for his Major League 1931 BA factor would be his High Minor 1932 BA factor because the adjustment factor cancels out (i.e. Arlett was in his first year in the League in each season). Of course, the evidence does not show this since his 1930 and 1932 BA factors in the High Minors are virtually identical.
However, this is simply a park factor illusion. Oakland was a good pitcher’s park and Baltimore, like his Major League Park in Philadelphia, was a fantastic hitter’s park. So, at ages 32 and 33, Arlett played in the Majors and Minors in great hitters parks and with the same adjustment factor. Thus, when all is said and done, Buzz Arlett’s data still supports a BA conversion rate of about .96 (19311.148/19321.194).
Of course, there are several caveats to this answer. One caveat is that Buzz Arlett’s 1931 Major League season was (as is interestingly told in the Arlett thread) disrupted by a serious thumb injury. While this would actually seem to support an even greater BA conversion rate than .96 because his injury probably decreased his BA factor, there is also the possibility that Arlett was not given enough time to regress to the mean of his talent.
For the data to be completely irrefutable, it would have been necessary for Arlett to probably get 2000 or so at bats in the Majors.
Another caveat is that the exact magnitude of the park effects on Arlett in Philadelphia and Baltimore are unknown. The Baker Bowl, in Philadelphia, was the best hitter’s park in the National League. On the other hand, Terrapin Park in Baltimore was also the best hitter’s park in the International League (and Nicollet Park in Minneapolis was the best hitter’s park in the American Association).
In fact, and more to the point, Terrapin Park was a fantastic hitter’s park for a lefthanded hitter (and, once again, Nicollet Park even more so). Of course, Arlett was a switchhitter; but this simply means that most of his at bats were lefthanded against righthanded pitchers.
To sum up, it seems that the High Minors of the 1920s were of better quality than today’s present High Minors (and this is logical). Also, whether the conversion factor is .92 or .96, it is also apparent that Arlett was of near Hall of Fame or Hall of Fame quality as a hitter. As is discussed in Arlett’s thread, I think that Bill James had him pegged just about right.
If he had played in the Majors from 1923 to 1936, Buzz Arlett would have hit between .320 and .330 with over 300 HR. If he had played his whole career in the Baker Bowl, I think that he would have had some seasons of around 40 HR and a .350 BA, but that needs further study.
Obviously, further study of the High Minors in the 1920s is also needed; but there are several interesting possibilities for further comparison. The best one is probably Earl Averill.
[Finally, it would be unfair to talk about Arlett and not mention Joe Hauser.
Hauser, who like Arlett was also born in 1899, was on his way to a great Major League career when it was derailed by a series of injuries. In 1930 and 1931 (before Arlett arrived for the 1932 and 1933 seasons), Hauser played at Baltimore. In 1930, Hauser hit 63 homers. After an off year in 1931, Hauser was shipped to Minneapolis and replaced by Arlett as Baltimore’s slugger. Hauser played in Minneapolis from 1932 to 1936 and was teammates with Arlett from 1934 to 1936.
In 1932 and 1933, the lefthanded Hauser lead the AA with 49 and 69 HR for Minneapolis, also hitting .332 in the latter year. In 1934, Hauser had his last great year. In 82 games, Hauser hit 33 HR and batted .348. His year and, for all intents and purposes, the rest of his career was ended by injury. In that same year, Arlett hit .319 with a league leading 41 HR for Minneapolis. All in all, I think Arlett was a better hitter than Hauser, but the similarities between their careers are interesting and Arlett is not all that much better.
One last fascinating thing about Hauser is his 1933 homeroad HR breakdown. In 1933, Hauser hit 19 HR on the road and a staggering 50 at home in Minneapolis. It was a hell of a hitter’s park.]
The second aspect of Brent’s Post #19 that fascinated me was this:
2) His breakdown of BA conversion rates into individual components (1B2B3BHR).
But I’ll have to post analysis about that (and the related stuff on Ramon Herrera) latter.
. . .
In the present day, the Minor Leagues, with some exceptions, basically just contains players going up the ladder. Once a player is no longer a Major League prospect, his Minor League job is in jeopardy. In this sense, the pool of present day Minor League talent is cut in half.
Here are some factoids and interptoids that I don't have knowledge to develop. Neither fact nor interpretation is original.
 The Farm Clubs project, Minor League Cmte, SABR, covers 1930 in one section with research relatively advanced and 1930 in another section on the contrary.
 1930s50s, the PCL had more control over its players than did AA and IL.
 Today, many AAA farms are very close to the mlb city (Pawtucket & Boston, < 1 hr). The AAA club is used as a taxi squad and a rehab site. AA has more good prospects (more talent less skill?).
 Today, the independent leagues employ many of the best players who are not prospects. Felix Jose, Nashua NH 2001(?).
The reasons were purely economical. It was cheaper for the Boston Red Sox to purchase a pitcher from Lynn or Brockton, Massachusetts, than it was from Omaha of the Western League or Idianapolis of the AA. It was quicker for the Yanks, Giants or Dodgers to purchase players from New Haven or Hartford that it was from Oakland or Dallas.
And it wasn't uncommon for the major league to already own a players' contract and farm him out to Lynn or Brockton, where they could be observed and called back quickly.
The minors were a whole different ball of wax then. The Red Sox had partial ownership of teams like Sacramento, Providence and Buffalo in the 19101920 era.
One would expect this to be true IF the ratio of Major League clubs to High Minor league clubs were the same in the 1920s as it is today. If there were more High Minor league clubs than major league clubs in the 1920's then these could help absorb those major leaguer players who were on the way out.
It has been observed that the regression formula I have used on the Beckwith MLEs is evening out Beckwith's seasons too much.
I think this is a mathematically correct observation, because if the regression formula were applied to a player with fully documented play in 154game seasons, that player would also be regressed.
What is needed is a regression that adjusts the NeL shortseasons towards the mean an amount appropriate to projecting performance into a 154game season.
I see two possibilities on which I'd appreciate comment:
1) Getting the "largesample" average for the player using fiveyear consecutive sums, but setting the regression ratio by taking the square root of the ratio of the player's recorded at bats to his projected at bats for that season.
Hypothetical example, since I don't have real numbers handy.
Beckwith's MLE average for 1925 is .371 in 210 at bats. His fiveyear average is .342. He is projected as having 580 ML at bats for 1925. Regress .371 toward .342 by the square root of 210/580.
By this method, .371 regresses to .359
2) Getting the "largesample" average for the player using fiveyear consecutive sums, but regress only the projected at bats towards the largesample mean, giving the player full credit for his actual level of performance.
Hypothetical example:
Beckwith's MLE average for 1925 is .371 in 210 at bats. His fiveyear average is .342 in 970 at bats. He is projected as having 580 ML at bats for 1925. Regress .371 toward .342 for the 370 projected at bats by the square root of 370/970. Calculate seasonal average by combining regressed projected average for projected at bats and actual MLE average for actual at bats.
By this method, .371 regresses to .360. .360 in 370 ab and .371 in 210 at bats yields a fullseason average of .364.
The current formula would regress Beckwith's .371 average to .355.
Which of these alternatives seems like the best model to use?
Regression to the mean only flattens peaks because it takes an average for the "rest" of the season. In reality, sometimes he would hit less than average for the "rest" of the season, sometimes more, giving some seasons where his good "actual" numbers disappered altogether (not just flattened but eliminated) and some where they weren't flattened, or were only flattened a little. By regressing him only to his 5 year average, you will in any case be raising the whole "elevation" of his peak seasons.
I have raised Beckwith several notches following your analysis, which I support wholeheartedly in principle, while believing it tends still to round up in practice. But fiddling with the regression formula because the peak's not high enough, or twisting your .87 conversion to .95 to please Gadfly would lose it all credibility, as far as I'm concerned. Gadfly's cuckoo (if he can call me racist, I can call him mad!)
You must be Registered and Logged In to post comments.
<< Back to main