You are here > Home > Primate Studies > Discussion
 
Primate Studies — Where BTF's Members Investigate the Grand Old Game Tuesday, September 03, 2002Strength of ScheduleAn interesting look at park effects. The straightforward calculation of park effects must be done with enormous caution.? If it is done with partial or even single season data the sample size may be too small; larger samples may obscure genuine changes from season to season.? Calculation is even more distorted if the schedule is unbalanced; runs scored in a particular park may have had more to do with strengths and weaknesses of the teams playing there than the intrinsic characteristics of the parks.? This is particularly likely if teams have not played equal numbers of home and away games with the same opponents.? The effect of an unbalanced schedule will diminish somewhat as the season progresses, but will not disappear because the schedules themselves are highly unbalanced.? Distortion of park effects is only one consequence of the unbalanced schedule.? I do not believe there has been adequate consideration of the effects of an unbalanced schedule on analysis on the accomplishments of teams or players.? If a team plays relatively more home games against good offensive teams, for example, the park factor will be falsely inflated.? To make this point I have developed a more sophisticated method of estimating park effects that takes into account strength and balance of schedule.? The method turns out to have a number of additional benefits. The number of runs a team scores in a game can be broken down into four elements: the team?s offensive prowess, the pitching and defense of the opposition, the park factor, and whether or not a designated hitter is used.? This can be written as a simple linear equation: Runs scored = Offensive strength + Defensive (pitching and fielding) strength of opponent + Park factor + DH factor + random error By writing the equation this way, I am suggesting that the park factor as well as the other three elements can be estimated using linear regression.? Each game played provides two observations, one for each team, consisting of runs scored; the identities of the park and offensive and defensive teams and whether a DH was used.? I estimated these factors for games played through August 23rd. Before I present the results, I need to make a few statistical notes.? Because the distribution of runs scored is skewed to the right (lognormal), calculations are made on a logarithmic transformation of runs scored (The transformations are reversed in the reported results.).? This provides better statistical estimates and two additional advantages.? First, the effect of high scores is diminished.? This is desirable because there is a lower limit to runs scored but no upper limit.? Secondly, a logarithmic transformation treats park effects as multiplicative rather than a fixed number of runs per game, making it consistent with current practice.? Here then are the results followed by explanations:
Off r/g is the number of runs per game the team is expected to score in a neutral park against average pitching and defense with no DH ExpOff is the number of runs per game the team would be expected to score against the actual opponents in the actual parks but without a DH.? Because of the logarithmic transformation mentioned above, this is not the same as the average number of runs scored per game (arithmetic mean).? The number is essentially a geometric mean using a work around for shutouts (all numbers used in a geometric mean must be positive) and should also be close to the median. Off Crxn is the Offensive Correction multiplier, the ratio of Off r/g to ExpOff.? It is used to correct any unadjusted runbased offensive statistic (runs scored, RBI, runs created, Raw EqR, etc.) for teams or players for both park effects and the defensive strength of opposition. Def r/g is the expected number of runs per game allowed in a neutral park against average offense with no DH. ExpDef is the number of runs per game the team would be expected to allow against the actual opponents in the actual parks but without a DH. Def Crxn is the Defensive Correction multiplier, the ratio of Def r/g to ExpDef.? It is used to correct any unadjusted runbased defensive statistic (runs allowed, runs prevented, ERA, etc.) for both park effects and the offensive strength of opposition. Park r/g is the expected number of runs per game scored per team in that team?s home field by teams with average offense and defense and no DH. Actual Park is the expected runs per team per game without the DH given the actual teams that have played there.? The DH effect (currently estimated at 9.5%) can be multiplied to give a more realistic number for AL parks. Teams Effect is the ratio of Actual Park to Park r/g and represents the relative strength of offense versus defense of the teams that have played in the park distortion to apparent park effects caused by the unbalanced schedule.? This is split into HmTm Effect, the degree to which the home team offense and defense would raise or lower runs scored in the park against average opposition. VsTm Effect, the degree to which the visiting team offense and defense would raise or lower runs scored in the park against average opposition. Neutral WP% is the Pythagorean Winning Percentage using Off r/g and Def r/g.? It is an adjustment for strength of schedule, estimating how the team would be doing against average opposition.? It should be a good measure of the relative quality of all the major league teams. ExPyth is the Expected Pythagorean Winning Percentage based on the expected runs scored and allowed against the actual opposition.? It should be close to the Pythagorean Winning Percentage calculated using actual runs scored and allowed but underweighs high scoring games. SoS is the relative Strength of Schedule, the ratio of Neutral WP% to ExPyth.? Multiply SoS by the number of games won to adjust the team?s win total for the strength of its opposition.? Note the largest adjustments. Toronto, because of all its games against the Yankees and Bosox, has been penalized about five wins.? The Twins, in addition to being fortunate in onerun games and extra innings, have accrued about four additional wins from playing their weak opposition. Offrank is the team?s offensive ranking (by Off r/g).? It?s no surprise that the Yankees are No. 1 but how can the Phillies be second?? See below.? The teams at the bottom are not surprising, although it?s impressive how poor the Tigers look even after correcting for Comerica. Defrank is the team?s defensive ranking (by ascending Def r/g).? No surprises at the top: the A?s, Atlanta, Arizona, Boston, and Anaheim.? I have no concern here about confounding with park effects; there are good hitter?s parks (AZ and Oak (this year)), mild pitcher?s parks (Bos and Ana) and a neutral park (Atl).? The bottom rankings seem plausible except perhaps the Phillies (once again, vide infra) Parkrank is the ranking of park effects.? The ranking of 1 (Colorado) is the best hitter?s park; 30 (Philly) is the best pitcher?s park.? Coors rose from 12^{th} to first from Memorial Day to the AllStar break.? So much for the humidor. Overank is the overall ranking.? The Red Sox are on top, followed by the Yankees, Dbacks and Braves.? Milwaukee ranks dead last behind Tampa and Detroit.? The Tigers and Brewers play in weak divisions while the Drays have all sorts of games against the Yankees and Boston and have done badly in onerun games so cut them some slack.? Why are the Red Sox on top?? The same reason they are at the top of the Pythagorean standings calculated with actual runs scored and allowed.? They too have been unlucky in onerun games but strength of schedule has not affected them at all, deflating their win total by only 0.6% So what about the Phillies?? Does my method overstate the pitcher?s park effect?? There is little question the Vet is a pitcher?s park: the Phillies are fifth in the majors in EqA using conventionally calculated park effects even though they are only seventh in the National League in runs scored.? And the conventional park effect calculation would likely understate that effect; the Phillies play almost half their road games against the NL East, which has no good hitter?s parks (I rank Atlanta as the best one at 15^{th}).? Their home/road differentials are made against other pitcher?s parks on average. Finally, the Phillies face the Braves? pitching as much as anyone and they are the only team in their division who don?t get to hit against their own crummy pitching.? Looking at the results as a whole, there is a moderate negative correlation between offensive and defensive ratings (.19) and stronger negative correlations between Park r/g and both Def r/g (.27) and Off r/g (.47).? This would suggest either difficulty in separating park from true offensive and defensive effects or an affirmation of the role of park effects in concealing team weaknesses, i.e., Coors makes the Rockies think their hitting is better and their pitching is worse than it really is so they concentrate on improving the latter at the cost of the former.? I believe the latter to be the case.? The method allows for formal statistical tests; these show that the differences among teams in offense, defense and park effects are all highly and separately statistically significant

BookmarksYou must be logged in to view your Bookmarks. Hot TopicsWhat do you do with Deacon White?
(17  1:12pm, Dec 23) Last: Alex King Loser Scores (15  12:05am, Oct 18) Last: mkt42 Nine (Year) Men Out: Free El Duque! (67  10:46am, May 09) Last: DanG Who is Shyam Das? (4  8:52pm, Feb 23) Last: RoyalsRetro (AG#1F) Greg Spira, RIP (45  10:22pm, Jan 09) Last: Jonathan Spira Northern California Symposium on Statistics and Operations Research in Sports, October 16, 2010 (5  12:50am, Sep 18) Last: balamar Mike Morgan, the Nexus of the Baseball Universe? (37  12:33pm, Jun 23) Last: The Keith Law Blog Blah Blah (battlekow) Sabermetrics, Scouting, and the Science of Baseball – May 21 and 22, 2011 (2  8:03pm, May 16) Last: Diamond Research Retrosheet SemiAnnual Site Update! (4  4:07pm, Nov 18) Last: Sweatpants What Might Work in the World Series, 2010 Edition (5  3:27pm, Nov 12) Last: fra paolo Predicting the 2010 Playoffs (11  5:21pm, Oct 20) Last: TomH SABR 40: Impressions of a FirstTime Attendee (5  11:12pm, Aug 19) Last: Joe Bivens, Minor Genius St. Louis Cardinals Midseason Report (12  12:42am, Aug 10) Last: bjhanke Napoleon Lajoie: Definition of Grace (9  12:38am, Jul 01) Last: Hang down your head, Tom Foley Youth Baseball Hitting Drills: Shine the Light (5  6:47am, Mar 11) Last: Pat Rapper's Delight 

Page rendered in 0.9166 seconds 
Reader Comments and Retorts
Go to end of page
Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.
1. Mike Emeigh Posted: September 03, 2002 at 12:44 AM (#606084) MWE
Like Oakland. So far this year it is playing like a hitter's park, though who knows why. Oakland is scoring 5.33 runs at home, 4.63 on the road. Oakland's opponents are scoring 4.19 runs at Oakland, 3.93 elsewhere. I'm pretty sure any measure of park factors that is based on just 2002 data will show Oakland as a hitter's park. However, in 2001, facing a very similar schedule, Oakland and their opponents scored about .4 runs more in A's road games.
One BP (2001?) argued that park factors, usually based on 3 years of data, should be based on 5 years of data.
Walt, you beat me to the punch in pointing to the unadjusted numbers for Oakland this year. The comparable numbers for the Phillies: Offense 4.08 r/g at home, 5.02 on the road.
Defense 4.06 r/g at home, 5.24 on the road.
Is four months of data a sufficient sample size? The differences in observed park effects are highly statistically significant, i.e., not due to chance. One reason three to five years of data may look more reliable is that most park effects may be genuine but unstable  weatherdependent, for example. Three to five years of data may average out all but the most persistent park effects. I don't believe there is any one correct time frame; it depends on what you are trying to measure and compare. What my method tries to do is increase the accuracy and efficiency of measurements for whatever time frame you use by correcting for the effects of an unbalanced schedule.
I didn't address Mike Emeigh's comment about how these numbers are calculated. Three of the numbers in the table; Off r/g, Def r/g, and park r/g; come directly from the regression estimation which is done by a statistical program. The remainder are derived as follows:
ExpOff is the (adjusted) geometric mean of runs scored in each game for each team. Runs scored in AL parks are reduced by approximately 9.5% (DH adjustment). ExpDef is the same for runs allowed.
I already explained the calculation of the Crxn factors.
Actual Park is the adj. geometric mean of all run totals of games played in that park (reduced 9.5% for AL parks).
Teams Effect is already explained.
HmTmEffect is the ratio of Off r/g to the average multiplied by the ratio Def r/g to the average. If the home team is stronger offensively than defensively this will be greater than one.
VsTmEffect is the ratio of Teams Effect to HmTmEffect because HmTmEffect * VsTmEffect = Teams Effect
Neutral WP% and ExPyth% are the usual Pythagorean calculations with an exponent of 2.
SoS is explained in the article.
More importantly, I am trying to measure total defense, not just pitching; earned runs, even if accurate, are irrelevant.
Finally, the disparity you note about the A's opponents hitting better on the road while the A's do better at home is why you need to correct for the unbalanced schedule. If, at this point on the season, the A's have faced better hitting teams on the road than at home you can see this kind of contradictory result.
As you point out if you fully take into account the unbalanced schedule, the traditional calculations are no longer correct. However, there does not seem to be any "easy" way to calculate park effects under this scenario. Your sample sizes become very small very quickly, since you no longer can lump all opponents into one bin. It might not be a case of circular reasoning, but your degrees of freedom seem to shrink precipitously. Maybe it's a massive simultaneous equations system.
Could you describe your method of calculating park effects in the face of unbalanced schedules? Thanks much.
Dave, the "lognormal" transformation I make is calculated by the statistical program to make the run distribution unskewed, essentially normal. For AL teams it adds about 4 runs, takes the geometric mean, then subtracts the four runs. For the NL about 3.5. For example if an AL team scores 4 runs twice the average would be sqrt((4+4)*(4+4))4 = sqrt(64)4 =4. If the team scores 0 and 8 runs, the average would be sqrt((0+4)*(8+4))4 = sqrt(48)4 or 2.93. For an NL team the results for the first example is the same and the second would be sqrt(3.5*11.5)3.5 = 2.84. Why the difference? In the NL a shutout is slightly more typical and scoring 8 runs is slightly more unusual.
You are right about batting events (hits and bbs) being a useful way of gauging the magnitude of a shutout/defensive blowout but using them as a general measure is problematic. First, the winner of the game is which side scores the most runs, not who has the most hits and bbs. Second, how do you weight the different events to make an aggregate measure of offensive or defensive strength(runs created, linear weights, OPS, etc.)? None of these measures is perfect so you are adding another level of error and uncertainty. You could use this method to adjust for park and opposition for the different events: Team X hits a lot of HRs, Team Y allows a lot of doubles, or there are relatively few Ks at Park Z after team tendencies are accounted for.
You must be Registered and Logged In to post comments.
<< Back to main