Baseball for the Thinking Fan

You are here > Home > Primate Studies > Discussion
Primate Studies
— Where BTF's Members Investigate the Grand Old Game

## Wednesday, December 15, 2004

#### An Oliver Twist on the Pythagorean Theorem

What the Dickens?

An Oliver Twist on the Pythagorean Theorem?  What the
Dickens?

I made a trip to the

href="http://www.baseball-reference.com/teams/NYY/2004_sched.shtml">

Hall of Fame the weekend before Thanksgiving.  One of the highlights of the
trip was the gift shop.  While there, I came upon a copy of Dean Oliver’s book,
book is an attempt to apply sabermetric thought to basketball.  I haven’t
finished it yet (basketball is probably my third most favorite sport), but I
found some interesting stuff in the book so far.

Chapter 11, titled “Basketball’s Bell Curve” provides an
alternative to Bill James’s Pythagorean Theorem.  While the Pythagorean Theorem
was developed for baseball analysis, it has been applied to other sports;
including football, hockey, and hoops.  The formula pretty much remains the
same [Winning Percentage= Points (Runs)^Z/ (Points (Runs)^Z)+(Opponents Points
(Runs)^Z)].  In baseball, the exponent (Z) is usually approximately 2 (I used
and 17 depending on the league and how high scoring it is.  Oliver used a
method called correlated Gaussian to come up with basketball winning percentages:

Where PPG = points per game, DPPG =defensive points per
game, and covar=covariance. If you calculate NORM with Excel, use a mean of 0,
a sd of 1, and a value of TRUE, instead of false.

Oliver used this method because he found a correlation
between PPG and DPPG and felt that the correlated Gaussian method would be more
accurate than the Pythag.  He mentioned two reasons for this correlation: 1.)
The tendency for teams to play up or down to their competition, and 2.)
“Garbage time”; where the game is no longer in doubt and the team with the lead
can give up points without changing the result.  He also found that as the
standard deviation of PPG and DPPG increases, teams are drawn towards a .500
winning percentage.  Thus, consistency is a good trait for a superior team to
have, while it is a hindrance for inferior teams.  This quick summary doesn’t
do justice to Oliver’s method and I suggest visiting his

website if you would like to learn
more.

Naturally, as a baseball fan, I wanted to see if this method
applications to the National Pastime.  I went to

href="http://www.baseball-reference.com/teams/NYY/2004_sched.shtml">

baseballreference.com
to get the scores for every AL game this year.  (

href="http://www.retrosheet.org/boxesetc/VBOS02003.htm">

retrosheet.org also
has this data.)  Well, correlated Gaussian doesn’t seem to work any better than
the Pythagorean method; at least for the 2004 American League.  The Pythagorean
winning percentage has a .948 correlation with actual winning percentages
compared with an “Oliver” winning percentage correlation of .945 with actual
winning percentages.  And Pythag is a helluva lot easier to calculate (at least
for me.)

It is well known in the slackademic circles of the baseball
blogosphere that the

href="http://www.baseball-reference.com/teams/NYY/2004.shtml">

2004 New York
Yankees outperformed their expected Pythagorean winning percentage.  Also,
they were a little more inconsistent than the average AL team as evidenced by
their runs scored and runs against standard deviations.  However, the Bronx
Bombers actually outperformed their Oliver winning percentage by a little bit
more (.623-.544 versus .623-.547)

Actually, the Oliver winning percentages are extremely close
to the Pythags.  Pythag was better six times, Oliver was better four, and there
were four virtual ties.  This is only one year for one league, but it hardly
seems worth the effort to use this new method for baseball.  If someone is a
better number cruncher than me and would like to run the results on other
seasons or leagues, go for it.  I’d be interested to see if either Oliver or
Pythag outperforms the other.  It appears to me that, while it exists,
covariance of baseball scores is pretty low.  If I had to make a guess as to
why this is the case, and I’m reaching here, teams probably use their better
relievers in games that frontline starters pitch and use the back of the
bullpen in starts by fourth and fifth starters.

As consolation for this inconclusive study, I decided to
include a byproduct of my spreadsheet: charts of the runs scored/runs against
distribution for all fourteen 2004 American League teams.  I’ll also throw in
that this study was based on, if you would like to play around with it.

href="http://www.baseballgraphs.com/">

Studes who has done more than anyone
else that I’m aware of in reviving John Warner Davenport’s interest in
graphical representation of baseball stats.

Jon Daly Posted: December 15, 2004 at 10:33 PM | 16 comment(s) Login to Bookmark
Related News:

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

1. Nick S Posted: December 15, 2004 at 11:05 PM (#1023260)

I would have guessed that a correlation in basketball between PPG and DPPG resulted from a correlation between PPG and opponent’s possessions/game.  That is, every time you score, your opponents get the ball, whereas everytime you don’t score you get the ball back (offensive rebound) some non-insignificant percentage of the time.  Also, and I know squat about basketball, it may be that teams choose to play aggressively, resulting in better offense and worse defense.

2. dcsmyth1 Posted: December 15, 2004 at 11:57 PM (#1023388)

Well, there’s something in basketball called Game Pace, which relates to how fast a team goes thru a possession. If a team has a fast pace, then both the team and the opponents will have more possessions during the game, and therefore likely more points and more points allowed (and vice-versa, of course).

3. TVerik, the gum-snappin' hairdresser Posted: December 16, 2004 at 01:45 AM (#1023594)

I always wondered about +/- ratios in basketball.  They have ‘em in hockey, why not hoops?

Of course, I have trouble with any metric that measures your performance when you’re not on the court/ice, but that’s neither here nor there.

4. Jim A Posted: December 16, 2004 at 03:04 AM (#1023713)

BOP is an excellent book.  It should also be noted that Dean Oliver was recently hired as a consultant by the Sonics, who perhaps not coincidentally are off to a surprising 18-4 start.  There are some quotes from Oliver in this article:

These Sonics are indeed super

Also Roland Beech at 82games.com has basically the equivalent of +/- ratings in basketball.

5. Joey Numbaz (Scruff) Posted: December 16, 2004 at 06:43 AM (#1024028)

Bill James did something like this in the 80s, I’m guessing, but I want to say it was the 1986 Abstract (the article focused on the Dodgers and Mets, and they were both good in 1985) - he basically came to a similar conclusion. James said that it’s basically 2-3% more acurate and 10 times the work, so it’s not worth it.

Obviously that’s different in a 8-10 run environment, as opposed to a 180 point environment. But it does explain a small percentage of a team’s deviation from the Pythagorean method.

6. MKT Posted: December 16, 2004 at 09:16 AM (#1024415)

1.  It’s not too surprising that taking covariance into account doesn’t add much accuracy, for reasons that have already been given:  game pace (each baseball team basically always gets 27 outs, whereas an NBA team might shoot 70 or 100 field goal attempts); the fact that the scoring team has to give the ball to the other team in basketball (this is actually much the same concept as the game pace one); and the garbage time/playing to the level of the opponent argument that Dean Oliver originally made.

2. The graphs would be better as scatterplots rather than bar graphs (assuming that the points would show up when viewed via the web).  Because scatterplots would enable the user to see the degree of correlation between points scored and allowed.  If the two are positively correlated, the scatterplots will look like upward-sloping elipses.  If they are not (as appears to be the case with baseball) then they will look like diffuse spheres.

With the bar graphs, one cannot detect the correlations, indeed it’s rather hard to even figure out if a team outscored its opponents or not.

3.  Post #3 said I always wondered about +/- ratios in basketball. They have ‘em in hockey, why not hoops?

Harvey Pollack of the Philadelphia 76ers has been reporting +/- for decades in his Philadelphia 76ers Statistical Yearbook.  Also, recently 82games.com started tracking this with their
Roland Ratings.  And Jeff Sagarin and Wayne Winston in recent years have come up with a more sophisticated version of +/- which they call WinVal.

There are however several problems with +/- ratings, both on theoretical grounds and on the empirical grounds that many of the players’ receive ridiculous ratings.

7. studes Posted: December 16, 2004 at 03:07 PM (#1024675)

Hey GGC, thanks for the dedication!  I appreciate it.

I’ve played with graphs like this (pythagorean distributions) ever since they invented Lotus.  I find it a fascinating subject, and one that lends itself to “graphical reflection.”

I was just thinking this morning about our possible Yankee pythagorean project.  Are you still up for it?

8. Walt Davis Posted: December 16, 2004 at 04:46 PM (#1024934)

One reason not to apply this to baseball is that neither run-scoring nor run-allowing are quite normally distributed so chances are neither is run differential.  The deviation from normality appears fairly small, but it might be enough to offset the small increase in explanatory power.  (As scoring increases, the distribution converges on normality, so when you’re talking basketball scores things should be pretty normal)

For those who look at that formula and don’t get what it’s doing:

the denominator part is the standard deviation of run differential—there’s no actual need to calculate the variances and covariance of RS and RA, just calculate the variance of RD.

So you’re looking at the mean run differential relative to its standard deviation.  If memory serves this is also called the coefficient of variation.  This then is treated as a z-score (think bell curve), fed through the normal CDF (don’t worry about it), which gives you the probability of seeing this run differential given this standard deviation.  At least I assume that’s what NORM is standing for in that formula.

The upshot being that the greater the variation in run differential, the less that a given run differential translates into wins.  So a .5 run differential in Dodgers Stadium would be huge but in Coors is not.  This would be true across eras with different scoring levels as well.

Jon, since you’ve got the data handy, how about just calculating the R-square and seeing how that works?  Just correlate RS with (RS+RA) and square the result.  This is quite similar to the original pythag but will correct for degrees of freedom and covariance.  The interpretation is “of the total variance of run-scoring in a team’s game, how much is due to that team’s offense.”  The R-square would be the estimated win percentage.

Also if you’d be nice enough to send me the raw game score data, there’s something else I’d like to take a crack at.

9. GGC don't think it can get longer than a novella Posted: December 16, 2004 at 05:37 PM (#1025062)

Jon, since you’ve got the data handy, how about just calculating the R-square and seeing how that works? Just correlate RS with (RS+RA) and square the result. This is quite similar to the original pythag but will correct for degrees of freedom and covariance. The interpretation is “of the total variance of run-scoring in a team’s game, how much is due to that team’s offense.” The R-square would be the estimated win percentage.

I may try that when I get some free time.

Also if you’d be nice enough to send me the raw game score data, there’s something else I’d like to take a crack at.

Walt, go to the spreadsheet that I linked.  It contains all the game scores for each AL team.

10. Kelly Posted: December 17, 2004 at 12:56 AM (#1026250)

I think the key to constructing a less empirical approach to predicting team winning percentages based on runs scored and allowed is to use Poissian statistics.  I recently constructed a model using the Poissian distribution and the average runs scored data for all 162 game seasons and compared the results to actual winning percentage and Pythagorean winning percentage.

The primary results:

1) The Poissonian distrubution model and Pythagorean model gave similar results.

2) The Poissonian model predicts higher and lower winning percentages for the best and worst run differentials than the Pythagorean model does.

3) The Pythagorean model works better.

I have wondered if there would be any interest in my writing up the method and presenting the results here.

11. GGC don't think it can get longer than a novella Posted: December 17, 2004 at 01:29 AM (#1026313)

Responding to some of these posts here:

Bill James did something like this in the 80s, I’m guessing, but I want to say it was the 1986 Abstract (the article focused on the Dodgers and Mets, and they were both good in 1985)

Joe DiMino, the man who remembers Run Element Ratio!  Indeed, Bill did something with the Mets article in 1986 that was somewhat similar.

The graphs would be better as scatterplots rather than bar graphs (assuming that the points would show up when viewed via the web). Because scatterplots would enable the user to see the degree of correlation between points scored and allowed. If the two are positively correlated, the scatterplots will look like upward-sloping elipses. If they are not (as appears to be the case with baseball) then they will look like diffuse spheres.

As per your idea, I did a scatterplot of the Royals.  It’s more diffuse than upward sloping.

With the bar graphs, one cannot detect the correlations, indeed it’s rather hard to even figure out if a team outscored its opponents or not.

I originally did these as line graphs instead of column graphs.  I actually think the columns look better.  But you are right, neither make it clear whether a team outscored it’s opponents.  You can look at the peaks for a clue, but even that doesn’t tell the whole story.  (Witness the Texas Rangers.)  Like I said, if you want to play around with these, I gave a link to the original spreadsheet.

how about just calculating the R-square and seeing how that works? Just correlate RS with (RS+RA) and square the result. This is quite similar to the original pythag but will correct for degrees of freedom and covariance. The interpretation is “of the total variance of run-scoring in a team’s game, how much is due to that team’s offense.”

I did this for five of the teams.  Unless I’m doing it incorrectly (a possibility) R-squared doesn’t seem to match up well with winning percentages.  THere were all between .521 and .593.  That includes the Royals and Mariners, who were both around .521.  Unlees the AL was renamed the Lake Woebegone League, I don’t think they’re accurate.

12. GGC don't think it can get longer than a novella Posted: December 17, 2004 at 01:44 AM (#1026347)

Kelly, I’d contact Dan Szymborski about that.  I think that he may be able to accomodate you.  If you’re looking for more research, BPro did an article that mentioned Poisson distribution back in 1999.

13. Dr. Chaleeko Posted: December 19, 2004 at 02:56 PM (#1030274)

I wonder if the Oliver method would be more appropriate for 19th Century teams (esp pre-1893).

Over on the Hall of Merit area, we’ve been discussing whether pitching to the score was a skill/strategy employed in the 19th C. because there essentially was no bullpen.  Pitching to the score sounds an awful lot like the baseball equivalent of
‘1.) The tendency for teams to play up or down to their competition, and 2.) “Garbage time”; where the game is no longer in doubt and the team with the lead can give up points without changing the result.’

14. Ivan Grushenko of Hong Kong Posted: December 20, 2004 at 01:52 PM (#1031318)

Wow! Oliver Twist!  Cockney Rhyming Slang!

I love this!

Whose twist is it now?

15. Slinger Francisco Barrios (Dr. Memory) Posted: December 20, 2004 at 04:20 PM (#1031592)

I always wondered about +/- ratios in basketball. They have ‘em in hockey, why not hoops?

They do have them in hoops.  It’s not widely published, but it is being done.

A couple of years ago someone showed that the Bulls were best off with both Jamal Crawford and Jay Williams on the floor at the same time, rather than paired up with other guards.  (Then Williams cracked up his motorcycle.)

16. Walt Davis Posted: December 22, 2004 at 03:07 AM (#1035195)

Kelly, you may want to look into the negative binomial distribution.  It allows for more dispersion than the poisson (i.e. should be able to handle very high scoring games a little better).

You must be Registered and Logged In to post comments.

<< Back to main

### Bookmarks

You must be logged in to view your Bookmarks.

### Syndicate

 Demarini, Easton and TPX Baseball Bats AllianceTickets.com has cheap MLB Tickets. Get all your Colorado Rockies Tickets, Seattle Mariners Tickets, San Francisco Giants Tickets and all your favorite baseball tickets here. We also carry cheap Denver Broncos Tickets, Seattle Seahawks Tickets and Denver Nuggets Tickets. For wholesale prices on baseball gifts and equipment, check these stores out! Baseball Autograph Signings Baseball Card Supplies Baseball Memorabilia Baseball Collectibles Baseball Equipment Baseball Protective Gear

Page rendered in 0.2979 seconds
66 querie(s) executed