Baseball for the Thinking Fan

Login | Register | Feedback

btf_logo
You are here > Home > Baseball Newsstand > Baseball Primer Newsblog > Discussion
Baseball Primer Newsblog
— The Best News Links from the Baseball Newsstand

Friday, January 03, 2014

What big data can teach us about our favorite sports

thought the article was worth reading

Harveys Wallbangers Posted: January 03, 2014 at 09:08 AM | 17 comment(s) Login to Bookmark
  Tags: analytics, basketballsucks, bigdata, football, statistics

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

   1. Mike Emeigh Posted: January 03, 2014 at 09:53 AM (#4628995)
I think the issue with some of the data scientists who look at sports is that rather than starting with a model and reviewing the data to see whether the data fits the model, they start with the data and try to fit a hypothesis to the patterns they see in the data.

Take this, for example:

Clauset and Merritt found another interesting pattern: While hockey and football teams tend to extend their leads, pro basketball squads play worse when they’re ahead. They’re not the first to notice this pattern. Jonah Berger, a professor at Wharton and the author of Contagious: Why Things Catch On, has argued that this phenomenon suggests losing teams are inspired to play harder. Berger tells me via email that we only see the pattern in basketball because “teams score frequently, and differences in motivation can easily impact whether a scoring event occurs. In hockey, and even football to some extent, scoring occurs less frequently and is more discrete. So even if players were more motivated it would be harder for that motivation to translate into additional scoring events.”

Clauset considers Berger’s “underdog inspiration” theory interesting, but as a data scientist, he wants to see the numbers. “How do you measure motivation?” he asks. “I don’t know.” Clauset thinks the NBA’s “restoring force”—the tendency for teams to lose their leads—might instead be due to player management. In basketball, he theorizes, coaches often pull their best players from the lineup when they’re in the lead, meaning they’re less likely to score. In football, by contrast, coaches rarely substitute in this manner, and there’s so much rotation in hockey that it’s more difficult to orchestrate when the best players shuffle in and out.


Did these scientists consider that, in hockey, a trailing team sends out its best offensive players (who are quite frequently lesser defenders) more often, that they pull the goalie (which often leads to empty-net goals on the other end) and that they take more risks in the offensive end, which opens up the ice on the defensive end for breakaways and two-on-ones? Did these scientists consider that trailing teams in football do things like go for it on fourth down, rather than punting or going for a field goal, and that when those strategies fail the trailing team gives up field position? Did they actually look at late-game strategies that football and hockey teams follow, and map how those strategies affect the outcome?

I think not.

-- MWE
   2. Harveys Wallbangers Posted: January 03, 2014 at 09:56 AM (#4628998)
mike

what you describe this author would likely point to as the 'reactive' approach he lists
   3. Mike Emeigh Posted: January 03, 2014 at 10:18 AM (#4629012)
Harveys:

I get that. But if you're going to draw a conclusion from the data, which these guys have done, you should at least investigate the impact of those forces, rather than publishing this:

Clauset agrees that there are incredibly complicated forces underpinning his seemingly simple findings. “These teams are working so hard to beat everyone else,” he says. “But in the end, it’s like the Red Queen in Alice in Wonderland—you have to run as hard as you can just to stand still. Only by working so hard and figuring out these strategies can you achieve this system that is like a random coin flip.”

That, to Clauset, is good news for the fans. It means these games are inherently balanced, that all teams have more or less the same advantages. What victories hinge on are the rare chance events—the tragic mistakes or lucky breaks, the stuff that gets stadiums and arenas cheering.



How can he draw that conclusion from the data? Because they have developed a simple model that is more accurate in predicting game outcomes, on average? This is the MGL approach - where you decide to take Jon Lester out of a WS game because "on average" your bullpen will hold the lead. Yes, it's important to know the percentages - but it's also important to understand specific contextual factors that lead you to make a decision that goes "against" the percentages. I think that's what these guys don't get; they see a pattern and make an assumption that understanding the dynamics of the situation doesn't matter in the long run because ultimately it comes down to a lucky bounce here and there - when it's the dynamics that actually drive the data that the see.

-- MWE
   4. Harveys Wallbangers Posted: January 03, 2014 at 10:35 AM (#4629019)
mike

not disagreeing with either of your posts.

just my guess on how this author would respond.

like a good many number pros, he's pretty sure of himself.
   5. Bitter Mouse Posted: January 03, 2014 at 11:48 AM (#4629039)
Note: I hate the term Big Data. I am in the field though so ....

Anyway the money quote is I think:
“These teams are working so hard to beat everyone else,” he says. “But in the end, it’s like the Red Queen in Alice in Wonderland—you have to run as hard as you can just to stand still. Only by working so hard and figuring out these strategies can you achieve this system that is like a random coin flip.”


And I think that is the correct view.

To address what Mike is saying, there are (grossly simplified) two ways of analyzing data. Start with a theory and test it (relationships and correlations) against the data, or set loose mathematics on large data sets and allow the data to spit out relationships and correlations.

Again this is very simplified but that is the high level view. Both are very valid and both can lead one down the rabbit hole. It sounds like he is definitely doing the second, and it often helps to take that perspective because it can free you of bias and you can gain new insights into the data set (and the system it represents) that someone "inside" the system will never see. Of course you can also end up with nonsense. I have seen both in plenty.

The two main factors are the quality of the data (accuracy and granularity - is it correct, complete, and granular enough that the main factors are measured and not buried) and quality of the analyst. Good math and enough quantity of data are also pretty important.
   6. Rickey! trades in sheep and threats Posted: January 03, 2014 at 11:55 AM (#4629046)
Note: I hate the term Big Data. I am in the field though so ....


I am counting the days until I'm no longer working a "big data" team.
   7. Tulo's Fishy Mullet (mrams) Posted: January 03, 2014 at 12:09 PM (#4629058)
I enjoyed the running poll an industry publication ran last month in ridiculing the buzzwords of the year in our business (securities). The winning phrase: "skate to where the puck is going to be"
   8. Yeaarrgghhhh Posted: January 03, 2014 at 12:15 PM (#4629066)
But if you're going to draw a conclusion from the data, which these guys have done, you should at least investigate the impact of those forces

Why? The "scientists" are looking at long term patterns in the data. I don't see why they have any obligation to analyze the discrete events as well or why a specific set of circumstances (Lester feels really good that day, but the bullpen is spent) contradicts the overall trends.
   9. villageidiom Posted: January 03, 2014 at 01:21 PM (#4629137)
But if you're going to draw a conclusion from the data, which these guys have done, you should at least investigate the impact of those forces

Why? The "scientists" are looking at long term patterns in the data. I don't see why they have any obligation to analyze the discrete events as well or why a specific set of circumstances (Lester feels really good that day, but the bullpen is spent) contradicts the overall trends.
They don't have an obligation to analyze the discrete events, but if they don't they have an obligation not to dismiss discrete events as luck or error if they haven't actually determined it to be so. They should not pretend (nor suffer the delusion that) their analysis is more than it is.

IMO, they actually do have an obligation to analyze the discrete events, because the discrete events can reveal ways in which one has totally screwed up the data prep. I always go back to this one, so sorry for being redundant, but... When UZR had Manny Ramirez at something like -47 one year, people were eager to defend UZR on the basis of (a) it works on average and (b) we all know Manny is bad, and -47 is bad, so -47 must be right. Shortly after that someone went through the PBP video and found a drastic mismatch between balls that were considered playable (underlying UZR) and balls that were actually playable, in Fenway LF. UZR has since been adjusted to account for the problems with Fenway LF data, and Manny's UZR bottoms out around a -26, which is still bad, but a couple wins better. A couple wins is a pretty big difference, on an individual player.

Understanding the outliers, the overall patterns, etc., is key to understanding if your data was worth using in the first place.
   10. Benji Gil Gamesh Rises Posted: January 03, 2014 at 01:39 PM (#4629153)
Why? The "scientists" are looking at long term patterns in the data. I don't see why they have any obligation to analyze the discrete events as well or why a specific set of circumstances (Lester feels really good that day, but the bullpen is spent) contradicts the overall trends.
I'm no numbers pro, but in addition to what vi said in #9, I think what Mike was originally referring to in #1 was not so much discrete events as specific, repeated strategies used within different sports that could reasonably be having a direct effect on the numbers that they are analyzing and the conclusions they are drawing.
   11. Shibal Posted: January 03, 2014 at 01:45 PM (#4629156)
Factoring in these few basic findings, Clauset and Merritt developed a mathematical model that, after observing just a few scoring events, predicted game outcomes for college and pro football, the NHL, and the NBA with surprising accuracy. Their model proved more accurate than the simple metric of looking who was in the lead at a given time, and it outperformed SportsbookReview.com’s pregame betting odds while more or less matching the accuracy of the live-betting site Bovada. Impressive results, considering Clauset and Merritt spent just three months analyzing the data and coming up with their model.


I think their definition of "impressive" is pretty weak. It would take me an hour to build a model to match Bovada's in-game lines, once I have the game data. The data is the hard part. The model certainly isn't.



   12. Accent Shallow Posted: January 03, 2014 at 02:06 PM (#4629177)
I enjoyed the running poll an industry publication ran last month in ridiculing the buzzwords of the year in our business (securities). The winning phrase: "skate to where the puck is going to be"


You know how I know you're Canadian?
   13. bobm Posted: January 03, 2014 at 02:06 PM (#4629178)
[1]
[FTFA:] Berger tells me via email that we only see the pattern in basketball because “teams score frequently, and differences in motivation can easily impact whether a scoring event occurs. In hockey, and even football to some extent, scoring occurs less frequently and is more discrete. 


I wonder if this is true in NCAA or just in the NBA with its rigged officiating.
   14. Swedish Chef Posted: January 03, 2014 at 02:10 PM (#4629185)
If you haven't got a trillion of something it isn't "big data".
   15. Elvis Posted: January 03, 2014 at 03:54 PM (#4629340)
#13 bobm -

If you don't think the NCAA has rigged officials, try watching an ACC contest involving Duke or Carolina.
   16. zenbitz Posted: January 03, 2014 at 04:35 PM (#4629394)
I think in general, data science is mediocre *science* -- although it's incredibly USEFUL. The reason is that it almost never provides models of understanding.

Or more to the point: it can never give you causation.
   17. valuearbitrageur Posted: January 03, 2014 at 05:18 PM (#4629436)
I think in general, data science is mediocre *science* -- although it's incredibly USEFUL. The reason is that it almost never provides models of understanding.

Or more to the point: it can never give you causation.


I think a million Pot Arbs could type on a million keyboards for a million innings and not come up with a more succient explanation of Pot Arb's opinion of data science.

You must be Registered and Logged In to post comments.

 

 

<< Back to main

BBTF Partner

Support BBTF

donate

Thanks to
Harveys Wallbangers
for his generous support.

Bookmarks

You must be logged in to view your Bookmarks.

Hot Topics

NewsblogLinkedIn: 10 Sales Lessons From “The Captain”
(1 - 8:24am, Oct 01)
Last: bobm

NewsblogSpector: Stats incredible! Numbers from the 2014 MLB season will amaze you
(36 - 8:15am, Oct 01)
Last: Never Give an Inge (Dave)

NewsblogNL WILD CARD 2014 OMNICHATTER
(2 - 8:10am, Oct 01)
Last: DKDC

NewsblogOT: NFL/NHL thread
(8174 - 8:01am, Oct 01)
Last: Norcan

NewsblogWSJ: Playoff Hateability Index
(16 - 8:00am, Oct 01)
Last: Swoboda is freedom

NewsblogOT: Politics, September, 2014: ESPN honors Daily Worker sports editor Lester Rodney
(4083 - 7:59am, Oct 01)
Last: Bitter Mouse

NewsblogAL WILD CARD GAME 2014 OMNICHATTER
(1128 - 7:39am, Oct 01)
Last: Joey B. has reignited his October #Natitude

NewsblogThe Calm-Before-The-Storm and Postseason Prediction OMNICHATTER, 2014
(111 - 7:14am, Oct 01)
Last: Jolly Old St. Nick Is A Jolly Old St. Crip

NewsblogThe Economist: The new market inefficiencies
(19 - 2:21am, Oct 01)
Last: David Concepcion de la Desviacion Estandar (Dan R)

Hall of MeritMost Meritorious Player: 2014 Discussion
(14 - 2:17am, Oct 01)
Last: bjhanke

Hall of MeritMost Meritorious Player: 1958 Ballot
(13 - 1:55am, Oct 01)
Last: neilsen

NewsblogBrown: Winners And Losers: MLB Attendance In 2014, Nearly 74 Million Through The Gate
(33 - 11:27pm, Sep 30)
Last: Bhaakon

NewsblogMLB’s Biggest Star Is 40 (And He Just Retired). That Could Be A Problem.
(76 - 11:27pm, Sep 30)
Last: Walt Davis

NewsblogESPN: Ron Gardenhire out after 13 Seasons with Twins
(42 - 10:49pm, Sep 30)
Last: The District Attorney

Hall of MeritMost Meritorious Player: 1959 Discussion
(6 - 10:35pm, Sep 30)
Last: MrC

Page rendered in 0.1548 seconds
52 querie(s) executed