Baseball for the Thinking Fan

Login | Register | Feedback

btf_logo
You are here > Home > Baseball Newsstand > Baseball Primer Newsblog > Discussion
Baseball Primer Newsblog
— The Best News Links from the Baseball Newsstand

Wednesday, July 10, 2019

If baseball is any indication, the big data revolution is over

. Following the so-called moneyball revolution, which emphasized the value of statistical analysis for player acquisition and strategy, Statcast seemed to promise that there was finally enough data to fundamentally answer the game’s remaining questions.


But will it? History says no. Over the past century and a half, data revolutions have helped fans and managers better understand the game, but each was also deemed woefully inadequate within years of debuting.

The same is likely to happen to Statcast — and sooner than we might think. That’s not because some fatal technological flaw will emerge, but rather because of the nature of data itself. Data are, in essence, the things we rely on to make arguments and answer questions. And the questions we ask inevitably change over time.

There’s no doubt Statcast and similar data-collection efforts have changed the game from the days when runs batted in and earned run average dominated our understanding of players. Right now, the latest trends in baseball are defensive shifts, launch angles, exit velocity and “true” outcomes. But soon we’ll ask new questions for which new data will be required. It’s not that we haven’t learned anything, but rather that we’ll never learn everything.

RoyalsRetro (AG#1F) Posted: July 10, 2019 at 10:23 AM | 22 comment(s) Login to Bookmark
  Tags: sabermetrics

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

   1. RoyalFlush Posted: July 10, 2019 at 11:15 AM (#5860689)
I remember people being excited about StatCast data, but I don't remember anyone promising it would "fundamentally answer the game’s remaining questions."

Like - all of them?

   2. PreservedFish Posted: July 10, 2019 at 11:22 AM (#5860692)
I don't get it. Seems like a strange strawman.
   3. SoSH U at work Posted: July 10, 2019 at 11:28 AM (#5860693)
That headline is like Bearnip at BTF.

   4. Zonk is Back Where He Came From Posted: July 10, 2019 at 12:14 PM (#5860705)
It's a poorly constructed piece - but I do think there is the kernel of a better point.

To wit - I could have sworn I read a piece last year at FG or the Athletic where shifts are becoming... I don't want to say passe and I wouldn't expect them to EVER go out of style, but that the teams with the best IF defensive efficiencies are actually the teams that shift the least... Correlation and causality, of course - a big variable is obviously the skilsets (or lack thereof) of your fielders, but still.

So... why aren't shifts the defensive holy grail they once might have been? Just spitballing, but I imagine it's a lot of factors. Dead pull hitters with bad launch angles get weeded out.... because they don't produce. And maybe "launch angle" focus itself plays a big role - if you're a lumbering power hitter that likes to pull the ball, what do you do about constantly facing a stacked IF on your strong side? Well.... you could learn to bunt well. Or - you could stop hitting so many balls on the ground and up your launch angle such that you're still hitting into the stacked defense, but now popping more balls over the heads of the stacked IF. The idea that defensive shifts and launch angles both arrived on the scene simultaneously simply isn't true, except in the very broadest of senses.

In other words - it's not this stuff CAN'T answer these "questions".... it's that the right (or best) answer constantly changes because sides are constantly playing a game of "Oh yeah? Well, what if I...."

As it should be.... and in that lies the beauty of baseball... and that's why this stuff is not only fascinating, but also so valuable... because the eternal push and pull never ends or stays static.
   5. PreservedFish Posted: July 10, 2019 at 12:25 PM (#5860718)
To wit - I could have sworn I read a piece last year at FG or the Athletic where shifts are becoming... I don't want to say passe and I wouldn't expect them to EVER go out of style, but that the teams with the best IF defensive efficiencies are actually the teams that shift the least... Correlation and causality, of course - a big variable is obviously the skilsets (or lack thereof) of your fielders, but still.


This is a paragon of the Zonkian run-on.
   6. SoSH U at work Posted: July 10, 2019 at 12:33 PM (#5860724)
As it should be.... and in that lies the beauty of baseball... and that's why this stuff is not only fascinating, but also so valuable... because the eternal push and pull never ends or stays static.


Baseball has always been very good with this kind of balance. When SBs fall out of favor (usually due to run environment), teams start selecting less for catcher arms/work less on holding runners, and suddenly players can begin to exploit that and SBs regain some lost ground.

The one area where the balance is out of whack is with strikeouts. Since the defense wants them and the offense doesn't mind them, (for different reasons), there's nothing obvious to pull things back to normal. This will likely needs some kind of tweaking of the sport to get things back in balance (should that be desired by those with the ability to do so, and not just by a majority of Primates).

   7. Zonk is Back Where He Came From Posted: July 10, 2019 at 12:40 PM (#5860728)
The one area where the balance is out of whack is with strikeouts. Since the defense wants them and the offense doesn't mind them, (for different reasons), there's nothing obvious to pull things back to normal. This will likely needs some kind of tweaking of the sport to get things back in balance (should that be desired by those with the ability to do so, and not just by a majority of Primates).


My suspicion is that this is probably where the old PAP/usage/etc holy grail lies. Strikeouts cost pitches. Pitches come with risk. High BB/high K environments make strikeouts cost even more pitches.
   8. BillWallace Posted: July 10, 2019 at 01:23 PM (#5860757)
Agree that there's probably a similar and better point to be made here.

I've been thinking a lot about the shift recently.

I'm sure most of you have seen this graphic that broadcasts will put on the field where they divide the infield into a few pie slices and show the percentage of time this batter hits it into each pie slice... e.g. a lefty will be up and there are a few slices on the right side of the infield with 20-30% each and then one bigger slice covering most of the left side with like 15%, and it's all overlayed on top of the shifted defense where you can see the 2nd baseman hanging out in short right at the top of a 30% slice, and basically no one on the left side of the infield.

For me this graphic was eye-opening and not in the way it was intended. It occurs to me that it's fairly obvious that this infield defense isn't necessarily the out-maximizing defense. For one thing, not all batted balls are equal. Many of the 10-15% hit to the left side are easy chances that you completely give up by not playing with a conventional 3rd baseman, and likewise many of the 30% are very hard hit grounders/semi-liners that you still don't get EVEN with the shift.

A simple summary of a complex issue, but it's not clear at all to me that teams have actually done the kind of comprehensive analysis that combines batted ball profile with likely fielding percentages and everything else that would show that giving up entire sections of the infield is a net gain even for extreme pull hitters. And as mentioned there's some evidence that the shifts don't really lower BABIP all that much in a lot of cases.

Big data, like all data, is only useful if you carefully ask exactly the right questions, make absolutely sure that you fully understand all of the inherent assumptions you are making, and understand all of the limitations in your work. Otherwise it's quite easy to 'science' your way into exactly backwards conclusions. I wouldn't be surprised if the 'out maximizing' defense was much closer to the age-old alignment than people realize.
   9. Walt Davis Posted: July 10, 2019 at 05:24 PM (#5860817)
This is a paragon of the Zonkian run-on.

I'm standing right here.
   10. The Duke Posted: July 10, 2019 at 05:34 PM (#5860821)
I think shifting is killing off dead pull hitters making shifting less valuable. On my team, Matt carpenter has been collapsing due to the shift. Over time it’s value will diminish as dead pull hitters will be weeded out. Similar thing happened to base stealing. It’s almost a lost art now because teams have gotten so good at holding runners.

Strikeouts - I agree neither offense nor defense seems inclined to “fix” this. I think we are stuck with them.
   11. Walt Davis Posted: July 10, 2019 at 05:54 PM (#5860824)
The excerpt is not convincing in the least. But ...

1. It's pretty well established and accepted that any "policy intervention" ends up having less effect than models/simulation (much less popular expectation) suggest, basically for the reason suggested by several posts already -- human beings will adjust, change behavior in unexpected ways, find the loopholes, etc.

2. Statcast isn't really big data. And even the biggish parts are things that really nobody knows, the key to big data is summarizing it in such a way that it makes some sense. You lose detail every step of the way.

3. No matter how much data you have, the results of a baseball game come down to a small handful of plays. Even Mike Trout barrels it only about 1 every 10 PAs ... which is what, about 1 out of every 40 pitches, once every 2-3 games. In terms of "controlling" Trout, we'd be talking about maybe figuring out ways to drop him from 65 barrels a year to, what, 60. So no matter how much "data" you have, your sample is small because it pretty much all comes down to those 65 swings and the effect you're chasing is even smaller than that.

3a. And of course most batters aren't Trout. In its 40 PAs in a game, a team will barrel it ... probably fewer than 3 times on average.

4. The shift, oddly, is about reducing the value of batted balls that are already of low value. Every once in a while, Trout (or David Ortiz or whoever) will barrel a line drive right at somebody but mostly you're taking away a few groundball singles a year. Not necessarily a bad strategy, every little bit helps (see #3), but the whining/praising about shifts was always overblown.

5. All that data, all those analysts, all that knowledge ... and, for some unknown reason, pitchers still throw Javy Baez pitches over the plate.
   12. Bote Man Posted: July 10, 2019 at 06:48 PM (#5860834)
5. All that data, all those analysts, all that knowledge ... and, for some unknown reason, pitchers still throw Javy Baez Austin Riley pitches over the plate.

FTFY
   13. kthejoker Posted: July 10, 2019 at 07:34 PM (#5860847)
Tldr baseball is a zero sum war game, so no single strategy can ever totally dominate, as it will either be met with a counter strategy or coopted and neutralized.


You're welcome.
   14. Zonk is Back Where He Came From Posted: July 10, 2019 at 07:35 PM (#5860848)
oh Walt.... I’m flattered.

But stylistically solely - I’m a just a pale imitation playing with blocks.... I aspire to tl;dr (to be clear, but should read :-)) - but I just construct conversational prose with a lot of clauses, generous use of ellipses, and preemptive strikes against contradictory points. I make single course word meals that require a bib and will get you a picture on the wall if you finish them. You prepare true seven course meals that you should use the proper fork to eat and even if the meal didn’t agree with you, you still brag to your friends you got a table.

But...since some fish like pithy... bite me, PF!
   15. Zach Posted: July 10, 2019 at 07:56 PM (#5860849)
So... why aren't shifts the defensive holy grail they once might have been?

There's a basic contradiction between big data and holy grails.

Holy grails are things with big impact. Hits, outs, walks, runs scored.

Big data looks for small effects, or at least effects that are smaller than you could see with "small data." You have to do things over and over again for small effects to add up to big impacts. But the game is only 27 outs long.

If you want to use big data to find a holy grail, you have to look at something more fine grained than outs. Things like catcher framing, spin rate, or pitch sequencing.
   16. Walt Davis Posted: July 10, 2019 at 10:11 PM (#5860879)
#15 ... good points. But also shifts weren't the defensive holy grail because they only impact on the small events (GBs). With a few exceptions like Ichiro, no batter has made their money off of GBs since at least 1993. Plus the easy and obvious counter-measure of bigger swings and positive launch angles to hit even fewer GBs.

There's also the fact that 3 guys over half the IF and some OF still ain't gonna be where they're hit all that much more often.

Still I want to emphasize your point. I've been in few analytical situations where the real problem was too few observations and lots of situations where the problem was too few variables, not the right variables, poorly measured variables, forcing additive linearity, etc. The main advantage of most big data is that you now have enough observations in very small groups that you might actually be able to detect ways in which they are different from other groups. But, lo and behold, while group membership (or social similarity or whatever) often has an impact, it's not usually a very big one.

The first famous "big data" thing was Google's flu outbreak predictor. But that was more about having access to a new variable (are people trying to find out about the flu) rather than the fact it was data on millions upon millions of people. Of course people worried about the flu will be looking for information and, to the extent some people are more worried about the flu than others, it's probably because people around them are getting the flu. If we'd had data on "asked pharmacists about flu remedies" or "bought cold medicine" in a timely manner 30 years ago, we probably could have generated the same finding. (Then of course Google's flu model hasn't worked so well since I don't believe.)

Most big data is administrative data, whether collected by the government or Google or whoever. The main advantages there are timeliness and measurement accuracy (for the "question" they've asked). Sometimes also population coverage but that can be difficult to establish. It would be pointless for many reasons to, say, run a survey 6 months after a flu outbreak asking respondents if they sought out information early in the flu season -- but sample size is low on that list of reasons.

So yeah, the benefit of huge samples is being able to detect tiny effects ... or to detect reasonably-sized effects in really small groups (assuming reasonable population coverage) ... or to have a big giant play-around with half the data while still having lots of data left to "test" what you found. Few of those really apply to baseball -- the samples aren't all that big, the effects are extra puny and, no matter what you find, you're using it to (at most) project somebody's performance over the small sample of their next 650 PAs where randomness plays havoc and those "agents" are working to counter your measures anyway.
   17. Jeff Frances the Mute Posted: July 11, 2019 at 08:13 AM (#5860908)
Percentage of plate appearances where teams are shifted:

2016 13.8%
2017 12.1%
2018 17.4%
2019 24.9%
   18. Shooty would run in but these bone spurs hurt! Posted: July 11, 2019 at 08:37 AM (#5860912)
I recently read a book called The Book of Why that talks about the limitations of big data and why the new statistical revolution is based in asking causal questions and in exploring counter factuals, something that statisticians had been loathe to do since, basically, the inception of the profession. Has anyone with a stats background read that? Honestly, the actual math went over my head but the arguments against big data mining as a stand in for the scientific method seemed compelling.
   19. DavidFoss Posted: July 11, 2019 at 10:05 AM (#5860938)
Honestly, the actual math went over my head but the arguments against big data mining as a stand in for the scientific method seemed compelling.

I have not read that particular book, but it is a new technology and like lots of new technology it is easily oversold. When the hype reaches the people with the purse strings, a new technology will get overfunded for a while. Listening to business types talk about big data can make me cringe. Like if we collect enough data and feed it into the machine, a model will develop which will start beating you at chess, know you are pregnant before you do and then start launching nuclear weapons. It doesn't matter that the data contains only launch angles and exit velocities -- with enough data, the computer will learn.

I'm having deja vu typing this, so I googled and found a wikipedia article I had seen before on the hype cycle which discusses something called Amara's Law: "We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run."

Big data is most certainly *not* a substitute for ab initio scientific analysis. The two fields do completely different things. But that doesn't make big data useless. There are problems where people are just looking to load the dice. You may want to 'enrich the data' -- design a fast filter which isolates almost all the points you are looking for in a small fraction of the original set. An another note, often it is interesting to find correlations even if you don't understand them. Lots of those will get immediately rejected for being ridiculous, but some of them might inspire an ab initio study to find out if there is a cause involved.

As far as baseball statcast stuff. I imagine it helps coaches but it can't replace them. Launch angle analysis might tell Alex Bregman to "hit the ball higher, but not too high" and he'll likely go "Duh! I've played Angry Birds. I know how that works. But it's not like I'm hitting the ball off a tee." But if Alex Bregman goes into a slump the data might be collected into a report for the hitting coach to see if anything has changed recently or if he's just getting unlucky.

   20. Shooty would run in but these bone spurs hurt! Posted: July 11, 2019 at 10:20 AM (#5860941)
Is causal modeling a technology? My impression of the book is that it was more of a method. I'm kind of a dim bulb on this kind of stuff, though. Also, the argument against Big Data wasn't that it was useless but that it won't answer questions on its own. Judea Pearl(the author) uses the example of the debate linking smoking with cancer in the 50's and 60's about how hard it is to make conclusions just from data. You can keep adding more data to the pile but without an interpretive framework you're just spinning your wheels. Pearl's conclusion is that Big Data is massively useful now that we have a causal framework to interpret it. I just wished I was more schooled in stats to have an informed opinion about it. (BTW, Pearl is definitely a baseball fan as he likes to use baseball stats examples in the book.)
   21. DavidFoss Posted: July 11, 2019 at 11:39 AM (#5860959)
Is causal modeling a technology? My impression of the book is that it was more of a method.

I'm sorry. Without having read that specific book, I'm not exactly sure what is being discussed. Often things get renamed and rebranded every few years.

I took it to mean that a first-principles mechanism for the cause of an effect was proposed. People then use the scientific method run an experiment comparing the experimental group with some control group. Then statistical analysis is done to determine if the difference between the experimental result dataset and the control result dataset is meaningful. If the theory catches on, then other researchers try to reproduce it. Then follow-up studies are done drilling down on individual aspects of the theory (If X is the cause of Y, we should also see Z). Eventually the scientific community reaches a consensus.

Big Data modeling does not necessarily need to ask 'why', but the best models tend to pass the smell test when you drill down and see what weights it found for each independent input variable. It is just handy that the models can be automatically generated and then used to generate 'prediction scores' for every record in a large dataset extremely quickly. You are right in that the scores usually need interpretation. High risk and low risk are not guarantees of anything. Plus, the model only does well for cases similar to those that are used to train the model.
   22. Shooty would run in but these bone spurs hurt! Posted: July 11, 2019 at 11:45 AM (#5860961)
Thanks David. That's a helpful summary of it. I think that's pretty close to what the book is about.

You must be Registered and Logged In to post comments.

 

 

<< Back to main

News

All News | Prime News

Old-School Newsstand


BBTF Partner

Dynasty League Baseball

Support BBTF

donate

Thanks to
ERROR---Jolly Old St. Nick
for his generous support.

Bookmarks

You must be logged in to view your Bookmarks.

Hot Topics

NewsblogOMNICHATTER could wipe you off the face of this game board!, for July 22, 2019
(15 - 9:34pm, Jul 22)
Last: Master of the Horse

Gonfalon CubsThat was fun
(435 - 9:06pm, Jul 22)
Last: Misirlou doesn't live in the restaurant

NewsblogBaseball card collectors suspected rampant fraud in their hobby. Now the FBI is investigating.
(55 - 9:06pm, Jul 22)
Last: Zonk is Back Where He Came From

NewsblogOT - Catch-All Pop Culture Extravaganza (July 2019)
(685 - 8:59pm, Jul 22)
Last: Master of the Horse

NewsblogGiants' relievers have fueled playoff push, but also are valuable trade assets
(34 - 8:44pm, Jul 22)
Last: The Yankee Clapper

NewsblogOT - NBA thread (Playoffs through off-season)
(6222 - 8:27pm, Jul 22)
Last: shout-out to 57i66135; that shit's working now

NewsblogDeadspin: Baseball Writer Jonah Keri Arrested, Charged With Assault On His Wife
(187 - 7:52pm, Jul 22)
Last: PreservedFish

NewsblogFor Some Players, Not Reaching the Hall Just Brings More Fame
(7 - 7:51pm, Jul 22)
Last: Cowboy Popup

NewsblogCastellanos On Comerica: 'This Park's A Joke'
(16 - 6:54pm, Jul 22)
Last: Pat Rapper's Delight (as quoted on MLB Network)

NewsblogAndrew Cashner Would Have Sat Out If Traded To Undesirable Team
(10 - 6:49pm, Jul 22)
Last: Walt Davis

NewsblogPrimer Dugout (and link of the day) 7-22-2019
(55 - 5:23pm, Jul 22)
Last: Pasta-diving Jeter (jmac66)

NewsblogMontgomery traded, but place in Cubs history secure
(8 - 5:20pm, Jul 22)
Last: Scott Lange

NewsblogThe Five Trends That Could Define Baseball’s Future - The Ringer
(16 - 4:44pm, Jul 22)
Last: Fancy Crazy Town Banana Pants Handle

NewsblogOT Soccer Thread - Baldrick Reports Live
(1226 - 3:20pm, Jul 22)
Last: J. Sosa

NewsblogWhy gambling used to scare baseball and why it doesn’t anymore
(103 - 2:27pm, Jul 22)
Last: Eddo

Page rendered in 0.3737 seconds
46 querie(s) executed