Baseball for the Thinking Fan

Login | Register | Feedback

btf_logo
You are here > Home > Baseball Newsstand > Baseball Primer Newsblog > Discussion
Baseball Primer Newsblog
— The Best News Links from the Baseball Newsstand

Wednesday, August 16, 2017

On the risks of categorizing a continuous variable (with an application to baseball data)

Don’t be so quick to discard the bunt into the dustbin of history.

From a statistical perspective, hopefully you can recognize both the dangers of categorizing continuous data, as well as the attractive features offered by a GAM (and if you want to try a GAM yourself – the code is up here).

From a baseball perspective, the hardest hit balls do increase error rates. Additionally, with my intuition being that errors are often discarded from the perspective of analyzing hitter talent, this type of error creation could be worth thinking more about. In addition to a side benefit of someone who hits the ball as hard as Aaron Judge or Giancarlo Stanton, having good, capable bunters could actually be undervalued. Putting down a sacrifice is generally not considered worth it (trading an out for moving up a base), but with error rates as high as 14%, there may be more to the story.

Jim Furtado Posted: August 16, 2017 at 08:50 AM | 17 comment(s) Login to Bookmark
  Tags: sabermetrics

Reader Comments and Retorts

Go to end of page

Statements posted here are those of our readers and do not represent the BaseballThinkFactory. Names are provided by the poster and are not verified. We ask that posters follow our submission policy. Please report any inappropriate comments.

   1. Jose is an Absurd Doubles Machine Posted: August 16, 2017 at 09:04 AM (#5514309)
That title sounds like an Orphan Black episode.
   2. villageidiom Posted: August 16, 2017 at 10:30 AM (#5514367)
The title sounds like my day job, except for the baseball part.

It's funny. Citing the 14% error rate on bunts as a reason for more hitters to lay one down is potentially another example of poor analysis. I'd assume most of the errors on bunts are associated with the fastest hitters. It might make more sense for them to lay one down - but as has been covered frequently, including in The Book, the more often you bunt the more often the defense is prepared for it, which might decrease the error rate. And it might make no sense to have slower runners expand their bunting, as the high error rate might not happen with their bunts.

(I know I'm reacting to a throwaway line at the end of a github post on a different point, so I shouldn't expect the zenith of statistical analysis out of it.)
   3. PreservedFish Posted: August 16, 2017 at 10:44 AM (#5514380)
Man, that is poorly written.

But is 14% a real number? Seems crazy high to me.
   4. villageidiom Posted: August 16, 2017 at 01:11 PM (#5514517)
14% is the peak, basically if you get launch speed and (especially) launch angle just right. It's not the collection of all bunts.

And to be even more clear, it's the peak of his prediction. It's unclear from TFA if that lines up with actual results.

   5. madvillain Posted: August 16, 2017 at 02:03 PM (#5514551)
Who cares about the error rate when we have much more holistic data showing the cost of bunting verse swinging away? There's a real opportunity cost in taking the bat out of a guy's hands.
   6. snapper (history's 42nd greatest monster) Posted: August 16, 2017 at 02:18 PM (#5514553)
Who cares about the error rate when we have much more holistic data showing the cost of bunting verse swinging away?

How can you have holistic data w/o considering the error rate?
   7. madvillain Posted: August 16, 2017 at 02:25 PM (#5514555)
How can you have holistic data w/o considering the error rate?


WPA, no?
   8. snapper (history's 42nd greatest monster) Posted: August 16, 2017 at 02:30 PM (#5514557)

WPA, no?


But that depends on whether the batter is out or not, so implicitly must inlcude error rate.
   9. PreservedFish Posted: August 16, 2017 at 02:38 PM (#5514563)
An error rate of 14% would be extraordinary. That turns Tom Glavine into Tony Gwynn.
   10. snapper (history's 42nd greatest monster) Posted: August 16, 2017 at 03:06 PM (#5514579)
An error rate of 14% would be extraordinary. That turns Tom Glavine into Tony Gwynn.

Well, if you go back to the 1890's, you've got league wide fielding % around .900. So, it's not inconceivable.
   11. Mike Fast Posted: August 16, 2017 at 05:34 PM (#5514695)
The MLB error rate on non-bunted batted balls is about 2%. The MLB error rate on bunted batted balls is about 3%.
   12. Pasta-diving Jeter (jmac66) Posted: August 16, 2017 at 06:04 PM (#5514725)
Well, if you go back to the 1890's, you've got league wide fielding % around .900. So, it's not inconceivable.

the author is obviously old school
   13. PASTE, Now with Extra Pitch and Extra Stamina Posted: August 16, 2017 at 06:37 PM (#5514750)
I just have one question about that title: What?
   14. Walt Davis Posted: August 16, 2017 at 07:11 PM (#5514789)
#13:

Continuous variable -- numeric with (theoretically) no breaks
Categorical variable -- discrete (e.g. yes/no ... or DP, force out, pop out, foul, missed, sac, RoE, hit)

Continuous to categorical: This happens all the time. Easiest for laymen to think about is probably in medical diagnosis. They take your blood pressure -- actually two continuous variables -- but then designate you as low/normal/high/really high. This happens with cholesterol levels, etc.

That categorization of a continuous variable generally involves a set of cut-offs. If your total cholesterol is higher than X, you get diagnosed as having high cholesterol. But there's not necessarily anything magical about that cut-off such that you're healthy at X-1 but unhealthy at X. One danger in categorization is that you over/under-diagnose or take those cutoffs as gospel.

You can also get different ideas about causation. I put together some examples for some stat training we've been doing recently and stumbled across an example. Using some high school testing data and comparing students in private to public schools (for a particular region) after controlling for some other variables, I analyzed it two different ways. In one, I looked at the raw test score for math; in the other, I was looking for "high achievers" and so coded students 1 if they scored in the upper quartile. Private school students did significantly better than public students on the continuous measure but were not significantly more likely to be high achievers. (It was a simple example for teaching not thorough research, don't go quoting that finding.)

Anyway, a key problem prior to the Book was that bunts were motly analysed as "successful sac" vs "out" via base/out run expectancy tables. You're better off with a man on 1st and nobody out than a man on 2nd and one out type of thing. But bunts have many more outcomes than that.

That said, continuous vs. categorical wouldn't seem to be the issue here. This sounds more like process vs. outcome. Taking it out of the bunt sphere, a given EV with a given launch angle with a given directional angle is the process but you still need the separate measurement of the outcome. Continuous to categorical is "balls with an EV over X are hard-hit balls." This seems more "bunts with an EV around X and an angle around Y will result in a hit p1% of the time and a RoE p2% of the time and ..." That's a relationship among variables ... technically the probability distribution of the outcome conditional on the inputs.
   15. Walt Davis Posted: August 16, 2017 at 07:25 PM (#5514808)
Sorry, meant to toss in a clearer baseball analogy. One guy hits 298, the other hits 300, the second is called a "300 hitter" while the first is not although we all know there's no important difference between the two. We also know the difference between the 300 BA and a 320 BA is greater than the difference between 290 and 300 but the first pair are both "300 hitters." In this case, we're throwing away relevant information and over-stating differences in one case and under-stating them in the other if we rely on the categorization rather than the continuous.

Statistically speaking, you are almost always throwing away some information about the distribution of the variable when you categorize a continuous. It is usually done for diagnostic purposes -- the patient just wants to know if they have a problem and the doctor will find it hard to assess multiple continuous variables. Also, from a modelling standpoint, categorizing a continuous variable can be helpful for detecting and adjusting for non-linearities without having to know the precise functional form.
   16. DavidFoss Posted: August 16, 2017 at 11:57 PM (#5515063)
That categorization of a continuous variable generally involves a set of cut-offs.

Any time you are making a histogram, you have to deal with this. You want the bins to be large enough that you get enough data in each bin but not so large that there is nothing to see. In practice you re-run the analysis with different bin boundary locations to see if you get different answers.

Player age is one of the most common binning issues. June 30/July 1 is the cutoff but there is nothing biologically significant about that date -- you just have to draw the line somewhere. So players born in July & August are always seen as a full year younger than players born in May & June. Not entirely fair, but what can you do.

   17. Mike Emeigh Posted: August 17, 2017 at 05:30 PM (#5515719)
So players born in July & August are always seen as a full year younger than players born in May & June. Not entirely fair, but what can you do.


Until recently the age cutoff for Little League was April 30. A player born on May 1 - who would be 13 for most of the Little League season - was considered to be a 12-year old for the purposes of Little League. It was no accident that "most" of the good teams had a preponderance of players who were born in May and June. Little League have since changed the cutoff date to August 31, effective next year, which will remove 13 year olds from the mix entirely, although I'd expect that the good teams will now have a preponderance of players born in September and October. In 2015, nearly 1/3 of the players in the LLWS were 13-year-olds.

When I went to school the cutoff date for kindergarten eligibility was December 31. My birthday is December 20, and I was always one of the youngest kids in my class, which caused me some issues later on when everyone else was getting their growth spurts and I wasn't.

-- MWE

You must be Registered and Logged In to post comments.

 

 

<< Back to main

News

All News | Prime News

Old-School Newsstand


BBTF Partner

Support BBTF

donate

Thanks to
There are a lot of good people in alt-Shooty
for his generous support.

Bookmarks

You must be logged in to view your Bookmarks.

Hot Topics

NewsblogOT - November* 2017 College Football thread
(179 - 6:13pm, Nov 18)
Last: Lance Reddick! Lance him!

NewsblogThe Eric Hosmer Dilemma | FanGraphs Baseball
(34 - 6:06pm, Nov 18)
Last: LA Podcasting Hombre of Anaheim

Hall of Merit2018 Hall of Merit Ballot Discussion
(240 - 5:49pm, Nov 18)
Last: The Honorable Ardo

NewsblogOTP 13 November 2017: Politics, race now touching every sport
(1977 - 5:30pm, Nov 18)
Last: Misirlou doesn't live in the restaurant

NewsblogOT: Winter Soccer Thread
(196 - 4:47pm, Nov 18)
Last: Nose army. Beef diaper? (CoB)

NewsblogStanton, Altuve capture first MVP Awards | MVP
(51 - 4:35pm, Nov 18)
Last: Lance Reddick! Lance him!

NewsblogJim Palmer on Mark Belanger and Omar Vizquel: The Hardball Times
(98 - 4:33pm, Nov 18)
Last: Walt Davis

NewsblogOT - NBA 2017-2018 Tip-off Thread
(1393 - 4:31pm, Nov 18)
Last: GregD

NewsblogFangraphs: Let's Make One Thing Absolutely Clear About Aaron Judge
(22 - 3:42pm, Nov 18)
Last: Walt Davis

Hall of MeritMock 2018 Modern Baseball Committee Hall of Fame Ballot
(74 - 3:16pm, Nov 18)
Last: cardsfanboy

NewsblogThe story of Alex Anthopoulos: From tragedy to prodigy to Braves GM
(1 - 8:30am, Nov 18)
Last: bfan

NewsblogBraves will lose prospects, and possibly a lot more, for violating international market rules
(48 - 1:30am, Nov 18)
Last: Armored Trooper VOTTO

NewsblogJudge, Bellinger named BBWAA Rookies of Year | MLB.com
(86 - 9:25pm, Nov 17)
Last: Walt Davis

NewsblogDerek Jeter addresses Giancarlo Stanton rumors | MLB.com
(24 - 7:38pm, Nov 17)
Last: Khrushin it bro

NewsblogYu Darvish is out to silence his doubters after World Series flop | SI.com
(9 - 7:15pm, Nov 17)
Last: Misirlou doesn't live in the restaurant

Page rendered in 0.2835 seconds
47 querie(s) executed