MASTER NOTES: Bucket lists

As I’ve said before, in this space, at industry events, and to the occasional stranger in the grocery store, I’m fascinated by all the new data pouring into my computer from baseball’s many new measurement systems.

The other day, I was telling someone—either Todd Zola or a guy in the dairy aisle—about batters’ Exit Velocity (EV) and Launch Angle (LA) data. Come to think of it, I talked about this in Master Notes, so I might just have been talking to myself.

In any case, from what I’ve seen, these data are often applied to projection and analysis by “bucketing” the numbers into, well, buckets—so that each Batted Ball Event (BBE) is categorized as, say, 90-95 MPH EV, 20-25 degree LA. My issue is that the buckets are arbitrarily defined. Why 90-95 MPH, other than the comfortable 5 and 10 endings? Why not 88-83? Or 92-97? Why not 23-27 degrees of LA?

The logical conclusion to this ruminating is that maybe we should look at each BBE as a discrete category: a specific EV/LA combination. So I built a little (actually, not so little) Excel workbook to look at every EV/LA combination in 2017, and to see how often that combination generated a hit. The idea was that I could then applied those hit values to 2018 YTD, to see if maybe there are some over- and under-performers who might we headed for a correction.

I downloaded Statcast data from the wonderful Baseball Savant website, getting every BBE in 2017, excluding pitchers' plate appearances. If you’re keeping score at home, there were more than 123,000 BBEs, which meant I was able to heat a ham-on-swiss panini on my computer while the quad-core i7 processor was grinding through the data.

The Statcast data capture EVs and LAs to several places of decimal, so I rounded all the EVs and LAs to the nearest whole MPH or degree. Even at that, there were 9,038 different EV/LA combinations. EVs ranged from 5 MPH (Lucas Duda) to 122 MPH (Giancarlo Stanton), and LAs ranged from -85 degrees (almost straight down, one of many feeble groundouts I remember from having Ezequiel Carrera on my fantasy team) to 90 degrees (straight up, a Jason Heyward popup).

The next step was a bigtime number-crunch, so I made another panini and set to work. Somewhat ironically, given how this project started, I started out by looking for buckets in which various EVs and LAs showed unusually high or low Hit Rates.

As has been discussed in other places, there’s a pretty obvious correlation between higher EVs and higher Hit Rates, which we’d expect. At or under 35 MPH, the Hit Rate was under 20%, but at or over 90%, it was closer to 50%. No surprise there.

When it came to LA, the only surprise was the extreme nature of the connection. For instance, the Hit Rate of all BBE with a LA of 67 degrees or higher—cans ’o corn with too much loft—was 0.4 per cent. In just over 4,800 BBEs, there were only 21 hits! The highest Hit Rates, as high as 75%, were in the range of roughly 10 degrees through 26 degrees, or as we used to call them, “line drives.”

Thus informed, it was time to go back to the original premise, and get the Hit Rates for each of the 9,038 individual EV/LA combinations. The idea was not to chart them or make a table or anything—with 9,038 combinations, the potential usefulness of such a chart or table seemed dubious at best.

But what seemed like it might have promise would be to use the EV/LA Hit Rate table as a database, looking up each 2018 BBE in the table to see what the corresponding Hit Rate was. From there, it seemed that totaling all the Hit Rates for each hitter’s separate BBEs would create an Expected Hits (xH) for that player, which would lead to an Expected Hit Rate (xH%), and ultimately to an xBA.

Doing this part of the project was easy for me, although the trusty quad-core i7 processor was now glowing bright red and the popcorn ceiling was melting. Of the 249 hitters in the pool (minimum 30 BBE), two-thirds had hit counts within two of their xH. At the extreme of unlucky outcomes were:

Player          BBE   xH    H  Diff
===================================
Santana,Carlos   59   22   11   -11
Votto,Joey       61   24   17    -7
Calhoun,Kole     58   22   15    -7
Abreu,Jose       56   26   19    -7
Springer,George  68   27   21    -6
Seager,Corey     64   24   18    -6
Bruce,Jay        51   19   13    -6
Zimmerman,Ryan   51   18   12    -6
Grichuk,Randal   36   11    5    -6

Based on their EV/LA BBEs, we might expect these hitters to bounce back and start getting more hits.

Meanwhile, at the lucky extreme:

Player          BBE   xH    H  Diff
===================================
Lowrie,Jed       73   28   33    +5
Swanson,Dansby   59   22   27    +5
Wendle,Joey      40   10   15    +5
Almora,Albert    36   10   15    +5

Again based on their BBEs, we would expect these hitters to have had fewer hits, which might imply a pending correction.

In both cases, of course, and movement towards more or fewer hits would depend on the hitters continuing to have BBEs in the same proportions as they’ve had so far.

That’s not the only issue with the project’s findings. The very nature of using discrete data like these means there will be an issue with small sample sizes at the margins. Almost 2,000 of the 9,000+ combinations happened only once, making their predictive power exceedingly questionable. For example, the 2017 BBE table sowed a Hit Rate of 100% for the 20 MPH/-22 LA combination, but a 0% Hit Rate for the 20/-21. In both cases, there was one BBE.

As well, a lot of the 2018 BBE EV/LA combinations didn’t occur in 2017, and might take years to total a big enough sample. There are just over 20,000 possible EV/LA combinations, based on minimum and maximum EVs and LAs from 2017 and 18. Since less than half of those possible combinations ever actually occurred, there were bound to be some misses, forcing some interpolation from nearby similar EV/LA combinations. If a BBE of 64 MPH and -37 degrees LA happened in 2018 but not in 2017, I’d find a near-match, like a 63/-37 or a 64/-36. Sometimes there were no such close matches, forcing an estimate based on a collection of nearby results. In other words, buckets. Again.

After all of this cipherin’, it’s hard to say how useful it will be to look at the Statcast data to this level of EV/LA specificity, at least until there are several more years of data. We already have a pretty decent approximation of xH using—the soft-medium-hard and GB-LD-FB data. Yes, again it’s back to the buckets.


Click here to subscribe

  For more information about the terms used in this article, see our Glossary Primer.