RESEARCH: The Predictive value of Deserved HR

Introduction

In December 2016, we introduced the concept of Deserved and Undeserved Homeruns (part 1, part 2). We used Statcast data for Launch Angle (LA) and Exit Velocity (EV) of batted balls to determine the likelihood of any batted ball would result in a homerun, and comparing that to the actual number of homeruns that were allowed by pitcher or hit by a batter.

In this article, we’ll improve the methodology by using a more elegant mathematical fit, improve our handling of park effects on homerun odds, and also investigate the predictive value of Deserved Homeruns.

Improving the Model

Note: we'll spend a little more time on the mathematical model. The analysis of predictive value begins about half way down.

The original model used a cubic polynomial fit between two thresholds, the left and right edge of the homerun transition zone. 

We found that this did a pretty good job and matching the data, but in aggregate over all batted balls, it tended to underestimate homeruns at the low end of Exit Velocity.  We had to modify the fit further to get it to eliminate systematic errors in our estimate of HR chances.

We revisited the fit function an experimented with using a hyperbolic tangent (tanh) function, which has the advantage of going to 0 as the x-axis approaches –infinity and to 1 as the x-axis approaches +infinity.  The usefulness of the tanh function was also noted by several readers in the forum.  The tanh fit much better and without any tweaking afterwards, produced a fit that matched the data with no systematic error visible.

We also identified a second shortcoming with the model, in the handling of park effects.  The original model used the computed park factors and multiplied the calculated dHR value for each batted ball by the park factor to get a PA dHR.  There were two potential problems with this approach, which we presumed would cancel out:

  1. Batted balls with dHR at or close to 1.0 would end up with a pA dHR of > 1.0, which doesn’t make logical sense
  2. Batted balls with a dHR of 0 but right near the fringe would end up with a PA dHR of 0, even though park effects could give it a nonzero chance of being a HR in reality.

As noted in the original work, the sample size of batted balls by LA and EV is not large enough to compute a transition zone for each ballpark. However, by pooling together all batted balls in a LA zone spanning 21 to 36 degrees, our sample is just large enough to begin calculating zones for individual ballparks.  We can calculate the league average for this EV range, then calculate a fit for each of the 30 MLB parks by both LHB and RHB. 

Here are four examples, with the baseline model in the upper-left, followed by Baltimore, Detroit, and Pittsburgh. We chose these three because they give a cross-section of the variety we see in the results. 

The L and R in the legends refer to LHB & RHB, respectively.

Note: the charts do look a little noisy, but in many cases, the outliers may contain as few as 10-20 batted balls.  Fitting the curves, we weight by the number of batted balls at that exit velocity, so we can be assured the outliers aren’t pulling our fit off target. 

Examining the differences in the curve fits above, we see they seem to make sense with what we know. 

  • In Baltimore, the 50% point is at about 99 mph, compared with a value of 101 mph for all parks, consistent with the previous finding that Camden Yards adds HR for both RHB and LHB. 
  • In Detroit, right field is 5-10 feet shorter than left field, and this is reflected in the fact that the 50% point for LHB is about 1mph less than for RHB. 
  • In Pittsburgh, right Field in PNC Park has a 21-foot high wall, so HR odds are more dependent on LA than a normal height fence; since we sum over many LA, we get a wider transition zone.

Also, the HR transition zone width changes from park to park, influenced likely by factors such as temperature and wind variability, and variability of ballpark dimensions (e.g. LF is 345’ in Comerica, while center is 420’).

We calculate the 50% point in each park-handedness combinations, and note the EV Delta—the amount of shift of the dHR curve compared to average. The EV Delta is in good agreement with the park factors we found previously:

The X-axis here shows the effective exit velocity delta between the baseline and that park. Higher park factors can be conceptualized as “adding” exit velocity relative to an average ballpark.

So now for each batted ball, we can add or subtract the park EV delta to the batted ball to get an effective batted ball velocity. We also adjust the transition zone width as determined park by park*.

(* Note: we had to decide whether to adjust the width or not.  On one hand, the width factor could be artificial, such as we theorized for the Pittsburgh LHB case, a result of summing over many launch angles.  On the other hand, it could be real, for example in a ballpark with highly variable wind.  In the end we decided to include it, since long-term we will improve and refine this estimate and so we will want it in the model, and the difference when we included it or not were very slight in the aggregate).

The leaderboards are nearly the same as before, with some batters moving by a few places in the standings. So, we won’t repeat them here. The park factors can now be expressed both as before (a multiplier times average home runs), or an EV adjustment.  The Park Factors are useful when considering larger sample sizes, while the EV delta is useful when comparing batted balls.

         Park Factors    EV Delta (MPH)
Park      LHB    RHB       LHB    RHB
====     ====   ====      ====   ====
ARI      1.08   1.12       0.1    0.3
ATL      0.84   0.80      -1.4   -1.8
BAL      1.29   1.43       1.3    2.0
BOS      0.88   1.14      -1.2   -0.1
CHC      1.02   1.08       0.5   -0.1
CIN      1.13   1.15       0.5    0.8
CLE      1.12   1.01       0.0    0.5
COL      1.11   1.15       0.3    0.9
CWS      1.19   1.23       0.7    0.5
DET      1.02   0.94      -1.0   -1.5
HOU      1.09   1.02      -1.0   -0.8
KC       0.84   0.86      -1.1   -0.3
LAA      0.93   0.97       0.2   -0.0
LAD      1.17   1.13       1.3    1.3
MIA      0.92   0.82      -0.9   -1.3
MIL      1.26   1.18       1.5    1.4
MIN      0.95   1.04      -1.1    0.5
NYM      0.98   1.03       0.1   -0.1
NYY      1.38   1.16       1.5    1.0
OAK      0.86   0.87      -0.5   -1.0
PHI      1.13   1.19       0.9    0.5
PIT      1.08   0.91       0.1   -0.4
SD       0.97   1.06      -0.1    0.4
SEA      1.11   1.09       0.5    0.5
SF       0.67   0.86      -3.1   -1.4
STL      0.82   0.78      -0.6   -1.4
TB       1.06   0.97       0.2   -0.3
TEX      1.11   1.01       0.5   -0.1
TOR      1.02   0.99       0.1   -0.2
WSH      0.89   0.96      -0.3   -0.2

 

Predictive Value of Deserved Homeruns—Pitchers

It is well-established that HR/FB for pitchers is not sticky from year to year.  The best predictor for HR/FB for pitchers is league average HR/FB.  We only have two years of Statcast data to utilize, but let’s check whether dHR does any better than simply using league average HR/FB to predict next year’s HR/FB.

To evaluate “next year’s” HR total, we will use dHR, without park adjustment.  This number is closest to the true talent level of the pitcher or batter.  If we know dHR, we can park-adjust it to forecast actual HR.

We will look at four possible predictors for 2016 dHR/FB and see which had the best correlation.  All will be divided by the sum of flyballs.

Predictor             Note
=================    =================================================
2015 HR/FB            Actual HR
2015 dHR/FB           Sum of deserved HR (HR odds in an average park)
2015 PA dHR/FB        Sum of dHR, adjusted by ballpark where hit
                      (HR odds in the ballpark where hit)
2015 PN HR/FB         Park-neutral HR (park effect removed)

We’ll then compare the best of those to our standard:

2016 FB x 11.4% *

* Note: This is the league-wide HR/FB rate for 2015

The four predictors give scatter charts that look very similar, all with very low R2 values.  Below is an example for dHR/FB for 2016 plotted against dHR/FB for 2015, where the fit is weighted by the geometric mean of fly balls in 2015 and 2016:

Predictor              R2
==============       =====
2015 HR/FB           0.003
2015 dHR/FB          0.012
2015 PA dHR/FB       0.013
2015 PN HR/FB        0.003

Not surprisingly, prior year raw HR/FB is very noisy (park-adjusted or not), with almost no correlation.  dHR does a little bit better at predicting next year’s result.  How does it compare to a simple 11.4% HR/FB? To stay consistent, we used the same weighting as before and plotted the actual dHR allowed in 2016 against the HR predicted by each method (predicted HR/FB times FB allowed in 2016):

 

                              R2 = 0.84                                              R2=0.85

These look very similar, initially appearing that dHR does just as well as league average HR/FB.  However, it is vital to note the graph on the left uses the 2015 HR/FB rate to predict 2016 Deserved Homeruns, with no knowledge of 2016.  The graph on the right used knowledge of the relationship between 2015 PA dHR/FB and 2016 dHR, and even so, was insignificantly better.  So essentially, last year’s dHR/FB rate, at best might make your prediction marginally better.  Our conclusion therefore is that regression to league average HR/FB is still the best predictor.

 

Predictive Value of Deserved Homeruns—Batters

Homeruns per Flyball for batters do not, generally speaking, regress to league averages.  According to work done by Joshua Randall:

“Each batter establishes an individual home run to fly ball rate that stabilizes over rolling three-year periods; those levels strongly predict the hr/f in the subsequent year. “

We’ll look at several predictors to see which does the best at predicting 2016’s park-neutral HR/FB. As before, we’ll weight our scatter plots by the geometric mean of total flyballs on the two axes, for example when comparing 2015 and 2016:

First, we’ll examine traditional HR/FB, park neutralized, in one-year, two-year, and three-year averages.

We created the weighted scatter plots for each scenario, and found the following R2 values for each relationship.:

Predictor: 2016 PA HR/FB     R2
==============================    ====
2015 PN HR/FB               0.44
2014-15 PN HR/FB            0.49
2013-15 PN HR/FB            0.53

So each year of additional batter history improved the predictive value slightly.

Now let’s look at dHR/FB:

This gives an R2 of 0.49, not as good as a third year of HR/FB, but on par with two years of HR/FB.  

We only have two years of Statcast batted ball data from which to model deserved homeruns, so we can only examine this for one season so far.  However, it’s promising that one year of dHR/FB gives us as good of an understanding of a batter’s true HR as two years of outcome-based data. In addition, dHR/FB is fairly sticky for hitters.  2015 and 2016 values, when weighted by number of fly balls, are correlated to 62%.  By comparison, PN HR/FB for the same period were correlated at 36%.

It is possible that there are yet unknown batted ball tendencies that would cause a batter to underperform his dHR.  For example, if his best-struck balls are to dead center that would yield fewer HR than the model would suggest. Or, if a batter ball is struck with topspin rather than backspin, it will not travel nearly as far. These are items to be investigated as better data becomes available.

In the meantime, what does this mean for 2017? Since two years of PN HR/FB data does better than one, let’s take the leap that two years of dHR/FB data is also better than one. We can then identify the batters with the greatest discrepancies between their 2-year dHR/FB and 3-year PN HR/FB.  These are candidates to outperform their projected HR totals in 2017.

                       '15-'16      '14-'16
Player                  dHR/FB     PN HR/FB    Diff
===================   ==========   ========   ======
Miguel Cabrera           25.5%       17.2%     8.2%
Howie Kendrick           18.2%       10.0%     8.2%
Christian Yelich         25.2%       17.3%     7.9%
Nick Castellanos         16.4%        9.9%     6.6%
Ryan Howard              26.6%       20.1%     6.4%
Joey Votto               25.9%       19.8%     6.1%
Matt Carpenter           17.3%       11.3%     6.0%
Joe Mauer                15.1%        9.5%     5.6%
Billy Butler             14.2%        8.7%     5.5%
Justin Smoak             21.1%       15.8%     5.3%
Shin-Soo Choo            21.3%       16.1%     5.2%
Chris Davis              30.3%       25.1%     5.2%
DJ LeMahieu              12.2%        7.1%     5.1%
Freddie Freeman          21.3%       16.3%     4.9%
Mitch Moreland           20.2%       15.4%     4.8%
Seth Smith               17.7%       13.0%     4.7%
Khris Davis              26.3%       21.6%     4.7%
Melvin Upton             17.3%       12.6%     4.7%
J.D. Martinez            24.1%       19.4%     4.6%
John Jaso                14.8%       10.2%     4.6%
Hunter Pence             18.6%       14.1%     4.5%
Eric Hosmer              18.4%       14.1%     4.3%
Chris Johnson            13.0%        8.8%     4.3%
Mike Trout               24.9%       21.0%     3.9%
Francisco Cervelli        9.2%        5.3%     3.9%
Yoenis Cespedes          19.1%       15.3%     3.8%

Here are the bottom 25, who are candidates to have their HR total regress further than projected:

                      '15-'16      '14-'16
Player                 PA dHR/FB   PA HR/FB    Diff
===================   ==========   ========   ======
Wilmer Flores             6.1%       10.7%    -4.6%
Alexei Ramirez            2.6%        6.8%    -4.2%
Melky Cabrera             4.4%        8.5%    -4.1%
Michael Brantley          6.9%       10.9%    -4.0%
Jose Abreu               15.8%       19.8%    -3.9%
Jose Bautista            14.0%       17.7%    -3.6%
Derek Norris              6.5%       10.1%    -3.5%
Justin Bour              16.7%       20.0%    -3.3%
Jimmy Rollins             5.0%        8.1%    -3.1%
Bryce Harper             16.6%       19.6%    -3.1%
Danny Espinosa           12.3%       15.3%    -3.0%
Salvador Perez            7.8%       10.6%    -2.8%
Mike Aviles               2.1%        4.8%    -2.8%
Didi Gregorius            5.2%        7.9%    -2.7%
Steve Pearce             13.5%       16.2%    -2.7%
Marcus Semien             9.1%       11.7%    -2.7%
Kolten Wong               5.6%        8.0%    -2.5%
Todd Frazier             14.6%       17.0%    -2.4%
Josh Reddick              6.8%        9.2%    -2.4%
Freddy Galvis             6.4%        8.6%    -2.2%
Travis d'Arnaud           9.1%       11.2%    -2.1%
Carlos Beltran           11.2%       13.3%    -2.1%
Albert Pujols            13.8%       15.9%    -2.1%
Brian Dozier             12.3%       14.3%    -2.1%
Mike Moustakas            9.1%       11.1%    -2.0%
Dustin Pedroia            6.5%        8.4%    -2.0%

Whatever projection system you use, check where their projected HR/FB falls with respect to dHR and 3-year HR/FB. Remember that the values above are park-neutral, so the hitter's home park will still have an effect. You may want to shift your expectations toward the middle of the range.

 

Conclusions

By using a hyperbolic tangent fit rather than a third order polynomial, we were able to get a cleaner fit to the homerun transition zone.  We were also able to pool together batted balls from a range of launch angles to estimate park effects in terms of an Exit Velocity offset for each ballpark, by batter handedness.  These agreed well with the previously calculated park factors.

We find again for pitchers that there is no better predictor for next year’s HR/FB than league average HR/FB.  However, for batters we found that the previous year’s park-adjusted dHR/FB was a very strong predictor of current year PA dHR/FB, on par with a batter’s two-year PA HR/FB rate.

The Errant Gust of Wind is beginning to reveal some of its secrets.


Click here to subscribe

  For more information about the terms used in this article, see our Glossary Primer.