RESEARCH: Updating xBA

Over the last few years, we have spent a lot of time in this space discussing the new hard-hit ball data that is available on BaseballHQ.com. One of the main advantages of using hard-hit ball data is that it better measures a fundamental skill, rather than a result that may be influenced by events outside of the batter's control.

Our previous research here at BaseballHQ.com around hard-hit balls for batters has shown that players who hit the ball harder have higher hit rates and hit for more power. We've also used hard-hit ball data—hit rates on soft-hit and medium-hit GB, specifically—to help us more accurately measure a player's speed (Spd). Most recently, we found that the current expected batting average (xBA) formula may understate the effect of Spd on batting average.

Today we introduce a new xBA formula that makes use of hard-hit data and Spd. The current xBA formula uses a batter's GB, LD, and FB rates as well as his linear-weighted power and speed (PX and SX, respectively) to measure expected BA. This formula has worked very well and is intuitive - players that hit for more power and that have greater speed can be expected to hit for a higher average.

Approach

The updated approach to xBA estimates batting average as a player's contact rate multiplied by his batting average on all balls in play, inclusive of home runs. In contrast to the current formula, the new formula calculates batting average on all balls in play using hard-hit ball data rather than PX and it uses Spd rather than SX.

There are two theoretical advantages of this revised approach. First, in his research presenting the new Spd metric, Ed DeCaria showed that Spd is closely correlated with home-to-first times, which one would reasonably believe is positively correlated with batting average. Second, by parsing between different categories of GB, LD and FB based upon how hard they are hit, we should be able to more accurately predict the outcome of those batted balls. As an easy example, the likely outcome from a soft-hit fly ball and a hard-hit fly ball is rather dramatic; the former is almost always an out whereas the former results in a hit more than half of the time.

The equation is:

xBA = CT%*[0.002 + 0.0036*GBS%*ln(Spd) + 0.012*GBM%*ln(Spd) + 0.011*GBH%*ln(Spd) + 0.0063*FBS% + 0.079*FBH% + 0.013*LDS% + 0.058*LDM% + 0.056*LDH%],

where GBS%, for example, represents the amount of soft-hit ground balls as a percentage of all balls put into play as compared to the league average for that year. For example, if 10% of a batter's balls in play were classified as soft-hit ground balls and the league average was 8%, then GBS% would equal 1.25.

Consistent with our previous research, hard-hit GB and FB and all forms of LD have the greatest effect on xBA, and speed also positively affects a player's xBA.

Technical comments

We examined using linear and polynomial measures of speed, but the logarithmic function produced the most accurate results, consistent with the current xBA formula. The nature of the logarithmic function is that effectively there are diminishing returns to speed: a increase in Spd from 90 to 100 has a larger effect on xBA than does an increase from 100 to 110.

We also considered using a component of speed tied to various FB and LD rates, but that reduced the best fit of the formula. No doubt there are one-off examples to the contrary, but across the overall population, speed doesn't appear to help a batter convert line drives or fly balls into hits after accounting for differences in the hard-hit ball rates.

A batter's relative rates of soft-hit, medium-hit, and hard-hit balls are calculated in comparison to the league average for the year so as to compensate for any differences in coding from year-to-year. As shown in a recent column, hard-hit rates do deviate from year to year. Comparing these rates to the yearly average is consistent with the existing formula as both PX and SX are also calculated in reference to the yearly league average.

It is worth noting that one of the components of Spd is a batter's hit rates on soft-hit GB and medium-hit GB. Thus, using Spd as a component to estimating xBA creates a circular reference. However, this relationship does not overwhelm the calculation. First, hit rates on soft-hit and medium-hit GB represent just one of four components used to calculate Spd, and thus, Spd is not overly sensitive to this data. Second, as Spd is only used in conjunction with GB, it only affects a portion of the overall equation.

We also note the potential confirmatory bias inherent in using hard-hit ball data. It is plausible that the person responsible for coding a ball as a soft-hit, medium-hit or hard-hit ball is more likely to classify it as hard-hit if it fell for a hit rather than if it resulted in an out. However, our previous research regarding hard-hit ball data and expected power (xPX) demonstrated that outliers typically reverted to levels supported by the hard-hit ball data. This suggests that while some bias may exist, the underlying data doesn't appear to be primarily results-driven.

Finally, FBM% is purposefully omitted from the equation as to avoid multi-collinearity as GB+LD+FB necessarily equal 1. Amongst all balls put into play, medium-hit FB are the least likely to fall for hits.

Evaluation and Comparison

So, how does the new formula do?

To measure whether the revised formula is an improvement over the current formula, we looked at the the average difference (in absolute terms) between xBA and actual BA by year (100 AB min).

       Sample    Current      New
Year     Size        xBA      xBA
====   ======    =======     ====
2006      415       18.0     16.7
2007      427       19.9     18.2
2008      436       16.7     17.0
2009      417       17.9     17.2
2010      433       18.4     19.0
Total    2128       18.2     17.6

In all but one year, the new xBA formula predicted current year BA with higher accuracy than the current formula, although the 5-year aggregate difference was not overwhelming. However, we found significant differences between the accuracy of the xBA formulas when we looked at players with different Spd ratings:

        Current    New
Spd         xBA    xBA
===     =======   ====
<75        16.5   18.0
75-100     17.4   17.3
101-125    18.7   18.3
126+       21.8   16.9

The current xBA does a better job of predicting current xBA for the slowest cohort of players, but the new version seems to be an improvement for most of the population, particularly for the fastest players.

We should point out that simply being more accurate isn't necessarily a goal unto itself. After all, the goal of any xBA equation simply measures what batting average should be based on to-date events after stripping out various elements of luck. And so, if we devised an xBA that perfectly matched actual BA, it really wouldn't be of much fanalytic use.

However, if xBA is truly successful in stripping out luck, then we would expect it to be unbiased in its direction. That is, we would expect half of the players to have an xBA greater than their actual BA and the other half to have an xBA less than their actual BA.

              Current        |          New
              -------        |        -------
        xBA > BA   xBA < BA  |  xBA > BA   xBA < BA
        ========   ========  |  ========   ========
2006         34%        64%  |       48%        52% 
2007         30%        68%  |       54%        46%
2008         46%        52%  |       55%        45%
2009         40%        59%  |       59%        46%
2010         45%        54%  |       60%        40%
Total        39%        59%  |       55%        45%

In general, the current xBA formula is more likely to predict an xBA less than actual BA whereas the new xBA formula is more likely to predict an xBA greater than actual BA. Overall, the revised formula appears to be more balanced.

Perhaps the most useful application of xBA is as a predictor of future performance. The following summarizes the average absolute difference between xBA in year 1 and actual BA in year 2 for both the current and new formulas.

          Current     New
Spd           xBA     xBA
===       =======    ====
<75          26.5    26.1
75-100       25.4    26.1
101-125      26.0    26.2
126+         24.6    24.8
All          25.7    25.9

In general both measures fare equally, and each do less well predicting future BA than they do current-year BA. However, the new xBA formula appears to introduce additional bias when it comes to projecting future performance. The current xBA over-predicted future BA 48% of the time whereas the new formula rate was 58%.

Conclusion

Ultimately, it does appear that the newly revised xBA formula presents some advantages over the current formula, most notably for faster players. We conclude by illustrating this with two players with which xBA has historically had trouble - Derek Jeter and Ichiro Suzuki. The following tables compare their actual BA with their xBA using both the current and new xBA formulas.

Derek Jeter

       Current    New   Actual
Year       xBA    xBA       BA
====   =======    ===   ======
2006      .299   .319     .343
2007      .287   .312     .322
2008      .276   .308     .300
2009      .289   .312     .334
2010      .283   .284     .270

Ichiro Suzuki

       Current    New   Actual
Year       xBA    xBA       BA
====   =======    ===   ====== 
2006      .275   .313     .322
2007      .277   .319     .351        
2008      .284   .308     .310
2009      .284   .298     .352
2010      .269   .281     .315

Acknowledgements

A number of other BHQ staff provided helpful commentary on this research. I'd like to express my sincere appreciation to Dave Adler, Patrick Davitt, Doug Dennis, Ed DeCaria and Michael Weddell for their insightful comments and suggestions.


Click here to subscribe

  For more information about the terms used in this article, see our Glossary Primer.