RESEARCH: Single Game Outcomes and the new Matchups Tool

(NOTE: To see the new Matchup Scores produced by this method for the final week of the 2017 season, click here.)

 

Introducing the Updated Starting Pitcher Matchup Tool

BaseballHQ.com subscribers will be familiar with the Starting Pitcher Matchup Tool, which has graced the site for years. It lists all projected starting pitchers for the current day, as well as an 8-day scan, along with their opponent and a matchup score. The Matchup Tool was based on the PQS averages of both the starter and the opposing team, and it did a nice job of identifying what matchups were solid, which were iffy, and which were plain bad. It helped fanalytic owners plan their weekly lineups, and certainly helped win a few titles along the way.

When the updated PQS was introduced at the beginning of 2016, it changed the PQS inputs to the SP Matchup Tool. It wasn’t exactly broken, but it had changed enough that more research was needed to re-evaluate the matchup index. Well, we love doing research, so what started as a tweak to tame an unruly algorithm turned into a complete overhaul. We have stripped down the SP Matchup Tool and built it back up with brand new parts. It still serves the same function—to identify pitchers that should be the best fantasy producers on a given day—and the layout will be similar, but under the hood? All new.

[NOTE: It was our intention to have this up and running earlier in 2017, but we ran into some technical challenges in getting it published in a usable format. Treat these final two weeks of the season as our beta run; the PQS-based DAILY MATCHUPS column will continue to use the current system, but we'll link out to the 8-day scans of the new system in that column. Come 2018, this new Matchup system will be fully integrated into BaseballHQ.com. —Ed.]

Motivation

It started with a simple objective: to predict single game outcomes as closely as possible. More specifically, we want to know how each start is likely to influence the statistics in the most common starting pitcher categories: Strikeouts, ERA, WHIP, and Wins. While attempting to predict the outcome of a single game is a fool’s errand, we believed we could use existing metrics to predict tendencies that would add up to an advantage over an entire season. Many small decisions adding up to victory is, after all, what makes a great fanalytic manager.

Methodology

To project the results of an individual game, we sought first to project what would happen in all the pitcher-batter matchups within the game, then build from there. We used dozens of parameters describing the pitcher, the opposition’s offense, and the ballpark to create estimates for four rate parameters—K%, BB%, GB%, and FB%—along with expected number of batters the starter would face.

We then created a second tier of formulas that predicted H%, HR/FB, and the efficiency of the pitcher’s defense in turning batted balls into outs.  Finally, we used those numbers to create a third layer of formulas to predict the numbers we ultimately care about: Strikeouts, ERA, WHIP, and Wins.

For this effort, we used logs of every pitching appearance from 2010-2016.  At each step, we created several versions of formulas to predict the matchup outcomes, and then we validated the results. Formulas couldn’t just give good results overall. They needed to give good results in each year, and for pitchers with a lot of history, or just a little. They had to work in April as well as September.

The First Tier of Matchup BPIs

We began with strikeouts, as one only needs to know k% and total batters faced (TBF) to calculate strikeouts.  According to the invaluable work of Russell Carleton, we know that the stabilization point for strikeout rate is 60 PA for batters and 70 TBF for pitchers, so initially we used rolling 30 day averages for the pitcher and the opposition team to calculate an expected k%.

Using just these parameters, we created our first formula that calculated the matchup k% (mK%) for a game:

               SwK%    Pitcher’s 30-day Swinging Strike Rate
               K%      Pitcher’s 30-day K/TBF
               B_K%    Opposition’s 30-day K/PA (against SP of the same handedness as our SP)

When put into bins sized to a single percentage point, we got an overall R2 value of 0.92, and nothing less than 0.89 in any year. 

However, several problems emerged.  First, our slope of 0.7 told us that we were getting less variation in k% than we were expecting.  Second, it only worked for pitchers with 30 day history. If a pitcher had fewer than 5 games started in the last 30 days, it fell apart.

We went back to the bench and folded in the pitcher’s projected k%, and mK% got better.  The R2 values went up, and the slopes in any given season ranged from 0.97 to 1.04.  However, we still had a problem: since the bulk of the pitchers had 30-day history, the model leaned heavily on those results.  But for pitchers with little recent history—such as zero, one or two starts—we were giving too much weight to the history and not enough to the projections.

Eventually we came to create a tiered projection, where we used a different formula based on the TBF in the last 30 days.

               Tier 0: TBF30 < 10
               Tier 1: 10 ≤ TBF30 < 50
               Tier 2: 50 ≤ TBF30 < 100
               Tier 3: TBF30  ≥ 100

Tier 0 used only the projected K% and park factor as input.  The other tiers used all of our input parameters, weighted differently by tier, to produce an output formula.

As final validation, we again separated the data into seasons, summed over single percentage-point bins, and found no R2 value lower than 0.96 in any season.  Furthermore, we split the data a different way, by number of games started by the pitcher in the prior 30 days, and got no R2 value lower than 0.94.

We used a similar method to compute mBB%, mGB%, and mFB%, with the same TBF tiers.  Note that the stabilization points identified by Carleton are 170 TBF for BB%, and 70 batted balls for GB% and FB%.  A starting pitcher rarely sees 170 batters in 30 days, but smaller samples were still found to be useful.  Using the TBF tiers described above allowed us to weight the recent results appropriately.   For each of these three matchup metrics, we were able to results of similar quality across seasons, and across different amounts of recent history.

As the last part of the first tier, we calculate the expected number of batters faced in a game, mTBF.   We found that on average, the results depended on a constant plus a factor times the mean TBF in the last 30.  That is: mTBF = a + b * Mean(TBF_30). However, when we separated the results into number of games started in the last 30 (GS_30), we got a different constant and different factor for each.  It turned out rather nicely, that both the constant and the factor depended on the number of games started in the last 30:

Those second order fits told us that the mTBF would be dependent on both TBF_L30 and GS_L30.  Using those inputs, we validated mTBF across seasons and GS bins.

Second Tier of mBPIs

To build the next tier, we needed to understand how often the batted balls went for hits and homeruns.  We know that H% depends on batted ball type, the ballpark, and how hard the ball is hit, as well as the quality of the defense behind the pitcher. HR/FB also depends on the batted ball type and how hard the ball is hit, and that despite intuition to the contrary, we have not yet identified any statistically significant way in which the pitcher can influence HR/FB. We didn’t know, however, which of these would have a measurable influence on the outcomes of a single game.

We began with H% but quickly realized we didn’t already have a handy measure of the team defense by batted ball type. We calculated how efficient a defense was at converting batted balls to outs:

               DE_GB = GB_Outs / GB
               DE_FB = FB_Outs / FB
               DE_LD = LD_Outs / LD

And the league averages for 2010-2016 were:

               GB_DE: 0.740
               FB_DE: 0.773
               LD_DE: 0.290

[Note: we deliberated whether to subtract HR from FB or not, but chose not to for several reasons: first, HR can come from either HR or LD, so it wasn’t clear where to subtract them; second, there are plenty of batted balls that aren’t homeruns that the defense also has zero chance of converting to outs, and we aren’t taking those out either.  In a future iteration we may change our approach, but for now, HR are not subtracted when calculating DE.]

Then, using all game logs for the study period, we calculated as above how much the DE in a given game depended on the team’s history over the last 30 days.

There was a small but consistent correlation between the last 30 days DE and the next game. For example, GB_DE over the last 30 ranged from .70 to .90.  We excluded games where the defense saw fewer than 400 batted balls in the previous 30 days (i.e. the first two weeks of April), then binned that last 30 day GB_DE in increments of .01.  We summed over the actual results in the games that followed then plotted those results in aggregate against the last 30 days number, weighted by total number of GB.  There is a nice relationship:

The line is described as GB_DE = 0.544 + 0.265 * GB DE% L30 Binned. 

Similar relationship held for FB and LD:

FB_DE = 0.467 + 0.405 * FB DE% L30 Binned
LD_DE = 0.205 + 0.298 * LD DE% L30 Binned

Using these formulas, we proceeded to create an expected DE for the matchup.  As with mTBF, we found the relationship with mDE became stronger the more history there was.  We used a least squares regression to fit to the recent history and number of balls in play in that time period, and found relationships for the three DE metrics for the matchup.  Here are the formulas, simplified slightly to make them more digestible.

mGB_DE% = 0.56 + 0.25 * GB_DE_30 + ((GB_DE_30 - 0.74) * (BIP_D30 - 667) / 3000)
mFB_DE% = 0.48 + 0.39 * FB_DE_30 + ((FB_DE_30 - 0.78) * (BIP_D30 - 668) / 1800)
mLD_DE% = 0.21 + 0.28 * LD_DE_30 + ((LD_DE_30 - 0.29) * (BIP_D30 - 669) / 280 )

Where BIP_D30 is the number of balls in play the defense saw over the last 30 games.

To be clear, these formulas are very weakly correlated to the results in a single game.  However, over an entire season, they do very well.

Armed with these results, we estimated mH% for each game.  We created no fewer than 7 models, and tested them all across each season and GS in the last 30.

Some versions had park factors, some didn’t.  Some were reverse engineered from past season totals, and some were built by running a least-squares regression on game-by-game data.  Some used recent hard hit rate (HH%) of the opposition, and some didn’t.  In the end, the model that was most consistent relied on only the matchup batted balls rates (mGB%, mFB%, & mLD% (≡ 1- mGB% - mFB% ) and the matchup DE rates (mGB_DE, mFB_DE, mLD_DE)

Notably, park factors were left out.  They did not significantly affect the outcome, likely because they are partially baked in to the home team’s DE numbers.  Also, Hard Hit ball rates by the opposition did not make the cut; when averaged over a team’s offense, they weren’t predictive enough to be statistically significant.

Finally, we get to HR/FB.  Research here at BHQ has shown repeatedly that the pitcher has no control over HR/FB, so we didn’t even make an attempt to use the pitcher’s history here. We used only data about the opposition’s offense. Of the eight models we tested thoroughly, the statistically significant factors were found to be:

HH%_30, by pitcher handedness
HR/FB_30, by pitcher handedness
HR park factors

As in previous calculations, we scaled the impact of the last 30 days’ HR/FB by the number of batted balls in that period. More batted balls gave greater weight to the value, while fewer reverted more toward league average.

The Third Tier: Matchup Ratings

We are now ready to estimate the “matchup” values for the standard rotisserie categories.

We return to strikeouts.  We calculate the matchup strikeouts as simply mK% * mTBF. We find this works very well across seasons, as you can see in the chart below.

You are looking at the average number of strikeouts in the game versus the mK, binned to 0.1 strikeouts. The larger circles mean more data in that bin. The slopes of the lines are nearly identical from year to year, and they are nearly perfectly overlaid.

We next move to mERA. The significant matchup inputs to ERA were not surprising: mK%, mBB%, mH%, mHR/FB, and mFB% (the last two both contributing to homeruns).

When ER and IP are summed in bins of size 0.10 ERA, we again see good agreement overall:

mWhip is also found to depend on mK%, mBB%, mH%, mHR/FB, and mFB%. Here is the result when binned to 0.02 units of whip:

And finally, for mWins, we used the mERA of both starters and BHQ’s equation for expected wins:

0.72 * (mERAOPP)1.8 / ((mERAOPP)1.8 + (mERA)1.8)

SP Matchup Ratings

The new SP Matchup tool will give a rating for each of these matchup statistics. This will be based on the mean and standard deviation of the matchup values over the study period. So, in each category, a zero would be average, i.e the 50th percentile of all SP. A +1.0 means they are in the 84th percentile. +2.0 would put a pitcher in the 98th percentile, and +3.0 is the 99.85% percentile: it should only happen a handful of times a year.

The Overall Ratings will be a mean of these four ratings. Below is the distribution of these overall matchup ratings from 2010-2016. 

Ordinarily, we’d expect a normal distribution, but this is skewed to the high side. You can thank Clayton Kershaw. There were 63 overall rating scores above 3.0, and about half belonged to that elite lefty:

For a final validation of the rating system, we look to the PQS metric. By overall rating, we plot the rate of DOM, DEC, and DIS starts by rating, excluding any bin with fewer than 30 starts:

How to use these numbers

Aside, from the obvious (you start the pitcher with the higher rating), you’ll need to determine what threshold you will care about by player pool restrictions. Consider the mean results per start, by Overall Rating range:

Overall
Rating          GS     W     IP     HR    BB     K    PQS   ERA    whip   BPV
============   =====  ====  =====  ====  ====  ====  ====  =====  =====  ====
-3.0 — -2.5      12   0.33   5.5    0.4   1.9   3.8   2.2   3.84   1.40    55
-2.5 — -2.0     127   0.31   5.2    0.9   2.1   3.2   1.7   5.32   1.51    22
-2.0 — -1.5     815   0.25   5.4    0.8   2.0   3.4   1.7   5.17   1.50    32
-1.5 — -1.0    2816   0.28   5.6    0.8   1.9   3.6   1.9   4.82   1.43    43
-1.0 — -0.5    5398   0.30   5.6    0.7   2.0   4.0   2.1   4.61   1.41    53
-0.5 — 0.0     7092   0.35   5.9    0.7   1.9   4.5   2.3   4.21   1.33    66
0.0 — 0.5      6433   0.37   6.0    0.7   1.8   5.0   2.5   3.85   1.27    82
0.5 — 1.0      4472   0.40   6.2    0.6   1.8   5.5   2.8   3.65   1.23    96
1.0 — 1.5      2438   0.43   6.4    0.6   1.8   6.1   3.0   3.38   1.18   109
1.5 — 2.0      1042   0.45   6.5    0.6   1.7   6.6   3.2   3.08   1.12   123
2.0 — 2.5       335   0.51   6.6    0.6   1.5   7.1   3.4   2.78   1.02   142
2.5 — 3.0       123   0.52   6.8    0.6   1.4   7.5   3.7   2.55   0.99   152
> 3.0            63   0.46   7.0    0.5   1.2   8.6   3.8   2.41   0.91   179

Clearly, any pitcher with rating of 1.0 or higher is a must-start in all but the deepest of leagues. And much below zero and those average stats start to look ugly pretty quickly. 

There are roughly 150 starting pitchers in MLB rotations at any given time. For different league sizes, we can estimate the number of SP in the useable pool, and then determine the overall ratings that correspond to worst, 25th percentile, median, and 75th percentile SP.

                               Overall SP Rating
                   SP       by SP league percentile
League size       Pool*     .00    .25    .50    .75
==============    =====    ====   ====   ====   =====
12-team “only”     120     -.73   -.22   +.22   +.74
10-team “only”     110     -.54   -.11   +.30   +.79
20-team mixed      110     -.54   -.11   +.30   +.79
15-team mixed      105     -.46   -.05   +.34   +.82
12-team mixed       80     -.07   +.22   +.55   +.97
10-team mixed       60     +.22   +.46   +.74   +1.12

So if you’re in a 12-team “only” league, the median start should have a rating of about 0.2. The 120th start is around -0.7. In a shallower, 15-team mixed, the median start is 0.34 and a top pitcher +.82.

Here’s the same information in a graphical presentation. Find the matchup rating of an SP, then trace to the right until you meet the curve that corresponds to the number of rostered SP in your league. The x-value of that intersection will tell you the percentile of that matchup within your league size.

Much of the value of an SP will come from their inherent ability. But there is a fairly large range in the matchup scores. If we calculate a mean matchup score for each SP in a season, then the range of deviation from that mean score looks like this:

One standard deviation is about 0.5 points, so that means the matchup score will differ from the SP’s rating mean by more than 0.5 points about 1/3 of the time. Looking at the table above, that’s huge. That’s the difference between unrosterable and average in a 10-team mixed, or between the 25th percentile and the 75th in a 12-team “only” league.

Conclusion

We completely rebuilt the SP Matchup Tool from the ground up. We now can assess the strength of a starter’s matchup from game to game, taking into account the pitcher’s inherent ability, recent performance, strength of defense, ballpark, and opposition’s offense recent history. We have also built a flexible framework that can evolve as we improve the model in the years that follow. 

Once these are rolled out for good at the start of 2018, the matchup scores early in the season will parallel what you’d expect from the projections, but divergence will occur as the history accumulates. And because projections are built in to the matchup scores, if a pitcher’s base skills change enough merit a change in projections, that will be reflected in the model, too.

 

NOTE: Current matchup scores for our revised system comes only in an 8-day scan, and can be found here. Note that they are current as of 9/18/17, but the chart is static and will not be updated through the week (though we will have a new version for the final week of the season on Monday 9/25/17). Next year's iteration will be flexible enought to handle late scratches and changes. Enjoy! 


Click here to subscribe

  For more information about the terms used in this article, see our Glossary Primer.