RESEARCH: Using Decision trees for breakouts

Every fantasy owner is always looking for the breakout and bust hitters—and I’m no exception. While taking a course in the computer language R, one section was on Decision Trees, and I felt like it could possibly provide some insight to finding hitters to target and avoid. The following is an attempt to find such hitters with some obvious success.

A few huge caveats to start out with. I’m not even close to being an expert in R and the inner workings of decision trees. While I found them useful for this subject, others are going to know more—a lot more. I feel I’m at that point of knowing "just enough" to be dangerous.

What a decision tree does is to take a group of inputs and groups them to a set variable. For example, here is a tree to represent the likelihood that a passenger survived the sinking of the Titanic ("sibsp" is the number of spouses or siblings aboard):

If you were a man, you were likely going to die while most women survived.

All the branches to the left are followed if the answer is yes to the condition, and the answer no is on the right. Once the data has been divided the first time, the new subsets are analyzed to see if any unique distinctions exist. If there are none, the branch ends. For more information on decision trees, here are the nerd, somewhat detailed, and simple explanations.

For the following analysis, I focused on hitters who exceeded or underperformed their expectations. The dependent variable was the player's projected versus actual production in dollar values. To allow the program to find which player types beat their projections, I used the following inputs (2013 to 2019):

  • Age
  • Plate appearances (walks plus at-bats)
  • Isolated Power (ISO)
  • Strikeout rate (K%)
  • Walk rate (BB%)
  • Stolen Base Attempts ((SB+CS)/PA)
  • Batting Average on Balls in Play (BABIP)

And remember, the key is to find a few groups of hitters to target, or avoid, to gain any little edge. It’s time to get started. Using all available data, here is the overall decision tree.

All hitters (> 100 projected PA)

Only two decision nodes are created with three groups which isn't surprising. For hitters projected for fewer than 289 PA, they exceeded their return by $1.9. The other group divides into young hitters not failing as much as older hitters. No one saw that coming.

Breakouts come from hitters not expected to do much (rocket science). And old hitters breakdown more than young hitters (more rocket science).

One key to remember is that anyone expected to have a full season of plate appearances provided negative value and the desired output needs to outside the range from -$2.6 to -$5.5.

I’m not here to focus on these three groups. The deal is that most projections work, so I’m going to limit the inputs into the decision tree program. The nodes increase but many of the divisions are based on age and plate appearances. I’m looking for those groups beyond that.

For the next tree, I just input the hitters who missed their projection on the low or high end by $10 or more.

Extremes (> $10 in value or <-$10 in value)

The same PA limit is used with most of the breakouts from the low PA group with even fewer PA (<170.5) being the biggest breakouts.

On the breakdown side, it’s interesting that that ‘age < 27.5’ and K% <.18135 have a minimal drop. It’s something to keep in mind for later.

The other trend seen at three nodes (ISO less than 157, 184.5, and 147.5) is that how power hitters perform worse than expected:

  • +12 vs -1
  • -1 vs -9
  • 16 or -7 or -8 vs -12

It seems like those projected for quite a bit of power underperform projections. So I took all projected for over 289 PA and an ISO over or under 150. Those with a projected ISO over 150 underperformed projections by $4.3. Those with a low ISO only underperformed by $2.6. The groups' performances fall in the range from the first tree and aren't helpful.

Meet (> -$2 and < $2) and outperform (>$10) expectations

This tree one breaks the trends and it seems like the old players (>28.5) immediately break the trend but these hitters are either near $0 in value or over $10. The $4.7 surplus meets in the middle.

Going down the left side, the first node is based on power with its right-side surpluses being quite a bit higher ($7, $8, $10, $25, and $27) than those on the left ($2, $6, and $9). The high-power breakouts go against what I found from the previous tree. One other note is that the 18.14% K% value shows up again with those under the rate exceeding expectations. I ran a couple of tests to run on the full dataset.

I tried all hitters with under 29 and an ISO over 136. When compared to everyone else, no difference with both at -$1.

Going for a smaller sample, isolated the hitters with the same two requirements along with a low number of plate appearances (<391) and a decent strikeout rate (<18.14%). These hitters outperformed the rest of the hitters by almost $4 ($2.6 vs -$1.1). The main reason for them outperforming is by exceeding projected plate appearance but 68. It almost like hitters with skills (power and plate discipline) will find playing time. I may have read about this phenomenon before.

I went through the BaseballHQ projections and found the following hitters who met the criteria. 

2020 examples

Miguel Andujar sticks out as a player to target who has little cost right now but huge upside. 

Meet (> -$2 and < $2) and underperform (<-$10) expectations

It’s time to find who avoid. The whole left side ( >= 323 PA) of the tree is the obvious answer but that was known from the first decision tree. The hitters with high BABIPs really struggle at -$12 or -$19 depending on their ISO. When all the hitters are added, the drop is only -$4.9 which is in-line with overall plate appearance split.

Going down the left side of the tree (BABIP < .312) and choosing those hitters with a high walk rate (BB% >= 6.0), I got nothing with an average value of -3.1. I couldn't find a subset who consistently didn't reach their expectations. 


In summary, I used a fancy computer program to determine that breakouts happen from young talented hitters with a low number of projected plate appearances.  I was hoping to find something more original like that magical little subset of hitters who could help us win our leagues. No such luck this time. Instead, just hope on good hitters with suspect playing time. 

Click here to subscribe

  For more information about the terms used in this article, see our Glossary Primer.