Predicting xCSW using a Random Forest Method
By Devon Wright
During this past year, I have spent much of my time building my knowledge of pitching mechanics, hitting mechanics, and some of the data points associated with those realms. However, what has always kept me interested in baseball is the ability to research the game from an analytical point of view. Forming models to predict certain statistics has always been something I have been interested and it wasn’t until recently that I decided to act upon that interest. With the rise of MLBAM and Baseball Savant, there is more data on each event in a game than ever before. From release velocity, to how fast a pitch is accelerating in the x, y, and z dimensions, we have more information to help us understand the game much better. For this reason, I decided to dive into the data and see what I could find.
One of the more interesting pitching statistics is CSW, which is called strike percentage + whiff (swing and miss) percentage. As found in Alex Fast’s article here, we now see that there is a strong correlation between CSW and SIERA or ‘Skill Interactive Earned Run Average’. With this in mind, I decided to build a model that would accurately predict a pitcher’s CSW, based on the pitch parameters of each one of their pitches. I will refer to the predicted values of CSW as xCSW or ‘Expected Called Strike + Whiff Percentage’.
Data Acquisition
Last summer, I had signed up for Driveline Plus to learn more about swinging and pitching mechanics. While I was searching their website, I came across a video that explained how to create a Statcast database. From there, I was able to build my first database through a bunch of spreadsheets in excel. I had felt accomplished, but I didn’t feel that it was nearly good enough. Then I found that there is an entire package of functions in r-studio that allowed me to gather the play-by-play files of every game being played during the timeframe that I would input into the function from baseballsavant.mlb.com. Once I had gathered all of the data and combined it into one larger data frame, I decided to place that data frame into an RSQLite database. However, when you have upwards of 3 million rows, the querying time can get very long. This problem forced my hand into downloading MySQL and building a database that has a user interface attached to it. Once this was set up, I was able to specify my queries and query much faster. In addition to faster queries, it also allowed me to centralize all of my data into one place. Once I had my data all in one place, I began to collect variables that I would use to build my model.
Variables
When beginning to look at this problem and identify what variables I would need, I first looked at how I wanted the format of my results to look. After deciding that I wanted to have each pitcher have an xCSW predicted value for each one of their pitch classes, I realized that I would need to create ‘pitch profiles’ for each pitcher. Now that I had a definition of what I wanted my data set to look like, I began to query results from my database. I took all rows from my database from the year 2019, excluding any pitches thrown by position players. From there, I logically thought about what would go into a pitch profile. Because I strictly wanted this metric to be based on the physical attributes of the pitch rather than pitch usage or other stats, I decided I would take the following columns:
- Release speed (velocity at pitch release)
- Pfx_x (horizontal movement in feet)
- Pfx_z (vertical movement in feet)
- Release_position_x (horizontal placement of release point)
- Release_position_z (vertical placement of release point)
- Release_extention (distance from release point to pitching rubber)
- Release_spin_rate (the spin rate at release)
- Plate_x (location at home plate horizontally)
- Plate_z (location at home plate vertically)
From there, because I need to create pitch profiles for each pitcher and pitch class, I decided to group all pitch types into 6 groups called pitch classes. The 6 pitch classes were 4-Seam Fastballs, Sinkers, Changeups, Curveballs, Sliders, and Cutters. To reduce noise, I decided to ignore all other pitch types such as knuckleballs, forkballs, etc.
Once all of my pitch classes were decided, I needed to group them by each pitcher’s ID and name. Once I had grouped my dataset by each pitcher, I needed to summarize the columns I was including for each profile. For this, I decided to count the number of pitches thrown for each pitch class, then take the average of the aforementioned variables. One of the reasons I decided to use the mean as opposed to the median or any other summary statistic and not apply any standardization was because tree-based algorithms such as random forest methods are insensitive to feature scales. Once I had those summary statistics for each pitch class, I took each pitcher’s 4-Seam Fastball or Sinker and compared their velocity, horizontal movement, and vertical movement to each of their other pitch classes and created variables ‘velo_difference’, ’hmov_difference’, and ‘vmov_difference’ for each of those classes. One caveat to this was with pitcher Bryan Shaw. Since he didn’t have a 4-Seam Fastball or a Sinker and considered his Cutter to be his fastball, I used that velocity and movement profile for his fastball. Once those variables were created, I then started to create my dependent variable CSW. To create this variable, I needed the total called strikes and swing and miss strikes. From there I took the pitches thrown in each pitch class and added them together with the called strikes and swing and miss strikes and divided them by the total pitch class pitches. Once I had all of my features ready for my model, I needed to do one last outlier detection. For this, I decided to set the floor for minimum pitches thrown at the 25th percentile. The minimum threshold was as shown here:
Now that my data was prepared, I began to build the models.
Model Building
At the beginning of this project, I decided to build baseline models to judge the effectiveness of my later models. To accomplish this, I split each of my profiles into train and test subsets and ran a normal linear regression. The way I chose to analyze the root mean squared error (RMSE) was to compare it to the standard deviation (SD) of the CSW in each pitch class. After running each model and calculating the RMSE for each one here are the baseline results:
As we can see, all of the models performed relatively well, as the RMSE values were smaller than the SD values. However, I believe that these predictions could significantly improve.
One of the first things that I wanted to do to improve the predictions is to look at the importance of each feature or to use a feature selection method. Through my university schooling and an article by Ethan Moore (found here), I know that a popular way to conduct a feature selection is through a Boruta algorithm. A Boruta Feature Selection Algorithm uses a randomized ‘shadow’ version of each feature and builds a random forest model. It then takes each shadow feature’s importance value and builds a threshold. This threshold is generally set by selecting the highest importance value of the shadow variables. From there, each normal variable’s importance value is taken and compared to the threshold. If it is greater than the threshold, then it gets a “hit”. Features that get more hits are kept in the model and features that get a hit roughly 50% of the time are tentatively kept. Features that get few hits are considered noise and are left out of the next iteration of the model. I decided to take each of my pitch profile datasets and put them into an algorithm to determine which features should be kept. Here are the results:
4 — Seam Fastballs:
After looking at the results of this Boruta function, I was not surprised to see that the only features rejected were the average location of the fastball profile. This would make sense that it confirmed it as unimportant simply because of the variance in location values.
Sinkers:
Seeing that spin rate was determined to be unimportant was a bit surprising, but I am curious if spin axis would have been determined important, as it helps determine how much the pitch moves.
Changeups:
Reading the results from this algorithm was very surprising to me, as I was almost certain that horizontal movement would be considered important, as more pitchers are moving toward a frisbee changeup. Also seeing that horizontal movement difference was rejected makes me think that if a pitcher’s primary fastball was a sinker, that the difference in movement and velocity profiles was insignificant.
Curveballs:
Reviewing these results, I was not surprised to see that location, release extension, vertical release position, and velocity difference were rejected. Logically, I wouldn’t say that any of these features contribute heavily to having an above average curveball.
Sliders:
Cutters:
After seeing that release characteristics were determined to be unimportant, I believe it is safe to say that what matters most for a cutter is its ball flight characteristics.
Now that I have my features selected, I am ready to begin to build the models. For my model building, I decided on a 70/30 training split. From there, I refined each of the equations in the random forest function to mirror what the Boruta algorithm had determined as important. As an example, here is the formula for my 4-Seam Fastball model:
From there I used the model to predict xCSW values and then compared those to the actual CSW values to calculate an RMSE. This RMSE value was taken and compared to the standard deviation of the CSW of each pitch class. I repeated this process for each pitch class. As above, here is each pitch class, its RMSE, and the standard deviation of CSW.
Results
To accompany this article and show the results, I have a created a shiny app here. For each pitch class, I will showcase the leaders in xCSW.
xCSW — Top 10 4-Seam Fastballs
xCSW — Top 10 Sinkers
xCSW — Top 10 Changeups
xCSW — Top 10 Curveballs
xCSW — Top 10 Sliders
xCSW — Top 10 Cutters
Limitations
1. One of the first things I worried about this type of model with only a single year of data is that I might be overfitting the model. In the near future, I will use data from 2016, 2017, and 2018 to supplement the 2019 data in order to make a better, more legitimate model.
2. Another thing that could limit the effectiveness of this research is the impact of minimum pitch thresholds on the model itself. I did my best by setting the threshold at the 25th percentile, but I definitely feel that there is a better, more objective way to determine this threshold.
3. Finally, I feel that my limited knowledge of the statistics side of random forest methods limits my ability to fine tune the model. While I have a base understanding of random forest regression models, I am eager to learn the more intricate details surrounding them.
Final Thoughts
This project was the first major machine learning research project I have done since I graduated last May. While it feels good to get some actionable results, I know that I still have much to learn in machine learning, statistics, and in coding. However, I know that my coding skills and the ability to build a shiny app make me a better analyst and I look forward to improving this model in the near future. If you have a comments, constructive criticisms, or questions, please don’t hesitate to reach me at dwright98@protonmail.com. The following links are to the code and to the shiny app interface. Enjoy!
Code: https://github.com/dwright98/xCSW
Shiny App: https://dwright1398.shinyapps.io/xCSW_App/