Decoding the Next Pitch: Using Data to Anticipate a Pitcher’s Move in Baseball
*This is the first post in a series about baseball and how data, analytics and machine learning can influence (and improve) the game.
Baseball, often referred to as a game of numbers, has a long history of using data to analyze player performance and strategize game tactics. In the early days, statisticians focused on traditional metrics like batting average, earned run average (ERA), and on-base percentage (OBP). However, as the sport evolved, so did the data collection methods and the depth of analysis. Today, advanced data analysis techniques have led to the emergence of Sabermetrics, a sophisticated approach to understanding the game that relies heavily on data science.
During a baseball game, a vast amount of data is generated, capturing everything from player movements to the physics of each pitch. Modern technology, such as MLB’s Statcast system, employs radar and high-definition cameras to collect detailed information on every pitch, hit, and play. This granular data provides an unprecedented level of insight into the game, enabling teams to make data-driven decisions and develop cutting-edge strategies.
One key area where I believe this data revolution can assist is within the ability to use data to predict the next pitch a pitcher is going to throw. Knowing the pitcher’s tendencies and the likely outcome of a pitch can provide a strategic advantage to the opposing team, allowing them to make informed decisions at the plate. In this article, I’ll explore how to leverage pitch data to achieve this and provide code snippets to showcase the process.
Step 1: Collect and understand the data
To predict a pitcher’s next move, you’ll need to gather data on their past performances. This information can be sourced from various public databases like MLB’s Statcast or websites like FanGraphs and Baseball-Reference. You can download the data directly from these websites or access it via their APIs. The key data points I’ll be using are:
- Pitch type (fastball, curveball, etc.)
- Pitch location
- Pitch velocity
- Batter handedness
- Pitch count
- Game situation (inning, outs, runners on base)
Many teams will have proprietary data they collect and store, but for those of us who don’t have such a vast arsenal of technology you can leverage the following.
Downloading data from a website: For example, you can download pitch-by-pitch data from Baseball Savant (https://baseballsavant.mlb.com/statcast_search). This website allows you to filter the data by various parameters such as date, pitcher, and pitch type. After setting the desired filters, click the “Export Data” button to download the data as a CSV file.
Using an API to get data: Alternatively, you can use an API like MLB’s Stats API (https://appac.github.io/mlb-data-api-docs/) to access the data programmatically. Here’s an example using Python and the requests
library to fetch data from the API:
import requests
import json
# Define API endpoint and parameters
api_endpoint = "https://statsapi.mlb.com/api/v1/game/"
game_pk = "634671" # Unique game identifier
inning = "1" # Inning number
api_url = f"{api_endpoint}{game_pk}/inning/{inning}"
# Fetch data from the API
response = requests.get(api_url)
data = json.loads(response.text)
# Extract pitch data
pitch_data = []
for play in data["plays"]:
for pitch in play["pitch"]:
pitch_data.append(pitch)
# Convert pitch data to a pandas DataFrame
import pandas as pd
pitch_df = pd.DataFrame(pitch_data)
After gathering the data, make sure to familiarize yourself with the variables and their meanings to better understand how they can be used for analysis.
Step 2: Preprocess the data
Once you have collected the data, you’ll need to preprocess it to ensure it is ready for analysis. This may involve cleaning up missing values, standardizing pitch types, and converting categorical variables into numerical ones. You can use Python’s pandas library to manipulate the dataset.
import pandas as pd
# Load the dataset
data = pd.read_csv("pitch_data.csv") # If you downloaded data as a CSV
# data = pitch_df # If you fetched data using the API
# Clean up missing values
# If a row contains missing values, we can either drop it or fill it with a suitable value (e.g., mean, median, or mode)
data = data.dropna()
# Standardize pitch types
# Ensure that pitch types have a consistent naming convention (e.g., all upper case)
data['pitch_type'] = data['pitch_type'].str.upper()
# Convert categorical variables into numerical ones
# Machine learning models typically require numerical input. To achieve this, we can use one-hot encoding or ordinal encoding
# For example, if we have a categorical variable with three categories (A, B, C), one-hot encoding creates three binary variables (is_A, is_B, is_C)
data = pd.get_dummies(data, columns=['batter_handedness', 'game_situation'])
# Scale features
# Scaling features ensures that all variables have the same range of values, preventing one variable from dominating others during the model training process
# There are several scaling methods, such as Min-Max scaling, Standard scaling (Z-score normalization), and Robust scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.drop("pitch_type", axis=1))
scaled_data = pd.DataFrame(scaled_data, columns=data.columns.drop("pitch_type"))
# Update the dataset
data = pd.concat([scaled_data, data["pitch_type"]], axis=1)
By carefully preprocessing the data, you can ensure that it’s in the right format and properly structured for the next step: training the machine learning model. This step will help you achieve better results when predicting the next pitch based on the given data points.
Step 3: Train a machine learning model
To predict the next pitch, I’ll use a machine learning model. For this example, I’ll use a Random Forest Classifier from Python’s scikit-learn library. This model is an ensemble learning method that works well with complex datasets, handling noise and potential overfitting. It operates by constructing multiple decision trees during training and outputting the mode (majority vote) of the individual trees’ predictions.
Before training the model, you will need to split the data into a training set and a testing set. The training set will be used to train the model, while the testing set will be used to evaluate the model’s performance. A typical split ratio is 80% for training and 20% for testing.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Split the data into training and testing sets
X = data.drop("pitch_type", axis=1)
y = data["pitch_type"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the Random Forest Classifier
# n_estimators is the number of trees in the forest, and random_state is a seed for reproducibility
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
Once the model is trained, you can use it to make predictions on the testing set. This will help you evaluate the model’s performance by comparing the predicted pitch types with the true pitch types. One common metric for classification tasks is accuracy, which measures the proportion of correct predictions.
# Predict on the test set
y_pred = clf.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
To further improve the model, you can experiment with different machine learning algorithms (e.g., logistic regression, support vector machines, or neural networks) or perform hyperparameter tuning (e.g., adjusting the number of trees, maximum depth, or minimum samples per leaf in the Random Forest Classifier).
Keep in mind that no model will be perfect, and the goal is to achieve a balance between underfitting (high bias) and overfitting (high variance). This will allow your model to generalize well to new, unseen data while still capturing the underlying patterns in the training data.
Conclusion
In this article, I demonstrated how to leverage pitch data to predict the next pitch a pitcher will throw using Python and machine learning. The process began with collecting pitch data from reliable sources such as websites or APIs, followed by preprocessing the data to ensure its quality and compatibility with machine learning models. Next, I discussed how to train a Random Forest Classifier to predict pitch types based on the available features, evaluated its performance using accuracy, and discussed potential avenues for further improvement.
When applied to real-world situations, this approach can provide valuable insights into a pitcher’s strategy, allowing opposing teams to make more informed decisions at the plate. For instance, a completed model may predict the next pitch type with an accuracy of 70% or higher, giving the batter an advantage in anticipating the incoming pitch. A sample output of the model’s predictions might look like this:
True Pitch Type Predicted Pitch Type
----------------------------------------
Fastball Fastball
Slider Slider
Curveball Curveball
Fastball Changeup
Changeup Changeup
While the model may not always predict the pitch type correctly, it can still provide valuable information that can improve a team’s overall performance. Moreover, the insights gained from analyzing feature importance can help teams tailor their strategies to exploit a pitcher’s weaknesses.
It’s important to note that the methods discussed in this article can be further refined and expanded upon. For example, you can experiment with different machine learning algorithms, incorporate more features or game situations into the model, or even apply advanced techniques like deep learning.
This article is just the beginning of this series on using data in baseball. Stay tuned for more in-depth articles that delve into other aspects of the game, such as player performance evaluation, injury prediction, and game strategy optimization. By harnessing the power of data and machine learning, we can revolutionize our understanding of baseball and elevate the sport to new heights.