Drafting The Perfect Prospect: Leveraging Data Science in MLB Draft Scenarios

14 min readJul 25, 2023

Created w/ Bing Image Creator — A Robot Baseball Scout

This is the fourth article in a series focused on data science + baseball. You can find parts I , II and III linked below.

Part I: Decoding the Next Pitch: Using Data to Anticipate a Pitcher’s Move in Baseball

Part II: Winning Through Data: Game Strategy Optimization in Baseball

Part III: Data-driven injury prevention in baseball: Maximizing player performance and longevity

— — — — —

Hello fellow baseball enthusiasts and data scientists. We’ve arrived at the final post in the series exploring the integration of data science and machine learning into the world of baseball. In this concluding entry, we are stepping into the high-stakes world of player drafting and scouting. Let’s unpack how data science and machine learning can guide us through this critical process and help us draft the perfect prospect.

The Major League Baseball (MLB) (Hey MLB, still waiting for that call 😉) draft is a complex, intricate ‘dance’ where the future of franchises can be shaped. It is a talent acquisition strategy that, while exciting and full of potential, also comes fraught with high risks. Each selection is a calculated bet on a player’s future performance, often made years in advance of them reaching their prime.

This complexity stems from multiple sources. First, the players themselves are often at the very beginning of their professional careers, with many draftees coming fresh college, and some directly from high school. The relative immaturity of these players means that scouts and executives have to base their decisions on projected development and potential, rather than an extensive record of professional performance.

Second, baseball is a sport rich in variables. A player’s success isn’t determined solely by their physical prowess, but also their technical skills, their mental strength, their adaptability, and even their luck. This multidimensionality makes player evaluation an intricate task.

Third, drafting involves a long-term vision. The drafted player might not reach the Major Leagues for several years, and they could take even longer to hit their stride. Consequently, drafting decisions have a substantial impact on a team’s future roster and their success in the years to come.

Given these complexities, drafting the right player is both an art and a science. Traditionally, this process has relied heavily on the wisdom of scouts who travel far and wide to watch, analyze and rate players using their expertise. However, the dawn of the data science era has provided us with a new set of tools to aid in this process. By leveraging machine learning and advanced analytics, we can augment these traditional methods to make more informed decisions. Traditional scouts, and their tacit knowledge, will never go away (I hope) but blending data science with their direct domain expertise will turn them into super scouts.

In this post we’ll examine how to utilize these cutting-edge techniques to transform raw data into insights, predictions, and actionable strategies.

Understanding the Importance of Data in Baseball Drafting

Success in Major League Baseball hinges on the ability to scout and draft effectively. The stakes are high; the right decision could potentially mean the acquisition of the next Shohei Ohtani or Julio Rodríguez. This process has long been dependent on the wisdom of experienced scouts who spend countless hours observing players, taking meticulous notes on their technique, evaluating their physical attributes, and making subjective judgments about their potential.

However, the sport is now amidst a paradigm shift with the introduction of data science. We can augment traditional scouting methods with insights gleaned from advanced analytics, enabling a more objective and informed decision-making process. This fusion of old and new school methodologies is helping to redefine how baseball franchises build their teams.

Let’s take a practical example. Suppose you’re considering drafting a promising high school pitcher. A scout might evaluate the player’s current skill level and consider factors such as fastball velocity, control, and secondary pitches. However, the scout’s report, while valuable, may not be sufficient to predict the player’s future performance in professional leagues.

Here’s where data science enters the game. Using data from similar players who’ve transitioned from high school to professional baseball, a machine learning model can predict the player’s performance trajectory. The model could consider variables like the player’s age, body composition, past injuries, pitching mechanics, the rate of improvement, and even factors such as the player’s socio-economic background, which could influence their access to training resources.

In Python, this might look something like:

from sklearn.ensemble import RandomForestRegressor

# Define the model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make a prediction on a new prospect
prospect_data = np.array([17, 6.2, 190, 95, 0])  # age, height, weight, fastball_velocity, past_injuries
predicted_future_performance = model.predict(prospect_data.reshape(1, -1))

In this hypothetical scenario, X_train and y_train are our training data (with columns such as age, height, weight, fastball velocity, and past injuries), and the prospect_data is the data from our new prospect. The model generates a prediction of the future performance for this prospect, aiding in the drafting decision.

In this way, the integration of data science and traditional scouting methods leads to a more holistic approach to drafting. We can leverage the nuances and insights from experienced scouts, while also utilizing data-driven methodologies for a more objective and precise evaluation. Such a dual approach is likely to lead to more accurate predictions and ultimately, a higher success rate in drafting the stars of tomorrow.

Harvesting the Field: The Art of Collecting Relevant Data

Collecting the right data is foundational to landing the next franchise star. The challenge lies in identifying which pieces of data will most accurately inform your models and lead to the most reliable predictions.

At a surface level, every fan is familiar with the standard set of baseball statistics — batting averages, RBIs, ERA for pitchers, and so on. These fundamental metrics form the backbone of any data collection effort and serve as the starting point for our analysis.

For instance, a pitcher’s Earned Run Average (ERA) can provide insight into their performance level. However, it’s essential to remember that such aggregated statistics can sometimes hide more than they reveal. A low ERA might be the result of excellent individual performance, or it could be due to a great defense team behind the pitcher, or perhaps even a combination of both.

This is where the power of more granular data comes in. The ‘Moneyball’ revolution in the early 2000s taught us that often-overlooked statistics, like on-base percentage, can sometimes be more valuable indicators of a player’s contribution to the team. In today’s era of Statcast, we can delve even deeper. For example, we can consider a hitter’s launch angle and exit velocity, two metrics that, when combined, can provide a strong indicator of their ability to generate powerful, productive hits. Similarly, a pitcher’s spin rate can give insights into their ability to deceive batters and induce swings and misses.

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('baseball_data.csv')

# Let's say we are interested in these specific metrics
interested_metrics = ['launch_angle', 'exit_velocity', 'spin_rate', 'height', 'weight', 'age']

# Analyzing the data
print(data[interested_metrics].describe())

# Visualizing the data 
# For instance, we might be interested in the relationship between 'exit_velocity' and 'launch_angle'
plt.scatter(data['exit_velocity'], data['launch_angle'])
plt.xlabel('Exit Velocity')
plt.ylabel('Launch Angle')
plt.title('Relationship between Exit Velocity and Launch Angle')
plt.show()

This above code first loads the dataset using pandas. It then focuses on certain metrics we’re interested in — in this case, ‘launch_angle’, ‘exit_velocity’, ‘spin_rate’, ‘height’, ‘weight’, and ‘age’. The describe() function provides a statistical summary of these metrics, giving an immediate understanding of their distribution.

We then create a scatter plot using matplotlib to visualize the relationship between ‘exit_velocity’ and ‘launch_angle’. Such visualizations can help us better understand the patterns and relationships within our data, driving our analysis and feature selection for machine learning modeling.

Remember, this is just a simple example. Real-world data exploration would involve much more detailed analysis and potentially more complex visualizations to capture the multifaceted nature of the data.

However, for younger prospects, especially those coming straight out of high school or college, the nature of the data we collect might need to shift. For these players, data related to physical development, such as height, weight, strength, and speed, can be beneficial. Tracking their development over time might reveal a growth pattern that indicates potential for future improvement. Furthermore, data on more intangible attributes, like work ethic and leadership qualities, could also be considered.

This expansion of data collection efforts ultimately aims to form a comprehensive, 360-degree view of each player. By considering a wide array of factors, from traditional stats to granular data and physical development metrics, we can equip our models with the best chance of predicting a player’s future success.

Setting the Stage: Preprocessing the Data for Machine Learning Models

Having collected a rich set of baseball data, the next crucial step in our pipeline is data preprocessing. This step transforms raw data into a format that can be ingested by machine learning algorithms. It includes data cleaning, handling missing values, normalizing numerical data, encoding categorical data, and feature engineering.

Data Cleaning and Handling Missing Values

It’s rare to find a dataset without missing or inconsistent values, especially in a diverse and dynamic domain like baseball. We need to fill in these gaps, drop irrelevant rows or columns, or impute missing values based on statistical measures.

# Import pandas library
import pandas as pd

# Load data
data = pd.read_csv('baseball_data.csv')

# Identify missing values
print(data.isnull().sum())

# Fill missing values with median
for column in data.columns:
    if data[column].isnull().any():
        data[column].fillna(data[column].median(), inplace=True)

Normalizing Numerical Data

Machine learning models can be sensitive to the scale of features. To handle this, we often normalize numerical data, bringing different variables to a comparable range.

# Import the preprocessing module from sklearn
from sklearn import preprocessing

# Define the scaler 
scaler = preprocessing.MinMaxScaler()

# Apply the scaler to the numerical columns
data[['height', 'weight', 'age']] = scaler.fit_transform(data[['height', 'weight', 'age']])

Many datasets include categorical data that needs to be converted into a numerical format for machine learning algorithms.

# Encode categorical data
data = pd.get_dummies(data, columns=['batting_side', 'pitching_arm'])

Feature Engineering

Sometimes, the original features might not be enough to capture underlying patterns or trends effectively. Feature engineering is the process of creating new features from existing data to enhance the predictive power of our models.

# Create a new feature 'BMI' from 'height' and 'weight'
data['BMI'] = data['weight'] / (data['height'] ** 2)

# Create a new feature 'experience' from 'age' and 'professional_years'
data['experience'] = data['professional_years'] / data['age']

Through these preprocessing steps, we’ve transformed the raw data into a format suitable for machine learning models. It’s important to remember that preprocessing and feature engineering often require a good understanding of the data and the domain. Each dataset is unique, and the steps you need to take will vary based on the specifics of your dataset.

Gearing Up: Selecting and Training Machine Learning Models

With our data preprocessed and ready to go, it’s time to dive into the heart of machine learning — model selection and training. The choice of model depends heavily on the problem at hand and the nature of the data. Whether it’s regression models for predicting a player’s future batting average, decision trees for classifying prospects into different tiers, or even neural networks for complex non-linear patterns, each has its place in the toolkit of a data scientist.

But before we can train our model, we need to split our data into a training set and a test set. This step is crucial for validating our model’s performance and ensuring it doesn’t just memorize the training data (a problem known as overfitting), but generalizes well to unseen data.

Let’s walk through a basic example of this process using Python’s scikit-learn library.

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Define features and target
features = data.drop('future_batting_average', axis=1)
target = data['future_batting_average']

# Split the data into training and test sets (80% used for training, 20% used for validation)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Use the trained model to make predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, predictions)
print(f'The Mean Squared Error of the model is: {mse}')

In this example, we first define our features and the target variable, ‘future_batting_average’. We then split the data into a training set (80% of the data) and a test set (the remaining 20%).

Next, we create a Linear Regression model and train it using our training data with the fit method. Once the model is trained, we use it to make predictions on the test data, and finally, we evaluate the model's performance using the Mean Squared Error (MSE), a common metric for regression tasks.

This is a simplified illustration, and actual model training might involve more complexities. We could consider using different models, tuning their parameters, or even combining them into an ensemble. Plus, we might use cross-validation instead of a simple train-test split, and we could explore more sophisticated evaluation metrics depending on the nature of the problem. Nevertheless, this example provides a basic overview of the process.

Decoding the Game: Feature Importance and Prospect Evaluation

A fascinating aspect of our machine learning journey is unveiling which features significantly contribute to our model’s predictions. This analysis, often called feature importance, reveals the hierarchy of relevance among our features, effectively helping us understand which attributes are paramount for predicting a player’s success.

Interpreting feature importance not only gives us insights into our model but can also illuminate broader trends and patterns in the game of baseball itself. For instance, it could reveal whether a pitcher’s spin rate is more critical for their success than their velocity, or if a hitter’s on-base percentage is more predictive than their slugging percentage.

Let’s see how we might extract feature importance from a tree-based model, such as a Random Forest Regressor:

# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest Regressor model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Get feature importances
importances = rf_model.feature_importances_

# Sort features by their importance
features_importance = sorted(zip(importances, X_train.columns), reverse=True)

# Print feature importance
for rank, (importance, feature) in enumerate(features_importance, start=1):
    print(f"{rank}. {feature}: {importance}")

In this code, we train a RandomForestRegressor model and then extract the feature importances using the feature_importances_ attribute. We sort the features by their importance and print them out in descending order.

With our model trained and our feature importance understood, we’re in a position to evaluate potential prospects. Given a prospect’s data, we can feed it into our model to predict their future performance. This prediction, synthesized with traditional scouting techniques, can guide decision-making processes in drafting.

# Let's assume 'prospect_data' is a DataFrame containing the data for a potential prospect
prospect_data = pd.read_csv('prospect_data.csv')

# Preprocess prospect data as done with the training data
# ...

# Predict the prospect's future performance
prospect_performance = rf_model.predict(prospect_data)

print(f"The predicted future performance of the prospect is: {prospect_performance}")

In this final code snippet, we load data for a potential prospect and preprocess it in the same way as our training data. We then use our trained Random Forest model to predict the prospect’s future performance.

With such tools at our disposal, the fusion of data science and traditional scouting can usher in a new era of talent identification and player development in baseball, turning once hunch-based decisions into data-driven ones.

In the Dugout: The Art of Continuous Model Improvement

Building a machine learning model is not a one-time event but rather a cyclical process of constant improvement. After the initial model is deployed, it is vital to keep monitoring its performance and updating it as needed. As the sport of baseball continues to evolve, our model needs to evolve with it. By validating the model with new data and updating or retraining it as required, we can ensure that our model remains accurate and reliable.

With the accumulation of more data over time — new players, more games, changing strategies — our model has the opportunity to learn and improve. This continuous learning process can lead to enhancements in predictive power and, consequently, better drafting decisions.

In Python, the process of updating a model with new data can be straightforward. Let’s illustrate this with our RandomForestRegressor model:

# Import new data
new_data = pd.read_csv('new_baseball_data.csv')

# Preprocess new data as done with the initial training data
# ...

# Split new data into features and target
new_features = new_data.drop('future_batting_average', axis=1)
new_target = new_data['future_batting_average']

# Retrain the model with new data
rf_model.fit(new_features, new_target)

# Validate the updated model with the latest test set
new_predictions = rf_model.predict(X_test)
new_mse = mean_squared_error(y_test, new_predictions)
print(f'The Mean Squared Error of the updated model is: {new_mse}')

In this example, we load new data and preprocess it in the same manner as our initial data. We then retrain our Random Forest model on this new data using the fit method. Finally, we evaluate the updated model's performance on our test set.

Remember, continuous model improvement doesn’t mean just retraining on new data. It also involves constant reevaluation of the model’s architecture, feature selection and engineering, and hyperparameters. The data science field is always evolving, with new techniques and methods being developed. Staying updated with these advancements can help us continually improve our models and make even more accurate predictions, bringing data-driven clarity to the complex world of baseball drafting.

The Final Score: A Grand Slam in Data-Driven Decision Making

As we draw the curtains on our journey into the depths of data science and machine learning in baseball, we find ourselves in a unique position where the traditional wisdom of scouting intersects with the insightful world of machine learning. The process we’ve gone through — collecting meaningful data, preprocessing it rigorously, building a predictive model, and continuously improving it — has fundamentally changed the game. It has transformed player drafting from a matter of instinct to a realm of insightful, data-backed decisions.

Let’s envision a situation where this approach truly shines. Imagine you’re the General Manager of a Major League Baseball team. You’re faced with a challenging decision: choosing between two promising prospects, Jake and Alex. Both have impressive college careers, but who has the potential to succeed in the big leagues?

Jake is a hard-throwing pitcher, with high velocity and an impressive strikeout record. Alex, on the other hand, is a disciplined batter with a knack for getting on base. Their styles are different, and so are their potential impacts on the team. How can you decide?

This is where our data-driven approach can help. We’ve gathered exhaustive data on both players, from their standard stats to granular metrics like Jake’s spin rate or Alex’s swing speed. We’ve processed and fed this data into our machine learning model.

# Import Jake's and Alex's data
prospects_data = pd.read_csv('prospects_data.csv')

# Preprocess their data as we did with our training data
# ...

# Use our model to predict their future performance
prospects_performance = rf_model.predict(prospects_data)

print(f"Predicted future performance: {prospects_performance}")

Our model projects the future performance of Jake and Alex. These predictions, combined with the scouts’ assessment of their skills and mental fortitude, form a comprehensive evaluation of the two players.

As the GM, you now have an additional, powerful tool in your decision-making arsenal. If Jake’s projected performance aligns better with the team’s needs, the choice is clear. If Alex’s on-base percentage prediction and hitting prowess fit the team’s strategy better, then he is the obvious choice.

Yet, this is not the end of the story. As Jake or Alex’s career unfolds, their performance will add new data to our model, which we will continuously update and refine to ensure its relevancy and accuracy.

# Import new performance data
new_data = pd.read_csv('new_data.csv')

# Preprocess new data in the same way we did with our initial dataset
# ...

# Retrain the model with the new data
rf_model.fit(new_features, new_target)

# Validate the updated model with the latest test set
new_predictions = rf_model.predict(X_test)
new_mse = mean_squared_error(y_test, new_predictions)
print(f'The Mean Squared Error of the updated model is: {new_mse}')

This process underlines the iterative nature of machine learning models. The performance data of Jake, Alex, and other players continually enrich our model, leading to a consistent evolution in its predictive capabilities.

The landscape of baseball drafting, in the age of data science and machine learning, has been transformed from a subjective art to an insightful, evidence-driven science. The drafting board now becomes a dynamic, data-driven battleground. By merging traditional scouting wisdom with machine learning techniques, we’re not just changing the game; we’re revolutionizing it, one data point at a time. It’s a brave new world of baseball, and we’re right at home plate.

**This series was so much fun to write. If you have any comments or suggestions, please send them my way to Fish@itsmefish.com.

Up next, Data Science + Soccer (futbol).