College Football Conference Realignment - Regression

Author:Murphy  |  View: 21094  |  Time: 2025-03-23 13:10:49

Welcome to part 2 of my series on conference realignment! Last summer when conference realignment was in full swing, Tony Altimore published a study on Twitter that inspired me to do my own conference realignment analysis. This series is organized into four parts (and the full motivation for it is found in part 1):

  1. College Football Conference Realignment – Exploratory Data Analysis in Python
  2. College Football Conference Realignment – Regression
  3. College Football Conference Realignment – Clustering
  4. College Football Conference Realignment – node2vec

 

Hopefully, each part of the series provides you with a fresh perspective on the future of the beloved game of College Football. For those of you who did not read part 1 a quick synopsis is that I created my own data set compiled from sources across the web. These data include basic information about each FBS program, a non-canonical approximation of all college football rivalries, stadium size, historical performance, frequency appearances in AP top 25 polls, whether the school is an AAU or R1 institution (historically important for membership in the Big Ten and Pac 12), the number of NFL draft picks, data on program revenue from 2017–2019, and a recent estimate on the size of college football fan bases. As it turns out, stadium capacity, 2019 revenue, and historical AP poll success correlate strongly with the estimated fan base size in Tony Altimore's analysis:

 

Supervised Learning

So, this got me thinking: can we create a simple regression model to estimate fan base size?

Broadly, we can divide Machine Learning into supervised and unsupervised learning. In supervised learning, the goal is to predict a pre-defined discrete class or continuous variable. In unsupervised learning, the goal is to discover trends in the data that are non-obvious. Regression is a type of supervised learning where the prediction target is a continuous variable. A great reference guide and resource was put together by Shervine and Afshine Amidi. (It has been translated into 11 other languages!)

Our choice of regression model is limited by the low number of observations in the data as there are only 133 teams in college football. Regardless of our model choice, the scikit-learn package will have us covered. It is easy to implement and well documented.

Feature Engineering

Now that we have our approach, we can re-structure our data for optimal model performance. This is commonly referred to as feature engineering. First, we import dependencies and upload the data.

#Import dependencies import numpy as np import pandas as pd # Read csv of data cfb_info_df = pd.read_csv(r'.FBS_Football_Team_Info.csv', encoding = 'unicode_escape')

We are only going to keep the features that our relevant to this analysis:

# Drop Unused columns cfb_info_df_regression = cfb_info_df[['Latitude', 'Longitude','Enrollment', 'Current_conference_2025','years_playing', 'years_playing_FBS', 'Stadium_capacity', 'is_aau_member', 'is_R1', 'total_draft_picks_2000_to_2020', 'first_rd_draft_picks_2000_to_2020', 'number_1_draft_picks_2000_to_2020',  'wsj_college_football_revenue_2019', 'wsj_college_football_value_2018', 'wsj_college_football_value_2017', 'bowl_games_played', 'bowl_game_win_pct', 'historical_win_pct', 'total_games_played','p_AP_Top_25_2001_to_2021', 'tj_altimore_fan_base_size_millions']]

Now, we can split this data into features, X, and labels, y. In this case, the features are everything except estimated fan base size. That estimate serves as the label.

X = cfb_info_df_regression.drop(['tj_altimore_fan_base_size_millions'], axis = 1) y = cfb_info_df_regression['tj_altimore_fan_base_size_millions']

Now, we can transform our categorical features into one-hot encoding vectors using pandas. This transforms our column of conference names into several columns of Boolean values.

X = pd.get_dummies(X, columns = ['Current_conference_2025'])

We can easily perform a 70–30 training-test set split of the data using the train_test_split function in scikit-learn. For our purposes, that gives us 93 training observations and 40 test observations.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We will transform the numeric features using min-max scaling. Min-max scaling is important because we can change the range of each numeric distribution to between 0 and 1 while maintaining the shape of the distribution. It is important to implement this after splitting the data into a training and test set to avoid data leakage. There are pre-defined methods to do this using sci-kit learn including a built-in pipeline function where we can define the type of preprocessing, but for our purposes, I defined my own min_max_scaling function and converted all the columns using this function:

def min_max_column(column):     column = column.astype('float')     column_scaled = (column - min(column)) / (max(column) - min(column))     return column_scaled

Now, we convert all our features in the train and test sets separately to avoid data leakage:

for col in X_train.columns:     X_train[col] = min_max_column(X_train[col])  for col in X_test.columns:     X_test[col] = min_max_column(X_test[col])

Linear Regression

With that, we are ready to run our model. Let's start simple with linear regression.

from sklearn.linear_model import LinearRegression reg = LinearRegression().fit(X_train, y_train)

We can measure how well the regression performed using the R-squared metric compute by the score() funciton.

from sklearn.linear_model import LinearRegression reg = LinearRegression().fit(X_train, y_train)

Unfortunately, the R-squared is only about 0.5, so our prediction isn't great. We can plot the actual fan base size compared to the projected fan base size using plotly. Below, the size of each point in the scatter plot is the size of the absolute percent error. The color indicates whether it was an under prediction. You can visually see that the model performs worst for small fan bases:

import plotly.express as px import plotly.express as px #Create a data frame for plot plot_df = pd.DataFrame(cfb_info_df['Team'].iloc[list(y_test.index)], columns=['Team']) plot_df['Actual Fan Base Size'] = y_test plot_df['Predicted Fan Base Size'] = reg.predict(X_test) plot_df['Absolute Percent Error'] = abs(plot_df['Actual Fan Base Size'] - plot_df['Predicted Fan Base Size'])/plot_df['Actual Fan Base Size'] plot_df['Under Predict'] = plot_df['Actual Fan Base Size'] > plot_df['Predicted Fan Base Size']  fig = px.scatter(plot_df, x='Actual Fan Base Size', y='Predicted Fan Base Size', size = 'Absolute Percent Error',                   color = 'Under Predict', hover_data = ['Team']) fig.show()

 

Random Forest

Now that we have seen the performance of our linear model, let's try a more advanced machine learning model called random forest. Random forest relies on a concept called bagging to improve prediction. It essentially produces many different decision trees which each vary slightly due to introduced randomization. It combines what it learns in each of these trees to improve its overall prediction.

 

Conveniently, we do not need to scale our data for a random forest model because it does not make predictions based on distance measures. So, we can re-sample from our train_test_split() function:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Now, we can easily train a random forest model with 100 trees in the forest and unlimited depth in each tree.

from sklearn.ensemble import RandomForestRegressor reg = RandomForestRegressor(n_estimators=100, max_depth=None, random_state=0) reg.fit(X_train, y_train)

Let's see if we have any improvement in our R-squared value:

reg.score(X_test, y_test)

The R-squared is now 0.78 which is a great improvement! If we create the exact same plot as above with our new random forest model, we can see that we have a lot better performance with small fan bases. Importantly, we are also no longer predicting negative sizes of fan bases.

 

So, what is driving the predictions? Random forest is great for explainability because it includes an attribute called feature importance. Similar to a coefficient in linear regression, these are measures of how much the random forest model relied on a feature to make its prediction. Feature importance is a relative metric, so it can tell us how useful a feature is within this model only.

import plotly.express as px #Create a data frame for plot plot_df = pd.DataFrame(X_train.columns, columns=['Feature Name']) plot_df['Importance'] = reg.feature_importances_ fig = px.bar(plot_df, x='Feature Name', y='Importance') fig.show()

 

Based on the bar chart above comparing our different features, on-field performance seems to be driving predictions. The most important feature was the percentage of weeks a team has been in the AP top 25 in the last 20 years. The next most important group of features is the Wall Street Journal value/revenue data. It follows that a team with more fans makes more money. Then we see that NFL draft picks, stadium capacity (more fans = bigger stadium), bowl game appearances, and historical win percentage are important. Geographic location, enrollment, years playing football, and academic success don't seem to make for good predictors.

I save the best takeaway for last, as it is relevant to the conference realignment discussion. Did you notice the right side of the graph? Conference membership is not an important predictor of fan base size. As we discussed in part 1 of this blog series, correlation is not causation, and the same holds for feature importance. However, it does seem to indicate that you can just as quickly gain or lose fans in any conference. It's all about the product on the field.

Model Improvements

I won't include it in this blog, but we could spend some time improving these models with better feature engineering or hyperparameter tuning. It is also better to report accuracy from cross-validation, as well. Our dataset is small, so I will save this for another blog, as well.

Be sure to keep reading on to part 3 of this blog series as we finally dive into some data-driven suggestions for college football conferences.


Interested in my content? Please consider following me on Medium.

Follow me on Twitter: @malloy_giovanni

Any fun use cases for regression in college football that you've found? How would you improve on this model?

Tags: College Football Data Science Machine Learning Regression Sports Analytics

Comment