2021 March Madness Kaggle Solution

Our approach is ensemble the shit out of everything. We will hold out the 2015-2019 games for validation purposes. We will prepare and optimize two sets of models - one, an ensemble of general team strength features trained on 1985-2014 games, and two - an ensemble of general team strength features + adjusted ratings based on box-score data trained on 2002-2014 games. Let’s prepare the first approach.

import sys
!{sys.executable} -m pip install pandas sklearn numpy rpy2 trueskill catboost hyperopt ray

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
import math
import trueskill as ts
import warnings
warnings.filterwarnings("ignore") # because pandas indexing gets mad at me

import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
Requirement already satisfied: pandas in c:\users\edwar\anaconda3\lib\site-packages (1.1.3)
Requirement already satisfied: sklearn in c:\users\edwar\anaconda3\lib\site-packages (0.0)
Requirement already satisfied: numpy in c:\users\edwar\anaconda3\lib\site-packages (1.19.2)
Requirement already satisfied: rpy2 in c:\users\edwar\anaconda3\lib\site-packages (3.4.2)
Requirement already satisfied: trueskill in c:\users\edwar\anaconda3\lib\site-packages (0.4.5)
Requirement already satisfied: catboost in c:\users\edwar\anaconda3\lib\site-packages (0.24.4)
Requirement already satisfied: hyperopt in c:\users\edwar\anaconda3\lib\site-packages (0.2.5)
Requirement already satisfied: ray in c:\users\edwar\anaconda3\lib\site-packages (1.2.0)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\edwar\anaconda3\lib\site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\edwar\anaconda3\lib\site-packages (from pandas) (2020.1)
Requirement already satisfied: scikit-learn in c:\users\edwar\anaconda3\lib\site-packages (from sklearn) (0.23.2)
Requirement already satisfied: jinja2 in c:\users\edwar\anaconda3\lib\site-packages (from rpy2) (2.11.2)
Requirement already satisfied: cffi>=1.10.0 in c:\users\edwar\anaconda3\lib\site-packages (from rpy2) (1.14.3)
Requirement already satisfied: tzlocal in c:\users\edwar\anaconda3\lib\site-packages (from rpy2) (2.1)
Requirement already satisfied: six in c:\users\edwar\anaconda3\lib\site-packages (from trueskill) (1.15.0)
Requirement already satisfied: matplotlib in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (3.3.2)
Requirement already satisfied: plotly in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (4.14.3)
Requirement already satisfied: scipy in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (1.5.2)
Requirement already satisfied: graphviz in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (0.16)
Requirement already satisfied: cloudpickle in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (1.6.0)
Requirement already satisfied: future in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (0.18.2)
Requirement already satisfied: tqdm in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (4.50.2)
Requirement already satisfied: networkx>=2.2 in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (2.5)
Requirement already satisfied: jsonschema in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.2.0)
Requirement already satisfied: aiohttp-cors in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.7.0)
Requirement already satisfied: grpcio>=1.28.1 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (1.36.1)
Requirement already satisfied: protobuf>=3.8.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.14.0)
Requirement already satisfied: prometheus-client>=0.7.1 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.8.0)
Requirement already satisfied: colorama in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.4.4)
Requirement already satisfied: colorful in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.5.4)
Requirement already satisfied: opencensus in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.7.12)
Requirement already satisfied: py-spy>=0.2.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.3.4)
Requirement already satisfied: gpustat in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.6.0)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (1.0.0)
Requirement already satisfied: filelock in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.0.12)
Requirement already satisfied: click>=7.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (7.1.2)
Requirement already satisfied: redis>=3.5.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.5.3)
Requirement already satisfied: aioredis in c:\users\edwar\anaconda3\lib\site-packages (from ray) (1.3.1)
Requirement already satisfied: aiohttp in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.7.4.post0)
Requirement already satisfied: requests in c:\users\edwar\anaconda3\lib\site-packages (from ray) (2.24.0)
Requirement already satisfied: pyyaml in c:\users\edwar\anaconda3\lib\site-packages (from ray) (5.3.1)
Requirement already satisfied: joblib>=0.11 in c:\users\edwar\anaconda3\lib\site-packages (from scikit-learn->sklearn) (0.17.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\edwar\anaconda3\lib\site-packages (from jinja2->rpy2) (1.1.1)
Requirement already satisfied: pycparser in c:\users\edwar\anaconda3\lib\site-packages (from cffi>=1.10.0->rpy2) (2.20)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (1.3.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: certifi>=2020.06.20 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (2020.6.20)
Requirement already satisfied: cycler>=0.10 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (8.0.1)
Requirement already satisfied: retrying>=1.3.3 in c:\users\edwar\anaconda3\lib\site-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: decorator>=4.3.0 in c:\users\edwar\anaconda3\lib\site-packages (from networkx>=2.2->hyperopt) (4.4.2)
Requirement already satisfied: setuptools in c:\users\edwar\anaconda3\lib\site-packages (from jsonschema->ray) (50.3.1.post20201107)
Requirement already satisfied: attrs>=17.4.0 in c:\users\edwar\anaconda3\lib\site-packages (from jsonschema->ray) (20.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\edwar\anaconda3\lib\site-packages (from jsonschema->ray) (0.17.3)
Requirement already satisfied: opencensus-context==0.1.2 in c:\users\edwar\anaconda3\lib\site-packages (from opencensus->ray) (0.1.2)
Requirement already satisfied: google-api-core<2.0.0,>=1.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from opencensus->ray) (1.26.1)
Requirement already satisfied: psutil in c:\users\edwar\anaconda3\lib\site-packages (from gpustat->ray) (5.7.2)
Requirement already satisfied: nvidia-ml-py3>=7.352.0 in c:\users\edwar\anaconda3\lib\site-packages (from gpustat->ray) (7.352.0)
Requirement already satisfied: blessings>=1.6 in c:\users\edwar\anaconda3\lib\site-packages (from gpustat->ray) (1.7)
Requirement already satisfied: hiredis in c:\users\edwar\anaconda3\lib\site-packages (from aioredis->ray) (1.1.0)
Requirement already satisfied: async-timeout in c:\users\edwar\anaconda3\lib\site-packages (from aioredis->ray) (3.0.1)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (5.1.0)
Requirement already satisfied: typing-extensions>=3.6.5 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (3.7.4.3)
Requirement already satisfied: chardet<5.0,>=2.0 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (3.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (1.6.3)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\edwar\anaconda3\lib\site-packages (from requests->ray) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in c:\users\edwar\anaconda3\lib\site-packages (from requests->ray) (2.10)
Requirement already satisfied: google-auth<2.0dev,>=1.21.1 in c:\users\edwar\anaconda3\lib\site-packages (from google-api-core<2.0.0,>=1.0.0->opencensus->ray) (1.28.0)
Requirement already satisfied: packaging>=14.3 in c:\users\edwar\anaconda3\lib\site-packages (from google-api-core<2.0.0,>=1.0.0->opencensus->ray) (20.4)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in c:\users\edwar\anaconda3\lib\site-packages (from google-api-core<2.0.0,>=1.0.0->opencensus->ray) (1.53.0)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in c:\users\edwar\anaconda3\lib\site-packages (from google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (4.7.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\edwar\anaconda3\lib\site-packages (from google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (4.2.1)
Requirement already satisfied: pyasn1>=0.1.3 in c:\users\edwar\anaconda3\lib\site-packages (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (0.4.8)
compact_results = pd.read_csv('data/MRegularSeasonCompactResults.csv')
compact_results['WTeamID'] = compact_results['WTeamID'].astype('str')
compact_results['LTeamID'] = compact_results['LTeamID'].astype('str')
compact_results.head()

Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT
0 1985 20 1228 81 1328 64 N 0
1 1985 25 1106 77 1354 70 H 0
2 1985 25 1112 63 1223 56 H 0
3 1985 25 1165 70 1432 54 H 0
4 1985 25 1192 86 1447 74 H 0

I am not a big fan of this format, so I will quickly write a function to re-oriented our dataframe around home/away teams rather than winning/losing teams.

def refactor_games_compact(dataframe):
    df = dataframe
    neutral_winners = df.loc[df.WLoc == 'N']
    home_winners = df.loc[df.WLoc == 'H']
    away_winners = df.loc[df.WLoc == 'A']
    
    neutral_winners['NeutralFlg'] = True
    neutral_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT','NeutralFlg']
    
    home_winners['NeutralFlg'] = False
    home_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT','NeutralFlg']
    
    away_winners['NeutralFlg'] = False
    away_winners.columns = ['Season','DayNum','ATeamID','AScore','HTeamID','HScore','WLoc','NumOT','NeutralFlg']
    
    df = neutral_winners.append(home_winners.append(away_winners))
    df = df.drop(columns=['WLoc'])
    df = df.sort_values(by=['Season','DayNum'])
    return df
    
compact_results = refactor_games_compact(compact_results)
compact_results.head()

Season DayNum HTeamID HScore ATeamID AScore NumOT NeutralFlg
0 1985 20 1228 81 1328 64 0 True
6 1985 25 1228 64 1226 44 0 True
7 1985 25 1242 58 1268 56 0 True
11 1985 25 1344 75 1438 71 0 True
13 1985 25 1412 70 1397 65 0 True

Our first step will be to write a function to rescore games based on home court advantage. To estimate home court advantage, we will use the LRMC methodology .

def calc_hfa(dataframe):
    df = dataframe
    home_games = df.loc[~df['NeutralFlg']]
    
    X_df = home_games
    Y_df = home_games
    
    X_df['HDiff'] = X_df['HScore'] - X_df['AScore']
    X_df.loc[X_df['NumOT'] > 0,'HDiff'] = 0 # any overtime games are treated as ties
    X_df = X_df[['HTeamID','ATeamID','HDiff']]
    
    Y_df['HWin'] = (Y_df['AScore'] > Y_df['HScore']) * 1
    Y_df = Y_df[['HTeamID','ATeamID','HWin']]
    
    log_df = pd.merge(X_df, Y_df, left_on=['HTeamID','ATeamID'],right_on=['ATeamID','HTeamID'])
    X = log_df['HDiff']
    Y = log_df['HWin']
    
    X = np.array(X).reshape(-1,1)
    
    logreg = LogisticRegression(random_state=0).fit(X, Y)
    
    a = float(logreg.coef_[0])
    b = float(logreg.intercept_)
    h = (-b / a) / 2
    
    return {'a':a,'b':b,'h':h}

def rescore_hfa(dataframe):
    df = dataframe
    hfa_mod = calc_hfa(df)
    
    h_2 = hfa_mod['h'] / 2
    
    df['HScore'] = df['HScore'] - h_2 * df['NeutralFlg']
    df['AScore'] = df['AScore'] + h_2 * df['NeutralFlg']
    
    return df
    
testing_df = compact_results.loc[compact_results['Season'] == 2014]

rescore_hfa(testing_df).head()

Season DayNum HTeamID HScore ATeamID AScore NumOT NeutralFlg
129204 2014 4 1102 75.734604 1119 71.265396 0 True
129211 2014 4 1124 68.734604 1160 63.265396 0 True
129220 2014 4 1163 74.734604 1268 80.265396 0 True
129223 2014 4 1184 79.734604 1198 64.265396 0 True
129240 2014 4 1258 74.734604 1213 78.265396 0 True

We can call on this function when working with team strength methodologies that are not adjusted to account for home field advantage. Additionally, this should aid us in scoring 2021 teams, when home field advantage is not as concrete as it has been in previous years.

We will also write another function to adjust the scoring of overtime games. To be points per 40 minutes. This will come in handy when estimating team strength from points-based metrics.

def rescore_ot(dataframe):
    df = dataframe
    df.HScore = df.HScore * (40 + df.NumOT * 5) / 40
    df.AScore = df.AScore * (40 + df.NumOT * 5) / 40
    
    return df

rescore_ot(testing_df.loc[testing_df['NumOT'] > 0]).head()

Season DayNum HTeamID HScore ATeamID AScore NumOT NeutralFlg
129265 2014 4 1344 92.25000 1130 87.75000 1 False
129227 2014 4 1414 109.12500 1201 110.25000 1 False
129251 2014 4 1330 75.37500 1283 88.87500 1 False
129276 2014 4 1274 69.75000 1383 74.25000 1 False
129344 2014 5 1464 79.57643 1198 84.67357 1 True

We will compute the following metrics to estimate team strength:

First is Elo. Normally we would tune Elo paramters, but that process can be somewhat intensive, especially with large datasets such as this – hence, I will be using a premade series of parameters to estimate team strength. Note that with Elo, you can normally adjust for home field advantage with a constant factor, and that availability still exists with this function, but passing in arguments for rescoring overrides this factor.

def calc_elo(dataframe, avg_elo = 1500, elo_width = 400, mov_factor = 8, hfa = 135, hfa_rescore = False, ot_rescore = False):
    df = dataframe
    
    if ot_rescore:
        df = rescore_ot(df)
    
    if hfa_rescore:
        df = rescore_hfa(df)
        
    df = df.sort_values(by=['Season','DayNum'])
    
    teams = list(set(df['HTeamID'].append(df['ATeamID'])))
    elos = {team:avg_elo for team in teams}
    
    for index, row in df.iterrows():
        home_elo = elos[row['HTeamID']]
        away_elo = elos[row['ATeamID']]
        
        home_elo_adj = home_elo
        if not row['NeutralFlg'] and not hfa_rescore and not ot_rescore:
            home_elo_adj += hfa
            
        # calculate expected score
        home_prob = 1.0/(1+10**((away_elo - home_elo_adj)/elo_width))
        away_prob = 1 - home_prob
        
        # calculate k factor as a function of margin of the log of margin of victory
        home_diff = row['HScore'] - row['AScore']
        k_factor = math.log(abs(home_diff)+1) * mov_factor
        
        home_win = (row['HScore'] > row['AScore']) * 1
        away_win = (row['AScore'] > row['HScore']) * 1
        
        # update elos
        elos[row['HTeamID']] = home_elo + k_factor * (home_win - home_prob)
        elos[row['ATeamID']] = away_elo + k_factor * (away_win - away_prob)
    
    elo_df = pd.DataFrame(elos,index=[0]).transpose()
    elo_df.columns = ['Elo']
    
    return elo_df

calc_elo(testing_df,hfa_rescore=True,ot_rescore=True).head()

Elo
1106 1500.823216
1219 1505.567477
1285 1517.728113
1350 1526.039480
1249 1336.231722

Next up is LRMC. We’re already halfway done with our code adjusting for HFA, we will simply take our logistic model and apply it to all of the games in our dataframe (because adjusting for HFA is integral to LRMC, it is not an optional argument).

def calc_lrmc(dataframe, ot_rescore=True):
    df = dataframe
    
    if ot_rescore:
        df = rescore_ot(df)
        
    hfa_mod = calc_hfa(df)
    h = hfa_mod['h']
    a = hfa_mod['a']
    b = hfa_mod['b']
    
    teams = list(set(df['HTeamID'].append(df['ATeamID'])))
    
    n_teams = len(teams)
    p = np.zeros((n_teams,n_teams))
    n_games = np.zeros(n_teams)
    
    for index, row in df.iterrows():
        home_team_ndx = teams.index(row['HTeamID'])
        away_team_ndx = teams.index(row['ATeamID'])
        
        # calculate r_x
        spread = row['HScore'] - row['AScore'] + h * (row['NeutralFlg'])
        r_x = math.exp(a * spread + b) / (1 + math.exp(a * spread + b))
        
        # update respective matrices
        n_games[home_team_ndx] += 1
        n_games[away_team_ndx] += 1
        
        p[home_team_ndx, away_team_ndx] += 1 - r_x
        p[away_team_ndx, home_team_ndx] += r_x
        p[home_team_ndx, home_team_ndx] += r_x
        p[away_team_ndx, away_team_ndx] += 1 - r_x
        
    # solve matrix
    p = p / n_games[:,None]
    prior = n_teams - np.array(list(range(n_teams)))
    steady_state = np.linalg.matrix_power(p, 1000)
    rating = prior.dot(steady_state)
    
    return pd.DataFrame({'LRMC':rating}, index = teams)

calc_lrmc(testing_df).head()

LRMC
1106 132.928218
1219 137.471394
1285 157.845504
1350 203.865228
1249 99.113128

SRS on deck.

def calc_srs(dataframe, hfa_rescore=True, ot_rescore=True):
    df = dataframe
    
    if hfa_rescore:
        df = rescore_hfa(df)
    if ot_rescore:
        df = rescore_ot(df)
        
    df['HSpread'] = df['HScore'] - df['AScore']
    df['ASpread'] = -1 * df['HSpread']
    
    teams = pd.concat([
        df[['HTeamID','HSpread','ATeamID']].rename(columns={'HTeamID':'TeamID','HSpread':'Spread','ATeamID':'OppID'}),
        df[['ATeamID','ASpread','HTeamID']].rename(columns={'ATeamID':'TeamID','ASpread':'Spread','HTeamID':'OppID'})
    ])
    
    spreads = teams.groupby('TeamID').Spread.mean()
    
    terms = []
    solutions = []

    for team in spreads.keys():
        row = []
        opps = list(teams[teams['TeamID'] == team]['OppID'])

        for opp in spreads.keys():
            if opp == team:
                row.append(1)
            elif opp in opps:
                row.append(-1.0 / len(opps))
            else:
                row.append(0)

        terms.append(row)
        solutions.append(spreads[team])
        
    solutions = np.linalg.solve(np.array(terms), np.array(solutions))
    return pd.DataFrame({'SRS':solutions},index = list(spreads.keys()))
    
calc_srs(testing_df).head()

SRS
1101 -18.642546
1102 -6.101457
1103 2.412745
1104 6.819273
1105 -5.714953

RPI’s usefulness is highly debatable, there’s a very good reason why the NCAA does not use it in tournament rankings and instead relies on NET ratings now. Still, we’re not too picky, and we can always throw it out when we reach the feature selection stage of the process. We’ll adjust for OT but not for HFA when it comes to calculating RPI (RPI likes to think it already incorporates HFA, even though it really doesn’t, but we’ll humor it for the time being).

def calc_rpi(dataframe, ot_rescore = True):
    df = dataframe
    
    if ot_rescore:
        df = rescore_ot(df)
        
    df['HWin'] = df['HScore'].ge(df['AScore']) * 1
    df['AWin'] = 1 - df['HWin']
    df['HLoss'] = 1 - df['HWin']
    df['ALoss'] = df['HWin']
    
    teams = pd.concat([
        df[['HTeamID','ATeamID','HWin','HLoss']].rename(columns={'HTeamID':'TeamID','ATeamID':'OppID','HWin':'HWin',
                                                                 'HLoss':'HLoss'}),
        df[['ATeamID','HTeamID','AWin','ALoss']].rename(columns={'ATeamID':'TeamID','HTeamID':'OppID','AWin':'AWin',
                                                                 'ALoss':'ALoss'})
    ])
    
    teams = teams.fillna(0)
    
    rpi = teams.groupby('TeamID')[['HWin','HLoss','AWin','ALoss']].sum()
    rpi['W'] = rpi['HWin'] + rpi['AWin']
    rpi['L'] = rpi['HLoss'] + rpi['ALoss']
    rpi['WP'] = rpi['W'] / (rpi['W'] + rpi['L'])
    rpi['AdjWP'] = (rpi['HWin'] * 0.6 + rpi['AWin'] * 1.4) / (rpi['HWin'] * 0.6 + rpi['AWin'] * 1.4 + rpi['HLoss'] * 1.4 + rpi['ALoss'] * 0.6)
    
    for index, row in rpi.iterrows():
        temp_teams = teams.loc[teams['TeamID'] == index]
        rpi.loc[index,'OWP'] = float(pd.merge(temp_teams, rpi, left_on = 'OppID', right_on = 'TeamID')['WP'].mean())
        
    for index, row in rpi.iterrows():
        temp_teams = teams.loc[teams['TeamID'] == index]
        rpi.loc[index,'OOWP'] = float(pd.merge(temp_teams, rpi, left_on = 'OppID', right_on = 'TeamID')['OWP'].mean())
        
    rpi['RPI'] = 0.25 * rpi['AdjWP'] + 0.5 * rpi['OWP'] + 0.25 * rpi['OOWP']
    
    return pd.DataFrame({'RPI':rpi['RPI']},index = list(rpi.index))

calc_rpi(testing_df).head()

RPI
1101 0.380277
1102 0.465483
1103 0.538925
1104 0.561932
1105 0.443962

Next is TrueSkill, which might actually be more valuable than Elo given the different numbers of games teams played. We’ll just record the Mu values for each Teams' TS rating.

def calc_ts(dataframe, hfa_rescore = True, ot_rescore = True):
    df = dataframe
    
    if hfa_rescore:
        df = rescore_hfa(df)
    if ot_rescore:
        df = rescore_ot(df)
        
    teams = list(set(df['HTeamID'].append(df['ATeamID'])))
    ts_dict = {}
    
    for team in teams:
        ts_dict[team] = {'TS':ts.Rating()}
        
    for index, row in df.iterrows():
        home_rating = ts_dict[row['HTeamID']]['TS']
        away_rating = ts_dict[row['ATeamID']]['TS']
        if row['HScore'] > row['AScore']:
            home_rating, away_rating = ts.rate_1vs1(home_rating, away_rating)
        elif row['AScore'] > row['HScore']:
            away_rating, home_rating = ts.rate_1vs1(away_rating, home_rating)
        else:
            home_rating, away_rating = ts.rate_1vs1(home_rating, away_rating, drawn = True)

        ts_dict[row['HTeamID']]['TS'] = home_rating
        ts_dict[row['ATeamID']]['TS'] = away_rating

    ts_list = [ts_dict[team]['TS'].mu for team in teams]
    
    return pd.DataFrame({'TS':ts_list},index = teams)

calc_ts(testing_df).head()

TS
1106 21.073280
1219 21.556950
1285 24.166877
1350 28.791182
1249 12.829917

Next up, we’re generating some mixed model ratings for each team. To do this, we’ll quickly, temporarily borrow from R’s lme4 package (because Python really doesn’t have any good mixed model packages).

utils = rpackages.importr('utils')
base = rpackages.importr('base')

utils.chooseCRANmirror(ind=1) # select the first mirror in the list

packnames = ('lme4', 'Matrix')

from rpy2.robjects.vectors import StrVector

names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
    
if len(names_to_install) > 0:
    utils.install_packages(StrVector(names_to_install))

ro.r('''
    gen_ratings = function(df){
        model <- lme4::lmer(Score ~ 0 + Loc + (1|Offense) + (1|Defense),
                            data = df)

        offense_ratings = lme4::ranef(model)$Offense
        offense_ratings$Team = row.names(lme4::ranef(model)$Offense)
        colnames(offense_ratings) = c('Off','TeamID')
        defense_ratings = lme4::ranef(model)$Defense
        defense_ratings$Team = row.names(lme4::ranef(model)$Defense)
        colnames(defense_ratings) = c('Def','TeamID')
        
        rtg_df = merge(offense_ratings, defense_ratings,on='TeamID')
        return(rtg_df)
    }
''')

gen_ratings = ro.globalenv['gen_ratings']

def calc_lme(dataframe, hfa_rescore = True, ot_rescore = True):
    df = dataframe
    
    if hfa_rescore:
        df = rescore_hfa(df)
    if ot_rescore:
        df = rescore_ot(df)
        
    home_scoring = df[['HTeamID','ATeamID','HScore','NeutralFlg']].rename(columns={'HTeamID':'Offense','ATeamID':'Defense',
                                                                                   'HScore':'Score','NeutralFlg':'NeutralFlg'})
    home_scoring['Loc'] = ''
    home_scoring.loc[home_scoring['NeutralFlg'],'Loc'] = 'N'
    home_scoring.loc[~home_scoring['NeutralFlg'],'Loc'] = 'H'
    
    away_scoring = df[['ATeamID','HTeamID','AScore','NeutralFlg']].rename(columns={'ATeamID':'Offense','HTeamID':'Defense',
                                                                                   'AScore':'Score','NeutralFlg':'NeutralFlg'})
    away_scoring['Loc'] = ''
    away_scoring.loc[away_scoring['NeutralFlg'],'Loc'] = 'N'
    away_scoring.loc[~away_scoring['NeutralFlg'],'Loc'] = 'A'    
    
    scoring = pd.concat([home_scoring,away_scoring])
    
    with localconverter(ro.default_converter + pandas2ri.converter):
        df_r = ro.conversion.py2rpy(scoring)
    
    df_pd = gen_ratings(df_r)
    
    with localconverter(ro.default_converter + pandas2ri.converter):
          rtg_df = ro.conversion.rpy2py(df_pd)
            
    return pd.DataFrame({'Off':list(rtg_df['Off']),'Def':list(rtg_df['Def'])},index=list(rtg_df['TeamID']))
    

calc_lme(testing_df).head()

Off Def
1101 0.037220 3.001805
1102 -2.803390 -0.991113
1103 -0.359363 -1.211838
1104 1.990535 0.245586
1105 -2.146115 -1.168408

Home stretch: Colley Matrix Method!

def calc_colley(dataframe, hfa_rescore = True, ot_rescore = True):
    df = dataframe
    
    if hfa_rescore:
        df = rescore_hfa(df)
    if ot_rescore:
        df = rescore_ot(df)
        
    teams = list(set(df['HTeamID'].append(df['ATeamID'])))
    n_teams = len(teams)
    c_mat = np.identity(n_teams) * 2
    b_mat = np.zeros(n_teams)
    
    for index, row in df.iterrows():
        home_team_ndx = teams.index(row['HTeamID'])
        away_team_ndx = teams.index(row['ATeamID'])
        spread = row['HScore'] - row['AScore']

        c_mat[home_team_ndx, home_team_ndx] += 1
        c_mat[away_team_ndx, away_team_ndx] += 1
        c_mat[home_team_ndx, away_team_ndx] -= 1
        c_mat[away_team_ndx, home_team_ndx] -= 1

        b_mat[home_team_ndx] += (row['HScore'] > row['AScore']) * 1 + (row['AScore'] > row['HScore']) * -1
        b_mat[away_team_ndx] += (row['AScore'] > row['HScore']) * 1 + (row['HScore'] > row['AScore']) * -1

    b_mat = b_mat / 2 + 1
    ranks = np.linalg.solve(c_mat,b_mat).tolist()
    return pd.DataFrame({'Colley':ranks},index = teams)

calc_colley(testing_df).head()

Colley
1106 0.429206
1219 0.427629
1285 0.508115
1350 0.670517
1249 0.091091

Finally, we’ll lift a couple prior pieces of code to calculate win percent and MOV.

def calc_mov(dataframe, hfa_rescore=True, ot_rescore=True):
    df = dataframe
    
    if hfa_rescore:
        df = rescore_hfa(df)
    if ot_rescore:
        df = rescore_ot(df)
        
    df['HSpread'] = df['HScore'] - df['AScore']
    df['ASpread'] = -1 * df['HSpread']
    
    teams = pd.concat([
        df[['HTeamID','HSpread','ATeamID']].rename(columns={'HTeamID':'TeamID','HSpread':'Spread','ATeamID':'OppID'}),
        df[['ATeamID','ASpread','HTeamID']].rename(columns={'ATeamID':'TeamID','ASpread':'Spread','HTeamID':'OppID'})
    ])
    
    spreads = teams.groupby('TeamID').Spread.mean()
    return pd.DataFrame({'MOV':spreads},index = list(spreads.index))

calc_mov(testing_df).head()

MOV
1101 -17.664136
1102 -6.344923
1103 4.572178
1104 11.177569
1105 -3.956765
def calc_wp(dataframe, hfa_rescore=True, ot_rescore=True):
    df = dataframe
    
    if hfa_rescore:
        df = rescore_hfa(df)
    if ot_rescore:
        df = rescore_ot(df)
        
    df['HWin'] = df['HScore'].ge(df['AScore']) * 1
    df['AWin'] = 1 - df['HWin']
    df['HLoss'] = 1 - df['HWin']
    df['ALoss'] = df['HWin']
    
    teams = pd.concat([
        df[['HTeamID','ATeamID','HWin','HLoss']].rename(columns={'HTeamID':'TeamID','ATeamID':'OppID','HWin':'HWin',
                                                                 'HLoss':'HLoss'}),
        df[['ATeamID','HTeamID','AWin','ALoss']].rename(columns={'ATeamID':'TeamID','HTeamID':'OppID','AWin':'AWin',
                                                                 'ALoss':'ALoss'})
    ])
    
    teams = teams.fillna(0)
    
    wp = teams.groupby('TeamID')[['HWin','HLoss','AWin','ALoss']].sum()
    wp['W'] = wp['HWin'] + wp['AWin']
    wp['L'] = wp['HLoss'] + wp['ALoss']
    wp['WP'] = wp['W'] / (wp['W'] + wp['L'])
    
    return pd.DataFrame({'WP':wp['WP']},index = list(wp.index))

calc_wp(testing_df).head()

WP
1101 0.095238
1102 0.357143
1103 0.666667
1104 0.516129
1105 0.428571

Alright, that’s the bulk of the heavy lifting done for generating team ratings from only schedule data. Next, we’re going to estimate team factors from advanced box score data, which is only available from 2002-on. We’ll use lme4 for this as well.

First, we’ll read in our data.

detailed_results = pd.read_csv('data/MRegularSeasonDetailedResults.csv')
detailed_results['WTeamID'] = detailed_results['WTeamID'].astype('str')
detailed_results['LTeamID'] = detailed_results['LTeamID'].astype('str')
detailed_results.head()

Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT WFGM WFGA ... LFGA3 LFTM LFTA LOR LDR LAst LTO LStl LBlk LPF
0 2003 10 1104 68 1328 62 N 0 27 58 ... 10 16 22 10 22 8 18 9 2 20
1 2003 10 1272 70 1393 63 N 0 26 62 ... 24 9 20 20 25 7 12 8 6 16
2 2003 11 1266 73 1437 61 N 0 24 58 ... 26 14 23 31 22 9 12 2 5 23
3 2003 11 1296 56 1457 50 N 0 18 38 ... 22 8 15 17 20 9 19 4 3 23
4 2003 11 1400 77 1208 71 N 0 30 61 ... 16 17 27 21 15 12 10 7 1 14

5 rows × 34 columns

We’ll first write a quick function converting these results to our preferred format.

def refactor_games_detailed(dataframe):
    df = dataframe
    neutral_winners = df.loc[df.WLoc == 'N']
    home_winners = df.loc[df.WLoc == 'H']
    away_winners = df.loc[df.WLoc == 'A']
    
    neutral_winners['NeutralFlg'] = True
    neutral_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT',
                               'HFGM','HFGA','HFGM3','HFGA3','HFTM','HFTA','HOR','HDR','HAST','HTO','HSTL','HBLK','HPF',
                               'AFGM','AFGA','AFGM3','AFGA3','AFTM','AFTA','AOR','ADR','AAST','ATO','ASTL','ABLK','APF',
                               'NeutralFlg']
    
    home_winners['NeutralFlg'] = False
    home_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT',
                            'HFGM','HFGA','HFGM3','HFGA3','HFTM','HFTA','HOR','HDR','HAST','HTO','HSTL','HBLK','HPF',
                            'AFGM','AFGA','AFGM3','AFGA3','AFTM','AFTA','AOR','ADR','AAST','ATO','ASTL','ABLK','APF',
                            'NeutralFlg']
    
    away_winners['NeutralFlg'] = False
    away_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT',
                            'AFGM','AFGA','AFGM3','AFGA3','AFTM','AFTA','AOR','ADR','AAST','ATO','ASTL','ABLK','APF',
                            'HFGM','HFGA','HFGM3','HFGA3','HFTM','HFTA','HOR','HDR','HAST','HTO','HSTL','HBLK','HPF',
                            'NeutralFlg']
    
    df = neutral_winners.append(home_winners.append(away_winners))
    df = df.drop(columns=['WLoc'])
    df = df.sort_values(by=['Season','DayNum'])
    return df
    
detailed_results = refactor_games_detailed(detailed_results)
detailed_results.head()

Season DayNum HTeamID HScore ATeamID AScore NumOT HFGM HFGA HFGM3 ... AFTM AFTA AOR ADR AAST ATO ASTL ABLK APF NeutralFlg
0 2003 10 1104 68 1328 62 0 27 58 3 ... 16 22 10 22 8 18 9 2 20 True
1 2003 10 1272 70 1393 63 0 26 62 8 ... 9 20 20 25 7 12 8 6 16 True
2 2003 11 1266 73 1437 61 0 24 58 8 ... 14 23 31 22 9 12 2 5 23 True
3 2003 11 1296 56 1457 50 0 18 38 3 ... 8 15 17 20 9 19 4 3 23 True
4 2003 11 1400 77 1208 71 0 30 61 6 ... 17 27 21 15 12 10 7 1 14 True

5 rows × 34 columns

Our next function will generate a dataframe with several rating columns of interest, adjusted for opponent and home field advantage, which should give us an estimate of each team’s overall strength in each box-score category.

ro.r('''
    gen_custom_rating = function(df, term){
        formula = as.formula(paste0(term,' ~ 0 + Loc + (1|TeamID) + (1|OppID)'))
        model <- lme4::lmer(formula,
                            data = df)

        made_ratings = lme4::ranef(model)$TeamID
        made_ratings$Team = row.names(lme4::ranef(model)$TeamID)
        colnames(made_ratings) = c(paste0(term,'Made'),'TeamID')
        allowed_ratings = lme4::ranef(model)$OppID
        allowed_ratings$Team = row.names(lme4::ranef(model)$OppID)
        colnames(allowed_ratings) = c(paste0(term,'Allowed'),'TeamID')
        
        rtg_df = merge(made_ratings, allowed_ratings ,on='TeamID')
        return(rtg_df)
    }
''')

gen_custom_rating = ro.globalenv['gen_custom_rating']

def calc_box_score_rtgs(dataframe, raw = False):
    df = dataframe
    
    # calculate possessions as our denominator
    df['HPOS'] = (df['HFGA'] - df['HOR']) + df['HTO'] + (0.44 * df['HFTA'])
    df['APOS'] = (df['AFGA'] - df['AOR']) + df['ATO'] + (0.44 * df['AFTA'])
    
    # calculate stats per possession
    df['HFGM_POS'] = df['HFGM'] / df['HPOS']
    df['HFGA_POS'] = df['HFGA'] / df['HPOS']
    df['HFGM3_POS'] = df['HFGM3'] / df['HPOS']
    df['HFGA3_POS'] = df['HFGA3'] / df['HPOS']
    df['HFTM_POS'] = df['HFTM'] / df['HPOS']
    df['HFTA_POS'] = df['HFTA'] / df['HPOS']
    df['HOR_POS'] = df['HOR'] / df['HPOS']
    df['HDR_POS'] = df['HDR'] / df['APOS'] # denominator is opponent possesions because these are a defensive stat
    df['HAST_POS'] = df['HAST'] / df['HPOS']
    df['HTO_POS'] = df['HTO'] / df['HPOS']
    df['HSTL_POS'] = df['HSTL'] / df['APOS'] # likewise
    df['HBLK_POS'] = df['HBLK'] / df['APOS'] 
    df['HPF_POS'] = df['HPF'] / (df['HPOS'] + df['APOS']) # denom is all possessions
    df['HPTS_POS'] = df['HScore'] / df['HPOS']
    
    df['AFGM_POS'] = df['AFGM'] / df['APOS']
    df['AFGA_POS'] = df['AFGA'] / df['APOS']
    df['AFGM3_POS'] = df['AFGM3'] / df['APOS']
    df['AFGA3_POS'] = df['AFGA3'] / df['APOS']
    df['AFTM_POS'] = df['AFTM'] / df['APOS']
    df['AFTA_POS'] = df['AFTA'] / df['APOS']
    df['AOR_POS'] = df['AOR'] / df['APOS']
    df['ADR_POS'] = df['ADR'] / df['HPOS']
    df['AAST_POS'] = df['AAST'] / df['APOS']
    df['ATO_POS'] = df['ATO'] / df['APOS']
    df['ASTL_POS'] = df['ASTL'] / df['HPOS']
    df['ABLK_POS'] = df['ABLK'] / df['HPOS']
    df['APF_POS'] = df['APF'] / (df['HPOS'] + df['APOS'])
    df['APTS_POS'] = df['AScore'] / df['APOS']
    
    # calculate "tempo", an estimate of possession per 40 minutes
    
    df['HTEMPO'] = 40 * df['HPOS'] / (40 + df['NumOT'] * 5)
    df['ATEMPO'] = 40 * df['APOS'] / (40 + df['NumOT'] * 5)
    
    home_df = df[['HTeamID','ATeamID','HFGM_POS','HFGA_POS','HFGM3_POS','HFGA3_POS','HFTM_POS','HFTA_POS','HOR_POS',
                  'HDR_POS','HAST_POS','HTO_POS','HSTL_POS','HBLK_POS','HPF_POS','HPTS_POS','HTEMPO','NeutralFlg']]
    home_df.columns = ['TeamID','OppID','FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS',
                       'DR_POS','AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO','NeutralFlg']
    home_df['Loc'] = ''
    home_df.loc[home_df['NeutralFlg'],'Loc'] = 'N'
    home_df.loc[~home_df['NeutralFlg'],'Loc'] = 'H'
    
    away_df = df[['ATeamID','HTeamID','AFGM_POS','AFGA_POS','AFGM3_POS','AFGA3_POS','AFTM_POS','AFTA_POS','AOR_POS',
                  'ADR_POS','AAST_POS','ATO_POS','ASTL_POS','ABLK_POS','APF_POS','APTS_POS','ATEMPO','NeutralFlg']]
    away_df.columns = ['TeamID','OppID','FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS',
                       'DR_POS','AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO','NeutralFlg']
    away_df['Loc'] = ''
    away_df.loc[home_df['NeutralFlg'],'Loc'] = 'N'
    away_df.loc[~home_df['NeutralFlg'],'Loc'] = 'A'
    
    main_df = pd.concat([home_df,away_df])
    
    if raw:
        main_df = main_df.sort_values(by=['TeamID'])
        made = main_df.groupby('TeamID')[['FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS','DR_POS',
                                          'AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO']].mean()
        main_df = main_df.sort_values(by=['OppID'])
        allowed = main_df.groupby('OppID')[['FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS',
                                            'DR_POS','AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO']].mean()
        
        made.columns = ['FGM_POSMadeRAW','FGA_POSMadeRAW','FGM3_POSMadeRAW','FGA3_POSMadeRAW','FTM_POSMadeRAW','FTA_POSMadeRAW',
                        'OR_POSMadeRAW','DR_POSMadeRAW','AST_POSMadeRAW','TO_POSMadeRAW','STL_POSMadeRAW','BLK_POSMadeRAW',
                        'PF_POSMadeRAW','PTS_POSMadeRAW','TEMPOMadeRAW']
        
        allowed.columns = ['FGM_POSAllowedRAW','FGA_POSAllowedRAW','FGM3_POSAllowedRAW','FGA3_POSAllowedRAW',
                           'FTM_POSAllowedRAW','FTA_POSAllowedRAW','OR_POSAllowedRAW','DR_POSAllowedRAW','AST_POSAllowedRAW',
                           'TO_POSAllowedRAW','STL_POSAllowedRAW','BLK_POSAllowedRAW','PF_POSAllowedRAW','PTS_POSAllowedRAW',
                           'TEMPOAllowedRAW']
        
        
        
        made = pd.DataFrame(made)
        allowed = pd.DataFrame(allowed)
        
        main_df = pd.concat([made, allowed], axis = 1)
        return main_df
        
        
    with localconverter(ro.default_converter + pandas2ri.converter):
        df_r = ro.conversion.py2rpy(main_df)
    
    # next, we run a lmer on each column to adjust for HFA and opponent to estimate how frequently a team does something 
    # per possession
    
    terms = ['FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS','DR_POS','AST_POS','TO_POS','STL_POS',
             'BLK_POS','PF_POS','PTS_POS','TEMPO']
    df_list = []
    for term in terms:
        # print('Calculating ratings for ' + term + '...')
        df_pd = gen_custom_rating(df_r, term)
        
        with localconverter(ro.default_converter + pandas2ri.converter):
            rtg_df = ro.conversion.rpy2py(df_pd)
        
        df_list.append(rtg_df)
        
    indexes = df_list[0].TeamID
    df_list = [df.drop(columns = ['TeamID']) for df in df_list]
    main_df = pd.concat(df_list,axis = 1)
    main_df.index = list(indexes)
    return(main_df)

Alright, almost there. We’ll make two quick functions for feature engineering. One of them will simply return the Massey ratings for each team immediately prior to the tournament for a given dataframe, the other will return the tournament seed for each team (if applicable).

def grab_massey(dataframe, massey):
    ms = massey
    df = dataframe
    
    year = str(df.iloc[0].Season)
    
    ms = ms.loc[ms['Season'].astype('str') == year]
    ms = ms.loc[ms['RankingDayNum'] == 133]
    ms = ms[['SystemName','TeamID','OrdinalRank']]
    
    ms = ms.pivot(index='TeamID',columns='SystemName')
    ms.columns=[y for y in ms.columns.get_level_values(1)]
    ms['AVG'] = ms.mean(axis=1)
    
    for col in ms.columns: # impute missing values as the mean of massey rankings
        ms[col].fillna(ms.AVG, inplace=True)
        
    del ms['AVG']
    ms.index.name = None
    
    return (ms)

mass = pd.read_csv('data/MMasseyOrdinals.csv')  
mass['TeamID'] = mass['TeamID'].astype('str')
grab_massey(testing_df,mass).head()

7OT ADE AP BBT BIH BLS BOB BUR CJB CNG ... TPR TRP TW UPS USA WIL WLK WMR WOB WOL
1101 343.0 297.0 338.163636 347.0 346.0 346.0 340.0 342.0 332.0 346.0 ... 348.0 339.0 329.0 338.163636 338.163636 342.0 330.0 346.0 323.0 345.0
1102 290.0 241.0 229.196721 266.0 227.0 231.0 220.0 207.0 232.0 225.0 ... 234.0 240.0 129.0 236.000000 229.196721 193.0 228.0 209.0 206.0 233.0
1103 103.0 89.0 113.885246 114.0 104.0 142.0 123.0 128.0 113.0 138.0 ... 136.0 129.0 124.0 93.000000 113.885246 99.0 126.0 132.0 94.0 93.0
1104 82.0 129.0 106.967213 111.0 123.0 75.0 92.0 82.0 110.0 75.0 ... 79.0 72.0 178.0 124.000000 106.967213 104.0 84.0 74.0 116.0 129.0
1105 288.0 262.0 292.344262 286.0 288.0 314.0 305.0 306.0 300.0 309.0 ... 314.0 325.0 285.0 259.000000 292.344262 285.0 308.0 319.0 279.0 275.0

5 rows × 65 columns

def grab_seeding(dataframe, seeding):
    sd = seeding
    df = dataframe
    
    year = str(df.iloc[0].Season)
    sd = sd.loc[sd['Season'].astype('str') == year]
    
    teams = list(set(df['HTeamID'].append(df['ATeamID'])))
    teams = pd.DataFrame({'TeamID':teams})
    
    main_df = pd.merge(teams,sd,how='outer')
    main_df['SeedNum'] = main_df['Seed'].str[1:3]
    main_df['Seed'].fillna('NS',inplace=True)
    main_df['SeedNum'].fillna('NS',inplace=True)
    main_df.index = list(main_df['TeamID'])
    del main_df['TeamID']
    del main_df['Season']
    return main_df
    
seed = pd.read_csv('data/MNCAATourneySeeds.csv')  
seed['TeamID'] = seed['TeamID'].astype('str')
grab_seeding(testing_df,seed).head()

Seed SeedNum
1106 NS NS
1219 NS NS
1285 NS NS
1350 NS NS
1249 NS NS

Alright, we’re done with all of our feature engineering. Kinda exhausting? Next step is to wrap all this into an enourmous function to generate a big-ass dataframe for generating ratings for every team, first for teams from 1985-2014, second from 2003-2014.

# score data

def gen_score_data(dataframe, seeding):
    df = dataframe
    
    elo = calc_elo(df,hfa_rescore=True,ot_rescore=True)
    lrmc = calc_lrmc(df)
    srs = calc_srs(df)
    rpi = calc_rpi(df)
    ts = calc_ts(df)
    lme = calc_lme(df)
    col = calc_colley(df)
    wp = calc_wp(df)
    mov = calc_mov(df)
    sd = grab_seeding(df, seeding)
    
    return pd.concat([elo, lrmc, srs, rpi, ts, lme, col, wp, mov, sd],axis=1)

gen_score_data(compact_results.loc[compact_results['Season'] == 1990], seeding = seed).head() # why 1990? no reason...

Elo LRMC SRS RPI TS Off Def Colley WP MOV Seed SeedNum
1106 1457.690655 114.959698 -6.468152 0.477352 23.586893 13.981185 16.435633 0.429118 0.520000 -3.809030 NS NS
1108 1389.483739 93.172849 -11.313684 0.457892 20.018400 1.030306 7.373383 0.339644 0.360000 -8.870126 NS NS
1425 1491.137635 131.578084 -0.651833 0.471407 26.394144 1.260576 1.200736 0.498420 0.428571 -0.500000 NS NS
1238 1413.525492 114.288471 -6.379905 0.440363 19.715176 -1.448090 1.866536 0.246491 0.280000 -3.879578 NS NS
1162 1353.674199 72.754592 -14.088478 0.386421 14.703133 -7.259959 4.629205 0.060706 0.076923 -15.385427 NS NS
def gen_adv_data(dataframe, det_dataframe, seeding, massey):
    df = dataframe
    det_df = det_dataframe
    
    elo = calc_elo(df,hfa_rescore=True,ot_rescore=True)
    lrmc = calc_lrmc(df)
    srs = calc_srs(df)
    rpi = calc_rpi(df)
    ts = calc_ts(df)
    lme = calc_lme(df)
    col = calc_colley(df)
    wp = calc_wp(df)
    mov = calc_mov(df)
    sd = grab_seeding(df, seeding)
    
    ms = grab_massey(df, massey)
    ppos = calc_box_score_rtgs(det_df)
    ppos_raw = calc_box_score_rtgs(det_df,raw = True)
    
    return pd.concat([elo, lrmc, srs, rpi, ts, lme, col, wp, mov, sd, ms, ppos, ppos_raw],axis=1)

Next step is to generate these ratings for every season and store them in an accessible manner.

basic_ratings_dict = {}
for year in compact_results['Season'].unique():
    print('Generating basic ratings for ' + str(year) + '...')
    temp_df = compact_results.loc[compact_results['Season'] == int(year)]
    basic_ratings_dict[str(year)] = gen_score_data(temp_df, seed)

# basic_ratings_dict['2014'].head()
basic_list = []
for key in basic_ratings_dict.keys():
    tmp_df = basic_ratings_dict[key]
    tmp_df['Season'] = int(key)
    basic_list.append(tmp_df)
    
basic_ratings_df = pd.concat(basic_list)
basic_ratings_df.head()
Generating basic ratings for 1985...
Generating basic ratings for 1986...
Generating basic ratings for 1987...
Generating basic ratings for 1988...
Generating basic ratings for 1989...
Generating basic ratings for 1990...
Generating basic ratings for 1991...
Generating basic ratings for 1992...
Generating basic ratings for 1993...
Generating basic ratings for 1994...
Generating basic ratings for 1995...
Generating basic ratings for 1996...
Generating basic ratings for 1997...
Generating basic ratings for 1998...
Generating basic ratings for 1999...
Generating basic ratings for 2000...
Generating basic ratings for 2001...
Generating basic ratings for 2002...
Generating basic ratings for 2003...
Generating basic ratings for 2004...
Generating basic ratings for 2005...
Generating basic ratings for 2006...
Generating basic ratings for 2007...
Generating basic ratings for 2008...
Generating basic ratings for 2009...
Generating basic ratings for 2010...
Generating basic ratings for 2011...
Generating basic ratings for 2012...
Generating basic ratings for 2013...
Generating basic ratings for 2014...
Generating basic ratings for 2015...
Generating basic ratings for 2016...
Generating basic ratings for 2017...
Generating basic ratings for 2018...
Generating basic ratings for 2019...
Generating basic ratings for 2020...
Generating basic ratings for 2021...

Elo LRMC SRS RPI TS Off Def Colley WP MOV Seed SeedNum Season
1106 1461.053111 107.258502 -2.698899 0.459318 19.541427 0.952114 2.104203 0.348392 0.500000 0.264010 NS NS 1985
1108 1573.024506 199.131764 6.038746 0.533009 25.704259 7.593771 3.356364 0.564220 0.680000 4.066550 NS NS 1985
1425 1566.036711 172.630828 4.867537 0.541256 30.437145 -0.582718 -3.477101 0.728202 0.642857 2.047567 Y08 08 1985
1238 1429.471832 103.695027 -6.388395 0.437554 18.680670 -2.368737 1.600165 0.256282 0.333333 -6.041667 NS NS 1985
1162 1494.682611 104.933737 -2.299174 0.455248 23.805288 -6.828934 -5.287778 0.448240 0.520000 0.146725 NS NS 1985
adv_ratings_list = []
for year in detailed_results['Season'].unique():
    print('Generating advanced ratings for ' + str(year) + '...')
    temp_df = compact_results.loc[compact_results['Season'] == int(year)]
    temp_det_df = detailed_results.loc[detailed_results['Season'] == int(year)]
    temp_adv_data = gen_adv_data(temp_df, temp_det_df, seed, mass)
    temp_adv_data['Season'] = year
    adv_ratings_list.append(temp_adv_data)
Generating advanced ratings for 2003...
Generating advanced ratings for 2004...
Generating advanced ratings for 2005...
Generating advanced ratings for 2006...
Generating advanced ratings for 2007...
Generating advanced ratings for 2008...
Generating advanced ratings for 2009...
Generating advanced ratings for 2010...
Generating advanced ratings for 2011...
Generating advanced ratings for 2012...
Generating advanced ratings for 2013...
Generating advanced ratings for 2014...
Generating advanced ratings for 2015...
Generating advanced ratings for 2016...
Generating advanced ratings for 2017...
Generating advanced ratings for 2018...
Generating advanced ratings for 2019...
Generating advanced ratings for 2020...
Generating advanced ratings for 2021...
tmp_list = []
for d in adv_ratings_list:
    d = d.loc[:,~d.columns.duplicated()]
    tmp_list.append(d)
adv_ratings_df = pd.concat(tmp_list)

We’re all the way wrapped on feature engineering, we can down assemble our training and testing dataframes.

tournament_games = pd.read_csv('data/MNCAATourneyCompactResults.csv')  
tournament_games['WTeamID'] = tournament_games['WTeamID'].astype('str')
tournament_games['LTeamID'] = tournament_games['LTeamID'].astype('str')
tournament_games['WLoc'] = 'N'
tournament_games = refactor_games_compact(tournament_games)

tournament_games.head()

Season DayNum HTeamID HScore ATeamID AScore NumOT NeutralFlg
0 1985 136 1116 63 1234 54 0 True
1 1985 136 1120 59 1345 58 0 True
2 1985 136 1207 68 1250 43 0 True
3 1985 136 1229 58 1425 55 0 True
4 1985 136 1242 49 1325 38 0 True

One trick I like to do is flip the data and double it, so the model doesn’t get biased towards predicting winners or losers.

tournament_games_flipped = tournament_games.copy()

tournament_games_flipped['HTeamID'] = tournament_games['ATeamID']
tournament_games_flipped['ATeamID'] = tournament_games['HTeamID']
tournament_games_flipped['HScore'] = tournament_games['AScore']
tournament_games_flipped['AScore'] = tournament_games['HScore']

tournament_games = pd.concat([tournament_games, tournament_games_flipped])

tournament_games['HDiff'] = tournament_games['HScore'] - tournament_games['AScore']
tournament_games.tail()

Season DayNum HTeamID HScore ATeamID AScore NumOT NeutralFlg HDiff
2246 2019 146 1246 71 1120 77 1 True -6
2247 2019 146 1181 67 1277 68 0 True -1
2248 2019 152 1277 51 1403 61 0 True -10
2249 2019 152 1120 62 1438 63 0 True -1
2250 2019 154 1403 77 1438 85 1 True -8
h_adv_ratings = adv_ratings_df.copy()
a_adv_ratings = adv_ratings_df.copy()

h_adv_ratings['HTeamID'] = h_adv_ratings.index
a_adv_ratings['ATeamID'] = a_adv_ratings.index

tournament_merged = pd.merge(tournament_games, h_adv_ratings)
tournament_merged = pd.merge(tournament_merged, a_adv_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
tournament_merged.head()

Season DayNum HTeamID HScore ATeamID AScore NumOT NeutralFlg HDiff Elo_H ... MGS_A NET_A PIR_A BNZ_A CBR_A CWL_A EMK_A FDM_A KCX_A RWP_A
0 2003 134 1421 92 1411 84 1 True 8 1443.840515 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2003 137 1421 61 1400 82 0 True -21 1443.840515 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2003 144 1163 78 1400 82 0 True -4 1615.681074 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2003 146 1277 76 1400 85 0 True -9 1621.855151 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2003 139 1345 67 1400 77 0 True -10 1575.549719 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 493 columns

All of our data prep work is done, I’m gonna swap over to R for some tidymodels work.

tournament_merged.to_csv('data/adv_data.csv')

h_basic_ratings = basic_ratings_df.copy()
a_basic_ratings = basic_ratings_df.copy()

h_basic_ratings['HTeamID'] = h_basic_ratings.index
a_basic_ratings['ATeamID'] = a_basic_ratings.index

tournament_merged = pd.merge(tournament_games, h_basic_ratings)
tournament_merged = pd.merge(tournament_merged, a_basic_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
tournament_merged.head()

tournament_merged.to_csv('data/basic_data.csv')

Actually, before we go, I’m going to prepare our stage 1 and stage 2 datasets.

stage1 = pd.read_csv('MSampleSubmissionStage1.csv')
stage2 = pd.read_csv('MSampleSubmissionStage2.csv')

stage1['Season'] = stage1['ID'].str.slice(start=0,stop=4)
stage1['HTeamID'] = stage1['ID'].str.slice(start=5,stop=9)
stage1['ATeamID'] = stage1['ID'].str.slice(start=10,stop=14)
stage1['Season'] = stage1['Season'].astype('int')

stage1_tmp = stage1.copy()
stage1_tmp['HTeamID'] = stage1['ATeamID']
stage1_tmp['ATeamID'] = stage1['HTeamID']
stage1_merged = pd.merge(stage1, h_adv_ratings)
stage1_merged = pd.merge(stage1_merged, a_adv_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
stage1_merged.to_csv('data/stage1_2021.csv')

stage2['Season'] = stage2['ID'].str.slice(start=0,stop=4)
stage2['HTeamID'] = stage2['ID'].str.slice(start=5,stop=9)
stage2['ATeamID'] = stage2['ID'].str.slice(start=10,stop=14)
stage2['Season'] = stage2['Season'].astype('int')

stage2_tmp = stage1.copy()
stage2_tmp['HTeamID'] = stage2['ATeamID']
stage2_tmp['ATeamID'] = stage2['HTeamID']

stage2_merged = pd.merge(stage2, h_adv_ratings)
stage2_merged = pd.merge(stage2_merged, a_adv_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
stage2_merged.to_csv('data/stage2_2021.csv')
%load_ext rpy2.ipython
C:\Users\edwar\anaconda3\lib\site-packages\rpy2\robjects\packages.py:366: UserWarning: The symbol 'quartz' is not in this R namespace/package.
  warnings.warn(
# initializations
library(tidyverse)
library(tidymodels)
library(recipeselectors)
library(MLmetrics)
adv_data <- read.csv("C:/Users/edwar/Documents/GitHub/kaggle-mens-march-madness-2021/data/adv_data.csv", header=TRUE) %>%
  mutate(HWin = HDiff > 0,
         SeedDiff = as.numeric(SeedNum_H) - as.numeric(SeedNum_A)) %>%
  select(-c(X, Seed_H, Seed_A, SeedNum_A, SeedNum_H))
R[write to console]: 
Attaching package: 'MLmetrics'


R[write to console]: The following object is masked from 'package:base':

    Recall

What follows is summarizing the result of some experiemnting in RStudio. I’m just keeping things together in the same notebook, this wasn’t my exact thought process but is a more refined means of working through things.

First step is to grab train/test data. We’ll hold out the 2019 tournament for evaluation.

pre_2015 = adv_data %>%
  filter(Season <= 2018) %>%
  select(-c(Season, DayNum, HTeamID, HScore, ATeamID, AScore, NumOT, NeutralFlg))

post_2015 = adv_data %>%
  filter(Season == 2019) %>%
  select(-c(Season, DayNum, HTeamID, HScore, ATeamID, AScore, NumOT, NeutralFlg))

Next step, we’ll see which columns have a large number of NAs, and avoid imputing these in our preprocessing.

colSums(is.na(pre_2015))
               HDiff                Elo_H               LRMC_H 
                   0                    0                    0 
               SRS_H                RPI_H                 TS_H 
                   0                    0                    0 
               Off_H                Def_H             Colley_H 
                   0                    0                    0 
                WP_H                MOV_H                 AP_H 
                   0                    0                    0 
               ARG_H                BIH_H                BOB_H 
                1310                  128                  268 
               BRZ_H                COL_H                DOL_H 
                1968                    0                    0 
               DUN_H                DWH_H                ECK_H 
                 524                 1456                 1456 
               ENT_H                ERD_H                GRN_H 
                1712                 1968                 1072 
               GRS_H                HER_H                HOL_H 
                1706                 1712                 1968 
               IMS_H                MAS_H                MKV_H 
                1840                  512                 1712 
               MOR_H                POM_H                RTH_H 
                   0                    0                    0 
               SAG_H                SAU_H                 SE_H 
                   0                 1712                  658 
               SEL_H                STR_H                TSR_H 
                 256                 1712                 1456 
               USA_H                WLK_H                WOB_H 
                   0                    0                  128 
               WOL_H                WTE_H        FGM_POSMade_H 
                   0                 1968                    0 
    FGM_POSAllowed_H        FGA_POSMade_H     FGA_POSAllowed_H 
                   0                    0                    0 
      FGM3_POSMade_H    FGM3_POSAllowed_H       FGA3_POSMade_H 
                   0                    0                    0 
   FGA3_POSAllowed_H        FTM_POSMade_H     FTM_POSAllowed_H 
                   0                    0                    0 
       FTA_POSMade_H     FTA_POSAllowed_H         OR_POSMade_H 
                   0                    0                    0 
     OR_POSAllowed_H         DR_POSMade_H      DR_POSAllowed_H 
                   0                    0                    0 
       AST_POSMade_H     AST_POSAllowed_H         TO_POSMade_H 
                   0                    0                    0 
     TO_POSAllowed_H        STL_POSMade_H     STL_POSAllowed_H 
                   0                    0                    0 
       BLK_POSMade_H     BLK_POSAllowed_H         PF_POSMade_H 
                   0                    0                    0 
     PF_POSAllowed_H        PTS_POSMade_H     PTS_POSAllowed_H 
                   0                    0                    0 
         TEMPOMade_H       TEMPOAllowed_H     FGM_POSMadeRAW_H 
                   0                    0                    0 
    FGA_POSMadeRAW_H    FGM3_POSMadeRAW_H    FGA3_POSMadeRAW_H 
                   0                    0                    0 
    FTM_POSMadeRAW_H     FTA_POSMadeRAW_H      OR_POSMadeRAW_H 
                   0                    0                    0 
     DR_POSMadeRAW_H     AST_POSMadeRAW_H      TO_POSMadeRAW_H 
                   0                    0                    0 
    STL_POSMadeRAW_H     BLK_POSMadeRAW_H      PF_POSMadeRAW_H 
                   0                    0                    0 
    PTS_POSMadeRAW_H       TEMPOMadeRAW_H  FGM_POSAllowedRAW_H 
                   0                    0                    0 
 FGA_POSAllowedRAW_H FGM3_POSAllowedRAW_H FGA3_POSAllowedRAW_H 
                   0                    0                    0 
 FTM_POSAllowedRAW_H  FTA_POSAllowedRAW_H   OR_POSAllowedRAW_H 
                   0                    0                    0 
  DR_POSAllowedRAW_H  AST_POSAllowedRAW_H   TO_POSAllowedRAW_H 
                   0                    0                    0 
 STL_POSAllowedRAW_H  BLK_POSAllowedRAW_H   PF_POSAllowedRAW_H 
                   0                    0                    0 
 PTS_POSAllowedRAW_H    TEMPOAllowedRAW_H                 BD_H 
                   0                    0                 1840 
               CNG_H                DES_H                JON_H 
                 128                  384                 1968 
               LYN_H                MGY_H                NOR_H 
                 932                 1968                 1968 
               REI_H                 RM_H                SIM_H 
                1968                 1968                 1328 
               ACU_H                BCM_H                CMV_H 
                1444                 1968                 1968 
                DC_H                KLK_H                REN_H 
                 390                 1456                 1712 
               RIS_H                ROH_H                SAP_H 
                1968                 1712                 1712 
               SCR_H                WIL_H                DOK_H 
                1584                  384                  384 
               JCI_H                KPK_H                 MB_H 
                1840                  768                 1054 
                PH_H                PIG_H                PKL_H 
                1968                  384                 1840 
               TRX_H                CPR_H                ISR_H 
                1712                  780                 1584 
               KRA_H                LYD_H                RTR_H 
                 640                 1840                  780 
               UCS_H                BKM_H                CPA_H 
                1968                 1968                  908 
               JEN_H                PGH_H                REW_H 
                1444                  640                  640 
               RSE_H                SPW_H                STH_H 
                1968                  640                  640 
               BPI_H                DC2_H                DCI_H 
                1438                 1438                  768 
               HKB_H                LMC_H                NOL_H 
                1304                  768                  902 
               OMY_H                RTB_H                KEL_H 
                1566                 1304                 1968 
               KMV_H                 RT_H                 TW_H 
                1834                 1030                 1432 
               AUS_H                KOS_H                PEQ_H 
                1962                 1694                 1962 
               PTS_H                ROG_H                RTP_H 
                1694                 1828                 1024 
               TMR_H               X7OT_H                ADE_H 
                1426                 1158                 1426 
               BBT_H                BNM_H                BUR_H 
                1158                 1962                 1292 
               CJB_H                CRO_H                EBP_H 
                1694                 1292                 1158 
               HAT_H                MSX_H                SFX_H 
                1962                 1426                 1158 
               TBD_H                BLS_H                D1A_H 
                1962                 1426                 1560 
               DII_H                KBM_H                TPR_H 
                1292                 1694                 1292 
               MvG_H                PPR_H                 SP_H 
                1828                 1962                 1426 
               SPR_H                STF_H                STS_H 
                1426                 1828                 1828 
               TRP_H                UPS_H                WMR_H 
                1426                 1962                 1962 
               BWE_H                LOG_H                TRK_H 
                1560                 1694                 1560 
               DAV_H                FAS_H                FSH_H 
                1694                 1694                 1694 
               HAS_H                HRN_H                KPI_H 
                1694                 1962                 1694 
               MCL_H                CRW_H                DDB_H 
                1828                 1962                 1828 
               ESR_H                FMG_H                HKS_H 
                1828                 1962                 1962 
               INP_H                JNG_H                JRT_H 
                1962                 1828                 1962 
               MUZ_H                OCT_H                PMC_H 
                1962                 1962                 1962 
               PRR_H                RSL_H                SGR_H 
                1828                 1962                 1828 
               SMN_H                SMS_H                YAG_H 
                1828                 1828                 1828 
               ZAM_H                BNT_H                COX_H 
                1828                 1962                 1962 
               JJK_H                LAB_H                MMG_H 
                1962                 1962                 1962 
               STM_H                WMV_H                AWS_H 
                1962                 1962                 2096 
               INC_H                LAW_H                LEF_H 
                2096                 2096                 2096 
               MGS_H                NET_H                PIR_H 
                2096                 2096                 2096 
               BNZ_H                CBR_H                CWL_H 
                2096                 2096                 2096 
               EMK_H                FDM_H                KCX_H 
                2096                 2096                 2096 
               RWP_H                Elo_A               LRMC_A 
                2096                    0                    0 
               SRS_A                RPI_A                 TS_A 
                   0                    0                    0 
               Off_A                Def_A             Colley_A 
                   0                    0                    0 
                WP_A                MOV_A                 AP_A 
                   0                    0                    0 
               ARG_A                BIH_A                BOB_A 
                1310                  128                  268 
               BRZ_A                COL_A                DOL_A 
                1968                    0                    0 
               DUN_A                DWH_A                ECK_A 
                 524                 1456                 1456 
               ENT_A                ERD_A                GRN_A 
                1712                 1968                 1072 
               GRS_A                HER_A                HOL_A 
                1706                 1712                 1968 
               IMS_A                MAS_A                MKV_A 
                1840                  512                 1712 
               MOR_A                POM_A                RTH_A 
                   0                    0                    0 
               SAG_A                SAU_A                 SE_A 
                   0                 1712                  658 
               SEL_A                STR_A                TSR_A 
                 256                 1712                 1456 
               USA_A                WLK_A                WOB_A 
                   0                    0                  128 
               WOL_A                WTE_A        FGM_POSMade_A 
                   0                 1968                    0 
    FGM_POSAllowed_A        FGA_POSMade_A     FGA_POSAllowed_A 
                   0                    0                    0 
      FGM3_POSMade_A    FGM3_POSAllowed_A       FGA3_POSMade_A 
                   0                    0                    0 
   FGA3_POSAllowed_A        FTM_POSMade_A     FTM_POSAllowed_A 
                   0                    0                    0 
       FTA_POSMade_A     FTA_POSAllowed_A         OR_POSMade_A 
                   0                    0                    0 
     OR_POSAllowed_A         DR_POSMade_A      DR_POSAllowed_A 
                   0                    0                    0 
       AST_POSMade_A     AST_POSAllowed_A         TO_POSMade_A 
                   0                    0                    0 
     TO_POSAllowed_A        STL_POSMade_A     STL_POSAllowed_A 
                   0                    0                    0 
       BLK_POSMade_A     BLK_POSAllowed_A         PF_POSMade_A 
                   0                    0                    0 
     PF_POSAllowed_A        PTS_POSMade_A     PTS_POSAllowed_A 
                   0                    0                    0 
         TEMPOMade_A       TEMPOAllowed_A     FGM_POSMadeRAW_A 
                   0                    0                    0 
    FGA_POSMadeRAW_A    FGM3_POSMadeRAW_A    FGA3_POSMadeRAW_A 
                   0                    0                    0 
    FTM_POSMadeRAW_A     FTA_POSMadeRAW_A      OR_POSMadeRAW_A 
                   0                    0                    0 
     DR_POSMadeRAW_A     AST_POSMadeRAW_A      TO_POSMadeRAW_A 
                   0                    0                    0 
    STL_POSMadeRAW_A     BLK_POSMadeRAW_A      PF_POSMadeRAW_A 
                   0                    0                    0 
    PTS_POSMadeRAW_A       TEMPOMadeRAW_A  FGM_POSAllowedRAW_A 
                   0                    0                    0 
 FGA_POSAllowedRAW_A FGM3_POSAllowedRAW_A FGA3_POSAllowedRAW_A 
                   0                    0                    0 
 FTM_POSAllowedRAW_A  FTA_POSAllowedRAW_A   OR_POSAllowedRAW_A 
                   0                    0                    0 
  DR_POSAllowedRAW_A  AST_POSAllowedRAW_A   TO_POSAllowedRAW_A 
                   0                    0                    0 
 STL_POSAllowedRAW_A  BLK_POSAllowedRAW_A   PF_POSAllowedRAW_A 
                   0                    0                    0 
 PTS_POSAllowedRAW_A    TEMPOAllowedRAW_A                 BD_A 
                   0                    0                 1840 
               CNG_A                DES_A                JON_A 
                 128                  384                 1968 
               LYN_A                MGY_A                NOR_A 
                 932                 1968                 1968 
               REI_A                 RM_A                SIM_A 
                1968                 1968                 1328 
               ACU_A                BCM_A                CMV_A 
                1444                 1968                 1968 
                DC_A                KLK_A                REN_A 
                 390                 1456                 1712 
               RIS_A                ROH_A                SAP_A 
                1968                 1712                 1712 
               SCR_A                WIL_A                DOK_A 
                1584                  384                  384 
               JCI_A                KPK_A                 MB_A 
                1840                  768                 1054 
                PH_A                PIG_A                PKL_A 
                1968                  384                 1840 
               TRX_A                CPR_A                ISR_A 
                1712                  780                 1584 
               KRA_A                LYD_A                RTR_A 
                 640                 1840                  780 
               UCS_A                BKM_A                CPA_A 
                1968                 1968                  908 
               JEN_A                PGH_A                REW_A 
                1444                  640                  640 
               RSE_A                SPW_A                STH_A 
                1968                  640                  640 
               BPI_A                DC2_A                DCI_A 
                1438                 1438                  768 
               HKB_A                LMC_A                NOL_A 
                1304                  768                  902 
               OMY_A                RTB_A                KEL_A 
                1566                 1304                 1968 
               KMV_A                 RT_A                 TW_A 
                1834                 1030                 1432 
               AUS_A                KOS_A                PEQ_A 
                1962                 1694                 1962 
               PTS_A                ROG_A                RTP_A 
                1694                 1828                 1024 
               TMR_A               X7OT_A                ADE_A 
                1426                 1158                 1426 
               BBT_A                BNM_A                BUR_A 
                1158                 1962                 1292 
               CJB_A                CRO_A                EBP_A 
                1694                 1292                 1158 
               HAT_A                MSX_A                SFX_A 
                1962                 1426                 1158 
               TBD_A                BLS_A                D1A_A 
                1962                 1426                 1560 
               DII_A                KBM_A                TPR_A 
                1292                 1694                 1292 
               MvG_A                PPR_A                 SP_A 
                1828                 1962                 1426 
               SPR_A                STF_A                STS_A 
                1426                 1828                 1828 
               TRP_A                UPS_A                WMR_A 
                1426                 1962                 1962 
               BWE_A                LOG_A                TRK_A 
                1560                 1694                 1560 
               DAV_A                FAS_A                FSH_A 
                1694                 1694                 1694 
               HAS_A                HRN_A                KPI_A 
                1694                 1962                 1694 
               MCL_A                CRW_A                DDB_A 
                1828                 1962                 1828 
               ESR_A                FMG_A                HKS_A 
                1828                 1962                 1962 
               INP_A                JNG_A                JRT_A 
                1962                 1828                 1962 
               MUZ_A                OCT_A                PMC_A 
                1962                 1962                 1962 
               PRR_A                RSL_A                SGR_A 
                1828                 1962                 1828 
               SMN_A                SMS_A                YAG_A 
                1828                 1828                 1828 
               ZAM_A                BNT_A                COX_A 
                1828                 1962                 1962 
               JJK_A                LAB_A                MMG_A 
                1962                 1962                 1962 
               STM_A                WMV_A                AWS_A 
                1962                 1962                 2096 
               INC_A                LAW_A                LEF_A 
                2096                 2096                 2096 
               MGS_A                NET_A                PIR_A 
                2096                 2096                 2096 
               BNZ_A                CBR_A                CWL_A 
                2096                 2096                 2096 
               EMK_A                FDM_A                KCX_A 
                2096                 2096                 2096 
               RWP_A                 HWin             SeedDiff 
                2096                    0                    0 

After filting columns with large numbers of missing values, We’ll quickly implement a preprocessing recipe, imputing missing values using knn and then running a boruta feature selection algorithm.

res = c('HDiff','HWin','SeedDiff')

metrics = c('Elo','LRMC','SRS','RPI','TS','Off','Def','Colley','WP','MOV',
            'AP','BOB','BIH','COL','DOL','DUN','MOR','POM','RTH','SAG','SE','SEL',
            'USA','WLK','WOB','WOL','FGM_POSMade','FGM_POSAllowed','FGA_POSMade',
            'FGA_POSAllowed','FGM3_POSMade','FGM3_POSAllowed','FGA3_POSMade',
            'FGA3_POSAllowed','FTM_POSMade','FTM_POSAllowed','FTA_POSMade',
            'FTA_POSAllowed','OR_POSMade','OR_POSAllowed','DR_POSMade',
            'DR_POSAllowed','AST_POSMade','AST_POSAllowed','TO_POSMade','TO_POSAllowed',
            'STL_POSMade','STL_POSAllowed','BLK_POSMade','BLK_POSAllowed',
            'PF_POSMade','PF_POSAllowed','PTS_POSMade','PTS_POSAllowed',
            'TEMPOMade','TEMPOAllowed','FGM_POSMadeRAW','FGA_POSMadeRAW',
            'FGM3_POSMadeRAW','FGA3_POSMadeRAW','FTM_POSMadeRAW','FTA_POSMadeRAW',
            'OR_POSMadeRAW','DR_POSMadeRAW','AST_POSMadeRAW','TO_POSMadeRAW',
            'STL_POSMadeRAW','BLK_POSMadeRAW','PF_POSMadeRAW','PTS_POSMadeRAW',
            'TEMPOMadeRAW','FGM_POSAllowedRAW','FGA_POSAllowedRAW',
            'FGM3_POSAllowedRAW','DR_POSAllowedRAW','AST_POSAllowedRAW',
            'TO_POSAllowedRAW','STL_POSAllowedRAW','BLK_POSAllowedRAW',
            'PF_POSAllowedRAW','PTS_POSAllowedRAW','TEMPOAllowedRAW',
            'CNG','DC')

cols = c(res, paste0(metrics,'_A'), paste0(metrics,'_H'))

boruta_df = pre_2015[colnames(pre_2015) %in% cols]

outright_boruta = recipe(HWin ~ ., data = boruta_df %>% select(-c(HDiff))) %>%
  step_knnimpute(all_predictors(), neighbors = 3) %>%
  step_select_boruta(all_predictors(), outcome = "HWin")

prepped = prep(outright_boruta)

prepped[['steps']][[2]][['res']][1]$finalDecision[which(prepped[['steps']][[2]][['res']][1]$finalDecision == 'Confirmed')]
              Elo_H              LRMC_H               SRS_H               RPI_H 
          Confirmed           Confirmed           Confirmed           Confirmed 
               TS_H            Colley_H                AP_H               BIH_H 
          Confirmed           Confirmed           Confirmed           Confirmed 
              BOB_H               COL_H               DOL_H               DUN_H 
          Confirmed           Confirmed           Confirmed           Confirmed 
              MOR_H               POM_H               RTH_H               SAG_H 
          Confirmed           Confirmed           Confirmed           Confirmed 
               SE_H               SEL_H               USA_H               WLK_H 
          Confirmed           Confirmed           Confirmed           Confirmed 
              WOB_H               WOL_H       FGM_POSMade_H       BLK_POSMade_H 
          Confirmed           Confirmed           Confirmed           Confirmed 
      PTS_POSMade_H    PTS_POSAllowed_H    BLK_POSMadeRAW_H    PTS_POSMadeRAW_H 
          Confirmed           Confirmed           Confirmed           Confirmed 
              CNG_H                DC_H               Elo_A              LRMC_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
              SRS_A               RPI_A                TS_A            Colley_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
               AP_A               BIH_A               BOB_A               COL_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
              DOL_A               DUN_A               MOR_A               POM_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
              RTH_A               SAG_A                SE_A               SEL_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
              USA_A               WLK_A               WOB_A               WOL_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
      FGM_POSMade_A       BLK_POSMade_A       PTS_POSMade_A    PTS_POSAllowed_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
   BLK_POSMadeRAW_A    PTS_POSMadeRAW_A PTS_POSAllowedRAW_A               CNG_A 
          Confirmed           Confirmed           Confirmed           Confirmed 
               DC_A            SeedDiff 
          Confirmed           Confirmed 
Levels: Tentative Confirmed Rejected

With our features selected, we can filter our unused columns and grab a cross-validated dataset to train on.

res = c('HDiff','HWin','SeedDiff')

metrics = c('Elo','LRMC','SRS','RPI','TS','Colley',
            'AP','BIH','COL','DOL','MOR','POM','SAG','SEL',
            'USA','WLK','WOB','PTS_POSMade','PTS_POSAllowed','PTS_POSMadeRAW')

cols = c(res, paste0(metrics,'_A'), paste0(metrics,'_H'))

data = pre_2015[colnames(pre_2015) %in% cols]

clf_impute = recipe(HWin ~ ., data = boruta_df %>% select(-c(HDiff))) %>%
  step_knnimpute(all_predictors(), neighbors = 3)

prepped = prep(clf_impute)

data = juice(prepped)
data = data[colnames(data) %in% cols]
data$HDiff = pre_2015$HDiff

set.seed(1986)

df_train  = data
df_test  = post_2015

cv_folds = vfold_cv(df_train %>% mutate(HWin = as.factor(HWin)) %>% select(-c(HDiff)), strata = HWin)

We’ll tune our xgboost algorithm here. In the interest of not repeating myself, I’ve copied the code over and it should be able to be run, but it will take a long time to run, and thus I’ve commented it out.

xgb = boost_tree(
  tree_depth = tune(),
  min_n = tune(),
  trees = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_params = parameters(
  tree_depth(),
  min_n(),
  trees(range = c(10, 2000)),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), df_train),
  learn_rate()
)

xgb_wrkflw <- workflow() %>%
  add_formula(HWin ~ .) %>%
  add_model(xgb)

doParallel::registerDoParallel()

xgb_tune <- tune_bayes(
  xgb_wrkflw ,
  resamples = cv_folds,
  param_info = xgb_params,
  iter = 1000,
  metrics = metric_set(
    mn_log_loss
  ),
  initial = 15,
  control = control_bayes(
    no_improve = 100,
    uncertain = 10,
    save_pred = F,
    save_workflow = F,
    time_limit = 600,
    verbose = T
  )
)
 
xgb_best <- select_best(xgb_tune, "mn_log_loss")
print(xgb_best)
NULL

We have our tuned xgboost model. We’ll train it on our train dataset, test it on our test dataset, and then apply predictions on the 2019 tournament.

xgb = boost_tree(mtry=2,
                  trees=401,
                  min_n=9,
                  tree_depth=2,
                  learn_rate=0.012,
                  loss_reduction=0.42,
                  sample_size = 0.443) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification") %>% 
  fit(HWin ~ ., data = df_train %>% mutate(HWin = as.factor(HWin)) %>% select(-c(HDiff)))

saveRDS(xgb,'xgb_outright.RDS')
[20:10:51] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
xgb = readRDS('xgb_outright.RDS')
xprb = predict(xgb, df_test, type = "prob")
print(LogLoss(xprb$.pred_TRUE,df_test$HWin))
[1] 0.5008113

We get a pretty solid log loss for the 2019 tournament – this would have us at about middle of the pack. There are a few tricks we can do to boost our performance, which I have not implemented here but are arbitrary to do so:

  • Cap predictions at 0.95/0.05 – in the event of (inevitable) upsets where a team with a high probability of winning loses, we are better insulated from those upsets than teams that do not cap their predictions, but the gain from correctly predicting a .99 win versus a .95 win is not that substantial.
  • Generate two sets of predictions and randomly flip close games around while artificially boosting confidence – Kaggle allows you to submit two sets of predictions for stage 2, not just one. We can exploit this by taking games we consider to be coinflips (games where win probabilites fall between 45%-55%) and randomly pick a winner, and assign an artifically high confidence (such as 70%/30% or, if you want to be really, extreme, 100%/0%). If we want to do well enough to win this competition, we can boost our score with this random confidence. If we fail however, yikes! (Spoiler alert, it worked)
  • Adjust predictions for injuries – One thing many Kaggle teams will not do but something that is extremely valuable to do is to monitor the injury status of key players on each team. Missing a star player can result in as much as a 3 point spread, something that affects a team’s win probability significantly. Using vegas spreads, I converted my win probability estimates to point spreads, used approximate value to estimate the value of missing players on the spread, adjusted those spreads, and then converted my spreads back to win probability estimates.
Reply to this post on Bluesky to leave a comment