2021 March Madness Kaggle Solution
Our approach is ensemble the shit out of everything. We will hold out the 2015-2019 games for validation purposes. We will prepare and optimize two sets of models - one, an ensemble of general team strength features trained on 1985-2014 games, and two - an ensemble of general team strength features + adjusted ratings based on box-score data trained on 2002-2014 games. Let’s prepare the first approach.
import sys
!{sys.executable} -m pip install pandas sklearn numpy rpy2 trueskill catboost hyperopt ray
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
import math
import trueskill as ts
import warnings
warnings.filterwarnings("ignore") # because pandas indexing gets mad at me
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
Requirement already satisfied: pandas in c:\users\edwar\anaconda3\lib\site-packages (1.1.3)
Requirement already satisfied: sklearn in c:\users\edwar\anaconda3\lib\site-packages (0.0)
Requirement already satisfied: numpy in c:\users\edwar\anaconda3\lib\site-packages (1.19.2)
Requirement already satisfied: rpy2 in c:\users\edwar\anaconda3\lib\site-packages (3.4.2)
Requirement already satisfied: trueskill in c:\users\edwar\anaconda3\lib\site-packages (0.4.5)
Requirement already satisfied: catboost in c:\users\edwar\anaconda3\lib\site-packages (0.24.4)
Requirement already satisfied: hyperopt in c:\users\edwar\anaconda3\lib\site-packages (0.2.5)
Requirement already satisfied: ray in c:\users\edwar\anaconda3\lib\site-packages (1.2.0)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\edwar\anaconda3\lib\site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\edwar\anaconda3\lib\site-packages (from pandas) (2020.1)
Requirement already satisfied: scikit-learn in c:\users\edwar\anaconda3\lib\site-packages (from sklearn) (0.23.2)
Requirement already satisfied: jinja2 in c:\users\edwar\anaconda3\lib\site-packages (from rpy2) (2.11.2)
Requirement already satisfied: cffi>=1.10.0 in c:\users\edwar\anaconda3\lib\site-packages (from rpy2) (1.14.3)
Requirement already satisfied: tzlocal in c:\users\edwar\anaconda3\lib\site-packages (from rpy2) (2.1)
Requirement already satisfied: six in c:\users\edwar\anaconda3\lib\site-packages (from trueskill) (1.15.0)
Requirement already satisfied: matplotlib in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (3.3.2)
Requirement already satisfied: plotly in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (4.14.3)
Requirement already satisfied: scipy in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (1.5.2)
Requirement already satisfied: graphviz in c:\users\edwar\anaconda3\lib\site-packages (from catboost) (0.16)
Requirement already satisfied: cloudpickle in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (1.6.0)
Requirement already satisfied: future in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (0.18.2)
Requirement already satisfied: tqdm in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (4.50.2)
Requirement already satisfied: networkx>=2.2 in c:\users\edwar\anaconda3\lib\site-packages (from hyperopt) (2.5)
Requirement already satisfied: jsonschema in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.2.0)
Requirement already satisfied: aiohttp-cors in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.7.0)
Requirement already satisfied: grpcio>=1.28.1 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (1.36.1)
Requirement already satisfied: protobuf>=3.8.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.14.0)
Requirement already satisfied: prometheus-client>=0.7.1 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.8.0)
Requirement already satisfied: colorama in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.4.4)
Requirement already satisfied: colorful in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.5.4)
Requirement already satisfied: opencensus in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.7.12)
Requirement already satisfied: py-spy>=0.2.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.3.4)
Requirement already satisfied: gpustat in c:\users\edwar\anaconda3\lib\site-packages (from ray) (0.6.0)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (1.0.0)
Requirement already satisfied: filelock in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.0.12)
Requirement already satisfied: click>=7.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (7.1.2)
Requirement already satisfied: redis>=3.5.0 in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.5.3)
Requirement already satisfied: aioredis in c:\users\edwar\anaconda3\lib\site-packages (from ray) (1.3.1)
Requirement already satisfied: aiohttp in c:\users\edwar\anaconda3\lib\site-packages (from ray) (3.7.4.post0)
Requirement already satisfied: requests in c:\users\edwar\anaconda3\lib\site-packages (from ray) (2.24.0)
Requirement already satisfied: pyyaml in c:\users\edwar\anaconda3\lib\site-packages (from ray) (5.3.1)
Requirement already satisfied: joblib>=0.11 in c:\users\edwar\anaconda3\lib\site-packages (from scikit-learn->sklearn) (0.17.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from scikit-learn->sklearn) (2.1.0)
Requirement already satisfied: MarkupSafe>=0.23 in c:\users\edwar\anaconda3\lib\site-packages (from jinja2->rpy2) (1.1.1)
Requirement already satisfied: pycparser in c:\users\edwar\anaconda3\lib\site-packages (from cffi>=1.10.0->rpy2) (2.20)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (1.3.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: certifi>=2020.06.20 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (2020.6.20)
Requirement already satisfied: cycler>=0.10 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\edwar\anaconda3\lib\site-packages (from matplotlib->catboost) (8.0.1)
Requirement already satisfied: retrying>=1.3.3 in c:\users\edwar\anaconda3\lib\site-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: decorator>=4.3.0 in c:\users\edwar\anaconda3\lib\site-packages (from networkx>=2.2->hyperopt) (4.4.2)
Requirement already satisfied: setuptools in c:\users\edwar\anaconda3\lib\site-packages (from jsonschema->ray) (50.3.1.post20201107)
Requirement already satisfied: attrs>=17.4.0 in c:\users\edwar\anaconda3\lib\site-packages (from jsonschema->ray) (20.3.0)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\edwar\anaconda3\lib\site-packages (from jsonschema->ray) (0.17.3)
Requirement already satisfied: opencensus-context==0.1.2 in c:\users\edwar\anaconda3\lib\site-packages (from opencensus->ray) (0.1.2)
Requirement already satisfied: google-api-core<2.0.0,>=1.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from opencensus->ray) (1.26.1)
Requirement already satisfied: psutil in c:\users\edwar\anaconda3\lib\site-packages (from gpustat->ray) (5.7.2)
Requirement already satisfied: nvidia-ml-py3>=7.352.0 in c:\users\edwar\anaconda3\lib\site-packages (from gpustat->ray) (7.352.0)
Requirement already satisfied: blessings>=1.6 in c:\users\edwar\anaconda3\lib\site-packages (from gpustat->ray) (1.7)
Requirement already satisfied: hiredis in c:\users\edwar\anaconda3\lib\site-packages (from aioredis->ray) (1.1.0)
Requirement already satisfied: async-timeout in c:\users\edwar\anaconda3\lib\site-packages (from aioredis->ray) (3.0.1)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (5.1.0)
Requirement already satisfied: typing-extensions>=3.6.5 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (3.7.4.3)
Requirement already satisfied: chardet<5.0,>=2.0 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (3.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\edwar\anaconda3\lib\site-packages (from aiohttp->ray) (1.6.3)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\edwar\anaconda3\lib\site-packages (from requests->ray) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in c:\users\edwar\anaconda3\lib\site-packages (from requests->ray) (2.10)
Requirement already satisfied: google-auth<2.0dev,>=1.21.1 in c:\users\edwar\anaconda3\lib\site-packages (from google-api-core<2.0.0,>=1.0.0->opencensus->ray) (1.28.0)
Requirement already satisfied: packaging>=14.3 in c:\users\edwar\anaconda3\lib\site-packages (from google-api-core<2.0.0,>=1.0.0->opencensus->ray) (20.4)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in c:\users\edwar\anaconda3\lib\site-packages (from google-api-core<2.0.0,>=1.0.0->opencensus->ray) (1.53.0)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in c:\users\edwar\anaconda3\lib\site-packages (from google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (4.7.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\edwar\anaconda3\lib\site-packages (from google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\edwar\anaconda3\lib\site-packages (from google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (4.2.1)
Requirement already satisfied: pyasn1>=0.1.3 in c:\users\edwar\anaconda3\lib\site-packages (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth<2.0dev,>=1.21.1->google-api-core<2.0.0,>=1.0.0->opencensus->ray) (0.4.8)
compact_results = pd.read_csv('data/MRegularSeasonCompactResults.csv')
compact_results['WTeamID'] = compact_results['WTeamID'].astype('str')
compact_results['LTeamID'] = compact_results['LTeamID'].astype('str')
compact_results.head()
Season | DayNum | WTeamID | WScore | LTeamID | LScore | WLoc | NumOT | |
---|---|---|---|---|---|---|---|---|
0 | 1985 | 20 | 1228 | 81 | 1328 | 64 | N | 0 |
1 | 1985 | 25 | 1106 | 77 | 1354 | 70 | H | 0 |
2 | 1985 | 25 | 1112 | 63 | 1223 | 56 | H | 0 |
3 | 1985 | 25 | 1165 | 70 | 1432 | 54 | H | 0 |
4 | 1985 | 25 | 1192 | 86 | 1447 | 74 | H | 0 |
I am not a big fan of this format, so I will quickly write a function to re-oriented our dataframe around home/away teams rather than winning/losing teams.
def refactor_games_compact(dataframe):
df = dataframe
neutral_winners = df.loc[df.WLoc == 'N']
home_winners = df.loc[df.WLoc == 'H']
away_winners = df.loc[df.WLoc == 'A']
neutral_winners['NeutralFlg'] = True
neutral_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT','NeutralFlg']
home_winners['NeutralFlg'] = False
home_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT','NeutralFlg']
away_winners['NeutralFlg'] = False
away_winners.columns = ['Season','DayNum','ATeamID','AScore','HTeamID','HScore','WLoc','NumOT','NeutralFlg']
df = neutral_winners.append(home_winners.append(away_winners))
df = df.drop(columns=['WLoc'])
df = df.sort_values(by=['Season','DayNum'])
return df
compact_results = refactor_games_compact(compact_results)
compact_results.head()
Season | DayNum | HTeamID | HScore | ATeamID | AScore | NumOT | NeutralFlg | |
---|---|---|---|---|---|---|---|---|
0 | 1985 | 20 | 1228 | 81 | 1328 | 64 | 0 | True |
6 | 1985 | 25 | 1228 | 64 | 1226 | 44 | 0 | True |
7 | 1985 | 25 | 1242 | 58 | 1268 | 56 | 0 | True |
11 | 1985 | 25 | 1344 | 75 | 1438 | 71 | 0 | True |
13 | 1985 | 25 | 1412 | 70 | 1397 | 65 | 0 | True |
Our first step will be to write a function to rescore games based on home court advantage. To estimate home court advantage, we will use the LRMC methodology .
def calc_hfa(dataframe):
df = dataframe
home_games = df.loc[~df['NeutralFlg']]
X_df = home_games
Y_df = home_games
X_df['HDiff'] = X_df['HScore'] - X_df['AScore']
X_df.loc[X_df['NumOT'] > 0,'HDiff'] = 0 # any overtime games are treated as ties
X_df = X_df[['HTeamID','ATeamID','HDiff']]
Y_df['HWin'] = (Y_df['AScore'] > Y_df['HScore']) * 1
Y_df = Y_df[['HTeamID','ATeamID','HWin']]
log_df = pd.merge(X_df, Y_df, left_on=['HTeamID','ATeamID'],right_on=['ATeamID','HTeamID'])
X = log_df['HDiff']
Y = log_df['HWin']
X = np.array(X).reshape(-1,1)
logreg = LogisticRegression(random_state=0).fit(X, Y)
a = float(logreg.coef_[0])
b = float(logreg.intercept_)
h = (-b / a) / 2
return {'a':a,'b':b,'h':h}
def rescore_hfa(dataframe):
df = dataframe
hfa_mod = calc_hfa(df)
h_2 = hfa_mod['h'] / 2
df['HScore'] = df['HScore'] - h_2 * df['NeutralFlg']
df['AScore'] = df['AScore'] + h_2 * df['NeutralFlg']
return df
testing_df = compact_results.loc[compact_results['Season'] == 2014]
rescore_hfa(testing_df).head()
Season | DayNum | HTeamID | HScore | ATeamID | AScore | NumOT | NeutralFlg | |
---|---|---|---|---|---|---|---|---|
129204 | 2014 | 4 | 1102 | 75.734604 | 1119 | 71.265396 | 0 | True |
129211 | 2014 | 4 | 1124 | 68.734604 | 1160 | 63.265396 | 0 | True |
129220 | 2014 | 4 | 1163 | 74.734604 | 1268 | 80.265396 | 0 | True |
129223 | 2014 | 4 | 1184 | 79.734604 | 1198 | 64.265396 | 0 | True |
129240 | 2014 | 4 | 1258 | 74.734604 | 1213 | 78.265396 | 0 | True |
We can call on this function when working with team strength methodologies that are not adjusted to account for home field advantage. Additionally, this should aid us in scoring 2021 teams, when home field advantage is not as concrete as it has been in previous years.
We will also write another function to adjust the scoring of overtime games. To be points per 40 minutes. This will come in handy when estimating team strength from points-based metrics.
def rescore_ot(dataframe):
df = dataframe
df.HScore = df.HScore * (40 + df.NumOT * 5) / 40
df.AScore = df.AScore * (40 + df.NumOT * 5) / 40
return df
rescore_ot(testing_df.loc[testing_df['NumOT'] > 0]).head()
Season | DayNum | HTeamID | HScore | ATeamID | AScore | NumOT | NeutralFlg | |
---|---|---|---|---|---|---|---|---|
129265 | 2014 | 4 | 1344 | 92.25000 | 1130 | 87.75000 | 1 | False |
129227 | 2014 | 4 | 1414 | 109.12500 | 1201 | 110.25000 | 1 | False |
129251 | 2014 | 4 | 1330 | 75.37500 | 1283 | 88.87500 | 1 | False |
129276 | 2014 | 4 | 1274 | 69.75000 | 1383 | 74.25000 | 1 | False |
129344 | 2014 | 5 | 1464 | 79.57643 | 1198 | 84.67357 | 1 | True |
We will compute the following metrics to estimate team strength:
- Elo
- LRMC
- SRS
- RPI
- TrueSkill
- Mixed Model Ratings with LME4
- Colley Matrix method
- Win Percentage
- Average Margin of Victory
First is Elo. Normally we would tune Elo paramters, but that process can be somewhat intensive, especially with large datasets such as this – hence, I will be using a premade series of parameters to estimate team strength. Note that with Elo, you can normally adjust for home field advantage with a constant factor, and that availability still exists with this function, but passing in arguments for rescoring overrides this factor.
def calc_elo(dataframe, avg_elo = 1500, elo_width = 400, mov_factor = 8, hfa = 135, hfa_rescore = False, ot_rescore = False):
df = dataframe
if ot_rescore:
df = rescore_ot(df)
if hfa_rescore:
df = rescore_hfa(df)
df = df.sort_values(by=['Season','DayNum'])
teams = list(set(df['HTeamID'].append(df['ATeamID'])))
elos = {team:avg_elo for team in teams}
for index, row in df.iterrows():
home_elo = elos[row['HTeamID']]
away_elo = elos[row['ATeamID']]
home_elo_adj = home_elo
if not row['NeutralFlg'] and not hfa_rescore and not ot_rescore:
home_elo_adj += hfa
# calculate expected score
home_prob = 1.0/(1+10**((away_elo - home_elo_adj)/elo_width))
away_prob = 1 - home_prob
# calculate k factor as a function of margin of the log of margin of victory
home_diff = row['HScore'] - row['AScore']
k_factor = math.log(abs(home_diff)+1) * mov_factor
home_win = (row['HScore'] > row['AScore']) * 1
away_win = (row['AScore'] > row['HScore']) * 1
# update elos
elos[row['HTeamID']] = home_elo + k_factor * (home_win - home_prob)
elos[row['ATeamID']] = away_elo + k_factor * (away_win - away_prob)
elo_df = pd.DataFrame(elos,index=[0]).transpose()
elo_df.columns = ['Elo']
return elo_df
calc_elo(testing_df,hfa_rescore=True,ot_rescore=True).head()
Elo | |
---|---|
1106 | 1500.823216 |
1219 | 1505.567477 |
1285 | 1517.728113 |
1350 | 1526.039480 |
1249 | 1336.231722 |
Next up is LRMC. We’re already halfway done with our code adjusting for HFA, we will simply take our logistic model and apply it to all of the games in our dataframe (because adjusting for HFA is integral to LRMC, it is not an optional argument).
def calc_lrmc(dataframe, ot_rescore=True):
df = dataframe
if ot_rescore:
df = rescore_ot(df)
hfa_mod = calc_hfa(df)
h = hfa_mod['h']
a = hfa_mod['a']
b = hfa_mod['b']
teams = list(set(df['HTeamID'].append(df['ATeamID'])))
n_teams = len(teams)
p = np.zeros((n_teams,n_teams))
n_games = np.zeros(n_teams)
for index, row in df.iterrows():
home_team_ndx = teams.index(row['HTeamID'])
away_team_ndx = teams.index(row['ATeamID'])
# calculate r_x
spread = row['HScore'] - row['AScore'] + h * (row['NeutralFlg'])
r_x = math.exp(a * spread + b) / (1 + math.exp(a * spread + b))
# update respective matrices
n_games[home_team_ndx] += 1
n_games[away_team_ndx] += 1
p[home_team_ndx, away_team_ndx] += 1 - r_x
p[away_team_ndx, home_team_ndx] += r_x
p[home_team_ndx, home_team_ndx] += r_x
p[away_team_ndx, away_team_ndx] += 1 - r_x
# solve matrix
p = p / n_games[:,None]
prior = n_teams - np.array(list(range(n_teams)))
steady_state = np.linalg.matrix_power(p, 1000)
rating = prior.dot(steady_state)
return pd.DataFrame({'LRMC':rating}, index = teams)
calc_lrmc(testing_df).head()
LRMC | |
---|---|
1106 | 132.928218 |
1219 | 137.471394 |
1285 | 157.845504 |
1350 | 203.865228 |
1249 | 99.113128 |
SRS on deck.
def calc_srs(dataframe, hfa_rescore=True, ot_rescore=True):
df = dataframe
if hfa_rescore:
df = rescore_hfa(df)
if ot_rescore:
df = rescore_ot(df)
df['HSpread'] = df['HScore'] - df['AScore']
df['ASpread'] = -1 * df['HSpread']
teams = pd.concat([
df[['HTeamID','HSpread','ATeamID']].rename(columns={'HTeamID':'TeamID','HSpread':'Spread','ATeamID':'OppID'}),
df[['ATeamID','ASpread','HTeamID']].rename(columns={'ATeamID':'TeamID','ASpread':'Spread','HTeamID':'OppID'})
])
spreads = teams.groupby('TeamID').Spread.mean()
terms = []
solutions = []
for team in spreads.keys():
row = []
opps = list(teams[teams['TeamID'] == team]['OppID'])
for opp in spreads.keys():
if opp == team:
row.append(1)
elif opp in opps:
row.append(-1.0 / len(opps))
else:
row.append(0)
terms.append(row)
solutions.append(spreads[team])
solutions = np.linalg.solve(np.array(terms), np.array(solutions))
return pd.DataFrame({'SRS':solutions},index = list(spreads.keys()))
calc_srs(testing_df).head()
SRS | |
---|---|
1101 | -18.642546 |
1102 | -6.101457 |
1103 | 2.412745 |
1104 | 6.819273 |
1105 | -5.714953 |
RPI’s usefulness is highly debatable, there’s a very good reason why the NCAA does not use it in tournament rankings and instead relies on NET ratings now. Still, we’re not too picky, and we can always throw it out when we reach the feature selection stage of the process. We’ll adjust for OT but not for HFA when it comes to calculating RPI (RPI likes to think it already incorporates HFA, even though it really doesn’t, but we’ll humor it for the time being).
def calc_rpi(dataframe, ot_rescore = True):
df = dataframe
if ot_rescore:
df = rescore_ot(df)
df['HWin'] = df['HScore'].ge(df['AScore']) * 1
df['AWin'] = 1 - df['HWin']
df['HLoss'] = 1 - df['HWin']
df['ALoss'] = df['HWin']
teams = pd.concat([
df[['HTeamID','ATeamID','HWin','HLoss']].rename(columns={'HTeamID':'TeamID','ATeamID':'OppID','HWin':'HWin',
'HLoss':'HLoss'}),
df[['ATeamID','HTeamID','AWin','ALoss']].rename(columns={'ATeamID':'TeamID','HTeamID':'OppID','AWin':'AWin',
'ALoss':'ALoss'})
])
teams = teams.fillna(0)
rpi = teams.groupby('TeamID')[['HWin','HLoss','AWin','ALoss']].sum()
rpi['W'] = rpi['HWin'] + rpi['AWin']
rpi['L'] = rpi['HLoss'] + rpi['ALoss']
rpi['WP'] = rpi['W'] / (rpi['W'] + rpi['L'])
rpi['AdjWP'] = (rpi['HWin'] * 0.6 + rpi['AWin'] * 1.4) / (rpi['HWin'] * 0.6 + rpi['AWin'] * 1.4 + rpi['HLoss'] * 1.4 + rpi['ALoss'] * 0.6)
for index, row in rpi.iterrows():
temp_teams = teams.loc[teams['TeamID'] == index]
rpi.loc[index,'OWP'] = float(pd.merge(temp_teams, rpi, left_on = 'OppID', right_on = 'TeamID')['WP'].mean())
for index, row in rpi.iterrows():
temp_teams = teams.loc[teams['TeamID'] == index]
rpi.loc[index,'OOWP'] = float(pd.merge(temp_teams, rpi, left_on = 'OppID', right_on = 'TeamID')['OWP'].mean())
rpi['RPI'] = 0.25 * rpi['AdjWP'] + 0.5 * rpi['OWP'] + 0.25 * rpi['OOWP']
return pd.DataFrame({'RPI':rpi['RPI']},index = list(rpi.index))
calc_rpi(testing_df).head()
RPI | |
---|---|
1101 | 0.380277 |
1102 | 0.465483 |
1103 | 0.538925 |
1104 | 0.561932 |
1105 | 0.443962 |
Next is TrueSkill, which might actually be more valuable than Elo given the different numbers of games teams played. We’ll just record the Mu values for each Teams' TS rating.
def calc_ts(dataframe, hfa_rescore = True, ot_rescore = True):
df = dataframe
if hfa_rescore:
df = rescore_hfa(df)
if ot_rescore:
df = rescore_ot(df)
teams = list(set(df['HTeamID'].append(df['ATeamID'])))
ts_dict = {}
for team in teams:
ts_dict[team] = {'TS':ts.Rating()}
for index, row in df.iterrows():
home_rating = ts_dict[row['HTeamID']]['TS']
away_rating = ts_dict[row['ATeamID']]['TS']
if row['HScore'] > row['AScore']:
home_rating, away_rating = ts.rate_1vs1(home_rating, away_rating)
elif row['AScore'] > row['HScore']:
away_rating, home_rating = ts.rate_1vs1(away_rating, home_rating)
else:
home_rating, away_rating = ts.rate_1vs1(home_rating, away_rating, drawn = True)
ts_dict[row['HTeamID']]['TS'] = home_rating
ts_dict[row['ATeamID']]['TS'] = away_rating
ts_list = [ts_dict[team]['TS'].mu for team in teams]
return pd.DataFrame({'TS':ts_list},index = teams)
calc_ts(testing_df).head()
TS | |
---|---|
1106 | 21.073280 |
1219 | 21.556950 |
1285 | 24.166877 |
1350 | 28.791182 |
1249 | 12.829917 |
Next up, we’re generating some mixed model ratings for each team. To do this, we’ll quickly, temporarily borrow from R’s lme4
package (because Python really doesn’t have any good mixed model packages).
utils = rpackages.importr('utils')
base = rpackages.importr('base')
utils.chooseCRANmirror(ind=1) # select the first mirror in the list
packnames = ('lme4', 'Matrix')
from rpy2.robjects.vectors import StrVector
names_to_install = [x for x in packnames if not rpackages.isinstalled(x)]
if len(names_to_install) > 0:
utils.install_packages(StrVector(names_to_install))
ro.r('''
gen_ratings = function(df){
model <- lme4::lmer(Score ~ 0 + Loc + (1|Offense) + (1|Defense),
data = df)
offense_ratings = lme4::ranef(model)$Offense
offense_ratings$Team = row.names(lme4::ranef(model)$Offense)
colnames(offense_ratings) = c('Off','TeamID')
defense_ratings = lme4::ranef(model)$Defense
defense_ratings$Team = row.names(lme4::ranef(model)$Defense)
colnames(defense_ratings) = c('Def','TeamID')
rtg_df = merge(offense_ratings, defense_ratings,on='TeamID')
return(rtg_df)
}
''')
gen_ratings = ro.globalenv['gen_ratings']
def calc_lme(dataframe, hfa_rescore = True, ot_rescore = True):
df = dataframe
if hfa_rescore:
df = rescore_hfa(df)
if ot_rescore:
df = rescore_ot(df)
home_scoring = df[['HTeamID','ATeamID','HScore','NeutralFlg']].rename(columns={'HTeamID':'Offense','ATeamID':'Defense',
'HScore':'Score','NeutralFlg':'NeutralFlg'})
home_scoring['Loc'] = ''
home_scoring.loc[home_scoring['NeutralFlg'],'Loc'] = 'N'
home_scoring.loc[~home_scoring['NeutralFlg'],'Loc'] = 'H'
away_scoring = df[['ATeamID','HTeamID','AScore','NeutralFlg']].rename(columns={'ATeamID':'Offense','HTeamID':'Defense',
'AScore':'Score','NeutralFlg':'NeutralFlg'})
away_scoring['Loc'] = ''
away_scoring.loc[away_scoring['NeutralFlg'],'Loc'] = 'N'
away_scoring.loc[~away_scoring['NeutralFlg'],'Loc'] = 'A'
scoring = pd.concat([home_scoring,away_scoring])
with localconverter(ro.default_converter + pandas2ri.converter):
df_r = ro.conversion.py2rpy(scoring)
df_pd = gen_ratings(df_r)
with localconverter(ro.default_converter + pandas2ri.converter):
rtg_df = ro.conversion.rpy2py(df_pd)
return pd.DataFrame({'Off':list(rtg_df['Off']),'Def':list(rtg_df['Def'])},index=list(rtg_df['TeamID']))
calc_lme(testing_df).head()
Off | Def | |
---|---|---|
1101 | 0.037220 | 3.001805 |
1102 | -2.803390 | -0.991113 |
1103 | -0.359363 | -1.211838 |
1104 | 1.990535 | 0.245586 |
1105 | -2.146115 | -1.168408 |
Home stretch: Colley Matrix Method!
def calc_colley(dataframe, hfa_rescore = True, ot_rescore = True):
df = dataframe
if hfa_rescore:
df = rescore_hfa(df)
if ot_rescore:
df = rescore_ot(df)
teams = list(set(df['HTeamID'].append(df['ATeamID'])))
n_teams = len(teams)
c_mat = np.identity(n_teams) * 2
b_mat = np.zeros(n_teams)
for index, row in df.iterrows():
home_team_ndx = teams.index(row['HTeamID'])
away_team_ndx = teams.index(row['ATeamID'])
spread = row['HScore'] - row['AScore']
c_mat[home_team_ndx, home_team_ndx] += 1
c_mat[away_team_ndx, away_team_ndx] += 1
c_mat[home_team_ndx, away_team_ndx] -= 1
c_mat[away_team_ndx, home_team_ndx] -= 1
b_mat[home_team_ndx] += (row['HScore'] > row['AScore']) * 1 + (row['AScore'] > row['HScore']) * -1
b_mat[away_team_ndx] += (row['AScore'] > row['HScore']) * 1 + (row['HScore'] > row['AScore']) * -1
b_mat = b_mat / 2 + 1
ranks = np.linalg.solve(c_mat,b_mat).tolist()
return pd.DataFrame({'Colley':ranks},index = teams)
calc_colley(testing_df).head()
Colley | |
---|---|
1106 | 0.429206 |
1219 | 0.427629 |
1285 | 0.508115 |
1350 | 0.670517 |
1249 | 0.091091 |
Finally, we’ll lift a couple prior pieces of code to calculate win percent and MOV.
def calc_mov(dataframe, hfa_rescore=True, ot_rescore=True):
df = dataframe
if hfa_rescore:
df = rescore_hfa(df)
if ot_rescore:
df = rescore_ot(df)
df['HSpread'] = df['HScore'] - df['AScore']
df['ASpread'] = -1 * df['HSpread']
teams = pd.concat([
df[['HTeamID','HSpread','ATeamID']].rename(columns={'HTeamID':'TeamID','HSpread':'Spread','ATeamID':'OppID'}),
df[['ATeamID','ASpread','HTeamID']].rename(columns={'ATeamID':'TeamID','ASpread':'Spread','HTeamID':'OppID'})
])
spreads = teams.groupby('TeamID').Spread.mean()
return pd.DataFrame({'MOV':spreads},index = list(spreads.index))
calc_mov(testing_df).head()
MOV | |
---|---|
1101 | -17.664136 |
1102 | -6.344923 |
1103 | 4.572178 |
1104 | 11.177569 |
1105 | -3.956765 |
def calc_wp(dataframe, hfa_rescore=True, ot_rescore=True):
df = dataframe
if hfa_rescore:
df = rescore_hfa(df)
if ot_rescore:
df = rescore_ot(df)
df['HWin'] = df['HScore'].ge(df['AScore']) * 1
df['AWin'] = 1 - df['HWin']
df['HLoss'] = 1 - df['HWin']
df['ALoss'] = df['HWin']
teams = pd.concat([
df[['HTeamID','ATeamID','HWin','HLoss']].rename(columns={'HTeamID':'TeamID','ATeamID':'OppID','HWin':'HWin',
'HLoss':'HLoss'}),
df[['ATeamID','HTeamID','AWin','ALoss']].rename(columns={'ATeamID':'TeamID','HTeamID':'OppID','AWin':'AWin',
'ALoss':'ALoss'})
])
teams = teams.fillna(0)
wp = teams.groupby('TeamID')[['HWin','HLoss','AWin','ALoss']].sum()
wp['W'] = wp['HWin'] + wp['AWin']
wp['L'] = wp['HLoss'] + wp['ALoss']
wp['WP'] = wp['W'] / (wp['W'] + wp['L'])
return pd.DataFrame({'WP':wp['WP']},index = list(wp.index))
calc_wp(testing_df).head()
WP | |
---|---|
1101 | 0.095238 |
1102 | 0.357143 |
1103 | 0.666667 |
1104 | 0.516129 |
1105 | 0.428571 |
Alright, that’s the bulk of the heavy lifting done for generating team ratings from only schedule data. Next, we’re going to estimate team factors from advanced box score data, which is only available from 2002-on. We’ll use lme4
for this as well.
First, we’ll read in our data.
detailed_results = pd.read_csv('data/MRegularSeasonDetailedResults.csv')
detailed_results['WTeamID'] = detailed_results['WTeamID'].astype('str')
detailed_results['LTeamID'] = detailed_results['LTeamID'].astype('str')
detailed_results.head()
Season | DayNum | WTeamID | WScore | LTeamID | LScore | WLoc | NumOT | WFGM | WFGA | ... | LFGA3 | LFTM | LFTA | LOR | LDR | LAst | LTO | LStl | LBlk | LPF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2003 | 10 | 1104 | 68 | 1328 | 62 | N | 0 | 27 | 58 | ... | 10 | 16 | 22 | 10 | 22 | 8 | 18 | 9 | 2 | 20 |
1 | 2003 | 10 | 1272 | 70 | 1393 | 63 | N | 0 | 26 | 62 | ... | 24 | 9 | 20 | 20 | 25 | 7 | 12 | 8 | 6 | 16 |
2 | 2003 | 11 | 1266 | 73 | 1437 | 61 | N | 0 | 24 | 58 | ... | 26 | 14 | 23 | 31 | 22 | 9 | 12 | 2 | 5 | 23 |
3 | 2003 | 11 | 1296 | 56 | 1457 | 50 | N | 0 | 18 | 38 | ... | 22 | 8 | 15 | 17 | 20 | 9 | 19 | 4 | 3 | 23 |
4 | 2003 | 11 | 1400 | 77 | 1208 | 71 | N | 0 | 30 | 61 | ... | 16 | 17 | 27 | 21 | 15 | 12 | 10 | 7 | 1 | 14 |
5 rows × 34 columns
We’ll first write a quick function converting these results to our preferred format.
def refactor_games_detailed(dataframe):
df = dataframe
neutral_winners = df.loc[df.WLoc == 'N']
home_winners = df.loc[df.WLoc == 'H']
away_winners = df.loc[df.WLoc == 'A']
neutral_winners['NeutralFlg'] = True
neutral_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT',
'HFGM','HFGA','HFGM3','HFGA3','HFTM','HFTA','HOR','HDR','HAST','HTO','HSTL','HBLK','HPF',
'AFGM','AFGA','AFGM3','AFGA3','AFTM','AFTA','AOR','ADR','AAST','ATO','ASTL','ABLK','APF',
'NeutralFlg']
home_winners['NeutralFlg'] = False
home_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT',
'HFGM','HFGA','HFGM3','HFGA3','HFTM','HFTA','HOR','HDR','HAST','HTO','HSTL','HBLK','HPF',
'AFGM','AFGA','AFGM3','AFGA3','AFTM','AFTA','AOR','ADR','AAST','ATO','ASTL','ABLK','APF',
'NeutralFlg']
away_winners['NeutralFlg'] = False
away_winners.columns = ['Season','DayNum','HTeamID','HScore','ATeamID','AScore','WLoc','NumOT',
'AFGM','AFGA','AFGM3','AFGA3','AFTM','AFTA','AOR','ADR','AAST','ATO','ASTL','ABLK','APF',
'HFGM','HFGA','HFGM3','HFGA3','HFTM','HFTA','HOR','HDR','HAST','HTO','HSTL','HBLK','HPF',
'NeutralFlg']
df = neutral_winners.append(home_winners.append(away_winners))
df = df.drop(columns=['WLoc'])
df = df.sort_values(by=['Season','DayNum'])
return df
detailed_results = refactor_games_detailed(detailed_results)
detailed_results.head()
Season | DayNum | HTeamID | HScore | ATeamID | AScore | NumOT | HFGM | HFGA | HFGM3 | ... | AFTM | AFTA | AOR | ADR | AAST | ATO | ASTL | ABLK | APF | NeutralFlg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2003 | 10 | 1104 | 68 | 1328 | 62 | 0 | 27 | 58 | 3 | ... | 16 | 22 | 10 | 22 | 8 | 18 | 9 | 2 | 20 | True |
1 | 2003 | 10 | 1272 | 70 | 1393 | 63 | 0 | 26 | 62 | 8 | ... | 9 | 20 | 20 | 25 | 7 | 12 | 8 | 6 | 16 | True |
2 | 2003 | 11 | 1266 | 73 | 1437 | 61 | 0 | 24 | 58 | 8 | ... | 14 | 23 | 31 | 22 | 9 | 12 | 2 | 5 | 23 | True |
3 | 2003 | 11 | 1296 | 56 | 1457 | 50 | 0 | 18 | 38 | 3 | ... | 8 | 15 | 17 | 20 | 9 | 19 | 4 | 3 | 23 | True |
4 | 2003 | 11 | 1400 | 77 | 1208 | 71 | 0 | 30 | 61 | 6 | ... | 17 | 27 | 21 | 15 | 12 | 10 | 7 | 1 | 14 | True |
5 rows × 34 columns
Our next function will generate a dataframe with several rating columns of interest, adjusted for opponent and home field advantage, which should give us an estimate of each team’s overall strength in each box-score category.
ro.r('''
gen_custom_rating = function(df, term){
formula = as.formula(paste0(term,' ~ 0 + Loc + (1|TeamID) + (1|OppID)'))
model <- lme4::lmer(formula,
data = df)
made_ratings = lme4::ranef(model)$TeamID
made_ratings$Team = row.names(lme4::ranef(model)$TeamID)
colnames(made_ratings) = c(paste0(term,'Made'),'TeamID')
allowed_ratings = lme4::ranef(model)$OppID
allowed_ratings$Team = row.names(lme4::ranef(model)$OppID)
colnames(allowed_ratings) = c(paste0(term,'Allowed'),'TeamID')
rtg_df = merge(made_ratings, allowed_ratings ,on='TeamID')
return(rtg_df)
}
''')
gen_custom_rating = ro.globalenv['gen_custom_rating']
def calc_box_score_rtgs(dataframe, raw = False):
df = dataframe
# calculate possessions as our denominator
df['HPOS'] = (df['HFGA'] - df['HOR']) + df['HTO'] + (0.44 * df['HFTA'])
df['APOS'] = (df['AFGA'] - df['AOR']) + df['ATO'] + (0.44 * df['AFTA'])
# calculate stats per possession
df['HFGM_POS'] = df['HFGM'] / df['HPOS']
df['HFGA_POS'] = df['HFGA'] / df['HPOS']
df['HFGM3_POS'] = df['HFGM3'] / df['HPOS']
df['HFGA3_POS'] = df['HFGA3'] / df['HPOS']
df['HFTM_POS'] = df['HFTM'] / df['HPOS']
df['HFTA_POS'] = df['HFTA'] / df['HPOS']
df['HOR_POS'] = df['HOR'] / df['HPOS']
df['HDR_POS'] = df['HDR'] / df['APOS'] # denominator is opponent possesions because these are a defensive stat
df['HAST_POS'] = df['HAST'] / df['HPOS']
df['HTO_POS'] = df['HTO'] / df['HPOS']
df['HSTL_POS'] = df['HSTL'] / df['APOS'] # likewise
df['HBLK_POS'] = df['HBLK'] / df['APOS']
df['HPF_POS'] = df['HPF'] / (df['HPOS'] + df['APOS']) # denom is all possessions
df['HPTS_POS'] = df['HScore'] / df['HPOS']
df['AFGM_POS'] = df['AFGM'] / df['APOS']
df['AFGA_POS'] = df['AFGA'] / df['APOS']
df['AFGM3_POS'] = df['AFGM3'] / df['APOS']
df['AFGA3_POS'] = df['AFGA3'] / df['APOS']
df['AFTM_POS'] = df['AFTM'] / df['APOS']
df['AFTA_POS'] = df['AFTA'] / df['APOS']
df['AOR_POS'] = df['AOR'] / df['APOS']
df['ADR_POS'] = df['ADR'] / df['HPOS']
df['AAST_POS'] = df['AAST'] / df['APOS']
df['ATO_POS'] = df['ATO'] / df['APOS']
df['ASTL_POS'] = df['ASTL'] / df['HPOS']
df['ABLK_POS'] = df['ABLK'] / df['HPOS']
df['APF_POS'] = df['APF'] / (df['HPOS'] + df['APOS'])
df['APTS_POS'] = df['AScore'] / df['APOS']
# calculate "tempo", an estimate of possession per 40 minutes
df['HTEMPO'] = 40 * df['HPOS'] / (40 + df['NumOT'] * 5)
df['ATEMPO'] = 40 * df['APOS'] / (40 + df['NumOT'] * 5)
home_df = df[['HTeamID','ATeamID','HFGM_POS','HFGA_POS','HFGM3_POS','HFGA3_POS','HFTM_POS','HFTA_POS','HOR_POS',
'HDR_POS','HAST_POS','HTO_POS','HSTL_POS','HBLK_POS','HPF_POS','HPTS_POS','HTEMPO','NeutralFlg']]
home_df.columns = ['TeamID','OppID','FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS',
'DR_POS','AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO','NeutralFlg']
home_df['Loc'] = ''
home_df.loc[home_df['NeutralFlg'],'Loc'] = 'N'
home_df.loc[~home_df['NeutralFlg'],'Loc'] = 'H'
away_df = df[['ATeamID','HTeamID','AFGM_POS','AFGA_POS','AFGM3_POS','AFGA3_POS','AFTM_POS','AFTA_POS','AOR_POS',
'ADR_POS','AAST_POS','ATO_POS','ASTL_POS','ABLK_POS','APF_POS','APTS_POS','ATEMPO','NeutralFlg']]
away_df.columns = ['TeamID','OppID','FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS',
'DR_POS','AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO','NeutralFlg']
away_df['Loc'] = ''
away_df.loc[home_df['NeutralFlg'],'Loc'] = 'N'
away_df.loc[~home_df['NeutralFlg'],'Loc'] = 'A'
main_df = pd.concat([home_df,away_df])
if raw:
main_df = main_df.sort_values(by=['TeamID'])
made = main_df.groupby('TeamID')[['FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS','DR_POS',
'AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO']].mean()
main_df = main_df.sort_values(by=['OppID'])
allowed = main_df.groupby('OppID')[['FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS',
'DR_POS','AST_POS','TO_POS','STL_POS','BLK_POS','PF_POS','PTS_POS','TEMPO']].mean()
made.columns = ['FGM_POSMadeRAW','FGA_POSMadeRAW','FGM3_POSMadeRAW','FGA3_POSMadeRAW','FTM_POSMadeRAW','FTA_POSMadeRAW',
'OR_POSMadeRAW','DR_POSMadeRAW','AST_POSMadeRAW','TO_POSMadeRAW','STL_POSMadeRAW','BLK_POSMadeRAW',
'PF_POSMadeRAW','PTS_POSMadeRAW','TEMPOMadeRAW']
allowed.columns = ['FGM_POSAllowedRAW','FGA_POSAllowedRAW','FGM3_POSAllowedRAW','FGA3_POSAllowedRAW',
'FTM_POSAllowedRAW','FTA_POSAllowedRAW','OR_POSAllowedRAW','DR_POSAllowedRAW','AST_POSAllowedRAW',
'TO_POSAllowedRAW','STL_POSAllowedRAW','BLK_POSAllowedRAW','PF_POSAllowedRAW','PTS_POSAllowedRAW',
'TEMPOAllowedRAW']
made = pd.DataFrame(made)
allowed = pd.DataFrame(allowed)
main_df = pd.concat([made, allowed], axis = 1)
return main_df
with localconverter(ro.default_converter + pandas2ri.converter):
df_r = ro.conversion.py2rpy(main_df)
# next, we run a lmer on each column to adjust for HFA and opponent to estimate how frequently a team does something
# per possession
terms = ['FGM_POS','FGA_POS','FGM3_POS','FGA3_POS','FTM_POS','FTA_POS','OR_POS','DR_POS','AST_POS','TO_POS','STL_POS',
'BLK_POS','PF_POS','PTS_POS','TEMPO']
df_list = []
for term in terms:
# print('Calculating ratings for ' + term + '...')
df_pd = gen_custom_rating(df_r, term)
with localconverter(ro.default_converter + pandas2ri.converter):
rtg_df = ro.conversion.rpy2py(df_pd)
df_list.append(rtg_df)
indexes = df_list[0].TeamID
df_list = [df.drop(columns = ['TeamID']) for df in df_list]
main_df = pd.concat(df_list,axis = 1)
main_df.index = list(indexes)
return(main_df)
Alright, almost there. We’ll make two quick functions for feature engineering. One of them will simply return the Massey ratings for each team immediately prior to the tournament for a given dataframe, the other will return the tournament seed for each team (if applicable).
def grab_massey(dataframe, massey):
ms = massey
df = dataframe
year = str(df.iloc[0].Season)
ms = ms.loc[ms['Season'].astype('str') == year]
ms = ms.loc[ms['RankingDayNum'] == 133]
ms = ms[['SystemName','TeamID','OrdinalRank']]
ms = ms.pivot(index='TeamID',columns='SystemName')
ms.columns=[y for y in ms.columns.get_level_values(1)]
ms['AVG'] = ms.mean(axis=1)
for col in ms.columns: # impute missing values as the mean of massey rankings
ms[col].fillna(ms.AVG, inplace=True)
del ms['AVG']
ms.index.name = None
return (ms)
mass = pd.read_csv('data/MMasseyOrdinals.csv')
mass['TeamID'] = mass['TeamID'].astype('str')
grab_massey(testing_df,mass).head()
7OT | ADE | AP | BBT | BIH | BLS | BOB | BUR | CJB | CNG | ... | TPR | TRP | TW | UPS | USA | WIL | WLK | WMR | WOB | WOL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1101 | 343.0 | 297.0 | 338.163636 | 347.0 | 346.0 | 346.0 | 340.0 | 342.0 | 332.0 | 346.0 | ... | 348.0 | 339.0 | 329.0 | 338.163636 | 338.163636 | 342.0 | 330.0 | 346.0 | 323.0 | 345.0 |
1102 | 290.0 | 241.0 | 229.196721 | 266.0 | 227.0 | 231.0 | 220.0 | 207.0 | 232.0 | 225.0 | ... | 234.0 | 240.0 | 129.0 | 236.000000 | 229.196721 | 193.0 | 228.0 | 209.0 | 206.0 | 233.0 |
1103 | 103.0 | 89.0 | 113.885246 | 114.0 | 104.0 | 142.0 | 123.0 | 128.0 | 113.0 | 138.0 | ... | 136.0 | 129.0 | 124.0 | 93.000000 | 113.885246 | 99.0 | 126.0 | 132.0 | 94.0 | 93.0 |
1104 | 82.0 | 129.0 | 106.967213 | 111.0 | 123.0 | 75.0 | 92.0 | 82.0 | 110.0 | 75.0 | ... | 79.0 | 72.0 | 178.0 | 124.000000 | 106.967213 | 104.0 | 84.0 | 74.0 | 116.0 | 129.0 |
1105 | 288.0 | 262.0 | 292.344262 | 286.0 | 288.0 | 314.0 | 305.0 | 306.0 | 300.0 | 309.0 | ... | 314.0 | 325.0 | 285.0 | 259.000000 | 292.344262 | 285.0 | 308.0 | 319.0 | 279.0 | 275.0 |
5 rows × 65 columns
def grab_seeding(dataframe, seeding):
sd = seeding
df = dataframe
year = str(df.iloc[0].Season)
sd = sd.loc[sd['Season'].astype('str') == year]
teams = list(set(df['HTeamID'].append(df['ATeamID'])))
teams = pd.DataFrame({'TeamID':teams})
main_df = pd.merge(teams,sd,how='outer')
main_df['SeedNum'] = main_df['Seed'].str[1:3]
main_df['Seed'].fillna('NS',inplace=True)
main_df['SeedNum'].fillna('NS',inplace=True)
main_df.index = list(main_df['TeamID'])
del main_df['TeamID']
del main_df['Season']
return main_df
seed = pd.read_csv('data/MNCAATourneySeeds.csv')
seed['TeamID'] = seed['TeamID'].astype('str')
grab_seeding(testing_df,seed).head()
Seed | SeedNum | |
---|---|---|
1106 | NS | NS |
1219 | NS | NS |
1285 | NS | NS |
1350 | NS | NS |
1249 | NS | NS |
Alright, we’re done with all of our feature engineering. Kinda exhausting? Next step is to wrap all this into an enourmous function to generate a big-ass dataframe for generating ratings for every team, first for teams from 1985-2014, second from 2003-2014.
# score data
def gen_score_data(dataframe, seeding):
df = dataframe
elo = calc_elo(df,hfa_rescore=True,ot_rescore=True)
lrmc = calc_lrmc(df)
srs = calc_srs(df)
rpi = calc_rpi(df)
ts = calc_ts(df)
lme = calc_lme(df)
col = calc_colley(df)
wp = calc_wp(df)
mov = calc_mov(df)
sd = grab_seeding(df, seeding)
return pd.concat([elo, lrmc, srs, rpi, ts, lme, col, wp, mov, sd],axis=1)
gen_score_data(compact_results.loc[compact_results['Season'] == 1990], seeding = seed).head() # why 1990? no reason...
Elo | LRMC | SRS | RPI | TS | Off | Def | Colley | WP | MOV | Seed | SeedNum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1106 | 1457.690655 | 114.959698 | -6.468152 | 0.477352 | 23.586893 | 13.981185 | 16.435633 | 0.429118 | 0.520000 | -3.809030 | NS | NS |
1108 | 1389.483739 | 93.172849 | -11.313684 | 0.457892 | 20.018400 | 1.030306 | 7.373383 | 0.339644 | 0.360000 | -8.870126 | NS | NS |
1425 | 1491.137635 | 131.578084 | -0.651833 | 0.471407 | 26.394144 | 1.260576 | 1.200736 | 0.498420 | 0.428571 | -0.500000 | NS | NS |
1238 | 1413.525492 | 114.288471 | -6.379905 | 0.440363 | 19.715176 | -1.448090 | 1.866536 | 0.246491 | 0.280000 | -3.879578 | NS | NS |
1162 | 1353.674199 | 72.754592 | -14.088478 | 0.386421 | 14.703133 | -7.259959 | 4.629205 | 0.060706 | 0.076923 | -15.385427 | NS | NS |
def gen_adv_data(dataframe, det_dataframe, seeding, massey):
df = dataframe
det_df = det_dataframe
elo = calc_elo(df,hfa_rescore=True,ot_rescore=True)
lrmc = calc_lrmc(df)
srs = calc_srs(df)
rpi = calc_rpi(df)
ts = calc_ts(df)
lme = calc_lme(df)
col = calc_colley(df)
wp = calc_wp(df)
mov = calc_mov(df)
sd = grab_seeding(df, seeding)
ms = grab_massey(df, massey)
ppos = calc_box_score_rtgs(det_df)
ppos_raw = calc_box_score_rtgs(det_df,raw = True)
return pd.concat([elo, lrmc, srs, rpi, ts, lme, col, wp, mov, sd, ms, ppos, ppos_raw],axis=1)
Next step is to generate these ratings for every season and store them in an accessible manner.
basic_ratings_dict = {}
for year in compact_results['Season'].unique():
print('Generating basic ratings for ' + str(year) + '...')
temp_df = compact_results.loc[compact_results['Season'] == int(year)]
basic_ratings_dict[str(year)] = gen_score_data(temp_df, seed)
# basic_ratings_dict['2014'].head()
basic_list = []
for key in basic_ratings_dict.keys():
tmp_df = basic_ratings_dict[key]
tmp_df['Season'] = int(key)
basic_list.append(tmp_df)
basic_ratings_df = pd.concat(basic_list)
basic_ratings_df.head()
Generating basic ratings for 1985...
Generating basic ratings for 1986...
Generating basic ratings for 1987...
Generating basic ratings for 1988...
Generating basic ratings for 1989...
Generating basic ratings for 1990...
Generating basic ratings for 1991...
Generating basic ratings for 1992...
Generating basic ratings for 1993...
Generating basic ratings for 1994...
Generating basic ratings for 1995...
Generating basic ratings for 1996...
Generating basic ratings for 1997...
Generating basic ratings for 1998...
Generating basic ratings for 1999...
Generating basic ratings for 2000...
Generating basic ratings for 2001...
Generating basic ratings for 2002...
Generating basic ratings for 2003...
Generating basic ratings for 2004...
Generating basic ratings for 2005...
Generating basic ratings for 2006...
Generating basic ratings for 2007...
Generating basic ratings for 2008...
Generating basic ratings for 2009...
Generating basic ratings for 2010...
Generating basic ratings for 2011...
Generating basic ratings for 2012...
Generating basic ratings for 2013...
Generating basic ratings for 2014...
Generating basic ratings for 2015...
Generating basic ratings for 2016...
Generating basic ratings for 2017...
Generating basic ratings for 2018...
Generating basic ratings for 2019...
Generating basic ratings for 2020...
Generating basic ratings for 2021...
Elo | LRMC | SRS | RPI | TS | Off | Def | Colley | WP | MOV | Seed | SeedNum | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1106 | 1461.053111 | 107.258502 | -2.698899 | 0.459318 | 19.541427 | 0.952114 | 2.104203 | 0.348392 | 0.500000 | 0.264010 | NS | NS | 1985 |
1108 | 1573.024506 | 199.131764 | 6.038746 | 0.533009 | 25.704259 | 7.593771 | 3.356364 | 0.564220 | 0.680000 | 4.066550 | NS | NS | 1985 |
1425 | 1566.036711 | 172.630828 | 4.867537 | 0.541256 | 30.437145 | -0.582718 | -3.477101 | 0.728202 | 0.642857 | 2.047567 | Y08 | 08 | 1985 |
1238 | 1429.471832 | 103.695027 | -6.388395 | 0.437554 | 18.680670 | -2.368737 | 1.600165 | 0.256282 | 0.333333 | -6.041667 | NS | NS | 1985 |
1162 | 1494.682611 | 104.933737 | -2.299174 | 0.455248 | 23.805288 | -6.828934 | -5.287778 | 0.448240 | 0.520000 | 0.146725 | NS | NS | 1985 |
adv_ratings_list = []
for year in detailed_results['Season'].unique():
print('Generating advanced ratings for ' + str(year) + '...')
temp_df = compact_results.loc[compact_results['Season'] == int(year)]
temp_det_df = detailed_results.loc[detailed_results['Season'] == int(year)]
temp_adv_data = gen_adv_data(temp_df, temp_det_df, seed, mass)
temp_adv_data['Season'] = year
adv_ratings_list.append(temp_adv_data)
Generating advanced ratings for 2003...
Generating advanced ratings for 2004...
Generating advanced ratings for 2005...
Generating advanced ratings for 2006...
Generating advanced ratings for 2007...
Generating advanced ratings for 2008...
Generating advanced ratings for 2009...
Generating advanced ratings for 2010...
Generating advanced ratings for 2011...
Generating advanced ratings for 2012...
Generating advanced ratings for 2013...
Generating advanced ratings for 2014...
Generating advanced ratings for 2015...
Generating advanced ratings for 2016...
Generating advanced ratings for 2017...
Generating advanced ratings for 2018...
Generating advanced ratings for 2019...
Generating advanced ratings for 2020...
Generating advanced ratings for 2021...
tmp_list = []
for d in adv_ratings_list:
d = d.loc[:,~d.columns.duplicated()]
tmp_list.append(d)
adv_ratings_df = pd.concat(tmp_list)
We’re all the way wrapped on feature engineering, we can down assemble our training and testing dataframes.
tournament_games = pd.read_csv('data/MNCAATourneyCompactResults.csv')
tournament_games['WTeamID'] = tournament_games['WTeamID'].astype('str')
tournament_games['LTeamID'] = tournament_games['LTeamID'].astype('str')
tournament_games['WLoc'] = 'N'
tournament_games = refactor_games_compact(tournament_games)
tournament_games.head()
Season | DayNum | HTeamID | HScore | ATeamID | AScore | NumOT | NeutralFlg | |
---|---|---|---|---|---|---|---|---|
0 | 1985 | 136 | 1116 | 63 | 1234 | 54 | 0 | True |
1 | 1985 | 136 | 1120 | 59 | 1345 | 58 | 0 | True |
2 | 1985 | 136 | 1207 | 68 | 1250 | 43 | 0 | True |
3 | 1985 | 136 | 1229 | 58 | 1425 | 55 | 0 | True |
4 | 1985 | 136 | 1242 | 49 | 1325 | 38 | 0 | True |
One trick I like to do is flip the data and double it, so the model doesn’t get biased towards predicting winners or losers.
tournament_games_flipped = tournament_games.copy()
tournament_games_flipped['HTeamID'] = tournament_games['ATeamID']
tournament_games_flipped['ATeamID'] = tournament_games['HTeamID']
tournament_games_flipped['HScore'] = tournament_games['AScore']
tournament_games_flipped['AScore'] = tournament_games['HScore']
tournament_games = pd.concat([tournament_games, tournament_games_flipped])
tournament_games['HDiff'] = tournament_games['HScore'] - tournament_games['AScore']
tournament_games.tail()
Season | DayNum | HTeamID | HScore | ATeamID | AScore | NumOT | NeutralFlg | HDiff | |
---|---|---|---|---|---|---|---|---|---|
2246 | 2019 | 146 | 1246 | 71 | 1120 | 77 | 1 | True | -6 |
2247 | 2019 | 146 | 1181 | 67 | 1277 | 68 | 0 | True | -1 |
2248 | 2019 | 152 | 1277 | 51 | 1403 | 61 | 0 | True | -10 |
2249 | 2019 | 152 | 1120 | 62 | 1438 | 63 | 0 | True | -1 |
2250 | 2019 | 154 | 1403 | 77 | 1438 | 85 | 1 | True | -8 |
h_adv_ratings = adv_ratings_df.copy()
a_adv_ratings = adv_ratings_df.copy()
h_adv_ratings['HTeamID'] = h_adv_ratings.index
a_adv_ratings['ATeamID'] = a_adv_ratings.index
tournament_merged = pd.merge(tournament_games, h_adv_ratings)
tournament_merged = pd.merge(tournament_merged, a_adv_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
tournament_merged.head()
Season | DayNum | HTeamID | HScore | ATeamID | AScore | NumOT | NeutralFlg | HDiff | Elo_H | ... | MGS_A | NET_A | PIR_A | BNZ_A | CBR_A | CWL_A | EMK_A | FDM_A | KCX_A | RWP_A | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2003 | 134 | 1421 | 92 | 1411 | 84 | 1 | True | 8 | 1443.840515 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2003 | 137 | 1421 | 61 | 1400 | 82 | 0 | True | -21 | 1443.840515 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2003 | 144 | 1163 | 78 | 1400 | 82 | 0 | True | -4 | 1615.681074 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2003 | 146 | 1277 | 76 | 1400 | 85 | 0 | True | -9 | 1621.855151 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2003 | 139 | 1345 | 67 | 1400 | 77 | 0 | True | -10 | 1575.549719 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 493 columns
All of our data prep work is done, I’m gonna swap over to R for some tidymodels work.
tournament_merged.to_csv('data/adv_data.csv')
h_basic_ratings = basic_ratings_df.copy()
a_basic_ratings = basic_ratings_df.copy()
h_basic_ratings['HTeamID'] = h_basic_ratings.index
a_basic_ratings['ATeamID'] = a_basic_ratings.index
tournament_merged = pd.merge(tournament_games, h_basic_ratings)
tournament_merged = pd.merge(tournament_merged, a_basic_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
tournament_merged.head()
tournament_merged.to_csv('data/basic_data.csv')
Actually, before we go, I’m going to prepare our stage 1 and stage 2 datasets.
stage1 = pd.read_csv('MSampleSubmissionStage1.csv')
stage2 = pd.read_csv('MSampleSubmissionStage2.csv')
stage1['Season'] = stage1['ID'].str.slice(start=0,stop=4)
stage1['HTeamID'] = stage1['ID'].str.slice(start=5,stop=9)
stage1['ATeamID'] = stage1['ID'].str.slice(start=10,stop=14)
stage1['Season'] = stage1['Season'].astype('int')
stage1_tmp = stage1.copy()
stage1_tmp['HTeamID'] = stage1['ATeamID']
stage1_tmp['ATeamID'] = stage1['HTeamID']
stage1_merged = pd.merge(stage1, h_adv_ratings)
stage1_merged = pd.merge(stage1_merged, a_adv_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
stage1_merged.to_csv('data/stage1_2021.csv')
stage2['Season'] = stage2['ID'].str.slice(start=0,stop=4)
stage2['HTeamID'] = stage2['ID'].str.slice(start=5,stop=9)
stage2['ATeamID'] = stage2['ID'].str.slice(start=10,stop=14)
stage2['Season'] = stage2['Season'].astype('int')
stage2_tmp = stage1.copy()
stage2_tmp['HTeamID'] = stage2['ATeamID']
stage2_tmp['ATeamID'] = stage2['HTeamID']
stage2_merged = pd.merge(stage2, h_adv_ratings)
stage2_merged = pd.merge(stage2_merged, a_adv_ratings, on = ['ATeamID','Season'], suffixes = ['_H','_A'])
stage2_merged.to_csv('data/stage2_2021.csv')
%load_ext rpy2.ipython
C:\Users\edwar\anaconda3\lib\site-packages\rpy2\robjects\packages.py:366: UserWarning: The symbol 'quartz' is not in this R namespace/package.
warnings.warn(
# initializations
library(tidyverse)
library(tidymodels)
library(recipeselectors)
library(MLmetrics)
adv_data <- read.csv("C:/Users/edwar/Documents/GitHub/kaggle-mens-march-madness-2021/data/adv_data.csv", header=TRUE) %>%
mutate(HWin = HDiff > 0,
SeedDiff = as.numeric(SeedNum_H) - as.numeric(SeedNum_A)) %>%
select(-c(X, Seed_H, Seed_A, SeedNum_A, SeedNum_H))
R[write to console]:
Attaching package: 'MLmetrics'
R[write to console]: The following object is masked from 'package:base':
Recall
What follows is summarizing the result of some experiemnting in RStudio. I’m just keeping things together in the same notebook, this wasn’t my exact thought process but is a more refined means of working through things.
First step is to grab train/test data. We’ll hold out the 2019 tournament for evaluation.
pre_2015 = adv_data %>%
filter(Season <= 2018) %>%
select(-c(Season, DayNum, HTeamID, HScore, ATeamID, AScore, NumOT, NeutralFlg))
post_2015 = adv_data %>%
filter(Season == 2019) %>%
select(-c(Season, DayNum, HTeamID, HScore, ATeamID, AScore, NumOT, NeutralFlg))
Next step, we’ll see which columns have a large number of NAs, and avoid imputing these in our preprocessing.
colSums(is.na(pre_2015))
HDiff Elo_H LRMC_H
0 0 0
SRS_H RPI_H TS_H
0 0 0
Off_H Def_H Colley_H
0 0 0
WP_H MOV_H AP_H
0 0 0
ARG_H BIH_H BOB_H
1310 128 268
BRZ_H COL_H DOL_H
1968 0 0
DUN_H DWH_H ECK_H
524 1456 1456
ENT_H ERD_H GRN_H
1712 1968 1072
GRS_H HER_H HOL_H
1706 1712 1968
IMS_H MAS_H MKV_H
1840 512 1712
MOR_H POM_H RTH_H
0 0 0
SAG_H SAU_H SE_H
0 1712 658
SEL_H STR_H TSR_H
256 1712 1456
USA_H WLK_H WOB_H
0 0 128
WOL_H WTE_H FGM_POSMade_H
0 1968 0
FGM_POSAllowed_H FGA_POSMade_H FGA_POSAllowed_H
0 0 0
FGM3_POSMade_H FGM3_POSAllowed_H FGA3_POSMade_H
0 0 0
FGA3_POSAllowed_H FTM_POSMade_H FTM_POSAllowed_H
0 0 0
FTA_POSMade_H FTA_POSAllowed_H OR_POSMade_H
0 0 0
OR_POSAllowed_H DR_POSMade_H DR_POSAllowed_H
0 0 0
AST_POSMade_H AST_POSAllowed_H TO_POSMade_H
0 0 0
TO_POSAllowed_H STL_POSMade_H STL_POSAllowed_H
0 0 0
BLK_POSMade_H BLK_POSAllowed_H PF_POSMade_H
0 0 0
PF_POSAllowed_H PTS_POSMade_H PTS_POSAllowed_H
0 0 0
TEMPOMade_H TEMPOAllowed_H FGM_POSMadeRAW_H
0 0 0
FGA_POSMadeRAW_H FGM3_POSMadeRAW_H FGA3_POSMadeRAW_H
0 0 0
FTM_POSMadeRAW_H FTA_POSMadeRAW_H OR_POSMadeRAW_H
0 0 0
DR_POSMadeRAW_H AST_POSMadeRAW_H TO_POSMadeRAW_H
0 0 0
STL_POSMadeRAW_H BLK_POSMadeRAW_H PF_POSMadeRAW_H
0 0 0
PTS_POSMadeRAW_H TEMPOMadeRAW_H FGM_POSAllowedRAW_H
0 0 0
FGA_POSAllowedRAW_H FGM3_POSAllowedRAW_H FGA3_POSAllowedRAW_H
0 0 0
FTM_POSAllowedRAW_H FTA_POSAllowedRAW_H OR_POSAllowedRAW_H
0 0 0
DR_POSAllowedRAW_H AST_POSAllowedRAW_H TO_POSAllowedRAW_H
0 0 0
STL_POSAllowedRAW_H BLK_POSAllowedRAW_H PF_POSAllowedRAW_H
0 0 0
PTS_POSAllowedRAW_H TEMPOAllowedRAW_H BD_H
0 0 1840
CNG_H DES_H JON_H
128 384 1968
LYN_H MGY_H NOR_H
932 1968 1968
REI_H RM_H SIM_H
1968 1968 1328
ACU_H BCM_H CMV_H
1444 1968 1968
DC_H KLK_H REN_H
390 1456 1712
RIS_H ROH_H SAP_H
1968 1712 1712
SCR_H WIL_H DOK_H
1584 384 384
JCI_H KPK_H MB_H
1840 768 1054
PH_H PIG_H PKL_H
1968 384 1840
TRX_H CPR_H ISR_H
1712 780 1584
KRA_H LYD_H RTR_H
640 1840 780
UCS_H BKM_H CPA_H
1968 1968 908
JEN_H PGH_H REW_H
1444 640 640
RSE_H SPW_H STH_H
1968 640 640
BPI_H DC2_H DCI_H
1438 1438 768
HKB_H LMC_H NOL_H
1304 768 902
OMY_H RTB_H KEL_H
1566 1304 1968
KMV_H RT_H TW_H
1834 1030 1432
AUS_H KOS_H PEQ_H
1962 1694 1962
PTS_H ROG_H RTP_H
1694 1828 1024
TMR_H X7OT_H ADE_H
1426 1158 1426
BBT_H BNM_H BUR_H
1158 1962 1292
CJB_H CRO_H EBP_H
1694 1292 1158
HAT_H MSX_H SFX_H
1962 1426 1158
TBD_H BLS_H D1A_H
1962 1426 1560
DII_H KBM_H TPR_H
1292 1694 1292
MvG_H PPR_H SP_H
1828 1962 1426
SPR_H STF_H STS_H
1426 1828 1828
TRP_H UPS_H WMR_H
1426 1962 1962
BWE_H LOG_H TRK_H
1560 1694 1560
DAV_H FAS_H FSH_H
1694 1694 1694
HAS_H HRN_H KPI_H
1694 1962 1694
MCL_H CRW_H DDB_H
1828 1962 1828
ESR_H FMG_H HKS_H
1828 1962 1962
INP_H JNG_H JRT_H
1962 1828 1962
MUZ_H OCT_H PMC_H
1962 1962 1962
PRR_H RSL_H SGR_H
1828 1962 1828
SMN_H SMS_H YAG_H
1828 1828 1828
ZAM_H BNT_H COX_H
1828 1962 1962
JJK_H LAB_H MMG_H
1962 1962 1962
STM_H WMV_H AWS_H
1962 1962 2096
INC_H LAW_H LEF_H
2096 2096 2096
MGS_H NET_H PIR_H
2096 2096 2096
BNZ_H CBR_H CWL_H
2096 2096 2096
EMK_H FDM_H KCX_H
2096 2096 2096
RWP_H Elo_A LRMC_A
2096 0 0
SRS_A RPI_A TS_A
0 0 0
Off_A Def_A Colley_A
0 0 0
WP_A MOV_A AP_A
0 0 0
ARG_A BIH_A BOB_A
1310 128 268
BRZ_A COL_A DOL_A
1968 0 0
DUN_A DWH_A ECK_A
524 1456 1456
ENT_A ERD_A GRN_A
1712 1968 1072
GRS_A HER_A HOL_A
1706 1712 1968
IMS_A MAS_A MKV_A
1840 512 1712
MOR_A POM_A RTH_A
0 0 0
SAG_A SAU_A SE_A
0 1712 658
SEL_A STR_A TSR_A
256 1712 1456
USA_A WLK_A WOB_A
0 0 128
WOL_A WTE_A FGM_POSMade_A
0 1968 0
FGM_POSAllowed_A FGA_POSMade_A FGA_POSAllowed_A
0 0 0
FGM3_POSMade_A FGM3_POSAllowed_A FGA3_POSMade_A
0 0 0
FGA3_POSAllowed_A FTM_POSMade_A FTM_POSAllowed_A
0 0 0
FTA_POSMade_A FTA_POSAllowed_A OR_POSMade_A
0 0 0
OR_POSAllowed_A DR_POSMade_A DR_POSAllowed_A
0 0 0
AST_POSMade_A AST_POSAllowed_A TO_POSMade_A
0 0 0
TO_POSAllowed_A STL_POSMade_A STL_POSAllowed_A
0 0 0
BLK_POSMade_A BLK_POSAllowed_A PF_POSMade_A
0 0 0
PF_POSAllowed_A PTS_POSMade_A PTS_POSAllowed_A
0 0 0
TEMPOMade_A TEMPOAllowed_A FGM_POSMadeRAW_A
0 0 0
FGA_POSMadeRAW_A FGM3_POSMadeRAW_A FGA3_POSMadeRAW_A
0 0 0
FTM_POSMadeRAW_A FTA_POSMadeRAW_A OR_POSMadeRAW_A
0 0 0
DR_POSMadeRAW_A AST_POSMadeRAW_A TO_POSMadeRAW_A
0 0 0
STL_POSMadeRAW_A BLK_POSMadeRAW_A PF_POSMadeRAW_A
0 0 0
PTS_POSMadeRAW_A TEMPOMadeRAW_A FGM_POSAllowedRAW_A
0 0 0
FGA_POSAllowedRAW_A FGM3_POSAllowedRAW_A FGA3_POSAllowedRAW_A
0 0 0
FTM_POSAllowedRAW_A FTA_POSAllowedRAW_A OR_POSAllowedRAW_A
0 0 0
DR_POSAllowedRAW_A AST_POSAllowedRAW_A TO_POSAllowedRAW_A
0 0 0
STL_POSAllowedRAW_A BLK_POSAllowedRAW_A PF_POSAllowedRAW_A
0 0 0
PTS_POSAllowedRAW_A TEMPOAllowedRAW_A BD_A
0 0 1840
CNG_A DES_A JON_A
128 384 1968
LYN_A MGY_A NOR_A
932 1968 1968
REI_A RM_A SIM_A
1968 1968 1328
ACU_A BCM_A CMV_A
1444 1968 1968
DC_A KLK_A REN_A
390 1456 1712
RIS_A ROH_A SAP_A
1968 1712 1712
SCR_A WIL_A DOK_A
1584 384 384
JCI_A KPK_A MB_A
1840 768 1054
PH_A PIG_A PKL_A
1968 384 1840
TRX_A CPR_A ISR_A
1712 780 1584
KRA_A LYD_A RTR_A
640 1840 780
UCS_A BKM_A CPA_A
1968 1968 908
JEN_A PGH_A REW_A
1444 640 640
RSE_A SPW_A STH_A
1968 640 640
BPI_A DC2_A DCI_A
1438 1438 768
HKB_A LMC_A NOL_A
1304 768 902
OMY_A RTB_A KEL_A
1566 1304 1968
KMV_A RT_A TW_A
1834 1030 1432
AUS_A KOS_A PEQ_A
1962 1694 1962
PTS_A ROG_A RTP_A
1694 1828 1024
TMR_A X7OT_A ADE_A
1426 1158 1426
BBT_A BNM_A BUR_A
1158 1962 1292
CJB_A CRO_A EBP_A
1694 1292 1158
HAT_A MSX_A SFX_A
1962 1426 1158
TBD_A BLS_A D1A_A
1962 1426 1560
DII_A KBM_A TPR_A
1292 1694 1292
MvG_A PPR_A SP_A
1828 1962 1426
SPR_A STF_A STS_A
1426 1828 1828
TRP_A UPS_A WMR_A
1426 1962 1962
BWE_A LOG_A TRK_A
1560 1694 1560
DAV_A FAS_A FSH_A
1694 1694 1694
HAS_A HRN_A KPI_A
1694 1962 1694
MCL_A CRW_A DDB_A
1828 1962 1828
ESR_A FMG_A HKS_A
1828 1962 1962
INP_A JNG_A JRT_A
1962 1828 1962
MUZ_A OCT_A PMC_A
1962 1962 1962
PRR_A RSL_A SGR_A
1828 1962 1828
SMN_A SMS_A YAG_A
1828 1828 1828
ZAM_A BNT_A COX_A
1828 1962 1962
JJK_A LAB_A MMG_A
1962 1962 1962
STM_A WMV_A AWS_A
1962 1962 2096
INC_A LAW_A LEF_A
2096 2096 2096
MGS_A NET_A PIR_A
2096 2096 2096
BNZ_A CBR_A CWL_A
2096 2096 2096
EMK_A FDM_A KCX_A
2096 2096 2096
RWP_A HWin SeedDiff
2096 0 0
After filting columns with large numbers of missing values, We’ll quickly implement a preprocessing recipe, imputing missing values using knn and then running a boruta feature selection algorithm.
res = c('HDiff','HWin','SeedDiff')
metrics = c('Elo','LRMC','SRS','RPI','TS','Off','Def','Colley','WP','MOV',
'AP','BOB','BIH','COL','DOL','DUN','MOR','POM','RTH','SAG','SE','SEL',
'USA','WLK','WOB','WOL','FGM_POSMade','FGM_POSAllowed','FGA_POSMade',
'FGA_POSAllowed','FGM3_POSMade','FGM3_POSAllowed','FGA3_POSMade',
'FGA3_POSAllowed','FTM_POSMade','FTM_POSAllowed','FTA_POSMade',
'FTA_POSAllowed','OR_POSMade','OR_POSAllowed','DR_POSMade',
'DR_POSAllowed','AST_POSMade','AST_POSAllowed','TO_POSMade','TO_POSAllowed',
'STL_POSMade','STL_POSAllowed','BLK_POSMade','BLK_POSAllowed',
'PF_POSMade','PF_POSAllowed','PTS_POSMade','PTS_POSAllowed',
'TEMPOMade','TEMPOAllowed','FGM_POSMadeRAW','FGA_POSMadeRAW',
'FGM3_POSMadeRAW','FGA3_POSMadeRAW','FTM_POSMadeRAW','FTA_POSMadeRAW',
'OR_POSMadeRAW','DR_POSMadeRAW','AST_POSMadeRAW','TO_POSMadeRAW',
'STL_POSMadeRAW','BLK_POSMadeRAW','PF_POSMadeRAW','PTS_POSMadeRAW',
'TEMPOMadeRAW','FGM_POSAllowedRAW','FGA_POSAllowedRAW',
'FGM3_POSAllowedRAW','DR_POSAllowedRAW','AST_POSAllowedRAW',
'TO_POSAllowedRAW','STL_POSAllowedRAW','BLK_POSAllowedRAW',
'PF_POSAllowedRAW','PTS_POSAllowedRAW','TEMPOAllowedRAW',
'CNG','DC')
cols = c(res, paste0(metrics,'_A'), paste0(metrics,'_H'))
boruta_df = pre_2015[colnames(pre_2015) %in% cols]
outright_boruta = recipe(HWin ~ ., data = boruta_df %>% select(-c(HDiff))) %>%
step_knnimpute(all_predictors(), neighbors = 3) %>%
step_select_boruta(all_predictors(), outcome = "HWin")
prepped = prep(outright_boruta)
prepped[['steps']][[2]][['res']][1]$finalDecision[which(prepped[['steps']][[2]][['res']][1]$finalDecision == 'Confirmed')]
Elo_H LRMC_H SRS_H RPI_H
Confirmed Confirmed Confirmed Confirmed
TS_H Colley_H AP_H BIH_H
Confirmed Confirmed Confirmed Confirmed
BOB_H COL_H DOL_H DUN_H
Confirmed Confirmed Confirmed Confirmed
MOR_H POM_H RTH_H SAG_H
Confirmed Confirmed Confirmed Confirmed
SE_H SEL_H USA_H WLK_H
Confirmed Confirmed Confirmed Confirmed
WOB_H WOL_H FGM_POSMade_H BLK_POSMade_H
Confirmed Confirmed Confirmed Confirmed
PTS_POSMade_H PTS_POSAllowed_H BLK_POSMadeRAW_H PTS_POSMadeRAW_H
Confirmed Confirmed Confirmed Confirmed
CNG_H DC_H Elo_A LRMC_A
Confirmed Confirmed Confirmed Confirmed
SRS_A RPI_A TS_A Colley_A
Confirmed Confirmed Confirmed Confirmed
AP_A BIH_A BOB_A COL_A
Confirmed Confirmed Confirmed Confirmed
DOL_A DUN_A MOR_A POM_A
Confirmed Confirmed Confirmed Confirmed
RTH_A SAG_A SE_A SEL_A
Confirmed Confirmed Confirmed Confirmed
USA_A WLK_A WOB_A WOL_A
Confirmed Confirmed Confirmed Confirmed
FGM_POSMade_A BLK_POSMade_A PTS_POSMade_A PTS_POSAllowed_A
Confirmed Confirmed Confirmed Confirmed
BLK_POSMadeRAW_A PTS_POSMadeRAW_A PTS_POSAllowedRAW_A CNG_A
Confirmed Confirmed Confirmed Confirmed
DC_A SeedDiff
Confirmed Confirmed
Levels: Tentative Confirmed Rejected
With our features selected, we can filter our unused columns and grab a cross-validated dataset to train on.
res = c('HDiff','HWin','SeedDiff')
metrics = c('Elo','LRMC','SRS','RPI','TS','Colley',
'AP','BIH','COL','DOL','MOR','POM','SAG','SEL',
'USA','WLK','WOB','PTS_POSMade','PTS_POSAllowed','PTS_POSMadeRAW')
cols = c(res, paste0(metrics,'_A'), paste0(metrics,'_H'))
data = pre_2015[colnames(pre_2015) %in% cols]
clf_impute = recipe(HWin ~ ., data = boruta_df %>% select(-c(HDiff))) %>%
step_knnimpute(all_predictors(), neighbors = 3)
prepped = prep(clf_impute)
data = juice(prepped)
data = data[colnames(data) %in% cols]
data$HDiff = pre_2015$HDiff
set.seed(1986)
df_train = data
df_test = post_2015
cv_folds = vfold_cv(df_train %>% mutate(HWin = as.factor(HWin)) %>% select(-c(HDiff)), strata = HWin)
We’ll tune our xgboost algorithm here. In the interest of not repeating myself, I’ve copied the code over and it should be able to be run, but it will take a long time to run, and thus I’ve commented it out.
xgb = boost_tree(
tree_depth = tune(),
min_n = tune(),
trees = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgb_params = parameters(
tree_depth(),
min_n(),
trees(range = c(10, 2000)),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), df_train),
learn_rate()
)
xgb_wrkflw <- workflow() %>%
add_formula(HWin ~ .) %>%
add_model(xgb)
doParallel::registerDoParallel()
xgb_tune <- tune_bayes(
xgb_wrkflw ,
resamples = cv_folds,
param_info = xgb_params,
iter = 1000,
metrics = metric_set(
mn_log_loss
),
initial = 15,
control = control_bayes(
no_improve = 100,
uncertain = 10,
save_pred = F,
save_workflow = F,
time_limit = 600,
verbose = T
)
)
xgb_best <- select_best(xgb_tune, "mn_log_loss")
print(xgb_best)
NULL
We have our tuned xgboost model. We’ll train it on our train dataset, test it on our test dataset, and then apply predictions on the 2019 tournament.
xgb = boost_tree(mtry=2,
trees=401,
min_n=9,
tree_depth=2,
learn_rate=0.012,
loss_reduction=0.42,
sample_size = 0.443) %>%
set_engine("xgboost") %>%
set_mode("classification") %>%
fit(HWin ~ ., data = df_train %>% mutate(HWin = as.factor(HWin)) %>% select(-c(HDiff)))
saveRDS(xgb,'xgb_outright.RDS')
[20:10:51] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
xgb = readRDS('xgb_outright.RDS')
xprb = predict(xgb, df_test, type = "prob")
print(LogLoss(xprb$.pred_TRUE,df_test$HWin))
[1] 0.5008113
We get a pretty solid log loss for the 2019 tournament – this would have us at about middle of the pack. There are a few tricks we can do to boost our performance, which I have not implemented here but are arbitrary to do so:
- Cap predictions at 0.95/0.05 – in the event of (inevitable) upsets where a team with a high probability of winning loses, we are better insulated from those upsets than teams that do not cap their predictions, but the gain from correctly predicting a .99 win versus a .95 win is not that substantial.
- Generate two sets of predictions and randomly flip close games around while artificially boosting confidence – Kaggle allows you to submit two sets of predictions for stage 2, not just one. We can exploit this by taking games we consider to be coinflips (games where win probabilites fall between 45%-55%) and randomly pick a winner, and assign an artifically high confidence (such as 70%/30% or, if you want to be really, extreme, 100%/0%). If we want to do well enough to win this competition, we can boost our score with this random confidence. If we fail however, yikes! (Spoiler alert, it worked)
- Adjust predictions for injuries – One thing many Kaggle teams will not do but something that is extremely valuable to do is to monitor the injury status of key players on each team. Missing a star player can result in as much as a 3 point spread, something that affects a team’s win probability significantly. Using vegas spreads, I converted my win probability estimates to point spreads, used approximate value to estimate the value of missing players on the spread, adjusted those spreads, and then converted my spreads back to win probability estimates.