Goal Scoring Prediction in Hockey.

15 min readDec 22, 2021

This blog post is providing a technical overview of my Udacity Data Science Nanodegree project. I’ve chosen a customized project using NHL datasets to predict a chance of scoring a goal after a shot.

Project overview

Predicting a chance of scoring a goal is increasingly appearing as a service of TV stations broadcasting sport games. I’ve seen this service several times watching soccer, but never yet for hockey. This led me, as a hockey fan, to the idea of predicting scoring chances for hockey shots based on event records captured during the game. I wanted to figure out how challenging and how precise these predictions might be, considering all available data possibly impacting the chance of scoring a goal.

In order to accomplish this project, I used 3 datasets:

NHL Game Data (extensive database including and connecting games, game events, teams, players) was the main source of information for goal predictions.
NHL Player Salaries (player statistics, incl. salaries) was used to extract additional data about players.
World Cities (GPS coordinates for world cities) was used for data exploration only.

NHL Game Data Schema (from Kaggle):

Problem Statement

The main goal of the project is to predict if a shot with given parameters will land in the goal. This is a binary classification problem. A binary classifier should be developed to make these predictions.

The classifier will predict the probability of a goal. This probability is a value from the interval [0, 1] and is converted to a binary prediction 0 or 1 based on a given threshold, default 0.5. If the predicted probability is higher than the threshold, the predicted value is interpreted as 1. If it’s lower than the threshold, it’s interpreted as 0.

In order to achieve this goal, data from different datasets should be used as predictive features. If meaningful, new predictive features should be modelled out of the existing data to increase the model performance. Using the existing data and the new features, a dataset should be designed for training and testing purposes. Most important features for the goal prediction should be figured out.

To evaluate the performance of the predictive model, one or more metrics need to be defined properly. Using these metrics, all steps in the model development and feature engineering need to be evaluated.

Metrics

There are several common metrics applicable for the model evaluation in a binary classifier. In my project, I used the following metrics: Accuracy, F1 Score, Precision, Recall and AUC-ROC.

Accuracy is the basic metrics for classification problems providing value on how many of overall predictions have been correct. However, accuracy is not sufficient especially for imbalanced datasets. If the model stops predicting correctly and is biased towards one of the target values in binary classification, accuracy would still be high if this target value occurs in most of the samples in the test dataset. For this reason, I also used F1-Score, which combines Precision and Recall. The goal here is to quantify how many data points in the test set have been predicted correctly as 0 or 1, how many of them have been predicted as FalsePositive (Type I error), and how many as FalseNegative (Type II error). As a FalsePositive can be considered values predicted as Positive (1) even if they should have been predicted as Negative (0). FalseNegative are predictions predicted as Negative (0) even if they should have been predicted as Positive (1). Based on these values, Precision, Recall, and finally the F1 Score can be calculated as:

AUC-ROC is the only metric working with predicted scores, not predicted values (0/1).

More explanations regarding the model evaluation and all metrics used in the project can be found in notebooks 4_nhl_logistic_regression and 5_nhl_multiple_classifiers. For the final evaluation of model performance, I considered the combination of Accuracy and F1-Score trying to keep them both high and in balance. In this combination, the prediction is accurate and the model has a high Recall. The reason I have chosen these metrics is to avoid situations of FalseNegatives, where goal situations are being predicted as no-goal. High Recall should prevent it.

The selection of proper metrics could (and should) be done differently in other use-cases where a binary classification is used. Each use-case requires a reconsideration of used metrics.

Data Exploration and Visualization

The project started with the exploration of a dataset for NHL players. This provides a lot of interesting facts and is documented in the notebook 1_nhl_explore_players.

Even if this kind of data is interesting, only one information was considered as possibly relevant for the future model, which was the salary. There might be an interesting correlation between salary and the capability of the player to make a goal. This will be answered later in the project.

Much more relevant data offered the NHL Game Data dataset. For the purpose of the goal/no-goal classification, I started to use data from the table game_plays. This is documented in the notebooks 2_nhl_explore_games and 3_nhl_explore_one_game. Each event of the game is recorded in the data providing detailed information about each goal, shot, hit, penalty, etc. For shots and goals, which are actually most interesting parameters for prediction, type of shot and its location is provided.

The first model has been trained with a collection of features regarding shot type and shot location only. The table game_plays provides 5.050.529 events out of which 929.419 are goals or shots used for the training. The categorical feature secondType, representing the type of the shot, has been converted to several numerical columns using the pandas function get_dummies. Following features have been used for the first model:

features = [‘st_x’, ‘st_y’, ‘secondaryType_Deflected’, ‘secondaryType_SlapShot’, ‘secondaryType_SnapShot’,
 ‘secondaryType_TipIn’, ‘secondaryType_Wraparound’, ‘secondaryType_WristShot’]

The distribution of the input space shows several interesting facts: On the x-axis of the shot/goal location, the data distribution is skewed towards the goal (on the right side of the plot). More goals/shots are made from the offensive zone than from the other side of the field. The distribution on the y-axis is normal, which means a lot of shots from the middle of the fields, less from the side. The number of goals vs. shots (with no goal) to clear towards no-goal. I kept it this way at the beginning to see the reaction of the models. More to this topic later in the evaluation.

Besides these features, I created new features calculating distance to the goal and the angle. The distance represents the distance of the shot to the middle of the goal. The angle represents the angle between the shot trajectory and the longitudinal axis of the field. Distance and angle do both have outliers, which will be handled in the next sections. Detailed information about the calculation of this two features can be found in the notebook 4_nhl_logistic_regression.

Next source of the data was the player statistics and player salaries. Table player_info, game_goalie_stats, and game_skater_stats have been used to calculate and aggregate savePercentage for goalies and goals statistics for skaters and to enrich the game events by these informations. The final information added was the salary of the skater extracted from the dataset NHL Player Salaries. The extraction of the data from different data sources, aggregation and calculation of new features can be found in the notebook 9_nhl_data_pipeline. The final dataset used in the last version of the models used 20 features with various levels of correlation to the target value goal.

There are several more or less correlated features with goal. The correlation is an important indication of using a particular feature for training. But some features might turn out to be very important for the model even if not strongly correlated with the target. More about this in later sections.

Data preprocessing

Several data preprocessing steps needed to be done before the data was able to be used for the training. Here are all types of changes (all preprocessing steps can be found in the notebook 9_nhl_data_pipeline):

Renaming columns, joining data: This was needed to extract data from multiple datasets and to merge those to one final pandas DataFrame.

# adding players to the game
df_player = pd.read_csv('data/nhl/nhl_stats/game_plays_players.csv')
df_player.drop_duplicates(inplace=True)

#skater
df_skater = df_player[df_player.playerType.isin(['Shooter', 'Scorer'])].copy()
df_skater.rename(columns={'player_id':'skater_id'}, inplace=True)
df_skater.drop(columns=['game_id', 'playerType'], inplace=True)

#goalies
df_goalie = df_player[df_player.playerType.isin(['Goalie'])].copy()
df_goalie.rename(columns={'player_id':'goalie_id'}, inplace=True)
df_goalie.drop(columns=['game_id', 'playerType'], inplace=True)

# created new table and merge skaters and golies to the game events
df = df.merge(df_skater, how='left', on='play_id')
df = df.merge(df_goalie, how='left', on='play_id')

Dropping and/or filling NaN values: Depending on columns and the number of missing data, the NaN values needed to be dropped or re-filled. If the amount of data was very low compared to the size of the dataset, the rows with NaN values have been dropped completely. If the number was significant, the data was replaced.

# Apparently not all shots were on goal. No goalie then. Keeping this rows anyway but replacing goalie with Zero
df.goalie_id.fillna(0, inplace=True)# replace missing salary
df.salary.fillna(df.salary.median(), inplace=True)
df.savePercentage.fillna(df.savePercentage.median(), inplace=True)

Categorical columns to numerical: I order to be used for the training, categorical data have been replaced by numerical dummy columns (utils.py)

df = pd.concat([df.drop(columns=col, axis=1), 
     pd.get_dummies(df[col], prefix=col, prefix_sep=’_’, 
     drop_first=True, dummy_na=dummy_na)], axis=1)

Target set to 0/1: All goal/shot events have been converted to a binary value 0/1 indicating goal/no-goal used for the target.

# Prepare target. Convert categorical event values 'Goal' to a numerical 0/1 value indicating goal
df['goal'] = np.where(df.event=='Goal', 1, 0)
df.drop(columns='event', inplace=True)

Balancing target value: After first training experiments, data was balanced based on the target value

size = min(df[df[target]==0].shape[0], df[df[target]==1].shape[0]) goals = df[df[target]==1].sample(size, replace=False,
        random_state=10) 
no_goals = df[df[target]==0].sample(size, replace=False, 
           random_state=10)

Extracting new features: New features have been extracted for distance and angle. Player statistics were grouped for savePercentage and Goal.

# calculating overall savePercentage for each goalie
df_goalie = pd.read_csv('data/nhl/nhl_stats/game_goalie_stats.csv')
df_goalie = df_goalie.groupby('player_id')
            .agg({'savePercentage':'mean'}).reset_index()
df_goalie.rename(columns={'player_id':'goalie_id'}, inplace=True)

# calculating overall statistics for each skater
df_skater = pd.read_csv('data/nhl/nhl_stats/game_skater_stats.csv')
df_skater = df_skater.groupby('player_id').agg({'goals':'sum',
            'shots':'sum', 'assists':'sum', 'timeOnIce':'sum'})
            .reset_index()
df_skater.rename(columns={'player_id':'skater_id'}, inplace=True)

Implementation

Three classifiers have been used to predict the goal. The same basic process was used for all of them: split data set to training and test set, create predicting model (classifier), fit the model with training set, predict and evaluate with test set. To reduce code duplications, this process was implemented in models.py for all classifiers. Following three classifiers have been used:

Logistic Regression: LogisticRegression Classifier from the skicit-learn library has been used as implemented below:

# balance weights
if balance_weights == True:
    weights = { 0:df[df[target] == 1].shape[0]/df.shape[0], 
                1:df[df[target] == 0].shape[0]/df.shape[0]}
else:
   weights = { 0:1.0, 1:1.0}# create model
model = LogisticRegression(max_iter=max_iter, class_weight=weights);# fit model
model.fit(X_train, y_train)

Logistic regression fits data using the sigmoid function to model a dependent variable with two possible values, 0 and 1. I provided an explanation of the sigmoid function in the notebook 4_nhl_logistic_regression and demonstrated visually how this function fits the data for one predictive feature, the distance:

LightGBM Classifier: Uses gradient boosting model with tree based algorithms. In particular, it used internally several weak learners (decision trees) built sequentially. The training of the LightGBM classifier is implemented in model.py as well.

# create model
 model = LGBMClassifier(random_state=42, is_unbalance=unbalanced)# fit model
 model.fit(X_train,y_train)

K- Neighbors Classifier: Is the implementation of the K-nearest neighbors method. It doesn’t construct an internal model but stores and clusters the training data instead. With the growing number of predictive features this algorithm got very slow, that’s it was not used for all experiments..

After each training, the model was evaluated and all metrics defined in the Metrics section have been calculated using the method evaluate_model. This is implemented in the file metrics.py.

preds = np.where(model.predict_proba(X_test)[:,1] > threshold, 1, 0) metrics[‘accuracy’] = accuracy_score(y_test, preds) 
metrics[‘f1’] = f1_score(y_test, preds, zero_division=0) metrics[‘auc’] = roc_auc_score(y_test, model.predict_proba(X_test)[:,1]) 
metrics[‘precision’] = precision_score(y_test, preds, zero_division=0) 
metrics[‘recall’] = recall_score(y_test, preds, zero_division=0)

In order to be able to make an offline metrics evaluation, all metrics have been stored in an external file results.csv. This is implemented in metrics.py as well.

The training process and the model evaluation have been executed after every adaption of the dataset (new feature added, target balanced, etc.) to be able to measure the impact to implemented changes in the dataset on the used models.

The whole implementation have been split into 9 jupyter notebooks and 3 python files:

1_nhl_explore_players.ipynb: Provides interesting insight about NHL players, their salary, place of birth and answering the question where the majority of players were born.

2_nhl_explore_games.ipynb: Exploring diverse statistics of events that happened during multiple games, like e.g. shots on goal, goals.

3_nhl_explore_one_game.ipynb: Exploring statistics for one particular game and starting to evaluate possible predictive features.

4_nhl_logistic_regression.ipynb: Goal prediction using first binary classifier, explaining of used metrics, evaluation and visualization of the sigmoid function and data.

5_nhl_multiple_classifiers.ipynb: Goal prediction using logistic regression, LightGBM classifier, K-Neighbors classifier. Balancing the dataset. Using new predictive features distance and angle in all classifiers.

6_nhl_player_features.ipynb : Introducing new predictive features extracted from players data.

7_nhl_finetuning_experiments.ipynb: Provides experiments to improve model performance using feature selection, threshold fine-tuning, cross validation.

8_nhl_evaluation.ipynb: Evaluates and plots performance metrics for all models.

9_nhl_data_pipeline.ipynb : This is the pipeline extracting data and creating all features in one function.

model.py: model creation (Logistic Regression, LightGBM, KNeighborsClassifier) and training.

metrics.py: utilities to evaluate models on scale and to store the results for later evaluation.

utils.py: useful re-usable functions for data manipulation, visualization and training.

Refinement of the used process was done in 7_nhl_finetuning_experiments. For Logistic Regression I used feature scaling and investigation of the best threshold for predictions.

# create model
model = make_pipeline(StandardScaler(),  
                      LogisticRegression(max_iter=10000));# fit model
model.fit(X_train, y_train)#evaluate
evaluate_model(model, X_test, y_test, c_matrix=True, r_curve=False);

max_value = -1
best_threshold = 0for thresh in np.arange(0.4, 0.6, 0.001):
     _, metrics = evaluate_model(model, X_test, y_test,
                  threshold=thresh, c_matrix=False, 
                  r_curve=False, suppress=True);
     if metrics[‘accuracy’] > max_value: 
         max_value = metrics[‘accuracy’]
         best_threshold = thresh
 
print(f’Best accuracy={max_value} achieved with the threshold {best_threshold}’)_, lg = evaluate_model(model, X_test, y_test, threshold=best_threshold, c_matrix=True, r_curve=True);

For the LightGBM classifier, a GridSearchCV was used to find the best performing parameters.

grid_params = {‘boosting_type’: [‘gbdt’], 
 ‘colsample_bytree’: [0.75, 0.8, 0.85], 
 ‘learning_rate’: [0.05, 0.1, 0.15], 
 ‘max_depth’: [4, 5, 6], 
 ‘n_estimators’: [80, 100, 120], 
 ‘num_leaves’: [12, 18, 24], 
 ‘objective’: [‘binary’], 
 ‘reg_alpha’: [4, 5, 6], 
 ‘reg_lambda’: [4, 5, 6], 
 ‘seed’: [500, 600, 700], 
 ‘subsample’: [0.75]}model = LGBMClassifier();
grid = GridSearchCV(model, param_grid=grid_params, 
                    verbose=1, cv=5, n_jobs=-1)# start training
start = time.time()
print(f’Starting…’)grid.fit(X_train, y_train)print(grid.best_params_)
print(grid.best_score_)model.fit(X_train, y_train)
duration = time.time() — start
print(f’Training ran in {duration:.5f} seconds’)mod, metrs = evaluate_model(grid, X_test, y_test)

In order to speed up the data extraction for all experiments and fine-tuning, a data pipeline has been implemented in the notebook 9_nhl_data_pipeline. It extracts data and creates new defined features step by step. After each step a dataset was saved as a parquet file. 10 datasets have been created and stored totally. The evaluation has been done for all of those datasets.

Model evaluation, validation and justification

All used models have been evaluated using the metrics described in the section Metrics and the results have been stored in results.csv. Besides the values of the metrics, I also evaluated the most important features especially for the LightGBM classifier.

The first dataset provided a very high Accuracy value of .9069. Closer look in the notebook 4_nhl_logistic_regression revealed that the model predicts everything as 0 and the values of F1-Score, Recall and Precision are 0.

After the balancing of the dataset regarding uniformly distributed target values, the Accuracy value from the first balanced dataset up to the full dataset with 20 features and tuned logistic regression model was increased from .629 to .694. The F1-Score was increased from .645 to .711. The Recall was increased from .677 to .752.

The logistic regression classifier was in general not really stable. It behaved very sensitive after adding some of the predictive features. An example is the dataset ‘with_players_id’, where high Precision but very low Recall and F1-Score were achieved.

The most stable classifier regarding new features and imbalanced data was the LightGBM classifier. It achieved good scores over all used datasets.

The K-Neighbors classifier was more stable than the logistic regression but was not achieving best scores for no of used dataset.

Regarding the best score, the Logistic Regression got very close to the LightGBM after fine tuning. They both achieved Accuracy close to .70, F1-Score .71, and Recall 0.74, resp. 0.75. To achieve these values, the standard scaler was used for Logistic Regression and the prediction threshold was adapted to .478. For the LightGBM classifier, no fine-tuning was necessary. Even attempts to use cross-validation of parameters using GridSearchCV haven’t improved the performance of the model.

Even better values of metrics have been achieved for short distance shots only. For the dataset (‘short_dist’), LightGBM achieved F1-score value of .798, Recall .93 and Accuracy .689. Logistic regression achieved similar values. It had even very high Recall (1.0) preventing predicting of False Negatives, which was the desired behaviour of the model defined at the beginning. F1-Score (.796) and Precision (.661) were still plausible.

The achieved accuracy of ~.69 is acceptable but not great. The value always depends on the model and the data. If the data do not provide good predictive power, even the best model can’t achieve perfect results. The learning here is that the short distance shots are much better predictable than the long distance ones. And there are still factors which were not included in the data but can impact the prediction and the probability of achieving a goal after a shot. What about the behaviour of defending players? What about ice temperature? These all can play an important role in the goal probability.

Besides quantitative evaluation of the model a qualitative evaluation using feature importance has been done. One of the goals of the project was to figure out what impacts the goal probability the most. I evaluated what features have been used by the LightGBM classifier and figured out, that for the full dataset of 20 predictive feature, following features have been used, ranked by importance:

This ranking probably makes sense even for someone who is not very familiar with hockey. Distance and angle are always important factors. Goalie_id represents the capabilities of the goalie to capture a shot. And how many goals a skater achieved in his history is a clear indicator of how good he is at doing it. The better is the scorer the higher probability of the goal. Surprisingly, salary of the skater do not really play an important role in the goal prediction.

All plots regarding comparison of different classifiers using different metrics can be found in 8_nhl_evaluation. Some examples here:

Conclusion

The main goal of the project was to predict a binary probability of the goal in hockey. I’ve chosen this topic because I’m a hockey fan and because I wanted to use the offered opportunity to use another dataset and to define my own problem instead of doing the standard project tasks.

I worked with different data sources, extracted and preprocessed the data, and created new features. I used the data to train three different binary classifiers: LogisticRegression, LightGBM classifier, K-Neighbors classifier. I defined metrics to be used for model evaluation and explained why. I implemented the training and the evaluation process. I prepared several datasets with different sets of features, made experiments and evaluated results. For the evaluation, I offered a table and plots to compare all classifiers on all metrics. I explained the results.

I figure out that the used data has a predictive power to predict the goal. The achieved results were acceptable. I haven’t achieved accuracy close to 100% but this is not always possible. Data is very often a limiting factor in ML projects. I tried several tuning methods with the models which increased the performance slightly. At the end of the day, the goals in hockey can’t be predictable to 100%. There are too many factors, not covered in the used data, which have an impact on the goal/no-goal result. It demonstrates the importance of data in our daily business.

I think some fine-tuning could still be possible. But after some steps and experiments, it really didn’t bring too much. It’s pretty time and resources consuming and also in daily business, we need to consider the trade-off between endless resources-consuming optimisation and the added value achieved doing it. What still could be done is to add new features and to model new ones from existing data sources.

Interesting aspect for me was to see the difference between shot distance and long distance shot/goal prediction. This absolutely makes sense but was still kind of surprising to me to see it even in the model behaviour.

At the end of the project my conclusion is, the goals in a hockey game are predictable just to some degree, which makes it such an amazing and exciting game.