03 - Role Classification and Model Evaluation¶

Executive Summary¶

This analysis develops and evaluates supervised classification models to predict player roles (Lurker, Spacetaker, Anchor, Rotator, AWPer) from behavioural and positional statistics.

Key Findings:

  • Model Performance:
    • T-Side: Roles are linearly separable and highly distinct. Logistic Regression achieves an F1-Macro of ~0.92, significantly outperforming complex ensembles.
    • CT-Side: Roles require non-linear decision boundaries. Random Forest (F1 ~0.75) outperforms linear baselines, but performance is capped largely by the porous boundary between Rotators and AWPers.
  • Feature Selection: The Orthogonal Feature Set (which replaces ADAT with ADAT (Residual)) maintains predictive power and removes multicollinearity between positional features.
  • Role Ambiguity: Excluding ambiguous hybrid roles (Half-Lurker, Mixed) improves F1 scores by +0.28 (T) and +0.21 (CT), confirming that "Core" roles are distinct statistical archetypes.
  • Interpretability:
    • "IGL Confound": In-Game Leaders are frequently misclassified as AWPers or Spacetakers due to their tendency to play "central" positions for information gathering.
    • Diagnostic Value: High-confidence misclassifications successfully identified ground-truth labeling errors (e.g., Spinx as Anchor $\to$ Rotator).

Outcome: Identified champion models for both sides and a diagnostic framework for validating player roles.

Note: All analyses are performed separately for T-side and CT-side to preserve tactical context.

Data Note: The input dataset has been corrected, but for this classification analysis we deliberately inject the original mislabels (e.g., HooXi, Spinx) to demonstrate the model's diagnostic capabilities. We use these "controlled errors" to verify if the model can flag inconsistent labels.


Objectives¶

1. Experimental Design (Cross-Validation)
Establish a rigorous 4-split $\times$ 20-repeat Nested Cross-Validation strategy (80 total folds) to ensure stable performance estimates given the small sample size ($N=84$).

2. Baseline & Feature Ablation
Evaluate Logistic Regression baselines across four feature sets (Raw, Orthogonal, Residuals, Full) to manage multicollinearity and establish a performance floor.

3. Sensitivity Analysis
Quantify the impact of ambiguous roles (Half-Lurker, Mixed) on classification performance to determine if the model struggles with features or with fuzzy definitions.

4. Advanced Modelling
Benchmark non-linear models (SVM, Random Forest, XGBoost) against the linear baseline using nested cross-validation to identify the optimal classifier for each side.

5. Interpretation & Diagnosis
Use SHAP values, Waterfall plots, and Prediction Confidence analysis to explain model decisions and diagnose specific misclassification patterns (e.g., the IGL-Centrality phenomena).

1. Setup and Experimental Design¶

TL;DR: Configure the analysis environment, load the processed dataset, define the cross-validation strategy, specify feature sets for ablation, and initialise the scaling pipeline.

Methodology Notes

Cross-Validation Strategy:

  • 4 splits × 20 repeats = 80 total folds provides robust performance estimates while maintaining reasonable computational cost
  • Stratified K-Fold ensures class balance is preserved in each fold, critical for small sample sizes
  • The same CV splits will be reused across all models to enable fair comparison

Feature Set Definitions:

  • Raw Features: All original behavioural (TAPD, OAP, PODT, POKT) and positional (ADNT, ADAT) metrics. Baseline with multicollinearity (r ≈ 0.92 between ADNT and ADAT on T-side, r ≈ 0.78 on CT-side).
  • Orthogonal Features: Uses ADNT + adat_residual (orthogonal pair, ρ ≈ 0.00). Replaces ADAT with its residual whilst keeping ADNT, eliminating multicollinearity between positional features whilst retaining both positioning dimensions.
  • Residual Features: Uses only adat_residual (drops ADNT entirely). Tests whether the residual alone is sufficient for role discrimination.
  • Full Features: All behavioural features + all three positioning features (ADNT, ADAT, adat_residual). Tests whether tree-based models can leverage multicollinearity better than linear models.
  • Feature sets are side-specific: T-side feature sets use *_t suffix (e.g., tapd_t, adnt_t, adat_residual_t), CT-side feature sets use *_ct suffix (e.g., tapd_ct, adnt_ct, adat_residual_ct).

Scaling Pipeline:

  • StandardScaler will be applied within cross-validation loops to prevent data leakage
  • Scaling is fit only on training folds and applied to validation folds

Side-Specific Modelling:

  • All models will be run for both T-side and CT-side separately
  • T-side models use T-side features (*_t) and predict role_t
  • CT-side models use CT-side features (*_ct) and predict role_ct
  • This preserves tactical context and enables side-specific feature set selection and model comparison
In [1]:
# === Setup: paths, imports, theme ===

from pathlib import Path
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import joblib

# sklearn imports
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline


# Dev convenience
%load_ext autoreload
%autoreload 2

# Display options
pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 120)

# Resolve repository root (if in notebooks/, step up one level)
REPO_ROOT = Path.cwd().resolve().parent if Path.cwd().name.lower() == "notebooks" else Path.cwd().resolve()

# Canonical paths
DATA_DIR    = REPO_ROOT / "data"
RESULTS_DIR = REPO_ROOT / "results" / "classification"
FIG_DIR     = RESULTS_DIR / "figures"
TAB_DIR     = RESULTS_DIR / "tables"
MODEL_DIR   = RESULTS_DIR / "models"

# Ensure results dirs exist
FIG_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR.mkdir(parents=True, exist_ok=True)
TAB_DIR.mkdir(parents=True, exist_ok=True)

# Dataset path
DATA_PATH = DATA_DIR / "processed" / "cs2_playstyles_2024_with_residuals.parquet"
assert DATA_PATH.exists(), f"Dataset not found at {DATA_PATH}"

# Local helpers
SRC_DIR = REPO_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

from style import (
    set_mpl_theme,
    set_seaborn_theme,
    ROLE_COLOURS,
    get_role_colour,
)

# Import classification utilities
from classification_utils import (
    get_feature_sets,
    evaluate_baseline_models,
    summarise_feature_set_results,
    fit_and_visualise_logreg,
    evaluate_per_class_metrics,
    evaluate_model_cv,
    plot_confusion_matrices_comparison,
    plot_model_stability_boxplots,
    prepare_classification_data,
    run_sensitivity_analysis,
    run_model_tuning,
    compile_model_leaderboard,
    save_champion_model,
    select_and_save_champion_models,
    compare_rf_feature_importance,
    plot_prediction_confidence,
    plot_shap_beeswarm_grid,
    prepare_champion_data,
    plot_single_player_waterfall,
    plot_comparison_waterfall,
    get_player_percentiles,
    get_repeated_cv_predictions,
    prepare_interpretation_model,
    plot_igl_feature_distribution,
)

# Themes
set_mpl_theme(mode="dark", preferred_font="Georgia")
set_seaborn_theme(mode="dark", preferred_font="Georgia")

# Echo key paths
REPO_ROOT, DATA_PATH, FIG_DIR, TAB_DIR
Out[1]:
(WindowsPath('P:/cs2-playstyle-analysis-2024'),
 WindowsPath('P:/cs2-playstyle-analysis-2024/data/processed/cs2_playstyles_2024_with_residuals.parquet'),
 WindowsPath('P:/cs2-playstyle-analysis-2024/results/classification/figures'),
 WindowsPath('P:/cs2-playstyle-analysis-2024/results/classification/tables'))

Load Dataset¶

Load the processed dataset with engineered residual features and verify role distributions.

In [2]:
# Load processed dataset
df = pd.read_parquet(DATA_PATH)
# After loading the dataset
MIN_MAPS = 40
df = df[df['map_count'] >= MIN_MAPS].copy()
print(f"After filtering (MIN_MAPS={MIN_MAPS}): {len(df)} players")

# --- DELIBERATE ERROR INJECTION FOR DIAGNOSTIC DEMONSTRATION ---
# We deliberately inject known mislabels for HooXi and Spinx to demonstrate the model's
# diagnostic capability (identifying "confident errors" in the analysis later).
# In the EDA notebook, these are correct (Mixed and Rotator respectively), but we revert
# them here to the original "noisy" labels found during the project.
mask_hooxi = df['player_name'] == 'HooXi'
mask_spinx = df['player_name'] == 'Spinx'
df.loc[mask_hooxi, 'role_ct'] = 'Anchor'  # True role: Mixed
df.loc[mask_spinx, 'role_ct'] = 'Anchor'  # True role: Rotator
print("\n> [DIAGNOSTIC SETUP] Injected deliberate CT-role errors for HooXi (Mixed->Anchor) and Spinx (Rotator->Anchor).")
# ---------------------------------------------------------------

# Structural checks
print("Shape:", df.shape)
display(df.head(3))

# Verify expected columns exist (especially residual features)
expected_residuals = ['adat_residual_t', 'adat_residual_ct']
assert all(col in df.columns for col in expected_residuals), "Missing residual features"

# Quick overview of role distribution
print("\nRole distribution (T-side):")
display(df['role_t'].value_counts())
print("\nRole distribution (CT-side):")
display(df['role_ct'].value_counts())
After filtering (MIN_MAPS=40): 84 players

> [DIAGNOSTIC SETUP] Injected deliberate CT-role errors for HooXi (Mixed->Anchor) and Spinx (Rotator->Anchor).
Shape: (84, 27)
steamid player_name team_clan_name map_count tapd_ct tapd_t tapd_overall oap_ct oap_t oap_overall podt_ct podt_t podt_overall pokt_ct pokt_t pokt_overall adnt_rank_ct adnt_rank_t adnt_rank_overall adat_rank_ct adat_rank_t adat_rank_overall role_overall role_t role_ct adat_residual_t adat_residual_ct
0 76561198041683378 NiKo G2 Esports 158 60.952893 59.136540 60.136000 24.745965 24.093423 24.424242 21.020276 24.857741 22.507740 17.051295 21.586555 19.197995 0.562493 0.621199 0.593089 0.547525 0.695336 0.616046 Lurker Spacetaker Rotator 0.074668 -0.014643
1 76561198012872053 huNter G2 Esports 158 62.048685 62.871661 62.589800 16.852540 14.807692 15.847511 21.585198 27.994772 24.696747 17.195516 27.180894 22.538284 0.480859 0.643004 0.571875 0.406082 0.615021 0.455108 Flex Lurker Rotator -0.026997 -0.073699
2 76561198074762801 m0NESY G2 Esports 155 62.786553 66.632594 64.362519 23.914373 17.754078 20.873335 19.122381 23.094640 21.473108 17.397469 26.423178 21.056274 0.577785 0.423733 0.453028 0.515617 0.409889 0.427645 AWPer AWPer AWPer -0.017438 -0.061984
Role distribution (T-side):
role_t
Spacetaker     31
Lurker         24
AWPer          17
Half-Lurker    12
Name: count, dtype: int64
Role distribution (CT-side):
role_ct
Rotator    27
Anchor     23
AWPer      17
Mixed      17
Name: count, dtype: int64

Cross-Validation Strategy¶

TL;DR: Define a 4-split, 20-repeat Stratified K-Fold strategy (80 total folds) that will be reused across all models for fair comparison.

Why This Strategy?

The small sample size (N=84) requires a rigorous CV strategy to obtain reliable performance estimates. Using 4 splits provides reasonable training set sizes (~63 samples in each training fold, ~21 samples in each test fold) while 20 repeats ensures statistical stability through multiple independent evaluations. Stratified sampling preserves class balance across all folds.

In [3]:
# === Cross-Validation Strategy ===

# Parameters
N_SPLITS = 4
N_REPEATS = 20
RANDOM_STATE = 42

# Create CV strategy (will be used for all models)
# Note: Will need to specify which side (T or CT) and target column when fitting
cv_strategy = RepeatedStratifiedKFold(
    n_splits=N_SPLITS,
    n_repeats=N_REPEATS,
    random_state=RANDOM_STATE
)
print(f"Cross-validation strategy: {N_SPLITS} splits × {N_REPEATS} repeats = {N_SPLITS * N_REPEATS} total folds per side")
Cross-validation strategy: 4 splits × 20 repeats = 80 total folds per side

Feature Sets¶

TL;DR: Define four feature sets for ablation testing: Raw (baseline), Orthogonal (ADNT + adat_residual, no positional multicollinearity), Residuals (adat_residual only), and Full (all positioning features).

Feature Set Definitions

Raw Features: All original behavioural and positional metrics. Includes both ADNT and ADAT despite their high correlation (r ≈ 0.92 on T-side, r ≈ 0.78 on CT-side). Baseline with multicollinearity among positional features (ADNT/ADAT).

Orthogonal Features: Uses ADNT + adat_residual (orthogonal pair, ρ ≈ 0.00). Eliminates multicollinearity while retaining both positioning dimensions. Tests whether removing positional multicollinearity improves model stability.

Residual Features: Uses only adat_residual (drops ADNT entirely). Tests whether the residual alone is sufficient, or if ADNT provides additional signal.

Full Features: All behavioural + all three positioning features (ADNT, ADAT, adat_residual). Tests whether including both ADAT and ADAT_residual produces meaningful performance improvement. Tree-based models may handle multicollinearity better than linear models.

Side-Specific Implementation: Feature sets are generated separately for T-side (using *_t suffix) and CT-side (using *_ct suffix). All four sets will be evaluated for both sides.

In [4]:
# === Feature Sets for Ablation Study ===

# Generate feature sets for both sides using modelling utility
FEATURE_SETS_T = get_feature_sets('t')
FEATURE_SETS_CT = get_feature_sets('ct')

# Store in a nested dictionary for easy access
FEATURE_SETS = {
    't': FEATURE_SETS_T,
    'ct': FEATURE_SETS_CT
}

# Define ambiguous roles 
EXCLUDED_ROLES = {'t': 'Half-Lurker', 'ct': 'Mixed'}

# Display feature sets for both sides
print("=" * 60)
print("T-SIDE FEATURE SETS")
print("=" * 60)
for name, features in FEATURE_SETS_T.items():
    print(f"\n{name} Features ({len(features)} features):")
    print(features)

print("\n" + "=" * 60)
print("CT-SIDE FEATURE SETS")
print("=" * 60)
for name, features in FEATURE_SETS_CT.items():
    print(f"\n{name} Features ({len(features)} features):")
    print(features)
============================================================
T-SIDE FEATURE SETS
============================================================

Raw Features (6 features):
['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adnt_rank_t', 'adat_rank_t']

Orthogonal Features (6 features):
['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adnt_rank_t', 'adat_residual_t']

Residuals Features (5 features):
['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adat_residual_t']

Full Features (7 features):
['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adnt_rank_t', 'adat_rank_t', 'adat_residual_t']

============================================================
CT-SIDE FEATURE SETS
============================================================

Raw Features (6 features):
['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adnt_rank_ct', 'adat_rank_ct']

Orthogonal Features (6 features):
['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adnt_rank_ct', 'adat_residual_ct']

Residuals Features (5 features):
['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adat_residual_ct']

Full Features (7 features):
['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adnt_rank_ct', 'adat_rank_ct', 'adat_residual_ct']

Setup Complete

  • Dataset loaded: 84 players with all expected residual features
  • CV strategy defined: 80 total folds (4 splits × 20 repeats)
  • Four feature sets prepared for ablation (Raw, Orthogonal, Residuals, Full) for both sides
  • Scaling pipeline ready (applied within CV loops to prevent data leakage)

Ready for baseline modelling.

2. Baseline Models & Feature Selection¶

TL;DR: Establish baselines using Dummy and Logistic Regression across all feature sets. Select Orthogonal feature set for parsimony. Diagnose per-class performance, then exclude ambiguous roles to test core role separability. Visualise feature–role associations from the filtered dataset.

Methodology Details

Baseline Evaluation:

  • Models: Dummy Classifier (stratified baseline) vs. Logistic Regression (linear baseline)
  • Feature Sets: Four sets evaluated to handle multicollinearity between ADNT and ADAT:
    • Raw: All features (high multicollinearity)
    • Orthogonal: ADNT + adat_residual (uncorrelated positioning dimensions)
    • Residuals: adat_residual only (tests if ADNT is redundant)
    • Full: All features (tests if models handle multicollinearity well)
  • Metric: F1-Macro (mean of per-class F1 scores) to handle class imbalance

Sensitivity Analysis:

  • Per-class metrics reveal that ambiguous roles (Half-Lurker on T-side, Mixed on CT-side) have poor performance, suggesting they blur boundaries between core playstyles
  • We re-evaluate after excluding these roles to test whether the model struggles due to weak features or simply because these roles are inherently fuzzy
  • Coefficient Visualisation: We visualise feature weights from the filtered dataset to see the clearest signal of what defines each core role

Baseline Models & Feature Selection¶

Run baseline models across all feature sets for both sides, then select the optimal feature set.

In [5]:
# === Baseline Models: Run & Display Results ===

baseline_results = {}

for side in ['t', 'ct']:
    print(f"Running baselines for {side.upper()}-side...")
    baseline_res = evaluate_baseline_models(
        df=df,
        side=side,
        feature_sets_dict=FEATURE_SETS[side],
        cv_strategy=cv_strategy
    )
    baseline_results[side] = baseline_res
    
    # Save raw results
    baseline_res[['model', 'feature_set', 'mean_f1', 'std_f1']].to_csv(
        TAB_DIR / f"baseline_results_{side}.csv", index=False
    )

# Display full comparison tables for both sides
print("\n" + "=" * 60)
print("FEATURE SET COMPARISON (Logistic Regression F1-Macro)")
print("=" * 60)

for side in ['t', 'ct']:
    comparison = summarise_feature_set_results(baseline_results[side], side=side)
    print(f"\n{side.upper()}-Side:")
    display(comparison)

# Combined comparison for saving
comparison_t = summarise_feature_set_results(baseline_results['t'], side='t')
comparison_ct = summarise_feature_set_results(baseline_results['ct'], side='ct')
comparison_df = pd.concat([comparison_t, comparison_ct], ignore_index=True)
comparison_df.to_csv(TAB_DIR / "feature_set_comparison.csv", index=False)
print(f"\nComparison table saved to: {TAB_DIR / 'feature_set_comparison.csv'}")
Running baselines for T-side...
Running baselines for CT-side...

============================================================
FEATURE SET COMPARISON (Logistic Regression F1-Macro)
============================================================

T-Side:
Side Model Feature_Set Mean_F1_Macro Std_F1_Macro Mean_Accuracy Std_Accuracy
0 T-Side LogisticRegression Full 0.693830 0.088836 0.770833 0.068633
1 T-Side LogisticRegression Raw 0.685843 0.085449 0.767857 0.069160
2 T-Side LogisticRegression Orthogonal 0.674325 0.074047 0.763690 0.063640
3 T-Side LogisticRegression Residuals 0.466390 0.085475 0.579762 0.083427
4 T-Side Dummy Orthogonal 0.253056 0.110655 0.305357 0.110322
5 T-Side Dummy Raw 0.253056 0.110655 0.305357 0.110322
6 T-Side Dummy Residuals 0.253056 0.110655 0.305357 0.110322
7 T-Side Dummy Full 0.253056 0.110655 0.305357 0.110322
CT-Side:
Side Model Feature_Set Mean_F1_Macro Std_F1_Macro Mean_Accuracy Std_Accuracy
0 CT-Side LogisticRegression Raw 0.508231 0.082315 0.564881 0.075196
1 CT-Side LogisticRegression Full 0.501163 0.085292 0.556548 0.084714
2 CT-Side LogisticRegression Orthogonal 0.495041 0.078242 0.554167 0.079769
3 CT-Side LogisticRegression Residuals 0.400822 0.077293 0.454167 0.082115
4 CT-Side Dummy Orthogonal 0.255153 0.090947 0.283333 0.092490
5 CT-Side Dummy Raw 0.255153 0.090947 0.283333 0.092490
6 CT-Side Dummy Residuals 0.255153 0.090947 0.283333 0.092490
7 CT-Side Dummy Full 0.255153 0.090947 0.283333 0.092490
Comparison table saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\feature_set_comparison.csv

Feature Set Selection: Baseline Analysis¶

Decision: Use Orthogonal for interpretation (coefficient analysis, sensitivity) but carry Raw, Orthogonal, and Full forward to the model comparison phase.

Detailed Analysis

1. Baseline Comparison
Logistic Regression substantially outperforms the stratified dummy classifier across all feature sets:

  • T-Side: LogReg (0.674 F1) vs. Dummy (0.253 F1) — 166% improvement
  • CT-Side: LogReg (0.495 F1) vs. Dummy (0.255 F1) — 94% improvement

These effect sizes (Cohen's d ≈ 5.7 for T-side, ≈ 3.1 for CT-side) confirm the features contain strong discriminative signal.

2. Feature Set Performance
Performance differences between Orthogonal, Raw, and Full are almost negligible:

  • T-Side: Full (0.694) > Raw (0.686) > Orthogonal (0.674) — differences within 0.02 F1
  • CT-Side: Raw (0.508) > Full (0.501) > Orthogonal (0.495) — differences within 0.013 F1

All three configurations show overlapping standard deviations, indicating no statistically meaningful advantage for Full or Raw.

3. Why Multiple Feature Sets for Model Comparison?
Although Orthogonal is preferred for interpretation, we retain all three sets for the advanced model phase because:

  • Multicollinearity affects LogReg more than trees/SVMs: The coefficient instability caused by correlated features (ADNT ↔ ADAT, r ≈ 0.92) is a LogReg-specific issue. Tree-based models (XGBoost, RF) and SVMs are robust to multicollinearity and may extract additional signal from Raw/Full.
  • Differences are within noise: With N=84, a 0.02 F1 difference corresponds to ~1–2 players. The sets are effectively tied, so committing to one risks leaving performance on the table.
  • Minimal cost: Testing three feature sets adds negligible compute overhead and ensures we identify the optimal configuration for each model class.

4. Interpretation vs. Prediction

  • Orthogonal will be used for the following sensitivity analysis and coefficient visualisation (Section 2) to ensure stable, interpretable feature–role associations.
  • All three sets will be benchmarked in the model comparison (Section 3) to maximise understanding of predictive performance.

5. CT-Side Signal Limitations
CT-side performance remains modest (F1 ≈ 0.50) across all feature sets, suggesting the CT role taxonomy is harder to discriminate with our playstyle metrics alone. This likely reflects the situational nature of CT roles (e.g., map-specific assignments, dynamic rotations). Predictions should be treated as exploratory pending richer CT-specific features.

In [6]:
# === Feature Set Selection ===
feature_set_selections = {'t': 'Orthogonal', 'ct': 'Orthogonal'}
BEST_FEATURE_SET_T = feature_set_selections['t']
BEST_FEATURE_SET_CT = feature_set_selections['ct']
print(f"Selected: {feature_set_selections}")
Selected: {'t': 'Orthogonal', 'ct': 'Orthogonal'}

Per-Class Performance Diagnosis¶

Evaluate precision, recall, and F1 per role to identify which roles are harder/easier to predict.

Why Per-Class Metrics?

F1-Macro provides an overall performance summary but masks per-class differences. Per-class metrics reveal whether the model struggles with specific roles (e.g., minority classes or ambiguous definitions) and help identify precision–recall trade-offs.

In [7]:
# === Per-Class Metrics: Full Dataset (T & CT) ===

per_class_dfs = {}
full_f1_scores = {}

for side in ['t', 'ct']:
    target_col = f'role_{side}'
    feats = FEATURE_SETS[side][feature_set_selections[side]]
    
    # Prepare Data
    df_clean = df.dropna(subset=[target_col]).copy()
    X = df_clean[feats].values
    y = df_clean[target_col].values
    
    # Evaluate
    clf = LogisticRegression(max_iter=1000, random_state=42)
    metrics = evaluate_per_class_metrics(clf, X, y, cv_strategy)
    
    # Store F1-Macro for later comparison
    cv_res = evaluate_model_cv(clf, X, y, cv_strategy)
    full_f1_scores[side] = cv_res['mean_score']
    
    # Store & Save
    per_class_dfs[side] = metrics
    metrics.to_csv(TAB_DIR / f"per_class_metrics_{side}.csv", index=False)
    
    print(f"\n{side.upper()}-Side Per-Class Metrics (Full Dataset):")
    # Show worst performing roles first to highlight ambiguous roles
    display(metrics.sort_values('f1_mean', ascending=True))
    print(f"Full table saved to: {TAB_DIR / f'per_class_metrics_{side}.csv'}")
T-Side Per-Class Metrics (Full Dataset):
role precision_mean precision_std recall_mean recall_std f1_mean f1_std
1 Half-Lurker 0.253542 0.335552 0.183333 0.240947 0.196806 0.240360
3 Spacetaker 0.763843 0.107101 0.843527 0.137287 0.793073 0.094453
2 Lurker 0.795965 0.141438 0.883333 0.127475 0.824978 0.094273
0 AWPer 0.930179 0.114774 0.861250 0.173561 0.882445 0.124662
Full table saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\per_class_metrics_t.csv

CT-Side Per-Class Metrics (Full Dataset):
role precision_mean precision_std recall_mean recall_std f1_mean f1_std
2 Mixed 0.186280 0.265918 0.106875 0.135264 0.126678 0.155462
0 AWPer 0.637991 0.214509 0.597500 0.245701 0.588312 0.190074
3 Rotator 0.604961 0.136919 0.689286 0.153239 0.631406 0.112602
1 Anchor 0.604894 0.142322 0.697083 0.182361 0.633769 0.122807
Full table saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\per_class_metrics_ct.csv

Key Finding

Ambiguous roles (Half-Lurker on T-side, Mixed on CT-side) show particularly poor F1 scores, suggesting they blur boundaries between core playstyles and aren't a distinct archetype. This justifies filtering them to test whether the model can better separate clearly-defined core roles.

Sensitivity Analysis: Core Role Separability¶

Exclude ambiguous roles to test whether the model struggles due to weak features or simply because these roles are inherently fuzzy. If performance improves substantially, it confirms these roles introduce systematic confusion rather than revealing feature limitations.

In [8]:
# === Sensitivity Analysis: Run & Display F1 Comparison ===

# Run sensitivity analysis using the helper function
sensitivity_results = {}
for side in ['t', 'ct']:
    sensitivity_results[side] = run_sensitivity_analysis(
        df=df,
        side=side,
        excluded_role=EXCLUDED_ROLES[side],
        feature_names=FEATURE_SETS[side][feature_set_selections[side]],
        cv_strategy=cv_strategy,
        tab_dir=TAB_DIR
    )

# Build and display F1 comparison table
comparison_rows = []
for side in ['t', 'ct']:
    res = sensitivity_results[side]
    comparison_rows.append({
        'Side': f"{side.upper()}-Side",
        'Excluded_Role': EXCLUDED_ROLES[side],
        'N_Full': res['n_full'],
        'N_Filtered': res['n_filtered'],
        'F1_Full': res['full_f1'],
        'F1_Filtered': res['filtered_f1'],
        'Delta': res['delta']
    })

print("=== Sensitivity Analysis: Impact of Excluding Ambiguous Roles ===\n")
impact_df = pd.DataFrame(comparison_rows)
display(impact_df)

# === Per-Class Metrics: Filtered Dataset ===

print("=== Per-Class Metrics (Core Roles Only) ===")
for side in ['t', 'ct']:
    print(f"\n{side.upper()}-Side (excluding {EXCLUDED_ROLES[side]}):")
    display(sensitivity_results[side]['per_class_filtered'].sort_values('f1_mean', ascending=False))
=== Sensitivity Analysis: Impact of Excluding Ambiguous Roles ===

Side Excluded_Role N_Full N_Filtered F1_Full F1_Filtered Delta
0 T-Side Half-Lurker 84 72 0.674325 0.921120 0.246794
1 CT-Side Mixed 84 67 0.495041 0.670192 0.175150
=== Per-Class Metrics (Core Roles Only) ===

T-Side (excluding Half-Lurker):
role precision_mean precision_std recall_mean recall_std f1_mean f1_std
1 Lurker 0.908631 0.097040 0.950000 0.080795 0.923538 0.061559
0 AWPer 0.962292 0.082653 0.899375 0.139305 0.922455 0.094451
2 Spacetaker 0.932222 0.075045 0.912723 0.099624 0.917367 0.064466
CT-Side (excluding Mixed):
role precision_mean precision_std recall_mean recall_std f1_mean f1_std
1 Anchor 0.807009 0.124657 0.760833 0.144316 0.769876 0.092408
2 Rotator 0.651553 0.158925 0.687202 0.186215 0.654556 0.137992
0 AWPer 0.655898 0.230986 0.575000 0.227211 0.586142 0.190224

Result

Excluding ambiguous roles improves F1-Macro substantially (Δ = +0.28 for T-side, +0.21 for CT-side), confirming these roles introduce systematic confusion. The selected features can effectively distinguish clearly-defined core roles. Per-class performance generally improves, reaching up to ~0.92 F1 for T side AWPers and Lurkers.

In [9]:
# === T-Side Confusion Matrices: Full vs Filtered ===

side = 't'
feats = FEATURE_SETS[side][feature_set_selections[side]]
clf = LogisticRegression(max_iter=1000, random_state=42)

plot_confusion_matrices_comparison(
    df=df,
    side=side,
    feature_names=feats,
    excluded_role=EXCLUDED_ROLES[side],
    model=clf,
    cv_strategy=cv_strategy,
    fig_dir=FIG_DIR
)
plt.show()
No description has been provided for this image
In [10]:
# === CT-Side Confusion Matrices: Full vs Filtered ===

side = 'ct'
feats = FEATURE_SETS[side][feature_set_selections[side]]
clf = LogisticRegression(max_iter=1000, random_state=42)

plot_confusion_matrices_comparison(
    df=df,
    side=side,
    feature_names=feats,
    excluded_role=EXCLUDED_ROLES[side],
    model=clf,
    cv_strategy=cv_strategy,
    fig_dir=FIG_DIR
)
plt.show()
No description has been provided for this image

Observations

  1. CT-side AWPers are the least distinct core role, with ~39% misclassified as Rotators even after excluding Mixed. This confusion is intrinsic—removing the ambiguous class barely changes AWPer recall (59%→57%). CT AWPers likely share rotational positioning behaviour that our features cannot separate.

  2. Ambiguous roles skew toward specific archetypes: Half-Lurkers are predicted as Spacetakers (49%) more than Lurkers (30%), suggesting more aggressive tendencies than their name may suggest. Mixed players lean toward Anchor (48%) over Rotator (29%).

  3. T-side benefits more from filtering: Lurker recall jumps from 88% to 95% when Half-Lurker is removed, indicating Half-Lurkers were "absorbing" correct Lurker predictions. CT-side gains are more modest.

  4. Confusion asymmetry reveals role distinctiveness: Rotators are rarely misclassified as AWPers (13%), but AWPers are often misclassified as Rotators (32%). This suggests Rotators have a more unique behavioural signature while some AWPers exhibit rotator-like flexibility.

Model Interpretation: Feature–Role Associations¶

Visualise Logistic Regression coefficients from the filtered dataset (core roles only) to reveal the clearest signal of what defines each archetype.

In [11]:
# === Coefficient Visualisation: Core Roles (Filtered Dataset) ===

# Generate coefficient plots for filtered dataset (core roles only)
for side in ['t', 'ct']:
    excluded_role = EXCLUDED_ROLES[side]
    df_filtered = df[df[f'role_{side}'] != excluded_role].copy()
    
    print(f"\nGenerating {side.upper()}-side coefficients (excluding {excluded_role})...")
    
    _, _, fig = fit_and_visualise_logreg(
        df=df_filtered,
        side=side,
        feature_set_name=feature_set_selections[side],
        feature_names=FEATURE_SETS[side][feature_set_selections[side]],
        fig_dir=FIG_DIR,
        tab_dir=TAB_DIR,
        suffix='_filtered'
    )
    plt.show()
    
    print(f"  Coefficients saved to: {TAB_DIR / f'logreg_coefficients_{side}_filtered.csv'}")
Generating T-side coefficients (excluding Half-Lurker)...
No description has been provided for this image
  Coefficients saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\logreg_coefficients_t_filtered.csv

Generating CT-side coefficients (excluding Mixed)...
No description has been provided for this image
  Coefficients saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\logreg_coefficients_ct_filtered.csv

Key Findings

Coefficients reveal clear behavioural signatures:

  • Spacetakers (T): High oap (opening attempts) and less POKT (trade kills)
  • Lurkers (T): High adnt (isolation) and high POKT (trade kills)
  • AWPers: (T) High TAPD (time alive), Low ADNT (positioning very close to teammates)
  • Anchors (CT): High ADNT and ADAT_Residual, positioning very far from teammates
  • Rotators (CT): Low ADNT (close to teammates) and low TAPD time alive.
  • AWPers: (CT) High TAPD (time alive), relatively High OAP (due to "pick" kills)

These associations generally align with domain knowledge and findings from the EDA, validating the selected features.

Summary & Interpretations¶

Key Takeaways:

  • Feature Selection: No clear winner among Raw, Orthogonal, and Full feature sets (differences < 0.02 F1). Orthogonal used for interpretation due to coefficient stability; all three carried forward for model comparison.
  • Sensitivity Analysis: Excluding ambiguous roles improves F1 by +0.28 (T) and +0.21 (CT), confirming core roles are relatively well-separated. We continue excluding ambiguous roles.
  • Feature Importance: Coefficients reveal clear behavioural signatures (e.g. Lurkers = isolation, Spacetakers = opening attempts)
  • Data Constraints: With only 84 players (~4–5 per role per CV fold), results are exploratory and should be validated with larger cohorts
Detailed Interpretations

Baseline Performance & Feature Selection:

  • Logistic Regression substantially outperforms Dummy Classifier, indicating features contain meaningful signal
  • Performance differences between Raw, Orthogonal, and Full are negligible (within 0.02 F1, ~1 player difference at N=84)
  • Orthogonal used for interpretation (sensitivity analysis, coefficients) because multicollinearity destabilises LogReg coefficients
  • All three feature sets will be tested with advanced models, since tree-based methods (RF, XGBoost) and SVMs are robust to multicollinearity
  • CT-side results remain modest (F1 ≈ 0.50). Treat predictions as exploratory pending richer features

Core vs. Ambiguous Roles:

  • Per-class diagnosis revealed that ambiguous roles (Half-Lurker on T-side, Mixed on CT-side) are very hard to classify and blur boundaries between core playstyles
  • Sensitivity analysis confirms these roles introduce systematic confusion: excluding them improves F1-Macro, validating the model can effectively distinguish clearly-defined core roles
  • The performance improvement (Δ) quantifies the impact and confirms these are mixed states rather than distinct archetypes

Feature–Role Associations:

  • Coefficient visualisation reflects clear behavioural signatures, analogous to findings in EDA (01_eda.ipynb): e.g.
    • Spacetakers (T): High oap (entry attempts)
    • Lurkers (T): High adnt (isolation)
    • Anchors vs. Rotators (CT): Distinguished by positioning metrics (packing density and mobility)
  • These associations align with domain knowledge, validating the features and that the model picked up on playstyle differences.

Next Steps: The logistic regression baseline is strong (~0.69 F1 for T-side), but we proceed to Section 3 to test whether non-linear models (SVM, Random Forest, XGBoost) can improve performance. All three feature sets (Raw, Orthogonal, Full) will be benchmarked since multicollinearity—which affects LogReg coefficient interpretation—is not a concern for tree-based models.

3. Advanced Model Comparison¶

TL;DR: Evaluate SVM, Random Forest, and XGBoost using Nested Cross-Validation. Compare performance across feature sets (Raw, Orthogonal, Full) to identify the champion model for further analysis.

Methodology: Nested Cross-Validation

Why Nested CV?
With only 84 players, we cannot afford a separate holdout set for hyperparameter tuning. If we tuned hyperparameters on the same data used for evaluation, we would overfit to noise. Nested CV solves this:

  • Outer Loop (Evaluation): Our existing 4-split × 20-repeat strategy (80 folds). This measures generalisation performance.
  • Inner Loop (Tuning): Inside each outer training fold (~63 players), we run a 3-fold GridSearchCV to select the best hyperparameters.

The Flow:

  1. Outer loop splits data into Train (63) and Test (21).
  2. Inner loop splits Train into Inner-Train (42) and Inner-Val (21), tests hyperparameter combinations.
  3. Best hyperparameters are selected, model is refit on full Train (63).
  4. Performance is evaluated on Test (21).
  5. Repeat 80 times; average scores give unbiased performance estimate.

Hyperparameter Strategy:
We search broad, sensible ranges (orders of magnitude) rather than fine-grained values. With small data, finding a "stable region" matters more than pinpointing an exact optimum.

Feature Sets:
We test Raw, Orthogonal, and Full for each model. Tree-based models (RF, XGBoost) and SVMs are robust to multicollinearity, so they may extract additional signal from correlated features.

Note on Reported Hyperparameters:
Performance metrics (F1-Macro, standard deviation) are derived from the 80-fold Nested CV to ensure unbiased evaluation. The "Best Parameters" reported for each configuration (model + feature set + side) are derived from a final grid search run on the full filtered dataset for that side. This is standard practice: Nested CV validates the search strategy and provides unbiased performance estimates, while the final fit on all available data identifies the optimal hyperparameters for each configuration.

In [12]:
# === Section 3 Setup ===

# Feature sets to evaluate (excluding 'Residuals' as per plan)
FEATURE_SETS_TO_TEST = ['Raw', 'Orthogonal', 'Full']

print("Section 3 setup complete. Ready for model evaluation.")
Section 3 setup complete. Ready for model evaluation.

3.1 Support Vector Machine (SVM)¶

Why SVM? SVMs excel with small datasets and high-dimensional feature spaces. The kernel trick allows them to find non-linear decision boundaries without explicit feature engineering. With balanced class weights, they handle imbalanced classes gracefully.

Hyperparameter Grid:

  • C (Regularisation): [0.1, 1, 10, 100] — Controls the trade-off between margin width and misclassification. Lower C = smoother boundary.
  • kernel: ['linear', 'rbf'] — Linear for simple boundaries, RBF for complex non-linear patterns.
  • gamma: ['scale', 'auto'] — RBF kernel spread. 'scale' is generally preferred.
  • class_weight: ['balanced'] — Adjusts weights inversely proportional to class frequencies.
  • probability: [True] — Enables probability estimates (needed for later calibration/SHAP).
In [13]:
# === SVM: Define Grid & Run ===

svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto'],
    'class_weight': ['balanced'],
    'probability': [True]
}

print("Running SVM with Nested CV...")
print(f"Grid size: {np.prod([len(v) for v in svm_param_grid.values()])} combinations per inner CV")

svm_results = run_model_tuning(
    model_class=SVC,
    param_grid=svm_param_grid,
    model_name='SVM',
    df=df,
    feature_sets=FEATURE_SETS,
    feature_set_names=FEATURE_SETS_TO_TEST,
    cv_strategy=cv_strategy,
    excluded_roles=EXCLUDED_ROLES
)

print("\nSVM Results:")
display(svm_results.round(3))

# Save results
svm_results.to_csv(TAB_DIR / "model_results_svm.csv", index=False)
Running SVM with Nested CV...
Grid size: 16 combinations per inner CV

SVM Results:
Side Model Feature_Set Mean_F1 Std_F1 Mean_Accuracy Std_Accuracy Mean_Train_F1 Std_Train_F1 Mean_Fit_Time Best_Params All_Scores
0 T-Side SVM Raw 0.874 0.066 0.872 0.065 0.961 0.025 0.144 {'C': 1, 'class_weight': 'balanced', 'gamma': ... [1.0, 0.8766884531590414, 0.8384208384208384, ...
1 T-Side SVM Orthogonal 0.913 0.053 0.910 0.054 0.962 0.017 0.060 {'C': 0.1, 'class_weight': 'balanced', 'gamma'... [0.9407407407407407, 0.9027777777777778, 0.889...
2 T-Side SVM Full 0.907 0.056 0.903 0.057 0.965 0.020 0.063 {'C': 0.1, 'class_weight': 'balanced', 'gamma'... [1.0, 0.8303872053872053, 0.8384208384208384, ...
3 CT-Side SVM Raw 0.704 0.098 0.711 0.095 0.867 0.060 0.068 {'C': 1, 'class_weight': 'balanced', 'gamma': ... [0.5285714285714286, 0.810966810966811, 0.7146...
4 CT-Side SVM Orthogonal 0.693 0.093 0.699 0.092 0.868 0.062 0.070 {'C': 1, 'class_weight': 'balanced', 'gamma': ... [0.6813186813186812, 0.7523809523809524, 0.714...
5 CT-Side SVM Full 0.708 0.091 0.715 0.089 0.875 0.058 0.069 {'C': 0.1, 'class_weight': 'balanced', 'gamma'... [0.6794871794871794, 0.810966810966811, 0.7146...

3.2 Random Forest¶

Why Random Forest? Ensemble of decision trees that reduces overfitting through bagging and feature randomisation. Robust to outliers and handles non-linear relationships well. Provides built-in feature importance estimates.

Hyperparameter Grid (Regularised for Small Data):

  • n_estimators: [100] — Fixed at 100 trees (sufficient for N=84).
  • max_depth: [2, 3] — Very shallow trees to prevent memorising the ~63 training samples per fold.
  • min_samples_leaf: [5, 8, 10] — Higher values force generalisation; with ~16 samples per class, leaves must represent broader patterns.
  • max_features: ['sqrt'] — Standard choice for classification.
  • class_weight: ['balanced'] — Adjusts for class imbalance.
In [14]:
# === Random Forest: Define Grid & Run (Regularised) ===

rf_param_grid = {
    'n_estimators': [100],
    'max_depth': [2, 3, 5],              # Shallower to prevent overfitting
    'min_samples_leaf': [5, 8, 10],   # Higher values force generalisation
    'max_features': ['sqrt'],
    'class_weight': ['balanced']
}

print("Running Random Forest with Nested CV (Regularised Grid)...")
print(f"Grid size: {np.prod([len(v) for v in rf_param_grid.values()])} combinations per inner CV")

rf_results = run_model_tuning(
    model_class=RandomForestClassifier,
    param_grid=rf_param_grid,
    model_name='RandomForest',
    df=df,
    feature_sets=FEATURE_SETS,
    feature_set_names=FEATURE_SETS_TO_TEST,
    cv_strategy=cv_strategy,
    excluded_roles=EXCLUDED_ROLES
)

print("\nRandom Forest Results:")
display(rf_results.round(3))

# Save results
rf_results.to_csv(TAB_DIR / "model_results_rf.csv", index=False)
Running Random Forest with Nested CV (Regularised Grid)...
Grid size: 9 combinations per inner CV

Random Forest Results:
Side Model Feature_Set Mean_F1 Std_F1 Mean_Accuracy Std_Accuracy Mean_Train_F1 Std_Train_F1 Mean_Fit_Time Best_Params All_Scores
0 T-Side RandomForest Raw 0.822 0.091 0.822 0.088 0.954 0.027 0.573 {'class_weight': 'balanced', 'max_depth': 3, '... [0.8857142857142857, 0.9500891265597149, 0.773...
1 T-Side RandomForest Orthogonal 0.813 0.103 0.813 0.101 0.957 0.035 0.547 {'class_weight': 'balanced', 'max_depth': 2, '... [0.8363636363636363, 0.8962962962962964, 0.824...
2 T-Side RandomForest Full 0.828 0.084 0.825 0.083 0.955 0.029 0.508 {'class_weight': 'balanced', 'max_depth': 2, '... [0.9407407407407407, 0.9500891265597149, 0.773...
3 CT-Side RandomForest Raw 0.756 0.084 0.758 0.080 0.870 0.029 0.523 {'class_weight': 'balanced', 'max_depth': 2, '... [0.7579365079365079, 0.6892736892736893, 0.776...
4 CT-Side RandomForest Orthogonal 0.751 0.090 0.755 0.087 0.861 0.034 0.523 {'class_weight': 'balanced', 'max_depth': 2, '... [0.6895104895104894, 0.7594405594405594, 0.714...
5 CT-Side RandomForest Full 0.764 0.084 0.767 0.081 0.863 0.033 0.517 {'class_weight': 'balanced', 'max_depth': 2, '... [0.6897546897546897, 0.8773892773892774, 0.776...

3.3 XGBoost¶

Why XGBoost? Gradient boosting builds trees sequentially, each correcting the errors of the previous. Often achieves state-of-the-art performance on tabular data. However, it is prone to overfitting on small datasets, so we use conservative hyperparameters with explicit regularisation.

Hyperparameter Grid (Regularised):

  • n_estimators: [100] — Number of boosting rounds. Kept modest to prevent overfitting.
  • learning_rate: [0.01, 0.1] — Step size shrinkage. Lower values require more trees but generalise better.
  • max_depth: [2, 3] — Very shallow trees (2-3) to limit model complexity with small data.
  • min_child_weight: [5, 10] — Minimum sum of instance weight in a child node. Higher values prevent overly specific splits that fit individual samples. Critical regulariser for small datasets.
  • subsample: [0.8] — Fraction of samples used per tree. Introduces stochasticity for regularisation.
In [15]:
# === XGBoost: Define Grid & Run ===

xgb_param_grid = {
    'n_estimators': [100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [2, 3, 5],              # Shallower to prevent overfitting
    'min_child_weight': [5, 10],      # Key regulariser for small data
    'subsample': [0.8],
    'eval_metric': ['mlogloss']
}

print("Running XGBoost with Nested CV...")
print(f"Grid size: {np.prod([len(v) for v in xgb_param_grid.values()])} combinations per inner CV")

xgb_results = run_model_tuning(
    model_class=XGBClassifier,
    param_grid=xgb_param_grid,
    model_name='XGBoost',
    df=df,
    feature_sets=FEATURE_SETS,
    feature_set_names=FEATURE_SETS_TO_TEST,
    cv_strategy=cv_strategy,
    excluded_roles=EXCLUDED_ROLES
)

print("\nXGBoost Results:")
display(xgb_results.round(3))

# Save results
xgb_results.to_csv(TAB_DIR / "model_results_xgb.csv", index=False)
Running XGBoost with Nested CV...
Grid size: 12 combinations per inner CV

XGBoost Results:
Side Model Feature_Set Mean_F1 Std_F1 Mean_Accuracy Std_Accuracy Mean_Train_F1 Std_Train_F1 Mean_Fit_Time Best_Params All_Scores
0 T-Side XGBoost Raw 0.794 0.084 0.798 0.079 0.942 0.023 0.181 {'eval_metric': 'mlogloss', 'learning_rate': 0... [0.8857142857142857, 0.8850408850408851, 0.824...
1 T-Side XGBoost Orthogonal 0.812 0.099 0.816 0.097 0.973 0.015 0.180 {'eval_metric': 'mlogloss', 'learning_rate': 0... [0.8169191919191919, 0.8962962962962964, 0.773...
2 T-Side XGBoost Full 0.819 0.080 0.820 0.078 0.965 0.019 0.165 {'eval_metric': 'mlogloss', 'learning_rate': 0... [0.8857142857142857, 0.8850408850408851, 0.824...
3 CT-Side XGBoost Raw 0.709 0.105 0.720 0.092 0.889 0.029 0.183 {'eval_metric': 'mlogloss', 'learning_rate': 0... [0.6813186813186812, 0.7633477633477633, 0.776...
4 CT-Side XGBoost Orthogonal 0.719 0.100 0.726 0.091 0.854 0.032 0.170 {'eval_metric': 'mlogloss', 'learning_rate': 0... [0.6190476190476191, 0.5712250712250713, 0.771...
5 CT-Side XGBoost Full 0.730 0.097 0.737 0.089 0.874 0.029 0.182 {'eval_metric': 'mlogloss', 'learning_rate': 0... [0.7579365079365079, 0.7594405594405594, 0.776...

3.4 Model Comparison & Champion Selection¶

Aggregate results from all models (including Logistic Regression baseline from Section 2) and identify the best-performing configuration for each side.

In [16]:
# === Aggregate Results: Build Leaderboard ===

# Compile leaderboard (includes LogReg baseline on core roles for fair comparison)
all_results_sorted = compile_model_leaderboard(
    df=df,
    feature_sets=FEATURE_SETS,
    excluded_roles=EXCLUDED_ROLES,
    cv_strategy=cv_strategy,
    model_results={
        'SVM': svm_results,
        'RandomForest': rf_results,
        'XGBoost': xgb_results
    },
    feature_sets_to_test=FEATURE_SETS_TO_TEST,
    tab_dir=TAB_DIR
)

print("=" * 70)
print("MODEL LEADERBOARD (Core Roles Only)")
print("=" * 70)
print("\nRanked by F1-Macro.")

# Display columns (excluding All_Scores for readability)
display_cols = ['Side', 'Model', 'Feature_Set', 'Mean_F1', 'Std_F1', 'Mean_Accuracy', 'Mean_Train_F1', 'Overfitting_Gap']
display(all_results_sorted[display_cols].round(3))
======================================================================
MODEL LEADERBOARD (Core Roles Only)
======================================================================

Ranked by F1-Macro.
Side Model Feature_Set Mean_F1 Std_F1 Mean_Accuracy Mean_Train_F1 Overfitting_Gap
0 CT-Side RandomForest Full 0.764 0.084 0.767 0.863 0.098
1 CT-Side RandomForest Raw 0.756 0.084 0.758 0.870 0.115
2 CT-Side RandomForest Orthogonal 0.751 0.090 0.755 0.861 0.109
3 CT-Side XGBoost Full 0.730 0.097 0.737 0.874 0.143
4 CT-Side XGBoost Orthogonal 0.719 0.100 0.726 0.854 0.135
5 CT-Side XGBoost Raw 0.709 0.105 0.720 0.889 0.180
6 CT-Side SVM Full 0.708 0.091 0.715 0.875 0.166
7 CT-Side SVM Raw 0.704 0.098 0.711 0.867 0.163
8 CT-Side SVM Orthogonal 0.693 0.093 0.699 0.868 0.175
9 CT-Side LogisticRegression Raw 0.688 0.110 0.705 0.811 0.123
10 CT-Side LogisticRegression Full 0.676 0.110 0.689 0.812 0.136
11 CT-Side LogisticRegression Orthogonal 0.670 0.110 0.683 0.806 0.136
12 T-Side LogisticRegression Orthogonal 0.921 0.057 0.922 0.974 0.053
13 T-Side LogisticRegression Full 0.919 0.059 0.919 0.977 0.058
14 T-Side SVM Orthogonal 0.913 0.053 0.910 0.962 0.049
15 T-Side SVM Full 0.907 0.056 0.903 0.965 0.058
16 T-Side LogisticRegression Raw 0.902 0.073 0.903 0.968 0.067
17 T-Side SVM Raw 0.874 0.066 0.872 0.961 0.088
18 T-Side RandomForest Full 0.828 0.084 0.825 0.955 0.127
19 T-Side RandomForest Raw 0.822 0.091 0.822 0.954 0.132
20 T-Side XGBoost Full 0.819 0.080 0.820 0.965 0.147
21 T-Side RandomForest Orthogonal 0.813 0.103 0.813 0.957 0.144
22 T-Side XGBoost Orthogonal 0.812 0.099 0.816 0.973 0.162
23 T-Side XGBoost Raw 0.794 0.084 0.798 0.942 0.148
In [17]:
# === Model Stability Visualisation ===
# Boxplots showing CV score distribution for top configurations per side
# Fixed 0.4-1 axis for honest comparison across sides

# T-Side stability
fig = plot_model_stability_boxplots(
    all_results=all_results_sorted,
    side='t',
    top_n=6,
    fig_dir=FIG_DIR,
    xlim=(0.4, 1.0)
)
plt.show()

# CT-Side stability
fig = plot_model_stability_boxplots(
    all_results=all_results_sorted,
    side='ct',
    top_n=6,
    fig_dir=FIG_DIR,
    xlim=(0.4, 1.0)
)
plt.show()
No description has been provided for this image
No description has been provided for this image
In [18]:
# === Champion Selection: Best Model per Side ===

# Explicitly select champions (not just highest F1) for interpretability & pipeline consistency
CHAMPION_SELECTION = {
    'T-Side': {'Model': 'LogisticRegression', 'Feature_Set': 'Orthogonal'},
    'CT-Side': {'Model': 'RandomForest', 'Feature_Set': 'Orthogonal'}
}

champions = select_and_save_champion_models(
    leaderboard_df=all_results_sorted,
    df=df,
    feature_sets=FEATURE_SETS,
    excluded_roles=EXCLUDED_ROLES,
    model_dir=MODEL_DIR,
    champion_criteria=CHAMPION_SELECTION
)
======================================================================
CHAMPION MODELS (Best F1-Macro per Side)
======================================================================

T-Side:
  Model: LogisticRegression
  Feature Set: Orthogonal
  F1-Macro: 0.921 ± 0.057
  Accuracy: 0.922 ± 0.054
  Hyperparameters: Default (baseline, no tuning)
  Saved to: P:\cs2-playstyle-analysis-2024\results\classification\models\champion_t_LogisticRegression.joblib

CT-Side:
  Model: RandomForest
  Feature Set: Orthogonal
  F1-Macro: 0.751 ± 0.090
  Accuracy: 0.755 ± 0.087
  Hyperparameters: {'class_weight': 'balanced', 'max_depth': 2, 'max_features': 'sqrt', 'min_samples_leaf': 8, 'n_estimators': 100}
  Saved to: P:\cs2-playstyle-analysis-2024\results\classification\models\champion_ct_RandomForest.joblib

======================================================================

Section 3 Summary & Champion Selection¶

We evaluated four model families (Logistic Regression, SVM, Random Forest, XGBoost) using nested cross-validation to identify the optimal classifier for each side.

1. T-Side Champion: Logistic Regression (Orthogonal)¶

  • Performance: F1-Macro 0.921 ± 0.06 (Top of Leaderboard)
  • Justification:
    • Simplicity Wins: Linear models matched or outperformed complex ensembles (XGBoost F1 ~0.82), proving that T-side roles are linearly separable in our feature space.
    • Interpretability: Logistic Regression offers direct coefficient interpretability, allowing us to explain exactly why a player is classified as a "Lurker" (e.g., +1.75 coefficient on ADNT).
    • Stability: Lowest standard deviation (±0.06) indicates robust generalisation across different player splits.

2. CT-Side Champion: Random Forest (Orthogonal)¶

  • Performance: F1-Macro 0.751 ± 0.09
  • Justification:
    • Performance vs Interpretability Trade-off: Random Forest (Full) achieved marginally higher mean F1 (0.764) with slightly lower variance (±0.08). However, the 1.3% difference corresponds to ~1 player classification difference across our cohort and falls within confidence overlap.
    • Why Orthogonal? We accept the modest variance trade-off (±0.09 vs ±0.08) in exchange for:
      1. A unified feature set with T-side, simplifying cross-interpretation
      2. Avoiding correlated features (ADNT ↔ ADAT) that can cause "vote-splitting" in tree-based importance rankings
      3. Parsimony (6 features vs 7)
    • Complexity Required: Tree-based models substantially outperformed the linear baseline (0.75 vs 0.67), confirming CT roles require non-linear decision boundaries.

3. Methodological Note: Constraints for Small Data¶

Given the small sample size (N=84), we deliberately constrained our tree-based models to use very shallow depth (max_depth=2-3) and high leaf requirements.

  • Why: Preliminary tests showed that standard depths (unconstrained or deep trees) allowed models to memorise individual players, leading to train-test gaps of >20%.
  • Result: By strictly regularising the hyperparameters in our grid search, we maintained healthy train-test gaps (~10%), ensuring the selected champions are learning generalisable role archetypes rather than overfitting to noise.

Next Steps: We proceed to Section 4 to train these two champion models on the full (filtered) dataset. We will then perform a "post-mortem" analysis using Confusion Matrices to identify specific misclassification patterns and SHAP/Coefficient analysis to validate the behavioural drivers of each role.

4. Model Interpretation¶

TL;DR: We analyse Random Forest models for both T and CT sides to enable consistent non-linear feature interpretation. We use Gini importance for a global view and SHAP values for directional insights. Finally, we analyse prediction confidence to identify misclassification patterns.

Why Random Forest for Both Sides?

Although Logistic Regression was the statistical champion for T-side (F1=0.92), we also analyse the T-side Random Forest (Orthogonal) model here.

Reasons:

  1. Consistency: Using the same model architecture allows for direct comparison of feature importance dynamics between T and CT sides.
  2. Non-Linearity: Trees can capture complex interactions (e.g., "high aggression is good ONLY IF trading is also high") that linear models miss.
  3. Validation: If the RF finds similar patterns to the linear model, it reinforces our findings.

Interpretability Approaches:

  • Gini Importance: "Which features are used most often?" (Magnitude)
  • SHAP Values: "How does this feature value affect the prediction?" (Direction & Magnitude)
  • Confidence Analysis: "Is the model confused or confidently wrong?" (Error Diagnosis)

4.1 Setup & Model Retrieval¶

We load the existing CT-side champion. For T-side, we retrieve the best hyperparameters from the leaderboard and retrain a fresh Random Forest on the filtered dataset (Core Roles) to ensure fair analysis.

In [19]:
# === Section 4 Setup: Prepare Models ===

# Load/retrain models for interpretation
t_model = prepare_interpretation_model(
    side='t',
    model_name='RandomForest',
    feature_set_name='Orthogonal',
    df=df,
    feature_sets=FEATURE_SETS,
    excluded_roles=EXCLUDED_ROLES,
    model_dir=MODEL_DIR,
    tab_dir=TAB_DIR,
    load_if_saved=False  # Retrain T-side for interpretation
)

ct_model = prepare_interpretation_model(
    side='ct',
    model_name='RandomForest',
    feature_set_name='Orthogonal',
    df=df,
    feature_sets=FEATURE_SETS,
    excluded_roles=EXCLUDED_ROLES,
    model_dir=MODEL_DIR,
    tab_dir=TAB_DIR,
    load_if_saved=True  # Load saved CT champion
)

# Unpack for downstream compatibility
champion_t_rf = t_model['pipeline']
champion_ct = ct_model['pipeline']
X_t, y_t = t_model['X'], t_model['y']
X_ct, y_ct = ct_model['X'], ct_model['y']
feature_names_t = t_model['feature_names']
feature_names_ct = ct_model['feature_names']
champion_data = {'t': t_model, 'ct': ct_model}

print("\nModels ready for interpretation:")
print(f"  T-Side:  RandomForest (F1 ~{t_model['f1_score']:.3f})")
print(f"  CT-Side: {type(ct_model['pipeline'].named_steps['clf']).__name__} (F1 ~{ct_model['f1_score']:.3f})")
Retraining T-Side RandomForest with params: {'class_weight': 'balanced', 'max_depth': 2, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'n_estimators': 100}
Loading saved model from P:\cs2-playstyle-analysis-2024\results\classification\models\champion_ct_RandomForest.joblib

Models ready for interpretation:
  T-Side:  RandomForest (F1 ~0.813)
  CT-Side: RandomForestClassifier (F1 ~0.751)

4.2 Global Feature Importance (Gini)¶

We compare which features drive the decision trees for each side. Gini importance measures how often a feature is used to split nodes and how much it reduces impurity.

In [20]:
# === Global Feature Importance (Gini) ===

compare_rf_feature_importance(
    pipeline_t=champion_t_rf,
    feature_names_t=feature_names_t,
    pipeline_ct=champion_ct,
    feature_names_ct=feature_names_ct,
    fig_dir=FIG_DIR
)
plt.show()
No description has been provided for this image

4.3 Directional Feature Analysis (SHAP)¶

Gini importance tells us what matters, but not how. We use SHAP beeswarm plots to reveal directionality (e.g., does high ADNT predict Lurker or Entry?).

How to read these plots:

  • Each Dot: Represents a single player from the dataset.
  • Rows: Features (variables).
  • X-axis (SHAP value): Impact on model output.
    • Right (Positive): Pushes prediction towards this role.
    • Left (Negative): Pushes prediction away from this role.
  • Colour: Actual Feature Value.
    • Red: High value for that feature.
    • Blue: Low value for that feature.

Example: If the "Lurker" plot shows red dots (High ADNT) on the right (positive SHAP), it means "High isolation increases the probability of being classified as a Lurker".

In [21]:
# === Separate SHAP Analysis (T & CT) ===

# T-Side SHAP
print("Generating T-Side SHAP Plot...")
fig_t = plot_shap_beeswarm_grid(
    pipeline=champion_t_rf,
    X=X_t,
    feature_names=feature_names_t,
    side='t',
    fig_dir=FIG_DIR
)
plt.show()

# CT-Side SHAP
print("Generating CT-Side SHAP Plot...")
fig_ct = plot_shap_beeswarm_grid(
    pipeline=champion_ct,
    X=X_ct,
    feature_names=feature_names_ct,
    side='ct',
    fig_dir=FIG_DIR
)
plt.show()
Generating T-Side SHAP Plot...
No description has been provided for this image
Generating CT-Side SHAP Plot...
No description has been provided for this image

Observations from SHAP Analysis:

T-Side Roles:

  • Lurker: Strongly defined by High ADNT (Isolation). The red dots on the far right of the ADNT row confirm that isolation is the primary driver. They also show high ADAT Residuals meaning they position further away from the team centre than expected (given their ADNT).
  • Spacetaker: Driven by High OAP (Opening Attempts) and Low TAPD (Time Alive). This confirms the "entry fragger" profile: aggressive first-contact seeking that often leads to earlier deaths.
  • AWPer: Characterized by Low ADNT (Pack play), Low OAP (Opening Attempts) and High TAPD (Survival). The blue dots on the ADNT row indicate that playing close to teammates is a key defining feature.

CT-Side Roles:

  • Anchor: The mirror of the Lurker, defined by High ADNT (Isolation) and High ADAT Residuals (Static/Peripheral holding).
  • Rotator: Defined by Low ADNT (Pack play) and Low TAPD (Active rotation/support leading to higher engagement risk).
  • AWPer (CT): Distinct from T-side AWPers, they show Higher OAP (getting more opening picks) but still maintain High TAPD (survival), reflecting the "posted up" nature of defensive sniping.

4.4 Probability Analysis (The "Discrimination Plot")¶

We visualise the model's confidence for every player to distinguish between "Clear Wins", "Near Misses", and "Confident Errors".

Methodology: Repeated Cross-Validation To ensure robust and unbiased probability estimates, we use Repeated Stratified K-Fold (4 splits × 20 repeats).

  1. For each player, we generate 20 independent out-of-sample predictions (trained on the other 75% of data).
  2. We average these probabilities to get a stable estimate of the model's confidence.
  3. Error bars represent the standard deviation across repeats, quantifying the stability of the classification.

High standard deviation indicates that a player's classification is sensitive to the specific training data split (an "Unstable" classification).

In [22]:
# === T-Side Probability Analysis ===


# 1. Generate Repeated CV Predictions
print("Generating T-Side Predictions...")
mean_probas_t, std_probas_t = get_repeated_cv_predictions(
    model=champion_t_rf,
    X=X_t,
    y=y_t
)

# Get player names
player_names_t = champion_data['t']['df']['player_name'].tolist()

# 2. Plot with Stability
fig = plot_prediction_confidence(
    y_true=y_t,
    mean_probas=mean_probas_t,
    std_probas=std_probas_t,
    class_names=champion_t_rf.classes_,
    side='t',
    player_names=player_names_t,
    fig_dir=FIG_DIR
)
plt.show()

# === CT-Side Probability Analysis ===
print("Generating CT-Side Predictions...")
mean_probas_ct, std_probas_ct = get_repeated_cv_predictions(
    model=champion_ct,
    X=X_ct,
    y=y_ct
)

# Get player names
player_names_ct = champion_data['ct']['df']['player_name'].tolist()

fig = plot_prediction_confidence(
    y_true=y_ct,
    mean_probas=mean_probas_ct,
    std_probas=std_probas_ct,
    class_names=champion_ct.classes_,
    side='ct',
    player_names=player_names_ct,
    fig_dir=FIG_DIR
)
plt.show()
Generating T-Side Predictions...
No description has been provided for this image
Generating CT-Side Predictions...
No description has been provided for this image

Observations: Probability & Misclassification Analysis¶

Overall Trends:

As anticipated from the F1 scores and previous analysis, the CT side exhibits a higher rate of misclassification compared to the T side. Specifically, the boundary between Rotators and AWPers appears the most porous, with the model frequently confusing these roles.

Interestingly, IGLs appear to be misclassified quite frequently (HooXi, apEX, chopper, bLitz, Snax, MAJ3R, biguzera), whilst effort was made to use features that did not directly measure personal performance, perhaps there is a confounding effect between IGLs generally performing worse in-game and the distinctiveness of their roles?

T-Side Observations and theories:

  • Brollan (Lurker $\to$ Spacetaker): The model confidently misclassifies Brollan as a Spacetaker. Personally, I had the impression that he was a relatively aggressive player, this high-confidence error suggests a potential mismatch between his assigned label and his actual behavioral metrics this year.
  • Jabbi (Lurker $\to$ Spacetaker): Similarly misclassified but with higher prediction variance, suggesting a unique or hybrid playstyle that defies rigid categorisation.
  • Ultimate (AWPer $\to$ Rotator): His misclassification as a Rotator aligns with his reputation as a famously aggressive AWPer in 2024, exhibiting movement patterns closer to a rifler than a sniper.
  • ZywOo (AWPer $\to$ Rotator): Likely misclassified due to his hybrid profile; he is well-known for his proficiency with rifles and willingness to pick them up more frequently than "pure" AWPers.
  • cadiaN (Rotator $\to$ AWPer): Misclassified as an AWPer, which is factually grounded: he served as the primary AWPer for Liquid before transitioning to a rifling role with Astralis mid-year.

CT-Side Observations and theories:

  • Spinx (Anchor $\to$ Rotator): Spinx is the most confidently misclassified player on the CT side. This is a notable case where the model's prediction (Rotator) contradicts the label (Anchor). As discussed in Section 4.5, this likely points to a labelling error in our ground truth rather than a model failure.
  • Ultimate (AWPer $\to$ Rotator): Consistent with his T-side results, his unique, aggressive AWPing style registers as rifler-like behavior to the model.
  • biguzera (Rotator $\to$ AWPer): Confidently misclassified as an AWPer, potentially due to being known for taking "Star" positions as an IGL that an AWPer may take.

4.5 Individual Misclassification Analysis: Waterfall Plots (LOOCV)¶

Waterfall plots are generated with a Leave-One-Out (LOO) fit: for the player being explained, the model is retrained on all other core-role players (excluding ambiguous roles such as Half-Lurker/Mixed). This mirrors the "unseen player" condition of the CV predictions in Section 4.4 and avoids explaining a model that has already memorised the player. Because the LOO model is slightly better-specified (N-1 > 0.75N), SHAP values may be marginally more confident than any single CV fold, but they faithfully explain why the model would misclassify the player when treated as new data.

Also note that many of the remarks made here are merely speculative, and would require individual analysis of the player's matches to be fully justified.

T side waterfall plots¶

We'll start with Brollan and jabbi, Lurkers who were misclassified as Spacetakers.

In [23]:
# Side-by-side comparison of predicted vs. true role for Brollan
plot_comparison_waterfall(
    pipeline_template=champion_t_rf,
    df=df,
    player_name="Brollan",
    feature_names=feature_names_t,
    side='t',
    excluded_role=EXCLUDED_ROLES['t'],
    fig_dir=FIG_DIR,
)
plt.show()
plot_comparison_waterfall(
    pipeline_template=champion_t_rf,
    df=df,
    player_name="jabbi",
    feature_names=feature_names_t,
    side='t',
    excluded_role=EXCLUDED_ROLES['t'],
    fig_dir=FIG_DIR,
)
plt.show()
No description has been provided for this image
No description has been provided for this image

Brollan and jabbi have relatively similar playstyles according to the feature set. Their main drives towards being predicted as Spacetaker were their high opening attempts (OAP) and low time alive per death (TAPD), both very typical of Spacetakers. Where they deviate is with their high isolation from teammates (ADNT), being the biggest (and virtually only) factor pulling them towards the Lurker role. This tells us that these players had an isolated but aggresive T-side playstyle. With such a high distance from their team and with high opening attempts, this could point to a problem with their team keeping up early in the rounds.

Next we'll look at Ultimate, an AWPer who was misclassified as a Spacetaker.

In [24]:
plot_comparison_waterfall(
    pipeline_template=champion_t_rf,
    df=df,
    player_name="ultimate",
    feature_names=feature_names_t,
    side='t',
    excluded_role=EXCLUDED_ROLES['t'],
    fig_dir=FIG_DIR,
)
plt.show()
No description has been provided for this image

Ultimate was known for being an exceptionally agressive AWPer in 2024, and his feature values corroborate this. He was misclassified as a Spacetaker largely due to dying quickly (low TAPD), and being traded relatively frequently (PODT).
The table below illustrates his feature values compared to other AWPers, he consistently places in the tails of the distribution for all features, indicating he was a significant outlier in his playstyle.

In [25]:
ultimate_t_percentiles = get_player_percentiles(
    df=df,
    player_name="ultimate",
    side='t',
    features=feature_names_t,
    excluded_role=EXCLUDED_ROLES['t'],
)
print("Ultimate vs. labelled role (AWPer):")
display(ultimate_t_percentiles)
Ultimate vs. labelled role (AWPer):
player_name side role_group feature value percentile rank
0 ultimate t AWPer tapd_t 59.090479 5.882353 1/17
1 ultimate t AWPer oap_t 19.145299 94.117647 16/17
2 ultimate t AWPer podt_t 25.097174 94.117647 16/17
3 ultimate t AWPer pokt_t 22.911286 11.764706 2/17
4 ultimate t AWPer adnt_rank_t 0.298120 5.882353 1/17
5 ultimate t AWPer adat_residual_t 0.063184 88.235294 15/17

CT side waterfall plots¶

Next we'll look at Spinx, an ""Anchor" who was "misclassified" as a Rotator.

In [26]:
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="Spinx",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show() 
No description has been provided for this image

The waterfall plots show that all of Spinx's features certainly match what one would expect for a Rotator, and his positioning (ADNT) combined with his opening attempts (OAP) make him very unlikely to be an Anchor. There is a very good reason for this, he is actually misclassified, upon analysing many of his matches from 2024 in Vitality, Spinx actually did appear to play a rotating role on the CT side. This is slightly embarassing as it is my fault, at the end of 2024 he switched teams and Harry Richard's positions data (which was used to collect most labels) didn't have him in there, so I must've (wrongly) assumed his CT side role. Whoops! On the plus side, this demonstrates a potential use case of a model such as this, even if it isn't accurate enough yet to classify all players, it can still highlight potential miscategorisations such as this.

Next we will look at Hooxi, an Anchor misclassified as a Rotator

In [27]:
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="HooXi",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show() 
No description has been provided for this image

HooXi is an interesting case, his closer proximity to teammates (ADNT) and lower time alive per death (TAPD) were the main causes of being classified as a Spacetaker. His low TAPD could be explained by his notoriously sacrificial playstyle*** (also indicated by a very high PODT), **but his low isolation (ADNT) is very A-Typical of an Anchor. Upon analysing a few of his games on G2 this year, it appeared that depending on the map, HooXi played a mixture of "Anchor" positions such as B-site Inferno and Cave Ancient, and more Rotating positions on maps such as Overpass and Anubis. This leads me to conclude that (I, again due to missing a label in my version of the positions data), mislabelled HooXi as an Anchor, when he should have been the more ambiguous "Mixed" role. A second mislabelling indicates that I should have been far more careful when imputing missing labels into the dataset, not just relying on my memory and assumptions (probably to rush it in for University deadline).

*https://www.hltv.org/news/39326/stat-check-g2-throws-roles-out-the-window

Next we will look at Ultimate (just because he's fun), an AWPer who was misclassified as a Rotator.

In [28]:
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="ultimate",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show() 
No description has been provided for this image

This is pretty much the same story as on the T side for Ultimate, impressively aggressive, high opening attempts (OAP > 30%!!) and low time alive (TAPD) push him towards the rotator role. His early deaths pulls him away from an AWPer classification along with positioning closer to the average teammate than his nearest teammate would suggest (ADAT Residual), which was the most influential feature when classifying AWPers (see beeswarm plot in section 4.3).

Next we will look at malbsMd, a Rotator who was misclassified as an Anchor.

In [29]:
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="malbsMd",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show() 
No description has been provided for this image

MalbsMd presents a curious case. The drive towards Anchor and away from Rotator is his early round high isolation from his team, he positions far from his nearest teammate (ADNT), and even further from his average teammate than one would expect (ADAT Residual). Even looking more specifically at the positions he played on each map, he did seem to play generally rotate-heavy spots ** (e.g. Connector Mirage, Middle Ancient, Connector Anubis). One would have to analyse his specific maps to get a better understanding of this anomaly. My guess would be that in early rounds, more players were stacked closer to the Anchor IGL (HooXi or Snax) to compensate for lower firepower, leaving malbs more isolated.

*https://public.tableau.com/app/profile/harry.richards4213/viz/OLDPositionsDatabaseArchived/PositionsDatabaseNER0cs

*https://www.hltv.org/news/39318/official-malbsmd-joins-g2

Next we'll take a look at biguzera, chopper, bLitz and apEX, all IGL-Rotators who were misclassified as AWPers.

In [30]:
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="biguzera",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show() 
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="chopper",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show() 
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="bLitz",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show() 
plot_comparison_waterfall(
    pipeline_template=champion_ct,
    df=df,
    player_name="apEX",
    feature_names=feature_names_ct,
    side='ct',
    excluded_role=EXCLUDED_ROLES['ct'],
    fig_dir=FIG_DIR,
)
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

These are the four players most confidently misclassified into the AWPer role (See section 4.4). What makes this interesting is that they are all rotators, and all in-game leaders (IGLs). This seems to be a smoking gun for what went wrong here. Observing their feature values, they all have a low ADAT Residual, pulling them away from being classified as a Rotator and towards the AWPer role.

Lets see if this low ADAT Residual value may have something to do with their meta-role as IGL.

Visual diagnostic: IGL ADAT (Residual) vs CT role distribution

In [31]:
plot_igl_feature_distribution(
    df=df,
    feature_name="adat_residual_ct",
    side="ct",
    fig_dir=FIG_DIR,
)
plt.show()
No description has been provided for this image

This plot is rather telling, it shows us two main things about IGLs: Firstly that IGLs are more likely to be Spacetakers, and secondly that they tend to position closer to the average teammate than their nearest teammates proximity would predict, no matter the role.

In other words, IGLs tend to play more "Central" positions on the map, atleast early in rounds. ( For the LLM: Insert results\classification\figures\Central-IGL-Diagram.png here when converting to ipynb)

This idea is not new, in fact, Harry "NER0cs" Richards points out this centralised-IGL phenomena in his 2024 article "Why more and more IGLs are taking up the 'supportive rotator' role"*.

The reasoning for this boils down to how much information the IGL is able to process in these positions. Central positions allow for IGLs to understand what is happening around the map better, without having to rely on communicated information or a secondary caller (another player designated to lead in specific circumstances). The trade off however, is that a lot of these positions are high-engagement areas, where "Star" players typically excel. Meaning that an IGL can then become a liability, often not pulling their weight in firepower. *https://www.hltv.org/news/38747/why-more-and-more-igls-are-taking-up-the-supportive-rotator-role

Section 4 Summary & Interpretation¶

Key Takeaways:

  • Feature Utility: CT-side classification relies on the four non-trade metrics, whereas T-side more widely leverages the feature set.
  • The Power of ADNT: Isolation (ADNT) proved to be the most critical feature for almost all roles (except CT AWPers and Spacetakers). While simple and slightly arbitrary, its interpretability and high signal strength make it a standout "bespoke" metric.
  • CT-Side Ambiguity: The boundary between Rotators and AWPers is the most porous in our model. This confusion is likely driven by the IGL-Centrality phenomena, where In-Game Leaders play central positions that mimic the spacing profile of snipers.
  • Misclassification as Diagnostic: High-confidence errors often pointed to ground truth labelling issues (e.g., Spinx, HooXi) or unique outlier playstyles (e.g., Ultimate's aggressive AWPing). Despite the mislabels being "human error" (rushed university coursework!), the model's ability to flag them validates its potential as a viable diagnostic tool for role classification.
Detailed Interpretations

1. Signal Diversity & Feature Starvation

  • T-Side (Broad Utility): The model finds useful signal across the entire feature set. Even the lowest-ranked feature (POKT) contributes a meaningful 0.07 Gini importance, indicating that trade behavior is a valid discriminator for attacking roles.
  • CT-Side (More Concentrated Signal): Signal is concentrated in the four non-trade metrics (ADNT, ADAT (R), OAP, TAPD). The drop-off for trade metrics (POKT, PODT) reflects the nature of defensive play: Terrorists typically dictate the terms of engagement, making proactive trading behavior less useful for differentiating between specific CT roles compared to T roles.

2. Behavioral Drivers (SHAP Analysis)

  • T-Side:
    • Lurker: Strongly defined by High Isolation (ADNT).
    • Spacetaker: Defined principally by High Opening Attempts (OAP) and Low Survival (TAPD). Aggresive as expected!
    • AWPer: Defined by Low Isolation (ADNT) and Low OAP. Plays passively and close to teammates (Unless their name is Ultimate).
  • CT-Side:
    • Anchor: The mirror of the Lurker, defined by High Isolation (ADNT).
    • Rotator: Defined by Low ADNT (Pack play) and Low Survival (TAPD), reflecting the high-engagement nature of rotation play.
    • AWPer: Less distinct than T-side. Defined by Low ADAT Residual ("Centrality") and High Survival (TAPD). Interestingly, ADNT signal is mixed, suggesting most AWPers play effectively like rotators while a minority isolate, potentially meaning AWPers may be subcategorised into Rotators and Anchors themselves.

2. The "Aggressive/Hybrid" Profiles

  • Ultimate (Aggressive): A statistical outlier across the board. His High OAP and Low TAPD make him an exceptionally aggressive sniper, confusing the model into seeing a Rifler.
  • ZywOo (Hybrid): Misclassified as a Rotator not exactly due to aggression, but because he was a hybrid that picks up rifles more frequently than more "pure" AWPers.

3. The IGL-Centrality Phenomena

  • A cluster of IGLs (biguzera, chopper, bLitz, apEX) were misclassified as AWPers.
  • Root Cause: These players consistently show low ADAT Residuals (positioning closer to the team's centre of mass than expected).
  • Tactical Insight: I believe this reflects IGLs taking "middle-of-the-pack" positions to gather information. The model confuses this "central" behavior with the positioning of an AWPer.

4. Future Potential

  • This "first attempt" demonstrates that even with coarse metrics like OAP and ADNT, we can extract meaningful role signatures.
  • The model's ability to identify "outliers" (like Ultimate) and labeling errors (like Spinx) shows its potential. With further refinement, it could be used for classifying a larger number of players (potentially commerically for casual players) or as a supplementary tool for scouting players.

5. Synthesis & Conclusion¶

This notebook has demonstrated that the playstyle features in the dataset contain sufficient signal to classify CS2 professional player roles with high accuracy, particularly on the T-side.

5.1 Methodology Recap & Feature Engineering¶

To ensure statistically robust feature values, we filtered for players with a minimum of 40 maps used to compute their statistics (as determined in the EDA notebook).

To enable this analysis on a small cohort (N=84), we employed a rigorous 4-split × 20-repeat Nested Cross-Validation strategy (80 total folds). This ensured that our performance estimates were stable and that our champion models learned generalisable role archetypes rather than memorising individual player data.

A key innovation was the ADAT (Residual) feature. By training a global linear regression ($ADNT \to ADAT$) on the stable cohort, we established a "Pro Standard" for how much a player should be isolated from the team centre given their distance from their nearest teammate. The residual measures deviation from this professional norm.

Note on Feature Engineering & Data Leakage

Technically, input distribution leakage exists because the ADAT (Residual) feature is calculated using a global Linear Regression model fit on the entire (stable cohort) dataset (including samples that eventually become "Test" data in Cross-Validation). This means the specific slope and intercept used to transform the features were very slightly influenced by the test subjects, violating the strict separation of Train and Test environments.

Despite this theoretical impurity, the approach is retained and justified for three key reasons:

  1. The "Pro Standard" Baseline (Domain Justification): We treat the relationship between Isolation (ADNT) and Centrality (ADAT) as a fixed geometric constraint of high-level CS2, defined by the global population of elite professionals. By fitting on the full stable cohort, we establish a canonical "Ground Truth" for standard positioning. The residual, therefore, measures a player's deviation from the professional norm, not just a statistical deviation from a local training fold.

  2. Unsupervised Nature: The leakage is strictly limited to the relationship between independent variables ($X \to X$). No information regarding the target variable (Player Roles) is leaked. The model is not learning "the answer" from the test set, only the "scale" of the input features.

  3. Feature Stability: Given the small sample size ($N=84$), calculating regression coefficients within each small CV fold (~63 players) would introduce high variance, causing the definition of the feature to shift wildly between folds. A global fit ensures the feature definition remains consistent and interpretable across the analysis.

5.2 Results Synthesis¶

Baseline & Feature Sets:

  • Our baseline Logistic Regression significantly outperformed the stratified dummy classifier (Cohen's $d \approx 5.7$ for T-side, $\approx 3.1$ for CT-side), proving that these features capture real playstyle differences, not random noise.
  • No clear "winner" emerged among the feature sets (Raw, Orthogonal, Full) at the baseline stage; differences between these sets were within 0.02 F1 (~1-2 players). This justified retaining all three sets for advanced model testing.
  • The exclusion of "Ambiguous" roles (Half-Lurker on T-side, Mixed on CT-side) improved F1-Macro by +0.28 (T) and +0.21 (CT). This confirms that while "Core" roles (Lurker, Spacetaker, Anchor, Rotator, AWPer) are statistically distinct, the ambiguous roles represent hybrid states that blur the boundaries of our taxonomy.

Model Performance & Nature:

  • T-Side: Best modeled by Logistic Regression (F1 ≈ 0.92). The roles are linearly separable, with all features (including trade metrics like POKT) contributing useful signal. Simplicity wins here, complex ensembles offered no improvement.
  • CT-Side: Best modeled by Random Forest (F1 ≈ 0.75). Defensive roles require non-linear decision boundaries, relying heavily on just four metrics (positioning and aggression: ADNT, ADAT_Residual, OAP, TAPD) while trade metrics provide little discriminatory value.

5.3 Key Insights & The "IGL Confound"¶

The misclassifications themselves provided some of the richest insights:

  1. The "IGL Centrality" Phenomena: A cluster of Rotator-IGLs (biguzera, chopper, bLitz, apEX) were confidently misclassified as AWPers. Our analysis revealed this is due to Low ADAT Residuals: IGLs tend to play "central" positions to maximise information processing. The model confuses this centralised support-rotator positioning with the nature of defensive AWPing.

  2. Outliers vs Archetypes: Ultimate stands out as an exceptionally aggressive AWPer. His classification as a Spacetaker/Rotator (despite being an AWPer) highlights his statistical uniqueness, an aggressive sniper (OAP > 30%) who breaks the mould of the passive hold.

  3. Diagnostic Value: High-confidence errors often pointed to ground truth labelling issues (e.g., Spinx labelled as Anchor but exhibiting Rotator statistics). This validates the model's potential as a "sanity check" tool for manual labelling efforts.

5.4 Conclusion & Future Directions¶

Data Integrity Note: While the base dataset (data/raw/cs2_playstyle_roles_2024.csv) has been corrected, this notebook deliberately injected the original errors for HooXi and Spinx. This "controlled fault injection" allows us to demonstrate the model's diagnostic value: by flagging these players as confident misclassifications (e.g., Spinx confidently predicted as Rotator despite the injected "Anchor" label), the model successfully identified the data quality issues.

This notebook serves as a successful proof of concept: player roles in CS2 are not just theoretical labels but measurable statistical clusters. We have provided validation for:

  • A. The role labels (they correspond to distinct behavioral profiles).
  • B. The features (they contain discriminatory signal).
  • C. The modelling approach (ML classification is feasible even with limited data).

It's not perfect—we obviously got some misclassifications even with the best models—but as a first pass at ML role classification, it has served its purpose.

Limitations & Future Work:

  • Data Availability: We are constrained by the organisation of the professional scene. Lower-tier teams often lack the consistent role structures of elite teams (roles may be less distinct or more fluid), and parsing enough maps (>40) for stable metrics is a challenge for teams that don't attend many events. Furthermore, accurate role labels require manual curation (e.g., Harry Richards' lovely Positions Database), which cannot scale to all teams.
  • Feature Expansion: Future iterations could include weapon-specific features (e.g., "% Kills with AWP" would obviously significantly boost AWPer classification and resolve the Ultimate/IGL confusion) or more granular/creative data that could reveal subtler differences between rifler playstyles.

Commercial Application: Finally, this modelling approach has potential for broader application. Integrating role classification into commercial statistic services for casual players (e.g., Leetify, Scope.gg, CSStats) could satisfy the "personal identity" aspect of Blumler and Katz's Uses and Gratifications Theory. A feature like "What's my Role" bridges the gap between abstract statistics and personal narrative, giving players a richer understanding of their own gameplay.