03 - Role Classification and Model Evaluation¶
Executive Summary¶
This analysis develops and evaluates supervised classification models to predict player roles (Lurker, Spacetaker, Anchor, Rotator, AWPer) from behavioural and positional statistics.
Key Findings:
- Model Performance:
- T-Side: Roles are linearly separable and highly distinct. Logistic Regression achieves an F1-Macro of ~0.92, significantly outperforming complex ensembles.
- CT-Side: Roles require non-linear decision boundaries. Random Forest (F1 ~0.75) outperforms linear baselines, but performance is capped largely by the porous boundary between Rotators and AWPers.
- Feature Selection: The Orthogonal Feature Set (which replaces ADAT with ADAT (Residual)) maintains predictive power and removes multicollinearity between positional features.
- Role Ambiguity: Excluding ambiguous hybrid roles (Half-Lurker, Mixed) improves F1 scores by +0.28 (T) and +0.21 (CT), confirming that "Core" roles are distinct statistical archetypes.
- Interpretability:
- "IGL Confound": In-Game Leaders are frequently misclassified as AWPers or Spacetakers due to their tendency to play "central" positions for information gathering.
- Diagnostic Value: High-confidence misclassifications successfully identified ground-truth labeling errors (e.g., Spinx as Anchor $\to$ Rotator).
Outcome: Identified champion models for both sides and a diagnostic framework for validating player roles.
Note: All analyses are performed separately for T-side and CT-side to preserve tactical context.
Data Note: The input dataset has been corrected, but for this classification analysis we deliberately inject the original mislabels (e.g., HooXi, Spinx) to demonstrate the model's diagnostic capabilities. We use these "controlled errors" to verify if the model can flag inconsistent labels.
Objectives¶
1. Experimental Design (Cross-Validation)
Establish a rigorous 4-split $\times$ 20-repeat Nested Cross-Validation strategy (80 total folds) to ensure stable performance estimates given the small sample size ($N=84$).
2. Baseline & Feature Ablation
Evaluate Logistic Regression baselines across four feature sets (Raw, Orthogonal, Residuals, Full) to manage multicollinearity and establish a performance floor.
3. Sensitivity Analysis
Quantify the impact of ambiguous roles (Half-Lurker, Mixed) on classification performance to determine if the model struggles with features or with fuzzy definitions.
4. Advanced Modelling
Benchmark non-linear models (SVM, Random Forest, XGBoost) against the linear baseline using nested cross-validation to identify the optimal classifier for each side.
5. Interpretation & Diagnosis
Use SHAP values, Waterfall plots, and Prediction Confidence analysis to explain model decisions and diagnose specific misclassification patterns (e.g., the IGL-Centrality phenomena).
1. Setup and Experimental Design¶
TL;DR: Configure the analysis environment, load the processed dataset, define the cross-validation strategy, specify feature sets for ablation, and initialise the scaling pipeline.
Methodology Notes
Cross-Validation Strategy:
- 4 splits × 20 repeats = 80 total folds provides robust performance estimates while maintaining reasonable computational cost
- Stratified K-Fold ensures class balance is preserved in each fold, critical for small sample sizes
- The same CV splits will be reused across all models to enable fair comparison
Feature Set Definitions:
- Raw Features: All original behavioural (TAPD, OAP, PODT, POKT) and positional (ADNT, ADAT) metrics. Baseline with multicollinearity (r ≈ 0.92 between ADNT and ADAT on T-side, r ≈ 0.78 on CT-side).
- Orthogonal Features: Uses ADNT + adat_residual (orthogonal pair, ρ ≈ 0.00). Replaces ADAT with its residual whilst keeping ADNT, eliminating multicollinearity between positional features whilst retaining both positioning dimensions.
- Residual Features: Uses only adat_residual (drops ADNT entirely). Tests whether the residual alone is sufficient for role discrimination.
- Full Features: All behavioural features + all three positioning features (ADNT, ADAT, adat_residual). Tests whether tree-based models can leverage multicollinearity better than linear models.
- Feature sets are side-specific: T-side feature sets use
*_tsuffix (e.g.,tapd_t,adnt_t,adat_residual_t), CT-side feature sets use*_ctsuffix (e.g.,tapd_ct,adnt_ct,adat_residual_ct).
Scaling Pipeline:
- StandardScaler will be applied within cross-validation loops to prevent data leakage
- Scaling is fit only on training folds and applied to validation folds
Side-Specific Modelling:
- All models will be run for both T-side and CT-side separately
- T-side models use T-side features (
*_t) and predictrole_t - CT-side models use CT-side features (
*_ct) and predictrole_ct - This preserves tactical context and enables side-specific feature set selection and model comparison
# === Setup: paths, imports, theme ===
from pathlib import Path
import sys
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import joblib
# sklearn imports
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
# Dev convenience
%load_ext autoreload
%autoreload 2
# Display options
pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 120)
# Resolve repository root (if in notebooks/, step up one level)
REPO_ROOT = Path.cwd().resolve().parent if Path.cwd().name.lower() == "notebooks" else Path.cwd().resolve()
# Canonical paths
DATA_DIR = REPO_ROOT / "data"
RESULTS_DIR = REPO_ROOT / "results" / "classification"
FIG_DIR = RESULTS_DIR / "figures"
TAB_DIR = RESULTS_DIR / "tables"
MODEL_DIR = RESULTS_DIR / "models"
# Ensure results dirs exist
FIG_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR.mkdir(parents=True, exist_ok=True)
TAB_DIR.mkdir(parents=True, exist_ok=True)
# Dataset path
DATA_PATH = DATA_DIR / "processed" / "cs2_playstyles_2024_with_residuals.parquet"
assert DATA_PATH.exists(), f"Dataset not found at {DATA_PATH}"
# Local helpers
SRC_DIR = REPO_ROOT / "src"
if str(SRC_DIR) not in sys.path:
sys.path.insert(0, str(SRC_DIR))
from style import (
set_mpl_theme,
set_seaborn_theme,
ROLE_COLOURS,
get_role_colour,
)
# Import classification utilities
from classification_utils import (
get_feature_sets,
evaluate_baseline_models,
summarise_feature_set_results,
fit_and_visualise_logreg,
evaluate_per_class_metrics,
evaluate_model_cv,
plot_confusion_matrices_comparison,
plot_model_stability_boxplots,
prepare_classification_data,
run_sensitivity_analysis,
run_model_tuning,
compile_model_leaderboard,
save_champion_model,
select_and_save_champion_models,
compare_rf_feature_importance,
plot_prediction_confidence,
plot_shap_beeswarm_grid,
prepare_champion_data,
plot_single_player_waterfall,
plot_comparison_waterfall,
get_player_percentiles,
get_repeated_cv_predictions,
prepare_interpretation_model,
plot_igl_feature_distribution,
)
# Themes
set_mpl_theme(mode="dark", preferred_font="Georgia")
set_seaborn_theme(mode="dark", preferred_font="Georgia")
# Echo key paths
REPO_ROOT, DATA_PATH, FIG_DIR, TAB_DIR
(WindowsPath('P:/cs2-playstyle-analysis-2024'),
WindowsPath('P:/cs2-playstyle-analysis-2024/data/processed/cs2_playstyles_2024_with_residuals.parquet'),
WindowsPath('P:/cs2-playstyle-analysis-2024/results/classification/figures'),
WindowsPath('P:/cs2-playstyle-analysis-2024/results/classification/tables'))
Load Dataset¶
Load the processed dataset with engineered residual features and verify role distributions.
# Load processed dataset
df = pd.read_parquet(DATA_PATH)
# After loading the dataset
MIN_MAPS = 40
df = df[df['map_count'] >= MIN_MAPS].copy()
print(f"After filtering (MIN_MAPS={MIN_MAPS}): {len(df)} players")
# --- DELIBERATE ERROR INJECTION FOR DIAGNOSTIC DEMONSTRATION ---
# We deliberately inject known mislabels for HooXi and Spinx to demonstrate the model's
# diagnostic capability (identifying "confident errors" in the analysis later).
# In the EDA notebook, these are correct (Mixed and Rotator respectively), but we revert
# them here to the original "noisy" labels found during the project.
mask_hooxi = df['player_name'] == 'HooXi'
mask_spinx = df['player_name'] == 'Spinx'
df.loc[mask_hooxi, 'role_ct'] = 'Anchor' # True role: Mixed
df.loc[mask_spinx, 'role_ct'] = 'Anchor' # True role: Rotator
print("\n> [DIAGNOSTIC SETUP] Injected deliberate CT-role errors for HooXi (Mixed->Anchor) and Spinx (Rotator->Anchor).")
# ---------------------------------------------------------------
# Structural checks
print("Shape:", df.shape)
display(df.head(3))
# Verify expected columns exist (especially residual features)
expected_residuals = ['adat_residual_t', 'adat_residual_ct']
assert all(col in df.columns for col in expected_residuals), "Missing residual features"
# Quick overview of role distribution
print("\nRole distribution (T-side):")
display(df['role_t'].value_counts())
print("\nRole distribution (CT-side):")
display(df['role_ct'].value_counts())
After filtering (MIN_MAPS=40): 84 players > [DIAGNOSTIC SETUP] Injected deliberate CT-role errors for HooXi (Mixed->Anchor) and Spinx (Rotator->Anchor). Shape: (84, 27)
| steamid | player_name | team_clan_name | map_count | tapd_ct | tapd_t | tapd_overall | oap_ct | oap_t | oap_overall | podt_ct | podt_t | podt_overall | pokt_ct | pokt_t | pokt_overall | adnt_rank_ct | adnt_rank_t | adnt_rank_overall | adat_rank_ct | adat_rank_t | adat_rank_overall | role_overall | role_t | role_ct | adat_residual_t | adat_residual_ct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 76561198041683378 | NiKo | G2 Esports | 158 | 60.952893 | 59.136540 | 60.136000 | 24.745965 | 24.093423 | 24.424242 | 21.020276 | 24.857741 | 22.507740 | 17.051295 | 21.586555 | 19.197995 | 0.562493 | 0.621199 | 0.593089 | 0.547525 | 0.695336 | 0.616046 | Lurker | Spacetaker | Rotator | 0.074668 | -0.014643 |
| 1 | 76561198012872053 | huNter | G2 Esports | 158 | 62.048685 | 62.871661 | 62.589800 | 16.852540 | 14.807692 | 15.847511 | 21.585198 | 27.994772 | 24.696747 | 17.195516 | 27.180894 | 22.538284 | 0.480859 | 0.643004 | 0.571875 | 0.406082 | 0.615021 | 0.455108 | Flex | Lurker | Rotator | -0.026997 | -0.073699 |
| 2 | 76561198074762801 | m0NESY | G2 Esports | 155 | 62.786553 | 66.632594 | 64.362519 | 23.914373 | 17.754078 | 20.873335 | 19.122381 | 23.094640 | 21.473108 | 17.397469 | 26.423178 | 21.056274 | 0.577785 | 0.423733 | 0.453028 | 0.515617 | 0.409889 | 0.427645 | AWPer | AWPer | AWPer | -0.017438 | -0.061984 |
Role distribution (T-side):
role_t Spacetaker 31 Lurker 24 AWPer 17 Half-Lurker 12 Name: count, dtype: int64
Role distribution (CT-side):
role_ct Rotator 27 Anchor 23 AWPer 17 Mixed 17 Name: count, dtype: int64
Cross-Validation Strategy¶
TL;DR: Define a 4-split, 20-repeat Stratified K-Fold strategy (80 total folds) that will be reused across all models for fair comparison.
Why This Strategy?
The small sample size (N=84) requires a rigorous CV strategy to obtain reliable performance estimates. Using 4 splits provides reasonable training set sizes (~63 samples in each training fold, ~21 samples in each test fold) while 20 repeats ensures statistical stability through multiple independent evaluations. Stratified sampling preserves class balance across all folds.
# === Cross-Validation Strategy ===
# Parameters
N_SPLITS = 4
N_REPEATS = 20
RANDOM_STATE = 42
# Create CV strategy (will be used for all models)
# Note: Will need to specify which side (T or CT) and target column when fitting
cv_strategy = RepeatedStratifiedKFold(
n_splits=N_SPLITS,
n_repeats=N_REPEATS,
random_state=RANDOM_STATE
)
print(f"Cross-validation strategy: {N_SPLITS} splits × {N_REPEATS} repeats = {N_SPLITS * N_REPEATS} total folds per side")
Cross-validation strategy: 4 splits × 20 repeats = 80 total folds per side
Feature Sets¶
TL;DR: Define four feature sets for ablation testing: Raw (baseline), Orthogonal (ADNT + adat_residual, no positional multicollinearity), Residuals (adat_residual only), and Full (all positioning features).
Feature Set Definitions
Raw Features: All original behavioural and positional metrics. Includes both ADNT and ADAT despite their high correlation (r ≈ 0.92 on T-side, r ≈ 0.78 on CT-side). Baseline with multicollinearity among positional features (ADNT/ADAT).
Orthogonal Features: Uses ADNT + adat_residual (orthogonal pair, ρ ≈ 0.00). Eliminates multicollinearity while retaining both positioning dimensions. Tests whether removing positional multicollinearity improves model stability.
Residual Features: Uses only adat_residual (drops ADNT entirely). Tests whether the residual alone is sufficient, or if ADNT provides additional signal.
Full Features: All behavioural + all three positioning features (ADNT, ADAT, adat_residual). Tests whether including both ADAT and ADAT_residual produces meaningful performance improvement. Tree-based models may handle multicollinearity better than linear models.
Side-Specific Implementation: Feature sets are generated separately for T-side (using *_t suffix) and CT-side (using *_ct suffix). All four sets will be evaluated for both sides.
# === Feature Sets for Ablation Study ===
# Generate feature sets for both sides using modelling utility
FEATURE_SETS_T = get_feature_sets('t')
FEATURE_SETS_CT = get_feature_sets('ct')
# Store in a nested dictionary for easy access
FEATURE_SETS = {
't': FEATURE_SETS_T,
'ct': FEATURE_SETS_CT
}
# Define ambiguous roles
EXCLUDED_ROLES = {'t': 'Half-Lurker', 'ct': 'Mixed'}
# Display feature sets for both sides
print("=" * 60)
print("T-SIDE FEATURE SETS")
print("=" * 60)
for name, features in FEATURE_SETS_T.items():
print(f"\n{name} Features ({len(features)} features):")
print(features)
print("\n" + "=" * 60)
print("CT-SIDE FEATURE SETS")
print("=" * 60)
for name, features in FEATURE_SETS_CT.items():
print(f"\n{name} Features ({len(features)} features):")
print(features)
============================================================ T-SIDE FEATURE SETS ============================================================ Raw Features (6 features): ['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adnt_rank_t', 'adat_rank_t'] Orthogonal Features (6 features): ['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adnt_rank_t', 'adat_residual_t'] Residuals Features (5 features): ['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adat_residual_t'] Full Features (7 features): ['tapd_t', 'oap_t', 'podt_t', 'pokt_t', 'adnt_rank_t', 'adat_rank_t', 'adat_residual_t'] ============================================================ CT-SIDE FEATURE SETS ============================================================ Raw Features (6 features): ['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adnt_rank_ct', 'adat_rank_ct'] Orthogonal Features (6 features): ['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adnt_rank_ct', 'adat_residual_ct'] Residuals Features (5 features): ['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adat_residual_ct'] Full Features (7 features): ['tapd_ct', 'oap_ct', 'podt_ct', 'pokt_ct', 'adnt_rank_ct', 'adat_rank_ct', 'adat_residual_ct']
Setup Complete
- Dataset loaded: 84 players with all expected residual features
- CV strategy defined: 80 total folds (4 splits × 20 repeats)
- Four feature sets prepared for ablation (Raw, Orthogonal, Residuals, Full) for both sides
- Scaling pipeline ready (applied within CV loops to prevent data leakage)
Ready for baseline modelling.
2. Baseline Models & Feature Selection¶
TL;DR: Establish baselines using Dummy and Logistic Regression across all feature sets. Select Orthogonal feature set for parsimony. Diagnose per-class performance, then exclude ambiguous roles to test core role separability. Visualise feature–role associations from the filtered dataset.
Methodology Details
Baseline Evaluation:
- Models: Dummy Classifier (stratified baseline) vs. Logistic Regression (linear baseline)
- Feature Sets: Four sets evaluated to handle multicollinearity between ADNT and ADAT:
- Raw: All features (high multicollinearity)
- Orthogonal: ADNT + adat_residual (uncorrelated positioning dimensions)
- Residuals: adat_residual only (tests if ADNT is redundant)
- Full: All features (tests if models handle multicollinearity well)
- Metric: F1-Macro (mean of per-class F1 scores) to handle class imbalance
Sensitivity Analysis:
- Per-class metrics reveal that ambiguous roles (Half-Lurker on T-side, Mixed on CT-side) have poor performance, suggesting they blur boundaries between core playstyles
- We re-evaluate after excluding these roles to test whether the model struggles due to weak features or simply because these roles are inherently fuzzy
- Coefficient Visualisation: We visualise feature weights from the filtered dataset to see the clearest signal of what defines each core role
Baseline Models & Feature Selection¶
Run baseline models across all feature sets for both sides, then select the optimal feature set.
# === Baseline Models: Run & Display Results ===
baseline_results = {}
for side in ['t', 'ct']:
print(f"Running baselines for {side.upper()}-side...")
baseline_res = evaluate_baseline_models(
df=df,
side=side,
feature_sets_dict=FEATURE_SETS[side],
cv_strategy=cv_strategy
)
baseline_results[side] = baseline_res
# Save raw results
baseline_res[['model', 'feature_set', 'mean_f1', 'std_f1']].to_csv(
TAB_DIR / f"baseline_results_{side}.csv", index=False
)
# Display full comparison tables for both sides
print("\n" + "=" * 60)
print("FEATURE SET COMPARISON (Logistic Regression F1-Macro)")
print("=" * 60)
for side in ['t', 'ct']:
comparison = summarise_feature_set_results(baseline_results[side], side=side)
print(f"\n{side.upper()}-Side:")
display(comparison)
# Combined comparison for saving
comparison_t = summarise_feature_set_results(baseline_results['t'], side='t')
comparison_ct = summarise_feature_set_results(baseline_results['ct'], side='ct')
comparison_df = pd.concat([comparison_t, comparison_ct], ignore_index=True)
comparison_df.to_csv(TAB_DIR / "feature_set_comparison.csv", index=False)
print(f"\nComparison table saved to: {TAB_DIR / 'feature_set_comparison.csv'}")
Running baselines for T-side... Running baselines for CT-side... ============================================================ FEATURE SET COMPARISON (Logistic Regression F1-Macro) ============================================================ T-Side:
| Side | Model | Feature_Set | Mean_F1_Macro | Std_F1_Macro | Mean_Accuracy | Std_Accuracy | |
|---|---|---|---|---|---|---|---|
| 0 | T-Side | LogisticRegression | Full | 0.693830 | 0.088836 | 0.770833 | 0.068633 |
| 1 | T-Side | LogisticRegression | Raw | 0.685843 | 0.085449 | 0.767857 | 0.069160 |
| 2 | T-Side | LogisticRegression | Orthogonal | 0.674325 | 0.074047 | 0.763690 | 0.063640 |
| 3 | T-Side | LogisticRegression | Residuals | 0.466390 | 0.085475 | 0.579762 | 0.083427 |
| 4 | T-Side | Dummy | Orthogonal | 0.253056 | 0.110655 | 0.305357 | 0.110322 |
| 5 | T-Side | Dummy | Raw | 0.253056 | 0.110655 | 0.305357 | 0.110322 |
| 6 | T-Side | Dummy | Residuals | 0.253056 | 0.110655 | 0.305357 | 0.110322 |
| 7 | T-Side | Dummy | Full | 0.253056 | 0.110655 | 0.305357 | 0.110322 |
CT-Side:
| Side | Model | Feature_Set | Mean_F1_Macro | Std_F1_Macro | Mean_Accuracy | Std_Accuracy | |
|---|---|---|---|---|---|---|---|
| 0 | CT-Side | LogisticRegression | Raw | 0.508231 | 0.082315 | 0.564881 | 0.075196 |
| 1 | CT-Side | LogisticRegression | Full | 0.501163 | 0.085292 | 0.556548 | 0.084714 |
| 2 | CT-Side | LogisticRegression | Orthogonal | 0.495041 | 0.078242 | 0.554167 | 0.079769 |
| 3 | CT-Side | LogisticRegression | Residuals | 0.400822 | 0.077293 | 0.454167 | 0.082115 |
| 4 | CT-Side | Dummy | Orthogonal | 0.255153 | 0.090947 | 0.283333 | 0.092490 |
| 5 | CT-Side | Dummy | Raw | 0.255153 | 0.090947 | 0.283333 | 0.092490 |
| 6 | CT-Side | Dummy | Residuals | 0.255153 | 0.090947 | 0.283333 | 0.092490 |
| 7 | CT-Side | Dummy | Full | 0.255153 | 0.090947 | 0.283333 | 0.092490 |
Comparison table saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\feature_set_comparison.csv
Feature Set Selection: Baseline Analysis¶
Decision: Use Orthogonal for interpretation (coefficient analysis, sensitivity) but carry Raw, Orthogonal, and Full forward to the model comparison phase.
Detailed Analysis
1. Baseline Comparison
Logistic Regression substantially outperforms the stratified dummy classifier across all feature sets:
- T-Side: LogReg (0.674 F1) vs. Dummy (0.253 F1) — 166% improvement
- CT-Side: LogReg (0.495 F1) vs. Dummy (0.255 F1) — 94% improvement
These effect sizes (Cohen's d ≈ 5.7 for T-side, ≈ 3.1 for CT-side) confirm the features contain strong discriminative signal.
2. Feature Set Performance
Performance differences between Orthogonal, Raw, and Full are almost negligible:
- T-Side: Full (0.694) > Raw (0.686) > Orthogonal (0.674) — differences within 0.02 F1
- CT-Side: Raw (0.508) > Full (0.501) > Orthogonal (0.495) — differences within 0.013 F1
All three configurations show overlapping standard deviations, indicating no statistically meaningful advantage for Full or Raw.
3. Why Multiple Feature Sets for Model Comparison?
Although Orthogonal is preferred for interpretation, we retain all three sets for the advanced model phase because:
- Multicollinearity affects LogReg more than trees/SVMs: The coefficient instability caused by correlated features (ADNT ↔ ADAT, r ≈ 0.92) is a LogReg-specific issue. Tree-based models (XGBoost, RF) and SVMs are robust to multicollinearity and may extract additional signal from Raw/Full.
- Differences are within noise: With N=84, a 0.02 F1 difference corresponds to ~1–2 players. The sets are effectively tied, so committing to one risks leaving performance on the table.
- Minimal cost: Testing three feature sets adds negligible compute overhead and ensures we identify the optimal configuration for each model class.
4. Interpretation vs. Prediction
- Orthogonal will be used for the following sensitivity analysis and coefficient visualisation (Section 2) to ensure stable, interpretable feature–role associations.
- All three sets will be benchmarked in the model comparison (Section 3) to maximise understanding of predictive performance.
5. CT-Side Signal Limitations
CT-side performance remains modest (F1 ≈ 0.50) across all feature sets, suggesting the CT role taxonomy is harder to discriminate with our playstyle metrics alone. This likely reflects the situational nature of CT roles (e.g., map-specific assignments, dynamic rotations). Predictions should be treated as exploratory pending richer CT-specific features.
# === Feature Set Selection ===
feature_set_selections = {'t': 'Orthogonal', 'ct': 'Orthogonal'}
BEST_FEATURE_SET_T = feature_set_selections['t']
BEST_FEATURE_SET_CT = feature_set_selections['ct']
print(f"Selected: {feature_set_selections}")
Selected: {'t': 'Orthogonal', 'ct': 'Orthogonal'}
Per-Class Performance Diagnosis¶
Evaluate precision, recall, and F1 per role to identify which roles are harder/easier to predict.
Why Per-Class Metrics?
F1-Macro provides an overall performance summary but masks per-class differences. Per-class metrics reveal whether the model struggles with specific roles (e.g., minority classes or ambiguous definitions) and help identify precision–recall trade-offs.
# === Per-Class Metrics: Full Dataset (T & CT) ===
per_class_dfs = {}
full_f1_scores = {}
for side in ['t', 'ct']:
target_col = f'role_{side}'
feats = FEATURE_SETS[side][feature_set_selections[side]]
# Prepare Data
df_clean = df.dropna(subset=[target_col]).copy()
X = df_clean[feats].values
y = df_clean[target_col].values
# Evaluate
clf = LogisticRegression(max_iter=1000, random_state=42)
metrics = evaluate_per_class_metrics(clf, X, y, cv_strategy)
# Store F1-Macro for later comparison
cv_res = evaluate_model_cv(clf, X, y, cv_strategy)
full_f1_scores[side] = cv_res['mean_score']
# Store & Save
per_class_dfs[side] = metrics
metrics.to_csv(TAB_DIR / f"per_class_metrics_{side}.csv", index=False)
print(f"\n{side.upper()}-Side Per-Class Metrics (Full Dataset):")
# Show worst performing roles first to highlight ambiguous roles
display(metrics.sort_values('f1_mean', ascending=True))
print(f"Full table saved to: {TAB_DIR / f'per_class_metrics_{side}.csv'}")
T-Side Per-Class Metrics (Full Dataset):
| role | precision_mean | precision_std | recall_mean | recall_std | f1_mean | f1_std | |
|---|---|---|---|---|---|---|---|
| 1 | Half-Lurker | 0.253542 | 0.335552 | 0.183333 | 0.240947 | 0.196806 | 0.240360 |
| 3 | Spacetaker | 0.763843 | 0.107101 | 0.843527 | 0.137287 | 0.793073 | 0.094453 |
| 2 | Lurker | 0.795965 | 0.141438 | 0.883333 | 0.127475 | 0.824978 | 0.094273 |
| 0 | AWPer | 0.930179 | 0.114774 | 0.861250 | 0.173561 | 0.882445 | 0.124662 |
Full table saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\per_class_metrics_t.csv CT-Side Per-Class Metrics (Full Dataset):
| role | precision_mean | precision_std | recall_mean | recall_std | f1_mean | f1_std | |
|---|---|---|---|---|---|---|---|
| 2 | Mixed | 0.186280 | 0.265918 | 0.106875 | 0.135264 | 0.126678 | 0.155462 |
| 0 | AWPer | 0.637991 | 0.214509 | 0.597500 | 0.245701 | 0.588312 | 0.190074 |
| 3 | Rotator | 0.604961 | 0.136919 | 0.689286 | 0.153239 | 0.631406 | 0.112602 |
| 1 | Anchor | 0.604894 | 0.142322 | 0.697083 | 0.182361 | 0.633769 | 0.122807 |
Full table saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\per_class_metrics_ct.csv
Key Finding
Ambiguous roles (Half-Lurker on T-side, Mixed on CT-side) show particularly poor F1 scores, suggesting they blur boundaries between core playstyles and aren't a distinct archetype. This justifies filtering them to test whether the model can better separate clearly-defined core roles.
Sensitivity Analysis: Core Role Separability¶
Exclude ambiguous roles to test whether the model struggles due to weak features or simply because these roles are inherently fuzzy. If performance improves substantially, it confirms these roles introduce systematic confusion rather than revealing feature limitations.
# === Sensitivity Analysis: Run & Display F1 Comparison ===
# Run sensitivity analysis using the helper function
sensitivity_results = {}
for side in ['t', 'ct']:
sensitivity_results[side] = run_sensitivity_analysis(
df=df,
side=side,
excluded_role=EXCLUDED_ROLES[side],
feature_names=FEATURE_SETS[side][feature_set_selections[side]],
cv_strategy=cv_strategy,
tab_dir=TAB_DIR
)
# Build and display F1 comparison table
comparison_rows = []
for side in ['t', 'ct']:
res = sensitivity_results[side]
comparison_rows.append({
'Side': f"{side.upper()}-Side",
'Excluded_Role': EXCLUDED_ROLES[side],
'N_Full': res['n_full'],
'N_Filtered': res['n_filtered'],
'F1_Full': res['full_f1'],
'F1_Filtered': res['filtered_f1'],
'Delta': res['delta']
})
print("=== Sensitivity Analysis: Impact of Excluding Ambiguous Roles ===\n")
impact_df = pd.DataFrame(comparison_rows)
display(impact_df)
# === Per-Class Metrics: Filtered Dataset ===
print("=== Per-Class Metrics (Core Roles Only) ===")
for side in ['t', 'ct']:
print(f"\n{side.upper()}-Side (excluding {EXCLUDED_ROLES[side]}):")
display(sensitivity_results[side]['per_class_filtered'].sort_values('f1_mean', ascending=False))
=== Sensitivity Analysis: Impact of Excluding Ambiguous Roles ===
| Side | Excluded_Role | N_Full | N_Filtered | F1_Full | F1_Filtered | Delta | |
|---|---|---|---|---|---|---|---|
| 0 | T-Side | Half-Lurker | 84 | 72 | 0.674325 | 0.921120 | 0.246794 |
| 1 | CT-Side | Mixed | 84 | 67 | 0.495041 | 0.670192 | 0.175150 |
=== Per-Class Metrics (Core Roles Only) === T-Side (excluding Half-Lurker):
| role | precision_mean | precision_std | recall_mean | recall_std | f1_mean | f1_std | |
|---|---|---|---|---|---|---|---|
| 1 | Lurker | 0.908631 | 0.097040 | 0.950000 | 0.080795 | 0.923538 | 0.061559 |
| 0 | AWPer | 0.962292 | 0.082653 | 0.899375 | 0.139305 | 0.922455 | 0.094451 |
| 2 | Spacetaker | 0.932222 | 0.075045 | 0.912723 | 0.099624 | 0.917367 | 0.064466 |
CT-Side (excluding Mixed):
| role | precision_mean | precision_std | recall_mean | recall_std | f1_mean | f1_std | |
|---|---|---|---|---|---|---|---|
| 1 | Anchor | 0.807009 | 0.124657 | 0.760833 | 0.144316 | 0.769876 | 0.092408 |
| 2 | Rotator | 0.651553 | 0.158925 | 0.687202 | 0.186215 | 0.654556 | 0.137992 |
| 0 | AWPer | 0.655898 | 0.230986 | 0.575000 | 0.227211 | 0.586142 | 0.190224 |
Result
Excluding ambiguous roles improves F1-Macro substantially (Δ = +0.28 for T-side, +0.21 for CT-side), confirming these roles introduce systematic confusion. The selected features can effectively distinguish clearly-defined core roles. Per-class performance generally improves, reaching up to ~0.92 F1 for T side AWPers and Lurkers.
# === T-Side Confusion Matrices: Full vs Filtered ===
side = 't'
feats = FEATURE_SETS[side][feature_set_selections[side]]
clf = LogisticRegression(max_iter=1000, random_state=42)
plot_confusion_matrices_comparison(
df=df,
side=side,
feature_names=feats,
excluded_role=EXCLUDED_ROLES[side],
model=clf,
cv_strategy=cv_strategy,
fig_dir=FIG_DIR
)
plt.show()
# === CT-Side Confusion Matrices: Full vs Filtered ===
side = 'ct'
feats = FEATURE_SETS[side][feature_set_selections[side]]
clf = LogisticRegression(max_iter=1000, random_state=42)
plot_confusion_matrices_comparison(
df=df,
side=side,
feature_names=feats,
excluded_role=EXCLUDED_ROLES[side],
model=clf,
cv_strategy=cv_strategy,
fig_dir=FIG_DIR
)
plt.show()
Observations
CT-side AWPers are the least distinct core role, with ~39% misclassified as Rotators even after excluding Mixed. This confusion is intrinsic—removing the ambiguous class barely changes AWPer recall (59%→57%). CT AWPers likely share rotational positioning behaviour that our features cannot separate.
Ambiguous roles skew toward specific archetypes: Half-Lurkers are predicted as Spacetakers (49%) more than Lurkers (30%), suggesting more aggressive tendencies than their name may suggest. Mixed players lean toward Anchor (48%) over Rotator (29%).
T-side benefits more from filtering: Lurker recall jumps from 88% to 95% when Half-Lurker is removed, indicating Half-Lurkers were "absorbing" correct Lurker predictions. CT-side gains are more modest.
Confusion asymmetry reveals role distinctiveness: Rotators are rarely misclassified as AWPers (13%), but AWPers are often misclassified as Rotators (32%). This suggests Rotators have a more unique behavioural signature while some AWPers exhibit rotator-like flexibility.
Model Interpretation: Feature–Role Associations¶
Visualise Logistic Regression coefficients from the filtered dataset (core roles only) to reveal the clearest signal of what defines each archetype.
# === Coefficient Visualisation: Core Roles (Filtered Dataset) ===
# Generate coefficient plots for filtered dataset (core roles only)
for side in ['t', 'ct']:
excluded_role = EXCLUDED_ROLES[side]
df_filtered = df[df[f'role_{side}'] != excluded_role].copy()
print(f"\nGenerating {side.upper()}-side coefficients (excluding {excluded_role})...")
_, _, fig = fit_and_visualise_logreg(
df=df_filtered,
side=side,
feature_set_name=feature_set_selections[side],
feature_names=FEATURE_SETS[side][feature_set_selections[side]],
fig_dir=FIG_DIR,
tab_dir=TAB_DIR,
suffix='_filtered'
)
plt.show()
print(f" Coefficients saved to: {TAB_DIR / f'logreg_coefficients_{side}_filtered.csv'}")
Generating T-side coefficients (excluding Half-Lurker)...
Coefficients saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\logreg_coefficients_t_filtered.csv Generating CT-side coefficients (excluding Mixed)...
Coefficients saved to: P:\cs2-playstyle-analysis-2024\results\classification\tables\logreg_coefficients_ct_filtered.csv
Key Findings
Coefficients reveal clear behavioural signatures:
- Spacetakers (T): High
oap(opening attempts) and lessPOKT(trade kills) - Lurkers (T): High
adnt(isolation) and highPOKT(trade kills) - AWPers: (T) High
TAPD(time alive), LowADNT(positioning very close to teammates) - Anchors (CT): High
ADNTandADAT_Residual, positioning very far from teammates - Rotators (CT): Low
ADNT(close to teammates) and lowTAPDtime alive. - AWPers: (CT) High
TAPD(time alive), relatively HighOAP(due to "pick" kills)
These associations generally align with domain knowledge and findings from the EDA, validating the selected features.
Summary & Interpretations¶
Key Takeaways:
- Feature Selection: No clear winner among Raw, Orthogonal, and Full feature sets (differences < 0.02 F1). Orthogonal used for interpretation due to coefficient stability; all three carried forward for model comparison.
- Sensitivity Analysis: Excluding ambiguous roles improves F1 by +0.28 (T) and +0.21 (CT), confirming core roles are relatively well-separated. We continue excluding ambiguous roles.
- Feature Importance: Coefficients reveal clear behavioural signatures (e.g. Lurkers = isolation, Spacetakers = opening attempts)
- Data Constraints: With only 84 players (~4–5 per role per CV fold), results are exploratory and should be validated with larger cohorts
Detailed Interpretations
Baseline Performance & Feature Selection:
- Logistic Regression substantially outperforms Dummy Classifier, indicating features contain meaningful signal
- Performance differences between Raw, Orthogonal, and Full are negligible (within 0.02 F1, ~1 player difference at N=84)
- Orthogonal used for interpretation (sensitivity analysis, coefficients) because multicollinearity destabilises LogReg coefficients
- All three feature sets will be tested with advanced models, since tree-based methods (RF, XGBoost) and SVMs are robust to multicollinearity
- CT-side results remain modest (F1 ≈ 0.50). Treat predictions as exploratory pending richer features
Core vs. Ambiguous Roles:
- Per-class diagnosis revealed that ambiguous roles (Half-Lurker on T-side, Mixed on CT-side) are very hard to classify and blur boundaries between core playstyles
- Sensitivity analysis confirms these roles introduce systematic confusion: excluding them improves F1-Macro, validating the model can effectively distinguish clearly-defined core roles
- The performance improvement (Δ) quantifies the impact and confirms these are mixed states rather than distinct archetypes
Feature–Role Associations:
- Coefficient visualisation reflects clear behavioural signatures, analogous to findings in EDA (01_eda.ipynb): e.g.
- Spacetakers (T): High
oap(entry attempts) - Lurkers (T): High
adnt(isolation) - Anchors vs. Rotators (CT): Distinguished by positioning metrics (packing density and mobility)
- Spacetakers (T): High
- These associations align with domain knowledge, validating the features and that the model picked up on playstyle differences.
Next Steps: The logistic regression baseline is strong (~0.69 F1 for T-side), but we proceed to Section 3 to test whether non-linear models (SVM, Random Forest, XGBoost) can improve performance. All three feature sets (Raw, Orthogonal, Full) will be benchmarked since multicollinearity—which affects LogReg coefficient interpretation—is not a concern for tree-based models.
3. Advanced Model Comparison¶
TL;DR: Evaluate SVM, Random Forest, and XGBoost using Nested Cross-Validation. Compare performance across feature sets (Raw, Orthogonal, Full) to identify the champion model for further analysis.
Methodology: Nested Cross-Validation
Why Nested CV?
With only 84 players, we cannot afford a separate holdout set for hyperparameter tuning. If we tuned hyperparameters on the same data used for evaluation, we would overfit to noise. Nested CV solves this:
- Outer Loop (Evaluation): Our existing 4-split × 20-repeat strategy (80 folds). This measures generalisation performance.
- Inner Loop (Tuning): Inside each outer training fold (~63 players), we run a 3-fold GridSearchCV to select the best hyperparameters.
The Flow:
- Outer loop splits data into Train (63) and Test (21).
- Inner loop splits Train into Inner-Train (42) and Inner-Val (21), tests hyperparameter combinations.
- Best hyperparameters are selected, model is refit on full Train (63).
- Performance is evaluated on Test (21).
- Repeat 80 times; average scores give unbiased performance estimate.
Hyperparameter Strategy:
We search broad, sensible ranges (orders of magnitude) rather than fine-grained values. With small data, finding a "stable region" matters more than pinpointing an exact optimum.
Feature Sets:
We test Raw, Orthogonal, and Full for each model. Tree-based models (RF, XGBoost) and SVMs are robust to multicollinearity, so they may extract additional signal from correlated features.
Note on Reported Hyperparameters:
Performance metrics (F1-Macro, standard deviation) are derived from the 80-fold Nested CV to ensure unbiased evaluation. The "Best Parameters" reported for each configuration (model + feature set + side) are derived from a final grid search run on the full filtered dataset for that side. This is standard practice: Nested CV validates the search strategy and provides unbiased performance estimates, while the final fit on all available data identifies the optimal hyperparameters for each configuration.
# === Section 3 Setup ===
# Feature sets to evaluate (excluding 'Residuals' as per plan)
FEATURE_SETS_TO_TEST = ['Raw', 'Orthogonal', 'Full']
print("Section 3 setup complete. Ready for model evaluation.")
Section 3 setup complete. Ready for model evaluation.
3.1 Support Vector Machine (SVM)¶
Why SVM? SVMs excel with small datasets and high-dimensional feature spaces. The kernel trick allows them to find non-linear decision boundaries without explicit feature engineering. With balanced class weights, they handle imbalanced classes gracefully.
Hyperparameter Grid:
C(Regularisation): [0.1, 1, 10, 100] — Controls the trade-off between margin width and misclassification. Lower C = smoother boundary.kernel: ['linear', 'rbf'] — Linear for simple boundaries, RBF for complex non-linear patterns.gamma: ['scale', 'auto'] — RBF kernel spread. 'scale' is generally preferred.class_weight: ['balanced'] — Adjusts weights inversely proportional to class frequencies.probability: [True] — Enables probability estimates (needed for later calibration/SHAP).
# === SVM: Define Grid & Run ===
svm_param_grid = {
'C': [0.1, 1, 10, 100],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto'],
'class_weight': ['balanced'],
'probability': [True]
}
print("Running SVM with Nested CV...")
print(f"Grid size: {np.prod([len(v) for v in svm_param_grid.values()])} combinations per inner CV")
svm_results = run_model_tuning(
model_class=SVC,
param_grid=svm_param_grid,
model_name='SVM',
df=df,
feature_sets=FEATURE_SETS,
feature_set_names=FEATURE_SETS_TO_TEST,
cv_strategy=cv_strategy,
excluded_roles=EXCLUDED_ROLES
)
print("\nSVM Results:")
display(svm_results.round(3))
# Save results
svm_results.to_csv(TAB_DIR / "model_results_svm.csv", index=False)
Running SVM with Nested CV... Grid size: 16 combinations per inner CV SVM Results:
| Side | Model | Feature_Set | Mean_F1 | Std_F1 | Mean_Accuracy | Std_Accuracy | Mean_Train_F1 | Std_Train_F1 | Mean_Fit_Time | Best_Params | All_Scores | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-Side | SVM | Raw | 0.874 | 0.066 | 0.872 | 0.065 | 0.961 | 0.025 | 0.144 | {'C': 1, 'class_weight': 'balanced', 'gamma': ... | [1.0, 0.8766884531590414, 0.8384208384208384, ... |
| 1 | T-Side | SVM | Orthogonal | 0.913 | 0.053 | 0.910 | 0.054 | 0.962 | 0.017 | 0.060 | {'C': 0.1, 'class_weight': 'balanced', 'gamma'... | [0.9407407407407407, 0.9027777777777778, 0.889... |
| 2 | T-Side | SVM | Full | 0.907 | 0.056 | 0.903 | 0.057 | 0.965 | 0.020 | 0.063 | {'C': 0.1, 'class_weight': 'balanced', 'gamma'... | [1.0, 0.8303872053872053, 0.8384208384208384, ... |
| 3 | CT-Side | SVM | Raw | 0.704 | 0.098 | 0.711 | 0.095 | 0.867 | 0.060 | 0.068 | {'C': 1, 'class_weight': 'balanced', 'gamma': ... | [0.5285714285714286, 0.810966810966811, 0.7146... |
| 4 | CT-Side | SVM | Orthogonal | 0.693 | 0.093 | 0.699 | 0.092 | 0.868 | 0.062 | 0.070 | {'C': 1, 'class_weight': 'balanced', 'gamma': ... | [0.6813186813186812, 0.7523809523809524, 0.714... |
| 5 | CT-Side | SVM | Full | 0.708 | 0.091 | 0.715 | 0.089 | 0.875 | 0.058 | 0.069 | {'C': 0.1, 'class_weight': 'balanced', 'gamma'... | [0.6794871794871794, 0.810966810966811, 0.7146... |
3.2 Random Forest¶
Why Random Forest? Ensemble of decision trees that reduces overfitting through bagging and feature randomisation. Robust to outliers and handles non-linear relationships well. Provides built-in feature importance estimates.
Hyperparameter Grid (Regularised for Small Data):
n_estimators: [100] — Fixed at 100 trees (sufficient for N=84).max_depth: [2, 3] — Very shallow trees to prevent memorising the ~63 training samples per fold.min_samples_leaf: [5, 8, 10] — Higher values force generalisation; with ~16 samples per class, leaves must represent broader patterns.max_features: ['sqrt'] — Standard choice for classification.class_weight: ['balanced'] — Adjusts for class imbalance.
# === Random Forest: Define Grid & Run (Regularised) ===
rf_param_grid = {
'n_estimators': [100],
'max_depth': [2, 3, 5], # Shallower to prevent overfitting
'min_samples_leaf': [5, 8, 10], # Higher values force generalisation
'max_features': ['sqrt'],
'class_weight': ['balanced']
}
print("Running Random Forest with Nested CV (Regularised Grid)...")
print(f"Grid size: {np.prod([len(v) for v in rf_param_grid.values()])} combinations per inner CV")
rf_results = run_model_tuning(
model_class=RandomForestClassifier,
param_grid=rf_param_grid,
model_name='RandomForest',
df=df,
feature_sets=FEATURE_SETS,
feature_set_names=FEATURE_SETS_TO_TEST,
cv_strategy=cv_strategy,
excluded_roles=EXCLUDED_ROLES
)
print("\nRandom Forest Results:")
display(rf_results.round(3))
# Save results
rf_results.to_csv(TAB_DIR / "model_results_rf.csv", index=False)
Running Random Forest with Nested CV (Regularised Grid)... Grid size: 9 combinations per inner CV Random Forest Results:
| Side | Model | Feature_Set | Mean_F1 | Std_F1 | Mean_Accuracy | Std_Accuracy | Mean_Train_F1 | Std_Train_F1 | Mean_Fit_Time | Best_Params | All_Scores | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-Side | RandomForest | Raw | 0.822 | 0.091 | 0.822 | 0.088 | 0.954 | 0.027 | 0.573 | {'class_weight': 'balanced', 'max_depth': 3, '... | [0.8857142857142857, 0.9500891265597149, 0.773... |
| 1 | T-Side | RandomForest | Orthogonal | 0.813 | 0.103 | 0.813 | 0.101 | 0.957 | 0.035 | 0.547 | {'class_weight': 'balanced', 'max_depth': 2, '... | [0.8363636363636363, 0.8962962962962964, 0.824... |
| 2 | T-Side | RandomForest | Full | 0.828 | 0.084 | 0.825 | 0.083 | 0.955 | 0.029 | 0.508 | {'class_weight': 'balanced', 'max_depth': 2, '... | [0.9407407407407407, 0.9500891265597149, 0.773... |
| 3 | CT-Side | RandomForest | Raw | 0.756 | 0.084 | 0.758 | 0.080 | 0.870 | 0.029 | 0.523 | {'class_weight': 'balanced', 'max_depth': 2, '... | [0.7579365079365079, 0.6892736892736893, 0.776... |
| 4 | CT-Side | RandomForest | Orthogonal | 0.751 | 0.090 | 0.755 | 0.087 | 0.861 | 0.034 | 0.523 | {'class_weight': 'balanced', 'max_depth': 2, '... | [0.6895104895104894, 0.7594405594405594, 0.714... |
| 5 | CT-Side | RandomForest | Full | 0.764 | 0.084 | 0.767 | 0.081 | 0.863 | 0.033 | 0.517 | {'class_weight': 'balanced', 'max_depth': 2, '... | [0.6897546897546897, 0.8773892773892774, 0.776... |
3.3 XGBoost¶
Why XGBoost? Gradient boosting builds trees sequentially, each correcting the errors of the previous. Often achieves state-of-the-art performance on tabular data. However, it is prone to overfitting on small datasets, so we use conservative hyperparameters with explicit regularisation.
Hyperparameter Grid (Regularised):
n_estimators: [100] — Number of boosting rounds. Kept modest to prevent overfitting.learning_rate: [0.01, 0.1] — Step size shrinkage. Lower values require more trees but generalise better.max_depth: [2, 3] — Very shallow trees (2-3) to limit model complexity with small data.min_child_weight: [5, 10] — Minimum sum of instance weight in a child node. Higher values prevent overly specific splits that fit individual samples. Critical regulariser for small datasets.subsample: [0.8] — Fraction of samples used per tree. Introduces stochasticity for regularisation.
# === XGBoost: Define Grid & Run ===
xgb_param_grid = {
'n_estimators': [100],
'learning_rate': [0.01, 0.1],
'max_depth': [2, 3, 5], # Shallower to prevent overfitting
'min_child_weight': [5, 10], # Key regulariser for small data
'subsample': [0.8],
'eval_metric': ['mlogloss']
}
print("Running XGBoost with Nested CV...")
print(f"Grid size: {np.prod([len(v) for v in xgb_param_grid.values()])} combinations per inner CV")
xgb_results = run_model_tuning(
model_class=XGBClassifier,
param_grid=xgb_param_grid,
model_name='XGBoost',
df=df,
feature_sets=FEATURE_SETS,
feature_set_names=FEATURE_SETS_TO_TEST,
cv_strategy=cv_strategy,
excluded_roles=EXCLUDED_ROLES
)
print("\nXGBoost Results:")
display(xgb_results.round(3))
# Save results
xgb_results.to_csv(TAB_DIR / "model_results_xgb.csv", index=False)
Running XGBoost with Nested CV... Grid size: 12 combinations per inner CV XGBoost Results:
| Side | Model | Feature_Set | Mean_F1 | Std_F1 | Mean_Accuracy | Std_Accuracy | Mean_Train_F1 | Std_Train_F1 | Mean_Fit_Time | Best_Params | All_Scores | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T-Side | XGBoost | Raw | 0.794 | 0.084 | 0.798 | 0.079 | 0.942 | 0.023 | 0.181 | {'eval_metric': 'mlogloss', 'learning_rate': 0... | [0.8857142857142857, 0.8850408850408851, 0.824... |
| 1 | T-Side | XGBoost | Orthogonal | 0.812 | 0.099 | 0.816 | 0.097 | 0.973 | 0.015 | 0.180 | {'eval_metric': 'mlogloss', 'learning_rate': 0... | [0.8169191919191919, 0.8962962962962964, 0.773... |
| 2 | T-Side | XGBoost | Full | 0.819 | 0.080 | 0.820 | 0.078 | 0.965 | 0.019 | 0.165 | {'eval_metric': 'mlogloss', 'learning_rate': 0... | [0.8857142857142857, 0.8850408850408851, 0.824... |
| 3 | CT-Side | XGBoost | Raw | 0.709 | 0.105 | 0.720 | 0.092 | 0.889 | 0.029 | 0.183 | {'eval_metric': 'mlogloss', 'learning_rate': 0... | [0.6813186813186812, 0.7633477633477633, 0.776... |
| 4 | CT-Side | XGBoost | Orthogonal | 0.719 | 0.100 | 0.726 | 0.091 | 0.854 | 0.032 | 0.170 | {'eval_metric': 'mlogloss', 'learning_rate': 0... | [0.6190476190476191, 0.5712250712250713, 0.771... |
| 5 | CT-Side | XGBoost | Full | 0.730 | 0.097 | 0.737 | 0.089 | 0.874 | 0.029 | 0.182 | {'eval_metric': 'mlogloss', 'learning_rate': 0... | [0.7579365079365079, 0.7594405594405594, 0.776... |
3.4 Model Comparison & Champion Selection¶
Aggregate results from all models (including Logistic Regression baseline from Section 2) and identify the best-performing configuration for each side.
# === Aggregate Results: Build Leaderboard ===
# Compile leaderboard (includes LogReg baseline on core roles for fair comparison)
all_results_sorted = compile_model_leaderboard(
df=df,
feature_sets=FEATURE_SETS,
excluded_roles=EXCLUDED_ROLES,
cv_strategy=cv_strategy,
model_results={
'SVM': svm_results,
'RandomForest': rf_results,
'XGBoost': xgb_results
},
feature_sets_to_test=FEATURE_SETS_TO_TEST,
tab_dir=TAB_DIR
)
print("=" * 70)
print("MODEL LEADERBOARD (Core Roles Only)")
print("=" * 70)
print("\nRanked by F1-Macro.")
# Display columns (excluding All_Scores for readability)
display_cols = ['Side', 'Model', 'Feature_Set', 'Mean_F1', 'Std_F1', 'Mean_Accuracy', 'Mean_Train_F1', 'Overfitting_Gap']
display(all_results_sorted[display_cols].round(3))
====================================================================== MODEL LEADERBOARD (Core Roles Only) ====================================================================== Ranked by F1-Macro.
| Side | Model | Feature_Set | Mean_F1 | Std_F1 | Mean_Accuracy | Mean_Train_F1 | Overfitting_Gap | |
|---|---|---|---|---|---|---|---|---|
| 0 | CT-Side | RandomForest | Full | 0.764 | 0.084 | 0.767 | 0.863 | 0.098 |
| 1 | CT-Side | RandomForest | Raw | 0.756 | 0.084 | 0.758 | 0.870 | 0.115 |
| 2 | CT-Side | RandomForest | Orthogonal | 0.751 | 0.090 | 0.755 | 0.861 | 0.109 |
| 3 | CT-Side | XGBoost | Full | 0.730 | 0.097 | 0.737 | 0.874 | 0.143 |
| 4 | CT-Side | XGBoost | Orthogonal | 0.719 | 0.100 | 0.726 | 0.854 | 0.135 |
| 5 | CT-Side | XGBoost | Raw | 0.709 | 0.105 | 0.720 | 0.889 | 0.180 |
| 6 | CT-Side | SVM | Full | 0.708 | 0.091 | 0.715 | 0.875 | 0.166 |
| 7 | CT-Side | SVM | Raw | 0.704 | 0.098 | 0.711 | 0.867 | 0.163 |
| 8 | CT-Side | SVM | Orthogonal | 0.693 | 0.093 | 0.699 | 0.868 | 0.175 |
| 9 | CT-Side | LogisticRegression | Raw | 0.688 | 0.110 | 0.705 | 0.811 | 0.123 |
| 10 | CT-Side | LogisticRegression | Full | 0.676 | 0.110 | 0.689 | 0.812 | 0.136 |
| 11 | CT-Side | LogisticRegression | Orthogonal | 0.670 | 0.110 | 0.683 | 0.806 | 0.136 |
| 12 | T-Side | LogisticRegression | Orthogonal | 0.921 | 0.057 | 0.922 | 0.974 | 0.053 |
| 13 | T-Side | LogisticRegression | Full | 0.919 | 0.059 | 0.919 | 0.977 | 0.058 |
| 14 | T-Side | SVM | Orthogonal | 0.913 | 0.053 | 0.910 | 0.962 | 0.049 |
| 15 | T-Side | SVM | Full | 0.907 | 0.056 | 0.903 | 0.965 | 0.058 |
| 16 | T-Side | LogisticRegression | Raw | 0.902 | 0.073 | 0.903 | 0.968 | 0.067 |
| 17 | T-Side | SVM | Raw | 0.874 | 0.066 | 0.872 | 0.961 | 0.088 |
| 18 | T-Side | RandomForest | Full | 0.828 | 0.084 | 0.825 | 0.955 | 0.127 |
| 19 | T-Side | RandomForest | Raw | 0.822 | 0.091 | 0.822 | 0.954 | 0.132 |
| 20 | T-Side | XGBoost | Full | 0.819 | 0.080 | 0.820 | 0.965 | 0.147 |
| 21 | T-Side | RandomForest | Orthogonal | 0.813 | 0.103 | 0.813 | 0.957 | 0.144 |
| 22 | T-Side | XGBoost | Orthogonal | 0.812 | 0.099 | 0.816 | 0.973 | 0.162 |
| 23 | T-Side | XGBoost | Raw | 0.794 | 0.084 | 0.798 | 0.942 | 0.148 |
# === Model Stability Visualisation ===
# Boxplots showing CV score distribution for top configurations per side
# Fixed 0.4-1 axis for honest comparison across sides
# T-Side stability
fig = plot_model_stability_boxplots(
all_results=all_results_sorted,
side='t',
top_n=6,
fig_dir=FIG_DIR,
xlim=(0.4, 1.0)
)
plt.show()
# CT-Side stability
fig = plot_model_stability_boxplots(
all_results=all_results_sorted,
side='ct',
top_n=6,
fig_dir=FIG_DIR,
xlim=(0.4, 1.0)
)
plt.show()
# === Champion Selection: Best Model per Side ===
# Explicitly select champions (not just highest F1) for interpretability & pipeline consistency
CHAMPION_SELECTION = {
'T-Side': {'Model': 'LogisticRegression', 'Feature_Set': 'Orthogonal'},
'CT-Side': {'Model': 'RandomForest', 'Feature_Set': 'Orthogonal'}
}
champions = select_and_save_champion_models(
leaderboard_df=all_results_sorted,
df=df,
feature_sets=FEATURE_SETS,
excluded_roles=EXCLUDED_ROLES,
model_dir=MODEL_DIR,
champion_criteria=CHAMPION_SELECTION
)
======================================================================
CHAMPION MODELS (Best F1-Macro per Side)
======================================================================
T-Side:
Model: LogisticRegression
Feature Set: Orthogonal
F1-Macro: 0.921 ± 0.057
Accuracy: 0.922 ± 0.054
Hyperparameters: Default (baseline, no tuning)
Saved to: P:\cs2-playstyle-analysis-2024\results\classification\models\champion_t_LogisticRegression.joblib
CT-Side:
Model: RandomForest
Feature Set: Orthogonal
F1-Macro: 0.751 ± 0.090
Accuracy: 0.755 ± 0.087
Hyperparameters: {'class_weight': 'balanced', 'max_depth': 2, 'max_features': 'sqrt', 'min_samples_leaf': 8, 'n_estimators': 100}
Saved to: P:\cs2-playstyle-analysis-2024\results\classification\models\champion_ct_RandomForest.joblib
======================================================================
Section 3 Summary & Champion Selection¶
We evaluated four model families (Logistic Regression, SVM, Random Forest, XGBoost) using nested cross-validation to identify the optimal classifier for each side.
1. T-Side Champion: Logistic Regression (Orthogonal)¶
- Performance: F1-Macro 0.921 ± 0.06 (Top of Leaderboard)
- Justification:
- Simplicity Wins: Linear models matched or outperformed complex ensembles (XGBoost F1 ~0.82), proving that T-side roles are linearly separable in our feature space.
- Interpretability: Logistic Regression offers direct coefficient interpretability, allowing us to explain exactly why a player is classified as a "Lurker" (e.g., +1.75 coefficient on
ADNT). - Stability: Lowest standard deviation (±0.06) indicates robust generalisation across different player splits.
2. CT-Side Champion: Random Forest (Orthogonal)¶
- Performance: F1-Macro 0.751 ± 0.09
- Justification:
- Performance vs Interpretability Trade-off:
Random Forest (Full)achieved marginally higher mean F1 (0.764) with slightly lower variance (±0.08). However, the 1.3% difference corresponds to ~1 player classification difference across our cohort and falls within confidence overlap. - Why Orthogonal? We accept the modest variance trade-off (±0.09 vs ±0.08) in exchange for:
- A unified feature set with T-side, simplifying cross-interpretation
- Avoiding correlated features (ADNT ↔ ADAT) that can cause "vote-splitting" in tree-based importance rankings
- Parsimony (6 features vs 7)
- Complexity Required: Tree-based models substantially outperformed the linear baseline (0.75 vs 0.67), confirming CT roles require non-linear decision boundaries.
- Performance vs Interpretability Trade-off:
3. Methodological Note: Constraints for Small Data¶
Given the small sample size (N=84), we deliberately constrained our tree-based models to use very shallow depth (max_depth=2-3) and high leaf requirements.
- Why: Preliminary tests showed that standard depths (unconstrained or deep trees) allowed models to memorise individual players, leading to train-test gaps of >20%.
- Result: By strictly regularising the hyperparameters in our grid search, we maintained healthy train-test gaps (~10%), ensuring the selected champions are learning generalisable role archetypes rather than overfitting to noise.
Next Steps: We proceed to Section 4 to train these two champion models on the full (filtered) dataset. We will then perform a "post-mortem" analysis using Confusion Matrices to identify specific misclassification patterns and SHAP/Coefficient analysis to validate the behavioural drivers of each role.
4. Model Interpretation¶
TL;DR: We analyse Random Forest models for both T and CT sides to enable consistent non-linear feature interpretation. We use Gini importance for a global view and SHAP values for directional insights. Finally, we analyse prediction confidence to identify misclassification patterns.
Why Random Forest for Both Sides?
Although Logistic Regression was the statistical champion for T-side (F1=0.92), we also analyse the T-side Random Forest (Orthogonal) model here.
Reasons:
- Consistency: Using the same model architecture allows for direct comparison of feature importance dynamics between T and CT sides.
- Non-Linearity: Trees can capture complex interactions (e.g., "high aggression is good ONLY IF trading is also high") that linear models miss.
- Validation: If the RF finds similar patterns to the linear model, it reinforces our findings.
Interpretability Approaches:
- Gini Importance: "Which features are used most often?" (Magnitude)
- SHAP Values: "How does this feature value affect the prediction?" (Direction & Magnitude)
- Confidence Analysis: "Is the model confused or confidently wrong?" (Error Diagnosis)
4.1 Setup & Model Retrieval¶
We load the existing CT-side champion. For T-side, we retrieve the best hyperparameters from the leaderboard and retrain a fresh Random Forest on the filtered dataset (Core Roles) to ensure fair analysis.
# === Section 4 Setup: Prepare Models ===
# Load/retrain models for interpretation
t_model = prepare_interpretation_model(
side='t',
model_name='RandomForest',
feature_set_name='Orthogonal',
df=df,
feature_sets=FEATURE_SETS,
excluded_roles=EXCLUDED_ROLES,
model_dir=MODEL_DIR,
tab_dir=TAB_DIR,
load_if_saved=False # Retrain T-side for interpretation
)
ct_model = prepare_interpretation_model(
side='ct',
model_name='RandomForest',
feature_set_name='Orthogonal',
df=df,
feature_sets=FEATURE_SETS,
excluded_roles=EXCLUDED_ROLES,
model_dir=MODEL_DIR,
tab_dir=TAB_DIR,
load_if_saved=True # Load saved CT champion
)
# Unpack for downstream compatibility
champion_t_rf = t_model['pipeline']
champion_ct = ct_model['pipeline']
X_t, y_t = t_model['X'], t_model['y']
X_ct, y_ct = ct_model['X'], ct_model['y']
feature_names_t = t_model['feature_names']
feature_names_ct = ct_model['feature_names']
champion_data = {'t': t_model, 'ct': ct_model}
print("\nModels ready for interpretation:")
print(f" T-Side: RandomForest (F1 ~{t_model['f1_score']:.3f})")
print(f" CT-Side: {type(ct_model['pipeline'].named_steps['clf']).__name__} (F1 ~{ct_model['f1_score']:.3f})")
Retraining T-Side RandomForest with params: {'class_weight': 'balanced', 'max_depth': 2, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'n_estimators': 100}
Loading saved model from P:\cs2-playstyle-analysis-2024\results\classification\models\champion_ct_RandomForest.joblib
Models ready for interpretation:
T-Side: RandomForest (F1 ~0.813)
CT-Side: RandomForestClassifier (F1 ~0.751)
4.2 Global Feature Importance (Gini)¶
We compare which features drive the decision trees for each side. Gini importance measures how often a feature is used to split nodes and how much it reduces impurity.
# === Global Feature Importance (Gini) ===
compare_rf_feature_importance(
pipeline_t=champion_t_rf,
feature_names_t=feature_names_t,
pipeline_ct=champion_ct,
feature_names_ct=feature_names_ct,
fig_dir=FIG_DIR
)
plt.show()
4.3 Directional Feature Analysis (SHAP)¶
Gini importance tells us what matters, but not how. We use SHAP beeswarm plots to reveal directionality (e.g., does high ADNT predict Lurker or Entry?).
How to read these plots:
- Each Dot: Represents a single player from the dataset.
- Rows: Features (variables).
- X-axis (SHAP value): Impact on model output.
- Right (Positive): Pushes prediction towards this role.
- Left (Negative): Pushes prediction away from this role.
- Colour: Actual Feature Value.
- Red: High value for that feature.
- Blue: Low value for that feature.
Example: If the "Lurker" plot shows red dots (High ADNT) on the right (positive SHAP), it means "High isolation increases the probability of being classified as a Lurker".
# === Separate SHAP Analysis (T & CT) ===
# T-Side SHAP
print("Generating T-Side SHAP Plot...")
fig_t = plot_shap_beeswarm_grid(
pipeline=champion_t_rf,
X=X_t,
feature_names=feature_names_t,
side='t',
fig_dir=FIG_DIR
)
plt.show()
# CT-Side SHAP
print("Generating CT-Side SHAP Plot...")
fig_ct = plot_shap_beeswarm_grid(
pipeline=champion_ct,
X=X_ct,
feature_names=feature_names_ct,
side='ct',
fig_dir=FIG_DIR
)
plt.show()
Generating T-Side SHAP Plot...
Generating CT-Side SHAP Plot...
Observations from SHAP Analysis:
T-Side Roles:
- Lurker: Strongly defined by High ADNT (Isolation). The red dots on the far right of the ADNT row confirm that isolation is the primary driver. They also show high ADAT Residuals meaning they position further away from the team centre than expected (given their ADNT).
- Spacetaker: Driven by High OAP (Opening Attempts) and Low TAPD (Time Alive). This confirms the "entry fragger" profile: aggressive first-contact seeking that often leads to earlier deaths.
- AWPer: Characterized by Low ADNT (Pack play), Low OAP (Opening Attempts) and High TAPD (Survival). The blue dots on the ADNT row indicate that playing close to teammates is a key defining feature.
CT-Side Roles:
- Anchor: The mirror of the Lurker, defined by High ADNT (Isolation) and High ADAT Residuals (Static/Peripheral holding).
- Rotator: Defined by Low ADNT (Pack play) and Low TAPD (Active rotation/support leading to higher engagement risk).
- AWPer (CT): Distinct from T-side AWPers, they show Higher OAP (getting more opening picks) but still maintain High TAPD (survival), reflecting the "posted up" nature of defensive sniping.
4.4 Probability Analysis (The "Discrimination Plot")¶
We visualise the model's confidence for every player to distinguish between "Clear Wins", "Near Misses", and "Confident Errors".
Methodology: Repeated Cross-Validation To ensure robust and unbiased probability estimates, we use Repeated Stratified K-Fold (4 splits × 20 repeats).
- For each player, we generate 20 independent out-of-sample predictions (trained on the other 75% of data).
- We average these probabilities to get a stable estimate of the model's confidence.
- Error bars represent the standard deviation across repeats, quantifying the stability of the classification.
High standard deviation indicates that a player's classification is sensitive to the specific training data split (an "Unstable" classification).
# === T-Side Probability Analysis ===
# 1. Generate Repeated CV Predictions
print("Generating T-Side Predictions...")
mean_probas_t, std_probas_t = get_repeated_cv_predictions(
model=champion_t_rf,
X=X_t,
y=y_t
)
# Get player names
player_names_t = champion_data['t']['df']['player_name'].tolist()
# 2. Plot with Stability
fig = plot_prediction_confidence(
y_true=y_t,
mean_probas=mean_probas_t,
std_probas=std_probas_t,
class_names=champion_t_rf.classes_,
side='t',
player_names=player_names_t,
fig_dir=FIG_DIR
)
plt.show()
# === CT-Side Probability Analysis ===
print("Generating CT-Side Predictions...")
mean_probas_ct, std_probas_ct = get_repeated_cv_predictions(
model=champion_ct,
X=X_ct,
y=y_ct
)
# Get player names
player_names_ct = champion_data['ct']['df']['player_name'].tolist()
fig = plot_prediction_confidence(
y_true=y_ct,
mean_probas=mean_probas_ct,
std_probas=std_probas_ct,
class_names=champion_ct.classes_,
side='ct',
player_names=player_names_ct,
fig_dir=FIG_DIR
)
plt.show()
Generating T-Side Predictions...
Generating CT-Side Predictions...
Observations: Probability & Misclassification Analysis¶
Overall Trends:
As anticipated from the F1 scores and previous analysis, the CT side exhibits a higher rate of misclassification compared to the T side. Specifically, the boundary between Rotators and AWPers appears the most porous, with the model frequently confusing these roles.
Interestingly, IGLs appear to be misclassified quite frequently (HooXi, apEX, chopper, bLitz, Snax, MAJ3R, biguzera), whilst effort was made to use features that did not directly measure personal performance, perhaps there is a confounding effect between IGLs generally performing worse in-game and the distinctiveness of their roles?
T-Side Observations and theories:
- Brollan (Lurker $\to$ Spacetaker): The model confidently misclassifies Brollan as a Spacetaker. Personally, I had the impression that he was a relatively aggressive player, this high-confidence error suggests a potential mismatch between his assigned label and his actual behavioral metrics this year.
- Jabbi (Lurker $\to$ Spacetaker): Similarly misclassified but with higher prediction variance, suggesting a unique or hybrid playstyle that defies rigid categorisation.
- Ultimate (AWPer $\to$ Rotator): His misclassification as a Rotator aligns with his reputation as a famously aggressive AWPer in 2024, exhibiting movement patterns closer to a rifler than a sniper.
- ZywOo (AWPer $\to$ Rotator): Likely misclassified due to his hybrid profile; he is well-known for his proficiency with rifles and willingness to pick them up more frequently than "pure" AWPers.
- cadiaN (Rotator $\to$ AWPer): Misclassified as an AWPer, which is factually grounded: he served as the primary AWPer for Liquid before transitioning to a rifling role with Astralis mid-year.
CT-Side Observations and theories:
- Spinx (Anchor $\to$ Rotator): Spinx is the most confidently misclassified player on the CT side. This is a notable case where the model's prediction (Rotator) contradicts the label (Anchor). As discussed in Section 4.5, this likely points to a labelling error in our ground truth rather than a model failure.
- Ultimate (AWPer $\to$ Rotator): Consistent with his T-side results, his unique, aggressive AWPing style registers as rifler-like behavior to the model.
- biguzera (Rotator $\to$ AWPer): Confidently misclassified as an AWPer, potentially due to being known for taking "Star" positions as an IGL that an AWPer may take.
4.5 Individual Misclassification Analysis: Waterfall Plots (LOOCV)¶
Waterfall plots are generated with a Leave-One-Out (LOO) fit: for the player being explained, the model is retrained on all other core-role players (excluding ambiguous roles such as Half-Lurker/Mixed). This mirrors the "unseen player" condition of the CV predictions in Section 4.4 and avoids explaining a model that has already memorised the player. Because the LOO model is slightly better-specified (N-1 > 0.75N), SHAP values may be marginally more confident than any single CV fold, but they faithfully explain why the model would misclassify the player when treated as new data.
Also note that many of the remarks made here are merely speculative, and would require individual analysis of the player's matches to be fully justified.
T side waterfall plots¶
We'll start with Brollan and jabbi, Lurkers who were misclassified as Spacetakers.
# Side-by-side comparison of predicted vs. true role for Brollan
plot_comparison_waterfall(
pipeline_template=champion_t_rf,
df=df,
player_name="Brollan",
feature_names=feature_names_t,
side='t',
excluded_role=EXCLUDED_ROLES['t'],
fig_dir=FIG_DIR,
)
plt.show()
plot_comparison_waterfall(
pipeline_template=champion_t_rf,
df=df,
player_name="jabbi",
feature_names=feature_names_t,
side='t',
excluded_role=EXCLUDED_ROLES['t'],
fig_dir=FIG_DIR,
)
plt.show()
Brollan and jabbi have relatively similar playstyles according to the feature set. Their main drives towards being predicted as Spacetaker were their high opening attempts (OAP) and low time alive per death (TAPD), both very typical of Spacetakers. Where they deviate is with their high isolation from teammates (ADNT), being the biggest (and virtually only) factor pulling them towards the Lurker role. This tells us that these players had an isolated but aggresive T-side playstyle. With such a high distance from their team and with high opening attempts, this could point to a problem with their team keeping up early in the rounds.
Next we'll look at Ultimate, an AWPer who was misclassified as a Spacetaker.
plot_comparison_waterfall(
pipeline_template=champion_t_rf,
df=df,
player_name="ultimate",
feature_names=feature_names_t,
side='t',
excluded_role=EXCLUDED_ROLES['t'],
fig_dir=FIG_DIR,
)
plt.show()
Ultimate was known for being an exceptionally agressive AWPer in 2024, and his feature values corroborate this. He was misclassified as a Spacetaker largely due to dying quickly (low TAPD), and being traded relatively frequently (PODT).
The table below illustrates his feature values compared to other AWPers, he consistently places in the tails of the distribution for all features, indicating he was a significant outlier in his playstyle.
ultimate_t_percentiles = get_player_percentiles(
df=df,
player_name="ultimate",
side='t',
features=feature_names_t,
excluded_role=EXCLUDED_ROLES['t'],
)
print("Ultimate vs. labelled role (AWPer):")
display(ultimate_t_percentiles)
Ultimate vs. labelled role (AWPer):
| player_name | side | role_group | feature | value | percentile | rank | |
|---|---|---|---|---|---|---|---|
| 0 | ultimate | t | AWPer | tapd_t | 59.090479 | 5.882353 | 1/17 |
| 1 | ultimate | t | AWPer | oap_t | 19.145299 | 94.117647 | 16/17 |
| 2 | ultimate | t | AWPer | podt_t | 25.097174 | 94.117647 | 16/17 |
| 3 | ultimate | t | AWPer | pokt_t | 22.911286 | 11.764706 | 2/17 |
| 4 | ultimate | t | AWPer | adnt_rank_t | 0.298120 | 5.882353 | 1/17 |
| 5 | ultimate | t | AWPer | adat_residual_t | 0.063184 | 88.235294 | 15/17 |
CT side waterfall plots¶
Next we'll look at Spinx, an ""Anchor" who was "misclassified" as a Rotator.
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="Spinx",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
The waterfall plots show that all of Spinx's features certainly match what one would expect for a Rotator, and his positioning (ADNT) combined with his opening attempts (OAP) make him very unlikely to be an Anchor. There is a very good reason for this, he is actually misclassified, upon analysing many of his matches from 2024 in Vitality, Spinx actually did appear to play a rotating role on the CT side. This is slightly embarassing as it is my fault, at the end of 2024 he switched teams and Harry Richard's positions data (which was used to collect most labels) didn't have him in there, so I must've (wrongly) assumed his CT side role. Whoops! On the plus side, this demonstrates a potential use case of a model such as this, even if it isn't accurate enough yet to classify all players, it can still highlight potential miscategorisations such as this.
Next we will look at Hooxi, an Anchor misclassified as a Rotator
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="HooXi",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
HooXi is an interesting case, his closer proximity to teammates (ADNT) and lower time alive per death (TAPD) were the main causes of being classified as a Spacetaker. His low TAPD could be explained by his notoriously sacrificial playstyle*** (also indicated by a very high PODT), **but his low isolation (ADNT) is very A-Typical of an Anchor. Upon analysing a few of his games on G2 this year, it appeared that depending on the map, HooXi played a mixture of "Anchor" positions such as B-site Inferno and Cave Ancient, and more Rotating positions on maps such as Overpass and Anubis. This leads me to conclude that (I, again due to missing a label in my version of the positions data), mislabelled HooXi as an Anchor, when he should have been the more ambiguous "Mixed" role. A second mislabelling indicates that I should have been far more careful when imputing missing labels into the dataset, not just relying on my memory and assumptions (probably to rush it in for University deadline).
*https://www.hltv.org/news/39326/stat-check-g2-throws-roles-out-the-window
Next we will look at Ultimate (just because he's fun), an AWPer who was misclassified as a Rotator.
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="ultimate",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
This is pretty much the same story as on the T side for Ultimate, impressively aggressive, high opening attempts (OAP > 30%!!) and low time alive (TAPD) push him towards the rotator role. His early deaths pulls him away from an AWPer classification along with positioning closer to the average teammate than his nearest teammate would suggest (ADAT Residual), which was the most influential feature when classifying AWPers (see beeswarm plot in section 4.3).
Next we will look at malbsMd, a Rotator who was misclassified as an Anchor.
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="malbsMd",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
MalbsMd presents a curious case. The drive towards Anchor and away from Rotator is his early round high isolation from his team, he positions far from his nearest teammate (ADNT), and even further from his average teammate than one would expect (ADAT Residual). Even looking more specifically at the positions he played on each map, he did seem to play generally rotate-heavy spots ** (e.g. Connector Mirage, Middle Ancient, Connector Anubis). One would have to analyse his specific maps to get a better understanding of this anomaly. My guess would be that in early rounds, more players were stacked closer to the Anchor IGL (HooXi or Snax) to compensate for lower firepower, leaving malbs more isolated.
*https://www.hltv.org/news/39318/official-malbsmd-joins-g2
Next we'll take a look at biguzera, chopper, bLitz and apEX, all IGL-Rotators who were misclassified as AWPers.
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="biguzera",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="chopper",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="bLitz",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
plot_comparison_waterfall(
pipeline_template=champion_ct,
df=df,
player_name="apEX",
feature_names=feature_names_ct,
side='ct',
excluded_role=EXCLUDED_ROLES['ct'],
fig_dir=FIG_DIR,
)
plt.show()
These are the four players most confidently misclassified into the AWPer role (See section 4.4). What makes this interesting is that they are all rotators, and all in-game leaders (IGLs). This seems to be a smoking gun for what went wrong here. Observing their feature values, they all have a low ADAT Residual, pulling them away from being classified as a Rotator and towards the AWPer role.
Lets see if this low ADAT Residual value may have something to do with their meta-role as IGL.
Visual diagnostic: IGL ADAT (Residual) vs CT role distribution
plot_igl_feature_distribution(
df=df,
feature_name="adat_residual_ct",
side="ct",
fig_dir=FIG_DIR,
)
plt.show()
This plot is rather telling, it shows us two main things about IGLs: Firstly that IGLs are more likely to be Spacetakers, and secondly that they tend to position closer to the average teammate than their nearest teammates proximity would predict, no matter the role.
In other words, IGLs tend to play more "Central" positions on the map, atleast early in rounds. ( For the LLM: Insert results\classification\figures\Central-IGL-Diagram.png here when converting to ipynb)
This idea is not new, in fact, Harry "NER0cs" Richards points out this centralised-IGL phenomena in his 2024 article "Why more and more IGLs are taking up the 'supportive rotator' role"*.
The reasoning for this boils down to how much information the IGL is able to process in these positions. Central positions allow for IGLs to understand what is happening around the map better, without having to rely on communicated information or a secondary caller (another player designated to lead in specific circumstances). The trade off however, is that a lot of these positions are high-engagement areas, where "Star" players typically excel. Meaning that an IGL can then become a liability, often not pulling their weight in firepower. *https://www.hltv.org/news/38747/why-more-and-more-igls-are-taking-up-the-supportive-rotator-role
Section 4 Summary & Interpretation¶
Key Takeaways:
- Feature Utility: CT-side classification relies on the four non-trade metrics, whereas T-side more widely leverages the feature set.
- The Power of ADNT: Isolation (
ADNT) proved to be the most critical feature for almost all roles (except CT AWPers and Spacetakers). While simple and slightly arbitrary, its interpretability and high signal strength make it a standout "bespoke" metric. - CT-Side Ambiguity: The boundary between Rotators and AWPers is the most porous in our model. This confusion is likely driven by the IGL-Centrality phenomena, where In-Game Leaders play central positions that mimic the spacing profile of snipers.
- Misclassification as Diagnostic: High-confidence errors often pointed to ground truth labelling issues (e.g., Spinx, HooXi) or unique outlier playstyles (e.g., Ultimate's aggressive AWPing). Despite the mislabels being "human error" (rushed university coursework!), the model's ability to flag them validates its potential as a viable diagnostic tool for role classification.
Detailed Interpretations
1. Signal Diversity & Feature Starvation
- T-Side (Broad Utility): The model finds useful signal across the entire feature set. Even the lowest-ranked feature (
POKT) contributes a meaningful 0.07 Gini importance, indicating that trade behavior is a valid discriminator for attacking roles. - CT-Side (More Concentrated Signal): Signal is concentrated in the four non-trade metrics (
ADNT,ADAT (R),OAP,TAPD). The drop-off for trade metrics (POKT,PODT) reflects the nature of defensive play: Terrorists typically dictate the terms of engagement, making proactive trading behavior less useful for differentiating between specific CT roles compared to T roles.
2. Behavioral Drivers (SHAP Analysis)
- T-Side:
- Lurker: Strongly defined by High Isolation (
ADNT). - Spacetaker: Defined principally by High Opening Attempts (
OAP) and Low Survival (TAPD). Aggresive as expected! - AWPer: Defined by Low Isolation (
ADNT) and Low OAP. Plays passively and close to teammates (Unless their name is Ultimate).
- Lurker: Strongly defined by High Isolation (
- CT-Side:
- Anchor: The mirror of the Lurker, defined by High Isolation (
ADNT). - Rotator: Defined by Low ADNT (Pack play) and Low Survival (
TAPD), reflecting the high-engagement nature of rotation play. - AWPer: Less distinct than T-side. Defined by Low ADAT Residual ("Centrality") and High Survival (
TAPD). Interestingly,ADNTsignal is mixed, suggesting most AWPers play effectively like rotators while a minority isolate, potentially meaning AWPers may be subcategorised into Rotators and Anchors themselves.
- Anchor: The mirror of the Lurker, defined by High Isolation (
2. The "Aggressive/Hybrid" Profiles
- Ultimate (Aggressive): A statistical outlier across the board. His High
OAPand LowTAPDmake him an exceptionally aggressive sniper, confusing the model into seeing a Rifler. - ZywOo (Hybrid): Misclassified as a Rotator not exactly due to aggression, but because he was a hybrid that picks up rifles more frequently than more "pure" AWPers.
3. The IGL-Centrality Phenomena
- A cluster of IGLs (biguzera, chopper, bLitz, apEX) were misclassified as AWPers.
- Root Cause: These players consistently show low
ADAT Residuals(positioning closer to the team's centre of mass than expected). - Tactical Insight: I believe this reflects IGLs taking "middle-of-the-pack" positions to gather information. The model confuses this "central" behavior with the positioning of an AWPer.
4. Future Potential
- This "first attempt" demonstrates that even with coarse metrics like
OAPandADNT, we can extract meaningful role signatures. - The model's ability to identify "outliers" (like Ultimate) and labeling errors (like Spinx) shows its potential. With further refinement, it could be used for classifying a larger number of players (potentially commerically for casual players) or as a supplementary tool for scouting players.
5. Synthesis & Conclusion¶
This notebook has demonstrated that the playstyle features in the dataset contain sufficient signal to classify CS2 professional player roles with high accuracy, particularly on the T-side.
5.1 Methodology Recap & Feature Engineering¶
To ensure statistically robust feature values, we filtered for players with a minimum of 40 maps used to compute their statistics (as determined in the EDA notebook).
To enable this analysis on a small cohort (N=84), we employed a rigorous 4-split × 20-repeat Nested Cross-Validation strategy (80 total folds). This ensured that our performance estimates were stable and that our champion models learned generalisable role archetypes rather than memorising individual player data.
A key innovation was the ADAT (Residual) feature. By training a global linear regression ($ADNT \to ADAT$) on the stable cohort, we established a "Pro Standard" for how much a player should be isolated from the team centre given their distance from their nearest teammate. The residual measures deviation from this professional norm.
Note on Feature Engineering & Data Leakage
Technically, input distribution leakage exists because the ADAT (Residual) feature is calculated using a global Linear Regression model fit on the entire (stable cohort) dataset (including samples that eventually become "Test" data in Cross-Validation). This means the specific slope and intercept used to transform the features were very slightly influenced by the test subjects, violating the strict separation of Train and Test environments.
Despite this theoretical impurity, the approach is retained and justified for three key reasons:
The "Pro Standard" Baseline (Domain Justification): We treat the relationship between Isolation (ADNT) and Centrality (ADAT) as a fixed geometric constraint of high-level CS2, defined by the global population of elite professionals. By fitting on the full stable cohort, we establish a canonical "Ground Truth" for standard positioning. The residual, therefore, measures a player's deviation from the professional norm, not just a statistical deviation from a local training fold.
Unsupervised Nature: The leakage is strictly limited to the relationship between independent variables ($X \to X$). No information regarding the target variable (Player Roles) is leaked. The model is not learning "the answer" from the test set, only the "scale" of the input features.
Feature Stability: Given the small sample size ($N=84$), calculating regression coefficients within each small CV fold (~63 players) would introduce high variance, causing the definition of the feature to shift wildly between folds. A global fit ensures the feature definition remains consistent and interpretable across the analysis.
5.2 Results Synthesis¶
Baseline & Feature Sets:
- Our baseline Logistic Regression significantly outperformed the stratified dummy classifier (Cohen's $d \approx 5.7$ for T-side, $\approx 3.1$ for CT-side), proving that these features capture real playstyle differences, not random noise.
- No clear "winner" emerged among the feature sets (Raw, Orthogonal, Full) at the baseline stage; differences between these sets were within 0.02 F1 (~1-2 players). This justified retaining all three sets for advanced model testing.
- The exclusion of "Ambiguous" roles (Half-Lurker on T-side, Mixed on CT-side) improved F1-Macro by +0.28 (T) and +0.21 (CT). This confirms that while "Core" roles (Lurker, Spacetaker, Anchor, Rotator, AWPer) are statistically distinct, the ambiguous roles represent hybrid states that blur the boundaries of our taxonomy.
Model Performance & Nature:
- T-Side: Best modeled by Logistic Regression (F1 ≈ 0.92). The roles are linearly separable, with all features (including trade metrics like
POKT) contributing useful signal. Simplicity wins here, complex ensembles offered no improvement. - CT-Side: Best modeled by Random Forest (F1 ≈ 0.75). Defensive roles require non-linear decision boundaries, relying heavily on just four metrics (positioning and aggression:
ADNT,ADAT_Residual,OAP,TAPD) while trade metrics provide little discriminatory value.
5.3 Key Insights & The "IGL Confound"¶
The misclassifications themselves provided some of the richest insights:
The "IGL Centrality" Phenomena: A cluster of Rotator-IGLs (biguzera, chopper, bLitz, apEX) were confidently misclassified as AWPers. Our analysis revealed this is due to Low ADAT Residuals: IGLs tend to play "central" positions to maximise information processing. The model confuses this centralised support-rotator positioning with the nature of defensive AWPing.
Outliers vs Archetypes: Ultimate stands out as an exceptionally aggressive AWPer. His classification as a Spacetaker/Rotator (despite being an AWPer) highlights his statistical uniqueness, an aggressive sniper (
OAP > 30%) who breaks the mould of the passive hold.Diagnostic Value: High-confidence errors often pointed to ground truth labelling issues (e.g., Spinx labelled as Anchor but exhibiting Rotator statistics). This validates the model's potential as a "sanity check" tool for manual labelling efforts.
5.4 Conclusion & Future Directions¶
Data Integrity Note: While the base dataset (data/raw/cs2_playstyle_roles_2024.csv) has been corrected, this notebook deliberately injected the original errors for HooXi and Spinx. This "controlled fault injection" allows us to demonstrate the model's diagnostic value: by flagging these players as confident misclassifications (e.g., Spinx confidently predicted as Rotator despite the injected "Anchor" label), the model successfully identified the data quality issues.
This notebook serves as a successful proof of concept: player roles in CS2 are not just theoretical labels but measurable statistical clusters. We have provided validation for:
- A. The role labels (they correspond to distinct behavioral profiles).
- B. The features (they contain discriminatory signal).
- C. The modelling approach (ML classification is feasible even with limited data).
It's not perfect—we obviously got some misclassifications even with the best models—but as a first pass at ML role classification, it has served its purpose.
Limitations & Future Work:
- Data Availability: We are constrained by the organisation of the professional scene. Lower-tier teams often lack the consistent role structures of elite teams (roles may be less distinct or more fluid), and parsing enough maps (>40) for stable metrics is a challenge for teams that don't attend many events. Furthermore, accurate role labels require manual curation (e.g., Harry Richards' lovely Positions Database), which cannot scale to all teams.
- Feature Expansion: Future iterations could include weapon-specific features (e.g., "% Kills with AWP" would obviously significantly boost AWPer classification and resolve the Ultimate/IGL confusion) or more granular/creative data that could reveal subtler differences between rifler playstyles.
Commercial Application: Finally, this modelling approach has potential for broader application. Integrating role classification into commercial statistic services for casual players (e.g., Leetify, Scope.gg, CSStats) could satisfy the "personal identity" aspect of Blumler and Katz's Uses and Gratifications Theory. A feature like "What's my Role" bridges the gap between abstract statistics and personal narrative, giving players a richer understanding of their own gameplay.