University Projects

An overview of the most relevant university projects/coursework from my Data Science BSc at the University of Exeter.

Data ScienceAIMachine LearningDeep LearningStatisticsDatabase TheoryData VisualisationPythonRSQLJavaGoogle Cloud Platform+ more!
University Projects screenshot

Introduction

During my degree I completed a range of coursework and projects that gave me practical experience with data science, machine learning, and database theory/design. While many of my modules were statistics-focused and built a strong theoretical foundation, this page highlights the more applied projects where I used those skills to solve problems and build working solutions.

Alongside these, I’ve also included a short section on other courseworks that were valuable for learning and broadening my skillset. Together, they demonstrate both my core strengths and the variety of methods I explored during my studies.


Modules Taken

Here is a list of all the modules I took at University. If you want to find out more of what was involved in these modules, you can found out more here: Mathematics/Statistics | Data/Computer Science.

Third Year
  • COM3022 — Data Science Individual Project
  • COM3021 — Data Science at Scale
  • COM3023 — Machine Learning and AI
  • ECM3420 — Learning from Data
  • MTH3024 — Stochastic Processes
  • MTH3028 — Statistical Inference
  • MTH3041 — Bayesian Statistics, Philosophy and Practice
Second Year
  • ECM2419 — Database Theory and Design
  • COM2014 — Computational Intelligence
  • COM2011 — Machine Learning and Data Science
  • COM2013 — Data Science Group Project 2
  • ECM2414 — Software Development
  • COM2012 — Data Science in Society
  • MTH2006 — Statistical Modelling and Inference
First Year
  • COM1012 — Data Science Group Project 1
  • ECM1410 — Object-Oriented Programming
  • MTH1002 — Mathematical Methods
  • MTH1004 — Probability, Statistics and Data
  • COM1011 — Fundamentals of Machine Learning
  • ECM1400 — Programming

Featured Projects


Data Science Individual Project — Analysing Playstyle in Counter-Strike

PythonMachine LearningFeature EngineeringClusteringData VisualisationExploratory Data AnalysisLogistic RegressionRandom ForestLinear RegressionPandasMatplotlibScikit-LearnSeabornPlotlyJupyter

For my final-year project (equivalent of a dissertation) I tackled a problem in a space I’m genuinely passionate about: using data science to understand professional Counter-Strike (the videogame). The project set out to measure the link between playstyles (how a player plays the game) and roles (the responsibilities they are expected to fulfil in a team).

TL;DR

This was a machine learning-based data analysis. I managed to show that data-driven playstyle profiles meaningfully align with expert role labels while also revealing fluidity and overlap in how roles manifest. To achieve this, I designed a bespoke workflow to process ~400GB of raw match data, engineered custom features, and applied a mix of EDA, linear regression analysis, basic classification, and multiple clustering methods analysing their quality metrics (Silhouette, Calinski–Harabasz, NMI, ARI).

Read more details

Data Processing

To do this, I built a bespoke data processing workflow. Starting with over 400GB of raw "demofiles" (gameplay data) from tournaments in 2024, I used the awpy python library to parse the game events and then engineered a bespoke set of behavioural features. These weren’t performance metrics like K/D — they were designed to quantify how a player approached the game, not how well:

  • Aggression (time alive per death, opening duel attempts)
  • Trading (proportion of kills and deaths traded)
  • Positioning (two novel distance-based metrics I developed to capture how close players stayed to their teammates)
Data Processing Diagram
Diagram showing the data processing workflow.

All processing was done locally in python, with careful validation (manual demo checks and cross-referencing HLTV stats) to ensure the features were robust. The result was a clean player-level dataset representing playstyle alone.

Data analysis methodology

I carried out extensive exploratory analysis on the engineered features. This included stability checks to ensure they were consistent across sufficient samples, correlation analysis to study relationships between them, and analysed a linear regression stratified by role on the positional metrics to determine whether both captured distinct information. Principal component analysis (PCA) and t-SNE were used for visualisation of the space. I also ran preliminary classification experiments with Logistic Regression and Random Forests to test how well the features could predict role labels. Throughout this process I generated a large number of plots — distributions, correlations, regressions, and role comparisons — to visualise and validate the behaviour of the features.

I then moved on to several clustering methods (K-Means, Hierarchical, Gaussian Mixture Models) to see if natural “archetypes” emerged. Cluster quality was assessed using internal validation metrics such as the Silhouette Coefficient and Calinski–Harabasz Index, while external comparison with expert-assigned role labels was quantified using Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI). Together, these steps created a rigorous analytical framework that contextualised the clustering results and highlighted both the structure and limitations of the data.

PCA of T-side features
PCA projection of the T-side player features, with roles colour-labelled.

Interpretations

The analysis showed clear spectra for rifler roles, distinct separation of snipers (AWPers), and measurable though overlapping groupings that aligned with expert role labels from the CS community. In short, I was able to demonstrate that with just a few carefully engineered features, data-driven playstyle profiles corresponded with how analysts describe roles, while also highlighting ambiguities and role fluidity. T-side roles proved more distinguishable than the CT-side, where playstyles were less separable and clusters weaker — a finding that mirrors the greater individual flexibility demanded on the defensive side. Notably, T-side AWPers consistently formed the most distinct cluster, showing that their behavioural signature was strong enough to emerge from positioning and engagement patterns alone. Meanwhile, rifler roles often appeared along spectra (e.g. Spacetaker ↔ Lurker, Anchor ↔ Rotator) rather than as discrete groups, reinforcing that playstyles exist on a continuum rather than in fixed categories.

Conclusion

This project gave me experience with a full data workflow — messy raw data acquisition, custom feature engineering, validation, clustering, and analysis — and produced a unique dataset and methodology in a field that hasn’t seen much academic work yet. It was the most meaningful piece of coursework I did, combining technical depth with a domain I have relative expertise in.


Data Science at Scale — Vertex AI Workflow on Fashion-MNIST

Google CloudVertex AICloud StorageTensorFlowPythonNeural NetworksMachine LearningModel DeploymentDeep LearningData Visualisation

This coursework used Fashion-MNIST to build a cloud-backed image classification pipeline and compare a deployed model with a locally loaded baseline and a small model I trained myself.

Google Cloud Icons and some images from the dataset

I uploaded the provided artefacts to Cloud Storage, registered a pre-trained model in Vertex AI, and deployed it to an endpoint. From a Vertex AI Workbench notebook, I wrote Python to preprocess test images, call the endpoint, and record metrics — precision, recall, F1-score, and accuracy. I then loaded the second pre-trained model locally with TensorFlow/Keras and ran the same evaluation for a like-for-like comparison; both pre-trained paths delivered similar results (weighted F1 ≈ 0.88).

To extend the work, I trained a compact convolutional neural network and exported it for reuse. The custom model reached a weighted F1 ≈ 0.89, a negligible lift over the baselines. Given the extra setup and higher credit spend, I favoured the pre-trained baseline for price–performance, and would deploy to a Vertex AI endpoint when a managed, real-time API is required.


Database Theory & Design

Database DesignER ModellingMySQLSQLJoinsSubqueriesViewsStored ProceduresTransactions
ER diagram for the ticket-booking system
Entity relationship diagram for the project.

This coursework involved designing and implementing a ticket-booking database in MySQL, including Entity-relationship modelling, relational schema design, data population, and the creation of queries and update operations to demonstrate its use. The system was structured to manage customers, events, venues, ticket types, vouchers, and bookings with defined primary and foreign keys.

After designing and initialising the database, I implemented queries and update operations to show the system in use. Reporting queries included availability checks, event listings within given timeframes, sales volumes by event, and revenue summaries, alongside booking-specific views. Update tasks modelled booking workflows such as creating and cancelling bookings, applying vouchers, and managing ticket stock. To support this I used joins, subqueries, views, stored procedures, transactions, and indexes, ensuring efficiency, accuracy, and data integrity.

In future projects I aim to advance from local coursework implementations to deploying live databases in production, while also applying techniques such as CTEs and Window Functions to write more expressive and scalable queries.


Machine Learning and AI — Bayesian Neural Networks for Flood Hazard Prediction

Machine LearningDeep LearningBayesian Neural NetworksVariational ApproximationMCMCPyTorchPyroPython

This coursework involved bayesian deep learning for binary classification. The task was to compare Variational Approximation (VA) and Markov Chain Monte Carlo (MCMC) for Bayesian Neural Networks (BNNs) on a flood hazard dataset of ~67k samples, where only ~7% were labelled hazardous.

The data was cleaned, standardised, one-hot encoded, and models were trained with class weights to address imbalance. A small feed-forward BNN with two hidden layers and Normal priors was implemented in PyTorch/Pyro to keep the setup consistent across both methods.

The VA model collapsed to predicting only the majority class, while the MCMC model converged and achieved a modest (F1 ≈ 0.57) at heavy computational cost. The limited results likely reflected both the difficulty of the imbalanced dataset and the simplicity of the chosen architecture, highlighting the practical challenges of applying Bayesian inference in this setting.


Learning from Data — Sentiment Classification on Steam Reviews

Machine LearningNatural Language ProcessingText ClassificationTF-IDFWord2VecLogistic RegressionSupport Vector MachineRandom ForestXGBoostScikit-LearnPython

I carried out a self-directed sentiment classification project on Steam game reviews using Python and standard machine learning libraries. The task was to take raw, informal text and predict review polarity with a straightforward, reproducible setup. This module was designed as an introduction for MSc Data Science students and final year computer science undergraduates, so the assignment was relatively simple.

The experiment involved text cleaning (normalisation, tokenisation, lemmatisation), feature extraction with TF-IDF and Word2Vec, and training models including Logistic Regression, Linear SVM, Random Forest, and XGBoost. Models were trained on a train/test split with class weighting applied for imbalance, and evaluated with precision, recall, F1-score, and accuracy. The work could have been improved with a validation split, though the aim was mainly to apply standard NLP techniques.


Mentionable Courseworks


PythonPandasNumPyMatplotlibScikit-LearnData CleaningFeature EngineeringVectorisation (TF–IDF)Cross-ValidationPrecision/Recall/F1Logistic RegressionNaïve BayesSVMGradient BoostingUnsupervised LearningClusteringDimensionality ReductionSimulation ModellingAgent-Based ModellingSentiment AnalysisNLP
  • Computational Intelligence: Built an agent-based simulation of gym usage in NetLogo, parameterised with real-world data and validated against observed patterns. Used the model to explore how equipment availability and arrival rates affect wait times, showing how simulation can provide operational insights.
NetLogo simulation interface from CI module
Interface of the agent-based gym simulation developed in the Computational Intelligence module.
  • Machine Learning and Data Science: Implemented a custom dimensionality reduction algorithm (Classical MDS) and applied K-Means clustering to visualise and analyse structure in the dataset. This gave hands-on experience with unsupervised learning and mathematical implementation of algorithms.

  • Data Science Group Project: Worked in a team to build sentiment classifiers for IMDB movie reviews. Applied TF–IDF vectorisation and trained models including Logistic Regression, Naïve Bayes, SVM, and Gradient Boosting, evaluating them with cross-validation and standard metrics.