Spotify Case Study
/
DATA SCIENCE / MACHINE LEARNING
Explored multiple machine learning paradigms on a 114k‑track Spotify dataset by progressing from classical OLS regression to log and Box–Cox–transformed models, then to Weighted Least Squares to address heteroskedasticity. Formulated one research question as a regression task (hit song impact on album popularity) and another as a classification task (predicting collaborations), incorporating logistic‑style setups, hypothesis testing, and class‑imbalance strategies. Prioritized rigorous diagnostics and validation, using Q–Q plots, residual and heteroskedasticity checks (Breusch–Pagan, LOWESS) and error metrics (MSE, RMSE, MAE, R²) to benchmark and select models.
Data Science and Machine Learning
3 months
TECHNICAL SKILLS AND METHODS
Python Data Science Stack
Core Libraries:
pandas,NumPy,matplotlib,SciPy,scikit-learn,statsmodels
Data Preparation & Transformation
Data Cleaning: Handling missing values and duplicates
Feature Engineering: Creation and transformation of meaningful variables
Handling Missing Data: Imputation techniques and analysis
Outlier Analysis: Detection and treatment methods
Box–Cox Transformation: Normalization of non-normal data
Feature Creation (Classification): Example — “collaboration flag” variable
Class Imbalance Handling: Resampling (SMOTE, undersampling) or weighting strategies
Exploratory Data Analysis & Visualization
Exploratory Techniques: Summary statistics, correlation analysis
Data Visualization: Plotting with
matplotlibandseabornDiagnostic Plots: Q–Q plots, residual plots, LOWESS smoothing
Supervised Learning Frameworks
Regression Models:
Linear Regression (OLS, WLS)
Logistic Regression (for classification tasks)
Training Process: Train–test split and cross-validation
Model Evaluation & Diagnostics
Error Metrics: MSE, RMSE, MAE, R2R2
Residual Diagnostics: Pattern assessment, normality, and variance checks
Heteroskedasticity Tests: Breusch–Pagan test and corrective approaches
KEY QUESTIONS EXPLORED
1. Hit songs and albums
It tested whether having a “hit” song (top 25% popularity) on an album boosts the average popularity of the other tracks. Using regression (OLS, log-transformed, Box–Cox, WLS) and diagnostics, it found only a weak, slightly negative relationship and no meaningful uplift for non-hit songs.
2. Collaborations vs solo tracks
It asked how musical attributes change when artists collaborate versus release solo tracks. A collaboration flag was engineered, feature distributions were explored, and supervised learning (logistic-style) was set up to see if collaboration status is predictable from audio features.
3. Toward recommendations
A third stretch goal was to explore the feasibility of building a simple recommendation system from user inputs and track attributes. This framed the dataset as a foundation for future recommender or similarity-based models rather than a full production system.
Key Findings
1. Hit songs and albums
Having a hit song on an album does not meaningfully increase the popularity of the album’s other tracks, despite a weak, slightly negative statistical relationship between hit and non‑hit popularity.
2. Collaborations vs solo tracks
Collaborative tracks differ systematically from solo tracks in their audio characteristics, and collaboration status is at least partially predictable from features like danceability, valence, and other Spotify audio metrics.
3. Toward recommendations
The dataset and feature engineering choices make it feasible to extend this work into simple recommendation or similarity-based systems, although building a full recommender was left as future work.







