Spotify Case Study

/

DATA SCIENCE / MACHINE LEARNING

Explored multiple machine learning paradigms on a 114k‑track Spotify dataset by progressing from classical OLS regression to log and Box–Cox–transformed models, then to Weighted Least Squares to address heteroskedasticity.​ Formulated one research question as a regression task (hit song impact on album popularity) and another as a classification task (predicting collaborations), incorporating logistic‑style setups, hypothesis testing, and class‑imbalance strategies.​ Prioritized rigorous diagnostics and validation, using Q–Q plots, residual and heteroskedasticity checks (Breusch–Pagan, LOWESS) and error metrics (MSE, RMSE, MAE, R²) to benchmark and select models.

Data Science and Machine Learning

3 months

TECHNICAL SKILLS AND METHODS

Python Data Science Stack

  • Core Libraries: pandas, NumPy, matplotlib, SciPy, scikit-learn, statsmodels

Data Preparation & Transformation

  • Data Cleaning: Handling missing values and duplicates

  • Feature Engineering: Creation and transformation of meaningful variables

  • Handling Missing Data: Imputation techniques and analysis

  • Outlier Analysis: Detection and treatment methods

  • BoxCox Transformation: Normalization of non-normal data

  • Feature Creation (Classification): Example collaboration flag variable

  • Class Imbalance Handling: Resampling (SMOTE, undersampling) or weighting strategies

Exploratory Data Analysis & Visualization

  • Exploratory Techniques: Summary statistics, correlation analysis

  • Data Visualization: Plotting with matplotlib and seaborn

  • Diagnostic Plots: QQ plots, residual plots, LOWESS smoothing

Supervised Learning Frameworks

  • Regression Models:

    • Linear Regression (OLS, WLS)

    • Logistic Regression (for classification tasks)

  • Training Process: Traintest split and cross-validation

Model Evaluation & Diagnostics

  • Error Metrics: MSE, RMSE, MAE, R2R2

  • Residual Diagnostics: Pattern assessment, normality, and variance checks

  • Heteroskedasticity Tests: BreuschPagan test and corrective approaches

KEY QUESTIONS EXPLORED

1. Hit songs and albums
It tested whether having a hit song (top 25% popularity) on an album boosts the average popularity of the other tracks. Using regression (OLS, log-transformed, BoxCox, WLS) and diagnostics, it found only a weak, slightly negative relationship and no meaningful uplift for non-hit songs.

2. Collaborations vs solo tracks
It asked how musical attributes change when artists collaborate versus release solo tracks. A collaboration flag was engineered, feature distributions were explored, and supervised learning (logistic-style) was set up to see if collaboration status is predictable from audio features.

3. Toward recommendations
A third stretch goal was to explore the feasibility of building a simple recommendation system from user inputs and track attributes. This framed the dataset as a foundation for future recommender or similarity-based models rather than a full production system.

KEY QUESTIONS EXPLORED

1. Hit songs and albums
It tested whether having a hit song (top 25% popularity) on an album boosts the average popularity of the other tracks. Using regression (OLS, log-transformed, BoxCox, WLS) and diagnostics, it found only a weak, slightly negative relationship and no meaningful uplift for non-hit songs.

2. Collaborations vs solo tracks
It asked how musical attributes change when artists collaborate versus release solo tracks. A collaboration flag was engineered, feature distributions were explored, and supervised learning (logistic-style) was set up to see if collaboration status is predictable from audio features.

3. Toward recommendations
A third stretch goal was to explore the feasibility of building a simple recommendation system from user inputs and track attributes. This framed the dataset as a foundation for future recommender or similarity-based models rather than a full production system.

Key Findings

1. Hit songs and albums

Having a hit song on an album does not meaningfully increase the popularity of the albums other tracks, despite a weak, slightly negative statistical relationship between hit and nonhit popularity.

2. Collaborations vs solo tracks
Collaborative tracks differ systematically from solo tracks in their audio characteristics, and collaboration status is at least partially predictable from features like danceability, valence, and other Spotify audio metrics.

3. Toward recommendations

The dataset and feature engineering choices make it feasible to extend this work into simple recommendation or similarity-based systems, although building a full recommender was left as future work.

more projects

more projects

more projects

Copy Email

CONTACT@VOID.COM

PST

1:24 PM

let’s collaborate

Copy Email

CONTACT@VOID.COM

PST

1:24 PM

let’s collaborate