Algorithmic
prediction of
music trends
Binanox
iGEM Leiden
2022
Creative Fields
Data Science,
Machine Learning,
Trend Forecasting
Responsibilities
Project Ideation,
Data Scraping,
Data Analysis
Location
Paris, FR
Year
2023
A microbial factory to make nanoparticles for cancer therapy
Keywords
music trends, algorithmic forecasting, feature extraction
Collaborators
Eva Koskova
As project manager of the team, I was responsible for creating a vision for the project, as well as facilitating communication between team members to ensure that the project remained aligned with this vision. Frustrated about the overwhelming complexity of most project’s science communication, I set the goal to make my project as accessible as possible. This vision was applied in all aspects: from experimental design to website design.
Vision
The core vision of this project was to move beyond subjective cultural analysis and apply a data science lens to the music industry, specifically challenging the notion that commercial success is purely accidental or dictated solely by marketing spend. We aimed to establish a statistical foundation for predicting musical virality by answering a crucial question: Can a song’s destiny be reliably forecast using only its intrinsic sonic features? By treating the outcome as a binary classification problem—Hit or Not a Hit—we sought to create a model that not only predicts the future but also provides transparent insights into the underlying mechanisms of mass musical appeal, ultimately giving composers and producers a data-backed understanding of the attributes that resonate most strongly with modern audiences.
Methodology
The project methodology focused on rigorous data preparation and machine learning implementation. The process began with data cleaning and preprocessing of a large dataset (over 41,000 tracks) defined by low-level audio features such as danceability, energy, and loudness. After feature scaling, we deployed and optimized several classification algorithms, notably Logistic Regression and K-Nearest Neighbors (KNN), training them to distinguish successful tracks from unsuccessful ones. Performance was meticulously measured using standard metrics like precision, recall, and F1-score. Crucially, to move the project beyond a black-box predictor, we integrated SHAP (SHapley Additive exPlanations) analysis, a cutting-edge technique used to quantify and visualize the exact contribution of each audio feature to the final prediction.
Interpretability
The modeling phase yielded strong classification results, demonstrating the viability of predicting commercial success from raw audio features. However, the most significant outcome was the interpretability provided by the SHAP analysis. This technique allowed us to confirm which specific audio characteristics—such as a higher danceability score or a specific range of energy—were the most powerful positive or negative predictors of a hit song. These findings effectively created a data-driven “sonic blueprint,” offering tangible evidence that success is not random. By showcasing the feature importance, the project concluded with a transparent, explainable model that illuminates the hidden statistical patterns governing popularity in the contemporary music landscape.
Conclusion
The project’s high classification accuracy demonstrated that popular music is reasonably predictable based on its fundamental audio features. This outcome strongly suggests that the parameters for commercial success are not random. The models’ effectiveness implies a broader finding: the space of popular music is becoming less variable over time, with successful tracks clustering tightly around specific, identifiable sonic profiles. Furthermore, the granular analysis of high-performing artists, such as Michael Jackson’s hits, revealed that their successful tracks’ features (e.g., energy, danceability, and loudness) often closely follow or define the average metrics for all hits within a given decade, underscoring that even genre-defining artists operate within predictable statistical boundaries.
As a fun add-on, we built a program which allowed to insert a song link and get whether the prediction of whether it would be a hit in different decades. You can try it out by downloading the source code from our GitHub.
Discussion
While the predictive accuracy was high, the project identified several avenues for future refinement. Methodologically, exploring alternatives to Logistic Regression, such as Support Vector Machines (SVM) or Ensemble methods, could potentially capture more complex, non-linear relationships between audio features and success. Statistically, the study would benefit from incorporating confidence intervals for correlation coefficients to better quantify the certainty around the relationship between specific features and the target variable. Finally, to optimize computational efficiency and address potential multicollinearity, future work should investigate techniques for reducing dimensionality, perhaps using Principal Component Analysis (PCA) to distill the essential, non-redundant predictive components from the extensive feature set.