Skip to main content

Algorithmic 

prediction of

music trends 

Binanox

iGEM Leiden

2022

Creative Fields

Data Science,
Machine Learning,
Trend Forecasting

Responsibilities

Project Ideation,
Data Scraping,
Data Analysis

Location

Paris, FR

Year

2023


A microbial factory to make nanoparticles for cancer therapy

Keywords

music trends, algorithmic forecasting, feature extraction

Collaborators

Eva Koskova

As project manager of the team, I was responsible for creating a vision for the project, as well as facilitating communication between team members to ensure that the project remained aligned with this vision. Frustrated about the overwhelming complexity of most project’s science communication, I set the goal to make my project as accessible as possible. This vision was applied in all aspects: from experimental design to website design.

Vision

The core vision of this project was to move beyond subjective cultural analysis and apply a data science lens to the music industry, specifically challenging the notion that commercial success is purely accidental or dictated solely by marketing spend. We aimed to establish a statistical foundation for predicting musical virality by answering a crucial question: Can a song’s destiny be reliably forecast using only its intrinsic sonic features? By treating the outcome as a binary classification problem—Hit or Not a Hit—we sought to create a model that not only predicts the future but also provides transparent insights into the underlying mechanisms of mass musical appeal, ultimately giving composers and producers a data-backed understanding of the attributes that resonate most strongly with modern audiences.

Overview of the methodological approach to setting up the machine learning on the available data and assessing its efficacy at predicting hits.

Methodology

The project methodology focused on rigorous data preparation and machine learning implementation. The process began with data cleaning and preprocessing of a large dataset (over 41,000 tracks) defined by low-level audio features such as danceability, energy, and loudness. After feature scaling, we deployed and optimized several classification algorithms, notably Logistic Regression and K-Nearest Neighbors (KNN), training them to distinguish successful tracks from unsuccessful ones. Performance was meticulously measured using standard metrics like precision, recall, and F1-score. Crucially, to move the project beyond a black-box predictor, we integrated SHAP (SHapley Additive exPlanations) analysis, a cutting-edge technique used to quantify and visualize the exact contribution of each audio feature to the final prediction.

Analysing different features in order to perform PCA

Interpretability

The modeling phase yielded strong classification results, demonstrating the viability of predicting commercial success from raw audio features. However, the most significant outcome was the interpretability provided by the SHAP analysis. This technique allowed us to confirm which specific audio characteristics—such as a higher danceability score or a specific range of energy—were the most powerful positive or negative predictors of a hit song. These findings effectively created a data-driven “sonic blueprint,” offering tangible evidence that success is not random. By showcasing the feature importance, the project concluded with a transparent, explainable model that illuminates the hidden statistical patterns governing popularity in the contemporary music landscape.

Heatmap representing the accuracy of our algorithmic prediction of what’s a hit or not according to the trends of a particular decade. For example, our algorithm trained on the 1980s data predicts 80s hits with 76% accuracy, 60s hits with 67%, and 2010s with 66%, showing the greatest flexibility.

SHAP values analysis of the Loudness feature. A simple visualisation for the Loudness Wars (link), a phenomenon that loudness has progressively been becoming a crucial feature for determining whether a song will be popular or not.

Conclusion

The project’s high classification accuracy demonstrated that popular music is reasonably predictable based on its fundamental audio features. This outcome strongly suggests that the parameters for commercial success are not random. The models’ effectiveness implies a broader finding: the space of popular music is becoming less variable over time, with successful tracks clustering tightly around specific, identifiable sonic profiles. Furthermore, the granular analysis of high-performing artists, such as Michael Jackson’s hits, revealed that their successful tracks’ features (e.g., energy, danceability, and loudness) often closely follow or define the average metrics for all hits within a given decade, underscoring that even genre-defining artists operate within predictable statistical boundaries.

  • mj1

  • mj2

  • mj3

    As a fun add-on, we built a program which allowed to insert a song link and get whether the prediction of whether it would be a hit in different decades. You can try it out by downloading the source code from our GitHub.

    Discussion

    While the predictive accuracy was high, the project identified several avenues for future refinement. Methodologically, exploring alternatives to Logistic Regression, such as Support Vector Machines (SVM) or Ensemble methods, could potentially capture more complex, non-linear relationships between audio features and success. Statistically, the study would benefit from incorporating confidence intervals for correlation coefficients to better quantify the certainty around the relationship between specific features and the target variable. Finally, to optimize computational efficiency and address potential multicollinearity, future work should investigate techniques for reducing dimensionality, perhaps using Principal Component Analysis (PCA) to distill the essential, non-redundant predictive components from the extensive feature set.