Project 3: Workout Song Classification

September 29, 2023

Why it matters?

Listening to music on YouTube has become an integral part of my life, yet a challenge persists. The music I prefer during workouts differs from my regular listening choices. However, the YouTube recommendation system fails to recognize this difference, resulting in the need to frequently switch songs between my workout sets.

To address this, I decided to create my own music classifier. By accomplishing this, I expect the distraction of changing songs diminishes, allowing me to focus more on my workout routine and utilize the time saved on song selection more effectively.

Overview

Labeled 1300+ songs into 3 categories and collected audio feature data using Spotify API in python
Applied feature engineering and Principal Component Analysis to create a dataset of 114 features
Achieved f1 weighted score of 0.68 using a logistic regression model
Created a workflow to add new songs in a sqlite3 database and update YouTube playlist automatically using Youtube API and sqlite3 database

Data Collection

First, I labeled 1300+ songs into 3 categories on my Spotify account. The 3 categories are Workout, Non-workout, and Dislike.

Then, using the Spotify API, I collected tracks’ audio features and created a dataframe.

danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	type	id	uri	track_href	analysis_url	duration_ms	time_signature	result
0.644	0.414	1	-6.723	1	0.0419	0.64	0	0.11	0.273	132.117	audio_features	2H1kS1ZOSTTPAFYUKjRiGo	spotify:track:2H1kS1ZOSTTPAFYUKjRiGo	https://api.spotify.com/v1/tracks/2H1kS1ZOSTTPAFYUKjRiGo	https://api.spotify.com/v1/audio-analysis/2H1kS1ZOSTTPAFYUKjRiGo	234875	4	Nonworkout
0.773	0.628	8	-5.095	1	0.145	0.0543	1.2e-05	0.0725	0.42	77.502	audio_features	3K7WdPYz7vcHMCsyBjK9vL	spotify:track:3K7WdPYz7vcHMCsyBjK9vL	https://api.spotify.com/v1/tracks/3K7WdPYz7vcHMCsyBjK9vL	https://api.spotify.com/v1/audio-analysis/3K7WdPYz7vcHMCsyBjK9vL	177653	4	Workout
0.516	0.768	9	-4.964	1	0.0362	0.00852	8.49e-06	0.136	0.204	115.005	audio_features	46bkeaB7DA45q7PdKWLFkR	spotify:track:46bkeaB7DA45q7PdKWLFkR	https://api.spotify.com/v1/tracks/46bkeaB7DA45q7PdKWLFkR	https://api.spotify.com/v1/audio-analysis/46bkeaB7DA45q7PdKWLFkR	241427	4	Workout
0.77	0.54	1	-9.087	1	0.0325	0.0347	1.49e-05	0.0326	0.804	89.989	audio_features	7yLtWtDPEC1zZpvNpbE4UA	spotify:track:7yLtWtDPEC1zZpvNpbE4UA	https://api.spotify.com/v1/tracks/7yLtWtDPEC1zZpvNpbE4UA	https://api.spotify.com/v1/audio-analysis/7yLtWtDPEC1zZpvNpbE4UA	214000	4	Workout
0.425	0.638	1	-3.184	0	0.0759	0.426	0	0.177	0.45	81.396	audio_features	3590AAEoqH50z4UmhMIY85	spotify:track:3590AAEoqH50z4UmhMIY85	https://api.spotify.com/v1/tracks/3590AAEoqH50z4UmhMIY85	https://api.spotify.com/v1/audio-analysis/3590AAEoqH50z4UmhMIY85	230667	4	Dislike

Visualizations

Imbalanced target distribution

Observations

Workout has the most songs, whereas dislike has the least.
- One of the reasons is that I’m more familiar with songs I enjoy, and that led to identifying songs I dislike the least.
- Need to take this into consideration when predicting to avoid classifying all songs into the dominant class.

Single variable distribution

Observations

The distinction between Workout and Nonworkout songs is apparent.
It’s hard to determine Dislike songs since their distribution overlap with both Workout and Nonworkout songs’ distributions.

Double variables distribution

Observations

The distinction between Workout and Nonworkout becomes more obvious than before.
- Workout songs tend to have higher energy and valence than Nonworkout songs.
Since Dislike songs still overlap with the other two classes, I collected track’s audio analysis data and applied feature engineering to find more meaningful features.

Feature Engineering

Track’s audio analysis data has 7 keys. Those are meta, track, bars, beats, sections, segments, and tatums. Among these, I’m interested in the segments.

According to the Spotify API website, segments are divided with a roughly consistent sound. Each segment in segments contains information such as start, duration, loudness_start, loudness_max, pitches, and timbre. pitches contains an array of 12 numbers each referring to the 12 pitch classes. Their values range from 0 to 1, describing the relative dominance. timbre also contains an array of 12 numbers. But this time, each number represents different qualities of sound and is unbounded with centered at 0.

I first acquired descriptive statistics (mean, standard deviation, median, skewness, kurtosis, range, interquartile range, relative min, relative max) of loudness, pitches, and timbre from the segments. Then, after dividing a dataset into train and test sets, I calculated the means of each 12 timbre values and applied cosine similarity in relation to the target class.

The resulting data set has 331 features with 900+ rows as a training set. So to avoid the curse of dimensionality, I applied PCA for dimension reduction. I included components up to 95% of the variability and got 114 features in the end.

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
X_scaled = scaler.fit_transform(X)

# Include components up to 95% variability
pca = PCA(n_components=0.95)
pca.fit(X_train_scaled)

X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
X_pca = pca.transform(X_scaled)

print(f'Num of columns after PCA: {X_train_pca.shape[1]}')

>>> Num of columns after PCA: 114

Prediction

With the dataset of 114 features, I tried using different models, including XGBoost Classifier and Support Vector Classifier. However, those models either failed to fit fast enough or were too complex for this project, which led to overfitting. Thus, I chose logistic regression, which is a simpler and faster algorithm compared to the others.

Another reason for choosing the logistic regression model is because of its precision performance in the Workout class. While the logistic regression model has a lower accuracy than the other models, it has a higher precision score for the Workout class than the other models, which is the primary goal of this project.

lr = LogisticRegression(solver='newton-cg', class_weight='balanced', max_iter=1000)
lr.fit(X_train_pca, y_train)
pred_test = lr.predict(X_test_pca)

# Calculating and printing the f1 score 
f1_test = f1_score(y_test, pred_test, average='weighted')
print('The f1 score for the testing data:', f1_test)

# Ploting the confusion matrix
confusion_matrix(y_test, pred_test)

>>> The f1 score for the testing data: 0.6816972503451451
>>> array([[15, 15, 10],
           [11, 66,  3],
           [16, 31, 98]])

The high precision score for the Workout class gave fewer misclassifications of the Workout class. According to the confusion matrix of the test data, the expected number of skipped songs when I listen to 40 songs during a workout is about 4.685 songs. $${10 + 3 \over (10+3+98)} \cdot 40 ≈ 4.685$$

To confirm this, I recorded how many songs I skip using the new classification system.

Application

Using YouTube Data API v3, I automatically update YouTube workout playlist with 40 recommended songs for my workout.

When I skip a song, I have to manually remove it from the playlist and the code will detect the skipped songs during the updating process and not recommend it in future updates.

Result

The following table is a recorded result of skipped songs during a workout with the YouTube algorithm,

day	target	songs_listened_to	skipped_songs	time_spent	exercise_time	num_songs_till_skip	skipped_songs_per_min
2023/6/11	back	33	13	30,3,18,6,10,17,19,17,24,10,35,10,13	121	2.53846	0.107438
2023/6/12	leg	24	11	30,12,3,6,6,18,6,16,6,14,17	82	2.18182	0.134146
2023/6/13	back	27	12	18,20,17,18,12,5,8,7,20,3,2,21	101	2.25	0.118812
2023/6/14	chest	29	15	16,7,20,8,20,13,10,6,9,27,9,4,6,10,4	110	1.93333	0.136364
2023/6/15	back	28	18	25,8,9,3,13,4,5,7,12,5,24,19,4,2,9,12,9,8	107	1.55556	0.168224

whereas this table contains results using the logistic regression model.

day	target	songs_listened_to	skipped_songs	exercise_time	num_songs_till_skip	skipped_songs_per_min
2023/8/19	chest	31	4	121	7.75	0.0330579
2023/8/21	back	35	6	121	5.83333	0.0495868
2023/8/22	leg	23	1	102	23	0.00980392
2023/8/23	chest	31	2	113	15.5	0.0176991
2023/8/24	back	31	1	121	31	0.00826446

As you can see in the skipped_songs columns from the two tables, my personal classification algorithm does a better job of recommending workout songs. However, comparing the raw values isn’t a good approach since the number of skipped_songs differs greatly depending on my exercise time. Thus, instead of the raw values of skipped_songs, I plotted the distributions of skipped_songs_per_min.

As you can see, the new algorithm distribution has lower skipped_songs_per_min, which means it’s doing a better job at recommending workout songs than the original method. But by how much? To compare the two distributions with a single value, I calculated the mean value of skipped_songs_per_min for each distribution.

print('Youtube algorithm: ', wo_rec['skipped_songs_per_min'].mean())

>>> Youtube algorithm:  0.10841203784175568

print('Personal algorithm: ', with_rec['skipped_songs_per_min'].mean())

>>> Personal algorithm:  0.024278991866002086

When comparing the two values, I get the following.

$${0.1084 \over 0.02428} ≈ 4.4653$$

Therefore, the new algorithm is about 4.5 times better at filtering songs I don’t listen to when I exercise.

Besides these quantitative values, I believe the workflow I built is better since it can provide more diverse playlists. When I used the YouTube algorithm, it primarily recommended songs I recently listened to. Thus, most of the time, the songs it recommended were not very much different between days, which is one of the reasons why the number of skipped songs was high. However, my personal recommendation algorithm randomly selects songs from all songs that are classified into the Workout class, and I can add any new song to the database. These two features lead to providing more diverse playlists and possibly contributed to reducing the number of skipped_songs.