A few weeks ago, I asked myself: how much do my emotions play with the lyrics of the songs I listen to? Sometimes I want to hear songs with a similar meaning, but I’d like to create a playlist automatically---especially since my Spotify library has around 911 songs.
I also wondered: do I tend to like songs with more depressive lyrics, or maybe songs with more positive ones?
That’s when my investigative side kicked in. I decided to build a small project to help me answer these questions, so I began by researching its feasibility. Turns out, Spotify is very developer-friendly, which is an immediate advantage---but song lyrics are often restricted due to copyright, and there’s no API you can simply tap into and call it a day. It looked like I’d hit a wall, but… what if instead of using the lyrics directly, I used an LLM to extract keywords based on the title and artist? I tested it with a manageable sample, and it worked. ChatGPT 4-mini proved surprisingly accurate at returning relevant keywords for songs---including modern or obscure ones, and even anime openings in different languages.
With that figured out, I started sketching my plan and ended up creating this little tutorial that combines sentiment analysis and clustering.
First Steps
First, let’s note the libraries we’ll be using:
- DotEnv to manage environment variables within
.env - Pandas to handle data as a dataframe for easier analysis
- Scikit-Learn, NumPy, etc., for data processing and analysis
- OpenAI for using the LLM
- Spotipy for connecting with Spotify’s API
- Flask/FastAPI---pick your favorite to expose a small web interface and handle the Spotipy callback
You can install everything I used with a single command:
pip install python-dotenv pandas transformers torch sentence-transformers scikit-learn numpy openai spotipy flask uvicorn fastapi watchfiles
You’ll also need a Spotify developer account, where you can create an
application and set up an endpoint to receive the callback. (If you need
more details, check Spotify’s official
documentation).
In my case, I handled the endpoint using Flask and exposed the URL
through Vercel with a personal domain, which relayed the callback to my
local URL running on port 8080. Here’s the example code I used to spin
up a small Next.js API route on Vercel:
import type { NextApiRequest, NextApiResponse } from "next";
export default async function handler(req: NextApiRequest, res: NextApiResponse) {
const { code, state, error } = req.query;
if (error) return res.status(400).send(`Authorization failed: ${error}`);
if (!code) return res.status(400).send("Missing 'code' param.");
// Configure your local receiver (default to 127.0.0.1:8080)
const base = process.env.LOCAL_CALLBACK_BASE || "http://127.0.0.1:8080";
const url = new URL("/callback", base);
url.searchParams.set("code", String(code));
if (state) url.searchParams.set("state", String(state));
res.writeHead(302, { Location: url.toString() }).end();
}
You can reuse this code and deploy it to Vercel via this GitHub repository.
LLM with OpenAI
For this project, I used ChatGPT 4-mini as the LLM. You’ll need an API key, which you can create here (you must register and add credits to your account).
Requirements Summary
Before moving on, make sure you have: 1. The CLIENT ID of your Spotify app\
- A valid endpoint for the app’s authentication callback\
- An OpenAI API key to use ChatGPT
Overview
The project is already complete, and you can find the full code at the
provided link. Below, I’ll summarize each part. The project comes in two
versions: one built in a Jupyter notebook, and another as a full Python
pipeline (inside the /pipeline folder). This blog focuses on the
Jupyter version for simplicity, but both are identical in functionality.
Both versions use SQLite as the database, although there’s also a MySQL fork that’s easily adaptable. You can, of course, switch to your preferred DB engine.
Step 1: Environment Variables
Before running anything, create a .env file with the following
variables:
SPOTIFY_CLIENT_ID="YOUR SPOTIFY APP ID"
SPOTIPY_REDIRECT_URI="YOUR REDIRECT URL"
SPOTIPY_CACHE_PATH="CACHE FILE LOCATION"
OPENAI_API_KEY="YOUR OPENAI API KEY"
Step 2: Data Extraction
Run the following cell:
from pipeline import data_extraction
data_extraction.main()
This will: 1. Load Spotify authentication using Spotipy\
- Receive your app’s callback\
- Upon authorization, import your Spotify song data into an SQLite database across several tables
When finished, your songs will be stored in the database for further analysis.
You can inspect the tables with any SQLite viewer, or query them
directly within your code. Essentially, this step retrieves your tracks
via the get_spotify_client function and stores the track, artist, and
album information.
Step 3: Extracting Keywords
Run the following cell:
import extract_keywords.extract_from_title_artist as keywords
keywords.extract_keywords_from_title_artist()
This function uses a custom LLM prompt to generate 20 keywords for each song. I chose 20 for a more diverse analysis, but you can adjust as needed.
As for cost, it depends on how many tracks you have. For my 911-song library, it cost roughly $0.40 to extract all keywords.
Step 4: Sentiment Analysis
Exploratory Analysis
Here we perform some exploratory data analysis (EDA) to better
understand the dataset. We load the tracks into a dataframe containing
only the track_spotify_id, artist_name, title, and keywords.
You’ll likely see a wide range of results.
We’ll normalize these keywords using embeddings: first, identify unique
words, then cluster them to create a dictionary that standardizes all
keyword variations.
After normalization, you’ll get a cleaner, easier-to-analyze dataframe.
Sentiment Analysis
Running all the cells under Sentimental Analysis will produce a
sentiment score for each song based on its keywords, stored in the
emotions field of the database.
The number of emotions depends on the model used; in this case, I used
joeddav/distilbert-base-uncased-go-emotions-student, which provides
28 emotion types.
The analysis also generates an emotion table and dictionary. Using the LLM, we can categorize emotions into five broader groups: - FP: Very Positive\
- MP: Positive\
- N: Neutral\
- MN: Negative\
- FN: Very Negative
This grouping simplifies reporting and visualization.
Step 5: Clustering
We now create clusters to group songs into thematic sets. The clustering section will:
- Normalize keywords per track (lowercase, trim, deduplicate)\
- Vectorize each keyword set using a Sentence Transformer
(
all-MiniLM-L6-v2)\ - Average embeddings per song to get a single representative vector\
- Cluster vectors with KMeans, measuring quality with the Silhouette Score\
- Annotate the dataframe with a
cluster_embcolumn\ - Interpret clusters by listing top-N frequent keywords and representative “Artist --- Title” examples\
- Assign a title and description to each cluster via the LLM
Step 6: Data Visualization
Once everything else is done, you can visualize the results. The project includes a Vue app that lets you both view the clusters and generate playlists from them. Running the final cell launches the interface.
Conclusions
This project confirmed something I already suspected: my emotions deeply influence what I listen to. My library of 911 songs isn’t random---it’s a clear reflection of my moods. Through sentiment analysis of the keywords, I found patterns between positive, neutral, and depressive lyrics that matched my listening habits.
I also learned that copyright restrictions weren’t an insurmountable wall. Even without an API providing full lyrics, using an LLM to extract keywords from titles and artists turned out to be practical, inexpensive, and remarkably accurate. For just a few cents, I built the foundation for analysis.
The pipeline I built ended up modular and reusable: data extraction, keyword generation, sentiment analysis, clustering, and visualization. Thanks to that structure, what began as a personal experiment can easily be adapted for other uses.
Clustering helped bring order to the chaos. Turning hundreds of scattered songs into thematic groups---with titles, descriptions, and representative examples---let me understand my own music better and build playlists that actually fit my moods.
Finally, visualization closed the loop. With the Vue app, I didn’t just analyze data---I built a tool that directly answers the initial question: do my musical tastes lean more toward the positive or the negative?
In the end, what started as curiosity turned into a prototype with real potential. I mixed data science, LLMs, and user experience to create something not just useful for me, but for anyone who wants to organize their music based on how they feel.