Skip to main content

Command Palette

Search for a command to run...

The Chorus Is the Part That Comes Back

Updated
7 min read

Pulsar by Vangelis is a 12-minute track. The first seven minutes are slow, evolving, near-silence synthesis. Then it erupts.

Deep Cuts was telling me it sounded like AC/DC.

The bug was in how I'd been selecting audio windows for the CLAP embedding model: pick the three loudest 10-second excerpts in the track, run them through the neural net, average the result. Loud sections are salient, I'd reasoned. Salient is representative.

It isn't. Deep Cuts was throwing away the first half of Pulsar and embedding the second half — which, stripped of context, does share some acoustic properties with loud rock guitar. The embedding was technically correct. It was also completely wrong.

Fixing it properly turned into a much longer journey than I expected.


Smarter window selection

The waveform is already in the database: 128 RMS energy bins per track, computed at scan time. That's enough to see a track's shape.

The new algorithm first trims quiet tails — the 20th-percentile bin value defines the "body" of a track, and anything below it at the leading and trailing edges is excluded from window candidates. A track with a 90-second ambient intro stops having its windows dominated by the outro.

Then it measures the coefficient of variation (CV = σ/μ) of the body. High CV means the track is genuinely dynamic — Vangelis, jazz, classical. For these, the body is split into three energy terciles and one representative window is drawn from each: low, mid, high. The embedding now spans the full dynamic range. Low CV means the track is loud everywhere — brickwall-mastered, heavily compressed. For these, three windows are spaced evenly at 15%, 50%, and 85% of the body.

Pulsar now gets a window from its quiet evolution, its mid-section swell, and its loud finale.

https://gist.github.com/robertolupi/1ee93ba69ac1d67627e1567b576c629f


SAX: 20-year-old timeseries tech that nobody's applying to music

While thinking about track structure, I remembered reading about SAX — Symbolic Aggregate approXimation, a technique from Lin et al., 2003, originally designed for sensor timeseries database indexing.

Three steps: divide the signal into equal segments and take segment means (PAA), z-normalise so amplitude doesn't dominate, then quantise each segment to a letter using Gaussian breakpoints. With a 5-letter alphabet — a through e, very quiet to very loud — you get a compact string that encodes the shape of the energy envelope.

The waveform data was already there. Adding SAX cost 32 bytes per track and a few microseconds of computation. The waveform_sax column now lives alongside waveform_data in the tracks table.

I immediately used it for two things: coloring the waveform bars in the track list (the gradient goes blue → cyan → orange → red as energy rises — you can see a track's shape at a glance), and blending a 15% structural distance penalty into the similarity search via the MINDIST metric. Tracks that sound alike but have opposite energy shapes get pushed apart slightly in the rankings.

But the bigger question was whether SAX could support structural search. Find me tracks that start quietly and build to a loud chorus. Find me tracks with a drop in the middle. Find me tracks that never really rest.

https://gist.github.com/robertolupi/ab2358e09bb109e987da8cb26028b899


The naive approach: energy alone doesn't work

I have 174 Downspiral tracks — my own music — each with a hand-written lyrics.txt containing section labels: [Intro], [Verse 1], [Chorus], [Bridge], [Outro]. 153 of them are in the Deep Cuts library with SAX computed.

The test: use line position in the lyrics file as a proxy for section position in the track, read the SAX letter at that position, check what energy level each section type actually lands on across 153 tracks.

Intro → quiet: 99%. Good start. Chorus → loud: 62%. Noisy. 22% of chorus positions land on L. Verse → quiet: 57%. Barely better than chance. Outro → quiet: 30%. My tracks end loud. The L$ pattern would miss two-thirds of all outros.

Pattern recall confirmed it. Running "starts quietly" (^L) as a filter on the full 1,890-track library returned 1,688 matches — 89% of everything. Perfect recall. 5% precision. Useless as a search.

Energy describes loudness. It doesn't describe function.


The missing axis: repetition

Song sections aren't defined by loudness. They're defined by role. And role correlates with repetition.

A chorus is the section that comes back. A verse comes back too, but quieter. A bridge happens once. An intro is unique. Loudness is incidental — there are quiet choruses, loud verses, energetic bridges.

Self-Similarity Matrices (SSMs) are built for this. Divide the signal into short segments, measure pairwise similarity between all of them, arrange as a matrix. Repeated sections show up as rectangular blocks along the diagonal. Verse-chorus structure looks like a checkerboard. Unique sections are isolated.

I divided each track's 128 waveform bins into 16 segments of 8 bins, computed cosine similarity between all pairs, and for each segment recorded its repetition score: mean cosine similarity to its 3 nearest non-adjacent neighbours, normalised to [0,1] within the track.

Then I ran the lyrics-position experiment again, now with two features per section: energy and repetition.

Section Energy Repetition
Intro 0.016 0.466
End 0.237 0.275
Verse 0.338 0.706
Pre-Chorus 0.447 0.845
Bridge 0.556 0.797
Outro 0.617 0.588
Chorus 0.647 0.875

Intro: near-zero energy, mid repetition. Unique and quiet — perfectly isolated.

Chorus and Outro: nearly identical energy (0.647 vs 0.617). Completely separated by repetition (0.875 vs 0.588). The hook comes back. The outro doesn't.

Verse and Chorus: both repeat, different energy. In 2D they're distinguishable. In 1D they weren't.

The insight is simple once you see it: a chorus is high-energy and highly repeated. A verse is low-energy and highly repeated. An intro is low-energy and unique. Two numbers, seven sections, mostly clean.

https://gist.github.com/robertolupi/53bbf952958219ea64c8a868c998da84


What this changes

The block composer I'm planning — click Intro, Chorus, Outro and find matching tracks — can't work as a pure regex over SAX strings. It needs to compute a repetition score vector per track and match against 2D thresholds.

That's more backend work. But it also means the search actually asks the right question. "Does this track have a structurally repeated high-energy section?" is what a user means when they tap Chorus. A regex that matches H in an RLE string would fire on any loud section, repeated or not.

The revised block definitions anchor Intro and Outro by position, not energy — the data makes that clear. Outro is only reliably "quiet" in a minority of tracks. Position is reliable. The rest of the blocks are defined by (energy range × repetition range) thresholds derived from the experiment.

This isn't built yet. But the data — SAX strings, waveform bins, and now a validated two-axis section model — is already there for every track in the library.


Most music apps treat a song as an atom. Title, artist, tempo, mood. The song is the unit.

But songs have internal structure. Producers think in sections. They reference by section. They build sets around energy arcs.

Deep Cuts is trying to make that structure queryable, without annotations, without stem separation, without network calls. Just from 128 energy bins and the self-similarity structure they reveal.

The block composer comes next.

Deep Cuts

Part 5 of 6

A local-first music intelligence desktop app that analyzes your audio library with machine learning — BPM, key, genre, mood, and semantic embeddings — so producers can filter, search, and discover reference tracks by sonic characteristics. Everything runs on your machine, with no cloud dependency.

Up next

Two AIs, One Notepad

I've been running two AI coding assistants in parallel. Claude for architecture and implementation — it reasons carefully about code structure and catches edge cases. Gemini for research, reading long