Why libraries get duplicated in the first place
Music libraries grow by accretion. Someone buys an album in 2014, gets the same track on a compilation in 2018, then it shows up again on a regional reissue in 2022 — three rows in your DB, three different files, one recording. Multiply by every user import you've ever ingested and the duplication rate stops being trivial. For a library of any real size, 15–25% of rows are duplicates of something already in the library.
The naive dedup — group by title and artist — fails fast. “Strobe” and “Strobe (Radio Edit)” have different titles. “Deadmau5” and “deadmau5” are the same artist with different casing. “feat. X” vs “ft. X” vs “featuring X” — pick your poison. None of these edge cases survive contact with a real catalog.
track.id. That ID is your dedup key.How the dedup key works
Every ISRC resolves to exactly one canonical SonoVault track. Two ISRCs for the same recording resolve to the same track:
// Two variants of the same recording — different ISRCs: // "Strobe (Radio Edit)" → USQX91300108 // "Strobe" → USUS11000123 // Both resolve to the same canonical SonoVault track.id. const a = await fetch(`${BASE}/tracks/isrc/USQX91300108`, { headers }); const b = await fetch(`${BASE}/tracks/isrc/USUS11000123`, { headers }); const ta = await a.json(); const tb = await b.json(); console.log(ta.id === tb.id); // true — same canonical track
That property is the entire dedup story:
- Resolve every row in your library to its canonical track ID.
- Group rows by that ID. Each group is one recording.
- For each group with more than one row, apply your tie-breaker.
Resolve every row to a canonical track ID
Batch your ISRCs through /v1/tracks/resolve (100 per request) and build a rowID → trackID map. Rows with no ISRC stay out of the dedup; you handle those separately at the end.
Group by canonical track ID and pick a winner
Group the map by value (canonical track ID), then pick one row per group as the survivor. Here we use longest duration as the tie-breaker — usually that's the extended mix, which carries the most content. Swap for whatever your library cares about (bitrate, format, oldest date added).
import fs from "node:fs"; const API_KEY = process.env.SONOVAULT_API_KEY!; const BASE = "https://api.sonovault.now/v1"; const BATCH = 100; interface LibraryRow { id: string; isrc?: string; title: string; duration: number; } async function resolveBatch(isrcs: string[]) { const res = await fetch(`${BASE}/tracks/resolve`, { method: "POST", headers: { "x-api-key": API_KEY, "content-type": "application/json" }, body: JSON.stringify({ input_type: "isrc", items: isrcs }), }); return (await res.json()).results; } const rows: LibraryRow[] = JSON.parse(fs.readFileSync("./library.json", "utf-8")); // 1. Build a map of (your row ID) → (canonical SonoVault track ID). const rowToTrack = new Map<string, number>(); const withIsrc = rows.filter(r => r.isrc); for (let i = 0; i < withIsrc.length; i += BATCH) { const chunk = withIsrc.slice(i, i + BATCH); const results = await resolveBatch(chunk.map(r => r.isrc!)); chunk.forEach((row, j) => { const track = results[j]?.track; if (track) rowToTrack.set(row.id, track.id); }); } // 2. Group rows by canonical track ID. Each group is one recording. const groups = new Map<number, LibraryRow[]>(); for (const row of rows) { const trackId = rowToTrack.get(row.id); if (!trackId) continue; // unresolved — leave standalone if (!groups.has(trackId)) groups.set(trackId, []); groups.get(trackId)!.push(row); } // 3. For each group with >1 row, pick a winner. Here: longest duration wins. const winners: LibraryRow[] = []; const losers: LibraryRow[] = []; for (const group of groups.values()) { if (group.length === 1) { winners.push(group[0]); continue; } const sorted = [...group].sort((a, b) => b.duration - a.duration); winners.push(sorted[0]); losers.push(...sorted.slice(1)); } console.log(`${rows.length} input rows → ${winners.length} unique recordings (${losers.length} duplicates)`); fs.writeFileSync("./library.deduped.json", JSON.stringify(winners, null, 2)); fs.writeFileSync("./library.duplicates.json", JSON.stringify(losers, null, 2));
Handle the rows with no ISRC
For the long tail of rows missing an ISRC, fall back to /v1/tracks/searchwith artist + title, take the top result's track ID, and feed it into the same group map. Be conservative: only adopt the result if the edit distance is tight (> 0.85) — a bad fuzzy match here merges two genuinely different recordings, which is much worse than leaving them separate.
Going further
- Remixes are not duplicates.A remix gets its own ISRC and its own canonical track. If you want “original + remix” collapsed under one entry that's a product decision — look at
artists[].is_remixeron the track to detect remix versions. - Cross-platform IDs come along for free. Once you have the canonical track ID,
/v1/tracks/linksgives you Spotify, Beatport, Apple Music, Tidal, Discogs, and MusicBrainz IDs in one call. See cross-platform ID backfill. - Run it incrementally. After the initial dedup, only newly-added rows need resolving — anything that maps to an existing group is a duplicate of something you already have.
Frequently asked questions
Won't two ISRCs always represent two different recordings?
Not in practice. Labels mint a fresh ISRC for every commercial variant — radio edit, extended mix, compilation reissue, remaster, regional release. SonoVault collapses those variants into one canonical track and keeps every ISRC. So two different ISRCs may resolve to the same SonoVault track.id, which is the real signal for dedup.
What if my source data is missing ISRCs entirely?
Fall back to /v1/tracks/search with artist + title, pick the top result, and use its canonical track ID. The result is less accurate than ISRC matching but covers the long tail. Score by edit distance and skip anything below a threshold (e.g. 0.85) to avoid bad merges.
How do I pick which duplicate to keep?
Up to you. Common rules: longest-duration version (often the extended mix), highest-bitrate file, the one from a studio album over a compilation, or simply the oldest add to the library. The dedup logic just identifies groups — the merge policy is your call.
What about live versions and remixes?
A remix is technically a new recording with its own ISRC and its own SonoVault track. The artist credit will include the remixer with is_remixer: true. Live versions are usually distinct too. If you want to collapse a studio + a remix as one entry, that's a product decision, not a metadata one — use the artists[].is_remixer flag.