mirror of https://github.com/gabehf/Koito.git synced 2026-04-22 12:01:52 -07:00

safierinx-a 8ce6ec494d Add bulk import optimization: track_lookup cache, batch inserts, BulkSubmitter

Adopts ListenBrainz-inspired patterns to speed up imports from ~24h to
under 30 minutes for 49k scrobbles.

Phase 1 - track_lookup cache table:
- New migration (000006) adds persistent entity lookup cache
- Maps normalized (artist, track, album) → (artist_id, album_id, track_id)
- SubmitListen fast path: cache hit skips 18 DB queries → 2 queries
- Cache populated after entity resolution, invalidated on merge/delete
- Benefits both live scrobbles and imports

Phase 2 - SaveListensBatch:
- New batch listen insert using pgx CopyFrom → temp table → INSERT ON CONFLICT
- Thousands of inserts per second vs one-at-a-time

Phase 3 - BulkSubmitter:
- Reusable import accelerator for all importers
- Pre-deduplicates scrobbles by (artist, track, album) in memory
- Worker pool (4 goroutines) for parallel entity creation on cache miss
- Batch listen insertion via SaveListensBatch

Phase 4 - Migrate importers:
- Maloja, Spotify, LastFM, ListenBrainz importers use BulkSubmitter
- Koito importer left as-is (already fast with pre-resolved IDs)

Phase 5 - Skip image lookups during import:
- GetArtistImage/GetAlbumImage calls fully skipped when SkipCacheImage=true
- Background tasks (FetchMissingArtistImages/FetchMissingAlbumImages) backfill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-25 04:17:50 +05:30

26 KiB

Raw Blame History

Bulk Import Optimization — Implementation Plan

Overview

Optimize Koito's import pipeline from ~20 listens/min to thousands/min by adopting ListenBrainz-inspired patterns: a persistent entity lookup cache, batch DB writes, pre-deduplication, and deferred enrichment. The core insight from the ecosystem research is write raw first, enrich async — and the persistent lookup table benefits all scrobbles (live + import), not just bulk imports.

Current State Analysis

The Problem

Importing 49,050 Maloja scrobbles takes ~24 hours. The import is stable (our PR 1 fixes eliminated panics) but each scrobble runs through the full SubmitListen path:

GetArtist issues 6 DB queries per lookup (including rank computation via window functions)
GetAlbum issues 6 DB queries per lookup
GetTrack issues 5-6 DB queries per lookup
GetArtistImage / GetAlbumImage makes HTTP calls even when all image providers are disabled
SaveListen is a single INSERT — the only fast part

Per unique scrobble: ~18 DB round-trips + 2 image lookups. Per repeated scrobble: ~18 DB round-trips (no caching). With 5,589 unique artists, 2,628 unique albums, and 49,050 total scrobbles, this is massively redundant.

Key Discoveries

SubmitListenOpts.SkipSaveListen (catalog.go:43) can be used to create entities without recording a listen — useful for entity pre-creation
SubmitListenOpts.SkipCacheImage (catalog.go:46) controls image download but NOT image URL resolution — the HTTP calls still happen
The Koito native importer (importer/koito.go) already bypasses SubmitListen and does direct DB calls — a precedent for a faster import path
pgxpool.Pool is goroutine-safe — concurrent DB operations are safe at the pool level
SaveListen SQL uses ON CONFLICT DO NOTHING — re-importing is idempotent
No batch insert methods exist anywhere in the codebase
GetArtist/GetAlbum/GetTrack compute full stats (listen count, time listened, rank) on every call — unnecessary during import

Ecosystem Patterns (from research)

ListenBrainz: Stores raw scrobbles immediately, resolves MBIDs asynchronously via background worker + Typesense index. Uses MessyBrainz as a stable (artist, track, release) → ID mapping.
Maloja: Runs every import through the full normalize → dedup → cache-invalidate cycle. Works for live scrobbles, kills bulk import. This is exactly Koito's current problem.
Last.fm: Resolves metadata at write time (corrections), batches up to 50 scrobbles per request.
General: DB-level dedup via unique constraint + ON CONFLICT is the industry standard.

Desired End State

A track_lookup table provides O(1) entity resolution for any (artist, track, album) tuple — both live and import scrobbles benefit
All 5 importers use a shared BulkSubmitter that pre-deduplicates, creates entities in parallel, and batch-inserts listens
Image/MBZ enrichment is fully deferred to existing background tasks during import
49k Maloja import completes in under 30 minutes (vs 24 hours currently)
Live scrobbles are faster too — cache hit skips 18 DB queries, goes straight to 1 SELECT + 1 INSERT

Verification

go build ./... compiles
go test ./... passes (existing + new tests)
Manual: import 49k Maloja scrobbles in under 30 minutes on vo-pc
Manual: verify live scrobbles from multi-scrobbler still work correctly
Manual: verify album art appears after background image backfill runs

What We're NOT Doing

Replacing the DB engine (no TimescaleDB, no Redis) — Postgres is fine for self-hosted scale
Local MusicBrainz mirror or Typesense index — overkill for single-user; the live API + background enrichment is sufficient
Changing the live SubmitListen API path — the lookup cache makes it faster, but the logic stays the same
Parallelizing live scrobbles — only imports use the worker pool; live scrobbles remain single-threaded through SubmitListen
Changing the ListenBrainz/Last.fm relay — multi-scrobbler handles that independently

Implementation Approach

Adopt ListenBrainz's "MessyBrainz" pattern as a persistent Postgres table: a normalized (artist, track, album) tuple maps to resolved (artist_id, album_id, track_id). This is the foundational optimization — everything else builds on it.

Before (per scrobble):
  GetArtist (6 queries) → GetAlbum (6 queries) → GetTrack (6 queries) → SaveListen (1 query)
  = 19 queries minimum

After (cache hit):
  SELECT FROM track_lookup (1 query) → SaveListen (1 query)
  = 2 queries

After (bulk import, cache hit):
  In-memory map lookup (0 queries) → batched SaveListen
  = amortized ~0.01 queries per scrobble

Phase 1: `track_lookup` Cache Table

Overview

Add a persistent lookup table that maps normalized (artist_name, track_title, release_title) to resolved entity IDs. Integrate into SubmitListen so both live and import scrobbles benefit.

Changes Required

1. New Migration

File: db/migrations/000006_track_lookup.sql

-- +goose Up
CREATE TABLE track_lookup (
    lookup_key TEXT NOT NULL PRIMARY KEY,
    artist_id INT NOT NULL REFERENCES artists(id) ON DELETE CASCADE,
    album_id INT NOT NULL REFERENCES releases(id) ON DELETE CASCADE,
    track_id INT NOT NULL REFERENCES tracks(id) ON DELETE CASCADE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_track_lookup_track_id ON track_lookup(track_id);
CREATE INDEX idx_track_lookup_artist_id ON track_lookup(artist_id);
CREATE INDEX idx_track_lookup_album_id ON track_lookup(album_id);

-- +goose Down
DROP TABLE IF EXISTS track_lookup;

The lookup_key is a normalized string: lower(artist) || '\x00' || lower(track) || '\x00' || lower(album). Using a single TEXT key with a null-byte separator is simpler and faster than a multi-column composite key with citext.

2. New SQL Queries

File: db/queries/track_lookup.sql

-- name: GetTrackLookup :one
SELECT artist_id, album_id, track_id
FROM track_lookup
WHERE lookup_key = $1;

-- name: InsertTrackLookup :exec
INSERT INTO track_lookup (lookup_key, artist_id, album_id, track_id)
VALUES ($1, $2, $3, $4)
ON CONFLICT (lookup_key) DO UPDATE SET
    artist_id = EXCLUDED.artist_id,
    album_id = EXCLUDED.album_id,
    track_id = EXCLUDED.track_id;

-- name: DeleteTrackLookupByArtist :exec
DELETE FROM track_lookup WHERE artist_id = $1;

-- name: DeleteTrackLookupByAlbum :exec
DELETE FROM track_lookup WHERE album_id = $1;

-- name: DeleteTrackLookupByTrack :exec
DELETE FROM track_lookup WHERE track_id = $1;

3. Regenerate sqlc

Run sqlc generate to create the Go bindings in internal/repository/.

4. DB Interface + Psql Implementation

File: internal/db/db.go — Add to interface:

// Track Lookup Cache
GetTrackLookup(ctx context.Context, key string) (*TrackLookupResult, error)
SaveTrackLookup(ctx context.Context, opts SaveTrackLookupOpts) error
InvalidateTrackLookup(ctx context.Context, opts InvalidateTrackLookupOpts) error

File: internal/db/opts.go — Add types:

type TrackLookupResult struct {
    ArtistID int32
    AlbumID  int32
    TrackID  int32
}

type SaveTrackLookupOpts struct {
    Key      string
    ArtistID int32
    AlbumID  int32
    TrackID  int32
}

type InvalidateTrackLookupOpts struct {
    ArtistID int32
    AlbumID  int32
    TrackID  int32
}

File: internal/db/psql/track_lookup.go — New file implementing the three methods.

5. Lookup Key Helper

File: internal/catalog/lookup_key.go — New file:

package catalog

import "strings"

// TrackLookupKey builds a normalized cache key for entity resolution.
func TrackLookupKey(artist, track, album string) string {
    return strings.ToLower(artist) + "\x00" + strings.ToLower(track) + "\x00" + strings.ToLower(album)
}

6. Integrate into SubmitListen

File: internal/catalog/catalog.go — Add fast path at the top of SubmitListen:

func SubmitListen(ctx context.Context, store db.DB, opts SubmitListenOpts) error {
    l := logger.FromContext(ctx)

    if opts.Artist == "" || opts.TrackTitle == "" {
        return errors.New("track name and artist are required")
    }

    opts.Time = opts.Time.Truncate(time.Second)

    // Fast path: check lookup cache for known entity combo
    if !opts.SkipSaveListen {
        key := TrackLookupKey(opts.Artist, opts.TrackTitle, opts.ReleaseTitle)
        cached, err := store.GetTrackLookup(ctx, key)
        if err == nil && cached != nil {
            l.Debug().Msg("Track lookup cache hit — skipping entity resolution")
            return store.SaveListen(ctx, db.SaveListenOpts{
                TrackID: cached.TrackID,
                Time:    opts.Time,
                UserID:  opts.UserID,
                Client:  opts.Client,
            })
        }
    }

    // ... existing SubmitListen logic (unchanged) ...

    // After successful entity resolution, populate the cache
    store.SaveTrackLookup(ctx, db.SaveTrackLookupOpts{
        Key:      TrackLookupKey(opts.Artist, opts.TrackTitle, opts.ReleaseTitle),
        ArtistID: artists[0].ID,
        AlbumID:  rg.ID,
        TrackID:  track.ID,
    })

    // ... rest of existing logic ...
}

Note: The cache only applies when we have a direct artist+track+album text match. Scrobbles with MBZ IDs that resolve to different text representations will still go through the full path on first encounter, then be cached.

7. Invalidation on Merge/Delete

File: internal/db/psql/artist.go — In DeleteArtist and MergeArtists, add:

d.q.DeleteTrackLookupByArtist(ctx, id)

File: internal/db/psql/album.go — In DeleteAlbum and MergeAlbums, add:

d.q.DeleteTrackLookupByAlbum(ctx, id)

File: internal/db/psql/track.go — In DeleteTrack and MergeTracks, add:

d.q.DeleteTrackLookupByTrack(ctx, id)

Success Criteria

go build ./... compiles
go test ./... passes
New test: TestSubmitListen_LookupCacheHit — second identical scrobble uses cache
New test: TestSubmitListen_LookupCacheInvalidateOnDelete — deleting entity clears cache
Migration applies cleanly on fresh and existing databases
Live scrobbles from multi-scrobbler populate the cache on first hit, use it on second

Phase 2: `SaveListensBatch` DB Method

Overview

Add a batch listen insert method using pgx's CopyFrom for high-throughput listen insertion. This is the DB foundation for the BulkSubmitter.

Changes Required

1. New SQL + DB Interface

File: internal/db/db.go — Add to interface:

SaveListensBatch(ctx context.Context, opts []SaveListenOpts) (int64, error)

Returns the number of rows actually inserted (excluding ON CONFLICT duplicates).

2. Psql Implementation

File: internal/db/psql/listen.go — New method:

func (d *Psql) SaveListensBatch(ctx context.Context, opts []db.SaveListenOpts) (int64, error) {
    if len(opts) == 0 {
        return 0, nil
    }

    // Use a transaction with a temp table + INSERT ... ON CONFLICT pattern
    // since CopyFrom doesn't support ON CONFLICT directly
    tx, err := d.conn.BeginTx(ctx, pgx.TxOptions{})
    if err != nil {
        return 0, fmt.Errorf("SaveListensBatch: BeginTx: %w", err)
    }
    defer tx.Rollback(ctx)

    // Create temp table
    _, err = tx.Exec(ctx, `
        CREATE TEMP TABLE tmp_listens (
            track_id INT,
            listened_at TIMESTAMPTZ,
            user_id INT,
            client TEXT
        ) ON COMMIT DROP
    `)
    if err != nil {
        return 0, fmt.Errorf("SaveListensBatch: create temp table: %w", err)
    }

    // CopyFrom into temp table
    rows := make([][]interface{}, len(opts))
    for i, o := range opts {
        var client interface{}
        if o.Client != "" {
            client = o.Client
        }
        rows[i] = []interface{}{o.TrackID, o.Time, o.UserID, client}
    }

    _, err = tx.CopyFrom(ctx,
        pgx.Identifier{"tmp_listens"},
        []string{"track_id", "listened_at", "user_id", "client"},
        pgx.CopyFromRows(rows),
    )
    if err != nil {
        return 0, fmt.Errorf("SaveListensBatch: CopyFrom: %w", err)
    }

    // Insert from temp table with dedup
    tag, err := tx.Exec(ctx, `
        INSERT INTO listens (track_id, listened_at, user_id, client)
        SELECT track_id, listened_at, user_id, client FROM tmp_listens
        ON CONFLICT DO NOTHING
    `)
    if err != nil {
        return 0, fmt.Errorf("SaveListensBatch: insert: %w", err)
    }

    if err := tx.Commit(ctx); err != nil {
        return 0, fmt.Errorf("SaveListensBatch: Commit: %w", err)
    }

    return tag.RowsAffected(), nil
}

This uses the standard CopyFrom → temp table → INSERT ON CONFLICT pattern, which is the fastest bulk insert approach with pgx while still supporting deduplication.

Success Criteria

go build ./... compiles
go test ./... passes
New test: TestSaveListensBatch — insert 1000 listens, verify count
New test: TestSaveListensBatch_Dedup — insert duplicates, verify no double-counting
New test: TestSaveListensBatch_Empty — empty input returns 0, no error

Phase 3: `BulkSubmitter` Helper

Overview

A reusable import accelerator that all importers can use. Pre-deduplicates scrobbles in memory, resolves entities via the track_lookup cache (falling back to SubmitListen on cache miss), and batch-inserts listens.

Design

BulkSubmitter
├── Accept(SubmitListenOpts)     — buffer a scrobble
├── Flush() (int, error)         — process all buffered scrobbles
│   ├── Phase A: Deduplicate by (artist, track, album) key
│   ├── Phase B: Resolve entities
│   │   ├── Check track_lookup cache (single SELECT)
│   │   ├── On miss: call SubmitListen(SkipSaveListen=true) to create entities
│   │   ├── Worker pool: N goroutines for parallel entity creation
│   │   └── Populate track_lookup cache after creation
│   ├── Phase C: Map all scrobbles to track_ids via resolved cache
│   └── Phase D: SaveListensBatch
└── Progress callback for logging

Changes Required

1. New Package

File: internal/importer/bulk.go

package importer

import (
    "context"
    "sync"

    "github.com/gabehf/koito/internal/catalog"
    "github.com/gabehf/koito/internal/db"
    "github.com/gabehf/koito/internal/logger"
    "github.com/gabehf/koito/internal/mbz"
)

type BulkSubmitter struct {
    store      db.DB
    mbzc       mbz.MusicBrainzCaller
    ctx        context.Context
    buffer     []catalog.SubmitListenOpts
    workers    int
    onProgress func(imported, total int)
}

type BulkSubmitterOpts struct {
    Store      db.DB
    Mbzc       mbz.MusicBrainzCaller
    Workers    int                        // default 4
    OnProgress func(imported, total int)  // called every 500 items
}

func NewBulkSubmitter(ctx context.Context, opts BulkSubmitterOpts) *BulkSubmitter {
    workers := opts.Workers
    if workers <= 0 {
        workers = 4
    }
    return &BulkSubmitter{
        store:      opts.Store,
        mbzc:       opts.Mbzc,
        ctx:        ctx,
        workers:    workers,
        onProgress: opts.OnProgress,
    }
}

func (bs *BulkSubmitter) Accept(opts catalog.SubmitListenOpts) {
    bs.buffer = append(bs.buffer, opts)
}

func (bs *BulkSubmitter) Flush() (int, error) {
    l := logger.FromContext(bs.ctx)
    if len(bs.buffer) == 0 {
        return 0, nil
    }

    l.Info().Msgf("BulkSubmitter: Processing %d scrobbles", len(bs.buffer))

    // Phase A: Deduplicate — find unique (artist, track, album) tuples
    type entityKey = string
    unique := make(map[entityKey]catalog.SubmitListenOpts)
    for _, opts := range bs.buffer {
        key := catalog.TrackLookupKey(opts.Artist, opts.TrackTitle, opts.ReleaseTitle)
        if _, exists := unique[key]; !exists {
            unique[key] = opts
        }
    }
    l.Info().Msgf("BulkSubmitter: %d unique entity combos from %d scrobbles", len(unique), len(bs.buffer))

    // Phase B: Resolve entities — check cache, create on miss
    resolved := make(map[entityKey]int32) // key → trackID
    var mu sync.Mutex
    var wg sync.WaitGroup
    sem := make(chan struct{}, bs.workers)
    resolveCount := 0

    for key, opts := range unique {
        // Check track_lookup cache first
        cached, err := bs.store.GetTrackLookup(bs.ctx, key)
        if err == nil && cached != nil {
            mu.Lock()
            resolved[key] = cached.TrackID
            resolveCount++
            mu.Unlock()
            continue
        }

        // Cache miss — create entities via SubmitListen (with worker pool)
        wg.Add(1)
        sem <- struct{}{} // acquire worker slot
        go func(k entityKey, o catalog.SubmitListenOpts) {
            defer wg.Done()
            defer func() { <-sem }() // release worker slot

            o.SkipSaveListen = true
            o.SkipCacheImage = true
            err := catalog.SubmitListen(bs.ctx, bs.store, o)
            if err != nil {
                l.Err(err).Msgf("BulkSubmitter: Failed to create entities for '%s' by '%s'", o.TrackTitle, o.Artist)
                return
            }

            // Re-check cache (SubmitListen populates it in Phase 1's integration)
            cached, err := bs.store.GetTrackLookup(bs.ctx, k)
            if err == nil && cached != nil {
                mu.Lock()
                resolved[k] = cached.TrackID
                mu.Unlock()
            }
        }(key, opts)
    }
    wg.Wait()

    l.Info().Msgf("BulkSubmitter: Resolved %d/%d entity combos", len(resolved), len(unique))

    // Phase C: Build listen batch
    batch := make([]db.SaveListenOpts, 0, len(bs.buffer))
    skipped := 0
    for _, opts := range bs.buffer {
        key := catalog.TrackLookupKey(opts.Artist, opts.TrackTitle, opts.ReleaseTitle)
        trackID, ok := resolved[key]
        if !ok {
            skipped++
            continue
        }
        batch = append(batch, db.SaveListenOpts{
            TrackID: trackID,
            Time:    opts.Time.Truncate(time.Second),
            UserID:  opts.UserID,
            Client:  opts.Client,
        })
    }
    if skipped > 0 {
        l.Warn().Msgf("BulkSubmitter: Skipped %d scrobbles with unresolved entities", skipped)
    }

    // Phase D: Batch insert listens
    inserted, err := bs.store.SaveListensBatch(bs.ctx, batch)
    if err != nil {
        return 0, fmt.Errorf("BulkSubmitter: SaveListensBatch: %w", err)
    }

    l.Info().Msgf("BulkSubmitter: Inserted %d listens (%d duplicates skipped)", inserted, int64(len(batch))-inserted)
    return int(inserted), nil
}

2. TOCTOU Safety for Parallel Entity Creation

The worker pool creates entities via SubmitListen(SkipSaveListen=true). Two workers could race on the same artist name. The existing code uses a get-then-save pattern. Mitigations:

Pre-dedup in Phase A ensures each unique tuple is processed by exactly one goroutine — no TOCTOU within the worker pool
The only remaining race is between the import workers and live scrobbles from multi-scrobbler hitting the same SubmitListen path. This is already handled by the DB's unique constraints + ON CONFLICT clauses on join tables.

Success Criteria

go build ./... compiles
go test ./... passes
New test: TestBulkSubmitter_BasicImport — buffer 100 scrobbles, flush, verify all imported
New test: TestBulkSubmitter_Dedup — buffer 100 scrobbles with 10 unique combos, verify 10 entity creations
New test: TestBulkSubmitter_CacheHit — pre-populate track_lookup, verify no SubmitListen calls
New test: TestBulkSubmitter_PartialFailure — one entity creation fails, rest still imported

Phase 4: Migrate All Importers

Overview

Wire all 5 importers to use BulkSubmitter instead of calling SubmitListen directly in a loop.

Changes Required

1. Maloja Importer

File: internal/importer/maloja.go

Replace the per-item catalog.SubmitListen loop with:

func ImportMalojaFile(ctx context.Context, store db.DB, mbzc mbz.MusicBrainzCaller, filename string) error {
    l := logger.FromContext(ctx)
    // ... file reading and JSON parsing (unchanged) ...

    bs := NewBulkSubmitter(ctx, BulkSubmitterOpts{
        Store: store,
        Mbzc:  mbzc,
        OnProgress: func(imported, total int) {
            l.Info().Msgf("Maloja import progress: %d/%d", imported, total)
        },
    })

    for _, item := range items {
        // ... existing artist parsing, time window check (unchanged) ...

        bs.Accept(catalog.SubmitListenOpts{
            MbzCaller:      mbzc,
            Artist:         item.Track.Artists[0],
            ArtistNames:    artists,
            TrackTitle:     item.Track.Title,
            ReleaseTitle:   releaseTitle,
            Time:           ts.Local(),
            Client:         "maloja",
            UserID:         1,
            SkipCacheImage: true,
        })
    }

    count, err := bs.Flush()
    if err != nil {
        return fmt.Errorf("ImportMalojaFile: %w", err)
    }
    return finishImport(ctx, filename, count)
}

2. Spotify Importer

File: internal/importer/spotify.go — Same pattern: Accept into BulkSubmitter, Flush at end.

3. LastFM Importer

File: internal/importer/lastfm.go — Same pattern. Note: LastFM scrobbles include MBZ IDs, which will pass through to SubmitListen on cache miss for proper entity resolution.

4. ListenBrainz Importer

File: internal/importer/listenbrainz.go — Same pattern. ListenBrainz data is the richest (full MBZ IDs, MBID mappings) — cache hits will be common after first import.

5. Koito Importer

File: internal/importer/koito.go — This one currently bypasses SubmitListen with direct DB calls. Two options:

Option A: Migrate to BulkSubmitter (consistent, benefits from cache)
Option B: Leave as-is (it's already fast, Koito exports have pre-resolved IDs)

Recommend Option A for consistency, with the Koito importer becoming the simplest BulkSubmitter user since its data is pre-resolved.

Success Criteria

go build ./... compiles
go test ./... passes — all existing import tests still pass
TestImportMaloja — 38 listens imported correctly
TestImportMaloja_NullAlbum — null album handled
TestImportMaloja_ApiFormat — list format works
TestImportSpotify — duration data preserved
TestImportLastFM — MBZ IDs resolved
TestImportListenBrainz — MBID mappings applied
TestImportKoito — aliases preserved
Manual: 49k Maloja import on vo-pc completes in under 30 minutes

Phase 5: Skip Image Lookups During Import

Overview

Short-circuit GetArtistImage and GetAlbumImage calls when SkipCacheImage is true. Currently these functions still make HTTP calls (or call providers that return "no providers enabled") even when the result won't be used. The existing background tasks (FetchMissingArtistImages, FetchMissingAlbumImages) will backfill images after import.

Changes Required

1. Early Return in Associate Functions

File: internal/catalog/associate_artists.go

In resolveAliasOrCreateArtist (line ~248) and matchArtistsByNames (line ~304):

// Before:
imgUrl, err := images.GetArtistImage(ctx, images.ArtistImageOpts{...})
if err == nil && imgUrl != "" {
    imgid = uuid.New()
    if !opts.SkipCacheImage {
        // download image
    }
}

// After:
var imgUrl string
if !opts.SkipCacheImage {
    imgUrl, err = images.GetArtistImage(ctx, images.ArtistImageOpts{...})
    if err == nil && imgUrl != "" {
        imgid = uuid.New()
        // download image
    }
}

File: internal/catalog/associate_album.go

Same pattern in createOrUpdateAlbumWithMbzReleaseID (line ~125) and matchAlbumByTitle (line ~220).

Success Criteria

go build ./... compiles
go test ./... passes
No GetArtistImage/GetAlbumImage calls during import (verify via log: no "No image providers" warnings)
Background tasks still fetch images after import completes
Live scrobbles (SkipCacheImage=false) still fetch images normally

Performance Estimates

Scenario	Current	After Phase 1	After All Phases
Repeated live scrobble	~19 queries	2 queries (cache hit)	2 queries
New live scrobble	~19 queries	~19 queries + 1 cache write	~19 queries + 1 cache write
49k Maloja import	~24 hours	~12 hours (cache helps repeats)	~15-30 minutes
49k import (second run)	~24 hours	~20 minutes (all cache hits)	~5 minutes

Implementation Order

Phase 1 (track_lookup) — standalone, immediate benefit for all scrobbles
Phase 5 (skip image lookups) — standalone, no dependencies, quick win
Phase 2 (SaveListensBatch) — DB layer, needed by Phase 3
Phase 3 (BulkSubmitter) — the core, depends on Phase 1 + 2
Phase 4 (migrate importers) — depends on Phase 3

Phases 1 and 5 can be done first as independent PRs. Phases 2-4 are one PR.

References

ListenBrainz architecture: https://listenbrainz.readthedocs.io/en/latest/developers/architecture.html
ListenBrainz MBID mapping: https://listenbrainz.readthedocs.io/en/latest/developers/mapping.html
MusicBrainz rate limiting: https://musicbrainz.org/doc/MusicBrainz_API/Rate_Limiting
PR 1 (importer fixes): https://github.com/gabehf/Koito/pull/228
PR 2 (MBZ search): https://github.com/gabehf/Koito/pull/229
Koito native importer (bypass pattern): internal/importer/koito.go
Current SubmitListen: internal/catalog/catalog.go:70
pgx CopyFrom docs: https://pkg.go.dev/github.com/jackc/pgx/v5#Conn.CopyFrom

26 KiB Raw Blame History

Bulk Import Optimization — Implementation Plan

Overview

Current State Analysis

The Problem

Key Discoveries

Ecosystem Patterns (from research)

Desired End State

Verification

What We're NOT Doing

Implementation Approach

Phase 1: track_lookup Cache Table

Overview

Changes Required

1. New Migration

2. New SQL Queries

3. Regenerate sqlc

4. DB Interface + Psql Implementation

5. Lookup Key Helper

6. Integrate into SubmitListen

7. Invalidation on Merge/Delete

Success Criteria

Phase 2: SaveListensBatch DB Method

Overview

Changes Required

1. New SQL + DB Interface

2. Psql Implementation

Success Criteria

Phase 3: BulkSubmitter Helper

Overview

Design

Changes Required

1. New Package

2. TOCTOU Safety for Parallel Entity Creation

Success Criteria

Phase 4: Migrate All Importers

Overview

Changes Required

1. Maloja Importer

2. Spotify Importer

3. LastFM Importer

4. ListenBrainz Importer

5. Koito Importer

Success Criteria

Phase 5: Skip Image Lookups During Import

Overview

Changes Required

1. Early Return in Associate Functions

Success Criteria

Performance Estimates

Implementation Order

References

26 KiB

Raw Blame History

Phase 1: `track_lookup` Cache Table

Phase 2: `SaveListensBatch` DB Method

Phase 3: `BulkSubmitter` Helper