navidrome

mirror of https://github.com/navidrome/navidrome.git synced 2026-06-02 07:01:36 +00:00

History

Deluan Quintão 945d0ba1e2

fix(transcoding): cap concurrent transcodes to prevent ffmpeg DoS (#5522 )

* feat(transcoding): add MaxConcurrent and MaxConcurrentPerUser config

Introduce Transcoding.MaxConcurrent (default NumCPU()*2) and
Transcoding.MaxConcurrentPerUser (default 3) to support upcoming
concurrency limits on the streaming pipeline. No behavior change yet.

Refs #5246

* feat(transcoding): add TranscodeLimiter with global and per-user caps

Introduce a non-blocking limiter that gates concurrent transcodes. Returns
ErrTooManyTranscodes immediately when the cap is reached so callers can
translate it into a 429 response, rather than queuing requests.

The per-user reservation is taken first to avoid burning a global slot
that would only be rolled back when the per-user cap rejects the caller.
Release is idempotent so wrapping the transcoder reader's Close is safe.

Refs #5246

* feat(transcoding): cap concurrent transcodes in media streamer

Acquire a TranscodeLimiter slot before spawning ffmpeg in the transcoding
cache's read function, and release it when the resulting reader is closed.
Raw streams and cache hits bypass the limiter so a single saturating client
cannot block ordinary playback.

When the cap is reached, ErrTooManyTranscodes bubbles up through cache.Get,
ready for the HTTP layer to translate into a 429 response.

Refs #5246

* feat(transcoding): return HTTP 429 with Retry-After when transcode cap is hit

Map stream.ErrTooManyTranscodes to HTTP 429 in both the Subsonic API
(/stream, /download) and the public share endpoint, including a 5s
Retry-After hint. The Subsonic response still carries a failed-status
envelope so clients that ignore HTTP codes also see the failure.

Refs #5246

* feat(transcoding): default MaxConcurrent to 0 (disabled)

Ship the limiter opt-in so existing installations are not affected by a
behavior change on upgrade. Users hitting the DoS reported in #5246 can
enable it by setting Transcoding.MaxConcurrent to a positive value
(NumCPU()*2 is a reasonable starting point).

Refs #5246

* fix(transcoding): make global and per-user caps independent

Previously the limiter short-circuited to a no-op whenever MaxConcurrent
was zero, silently ignoring a configured MaxConcurrentPerUser. Treat each
cap independently so an operator can throttle per-user without enforcing
a global ceiling (or vice versa), and only fall back to the no-op limiter
when both caps are disabled.

* fix(archiver): abort archive download when the transcode limiter rejects

The album/artist/playlist zip writers were silently producing zip entries
with headers but no data when ms.NewStream returned ErrTooManyTranscodes,
because the per-file error was discarded by `_ = a.addFileToZip(...)`.
The client received HTTP 200 with a corrupt zip and no indication that
the server was rate-limited.

Now the zip loop bails out as soon as it sees ErrTooManyTranscodes, and
the Download handler swallows the error (the response status and
Content-Disposition are already flushed by the time the limit is hit, so
no 429 can be sent). The truncated zip surfaces the problem to the
client; operators see a clear "transcode cap reached" warning in the
server logs.

Refs #5246

* fix(transcoding): release limiter slot on client close, not ffmpeg EOF

Previously the slot was wrapped around the ffmpeg source reader, so it
was only released by the cache's background copyAndClose goroutine when
ffmpeg finished producing the file — meaning a client that disconnected
after a single byte still held the slot for the full transcode duration.
Under MaxConcurrent=N this serialized fresh requests behind abandoned
encodes for minutes.

Hand the release function back from the cache producer via the streamJob
struct and wire it into the consumer-side Stream.Close. The HTTP handler
already runs `defer stream.Close()`, so disconnect now frees the slot
immediately. Cache hits never enter the producer and still pay no slot,
and singleflight waiters on the same key correctly inherit no release
(only the original producer's job holds the slot).

Refs #5246

* fix(transcoding): skip per-user cap for anonymous requests

Public share viewers have no user in context, so userName(ctx) returned
the literal string "UNKNOWN" and the limiter mapped every anonymous
viewer to the same bucket. With MaxConcurrentPerUser=N, only N
unrelated anonymous clients could stream a viral share at any time —
the opposite of the fairness the per-user cap is meant to provide.

Introduce a limiterKey(ctx) helper that returns "" for anonymous
callers (userName(ctx) is unchanged for logs), and teach Acquire to
skip the per-user reservation when the key is empty. The global cap is
still enforced for anonymous traffic and remains the protection against
runaway anonymous load.

Refs #5246

* refactor(transcoding): tidy limiter struct and centralize Retry-After

Per review feedback:

- Drop the redundant maxConcurrent field on transcodeLimiter; the channel
  capacity already enforces the global cap and the field was only used
  inside the constructor.
- Only allocate the perUser map when MaxConcurrentPerUser > 0.
- Move the Retry-After value into core/stream as RetryAfterSeconds so the
  Subsonic API and public-share handlers cannot drift if the window is
  later tuned.

* fix(transcoding): do not log limiter rejections as cache failures

NewStream was emitting an error-level "Error accessing transcoding cache"
log whenever cache.Get returned anything non-nil, including the limiter's
ErrTooManyTranscodes — even though the producer had already logged the
rejection at warn level. The result was double logging and a misleading
"cache failure" classification that buries real cache problems.

Skip the error log when the cause is ErrTooManyTranscodes; the warn line
from the producer is the canonical signal.

* fix(archiver): open stream before writing zip entry header

Per review: addFileToZip previously called z.CreateHeader before
NewStream, so when the limiter rejected a transcode the zip already
contained a 0-byte entry for that track. Open the source first and only
write the header once the read side is ready; rejections now skip the
entry entirely.

The truncation comment in handleArchiveErr was also misleading — z.Close
finalises the central directory, so the client receives a well-formed
zip containing only the tracks written before the rejection, not a
"truncated" archive. Reword to match reality.

* fix(transcoding): hold slot for ffmpeg lifetime, force cancellable ctx

The previous release-on-consumer-close design let a client open many
unique transcodes, disconnect immediately, and still spawn the
configured cap's worth of ffmpeg processes — the cache writer goroutine
continued draining ffmpeg to disk after the client disappeared, defeating
the DoS protection the limiter is meant to provide.

Move the release back onto the source reader so the slot is freed only
when ffmpeg actually exits (either EOF or context cancellation). To keep
disconnects from leaking slots for the full transcode duration, force
the request context into ffmpeg whenever the limiter is enabled — so
client disconnect cancels the process and frees the slot promptly.

When the limiter is disabled, the legacy EnableTranscodingCancellation
behavior is preserved unchanged.

Reported by codex and Copilot reviewers on #5522.