clarify docs

config validator and docs
publish live transcriptions on their own topic instead of tracked_object_update
2026-05-03 06:50:58 +00:00 · 2025-05-27 10:15:53 -05:00 · 2025-05-27 10:02:32 -05:00 · 2025-05-27 09:50:17 -05:00 · 2025-05-27 08:52:46 -05:00
9 changed files with 57 additions and 65 deletions
--- a/docs/docs/configuration/audio_detectors.md
+++ b/docs/docs/configuration/audio_detectors.md
@ -75,14 +75,16 @@ audio:

 ### Audio Transcription

-Frigate supports fully local text transcription using `sherpa-onnx` and OpenAI's fully local, open source Whisper models (using `faster-whisper`). Enable audio transcription features at the global level in your config:
+Frigate supports fully local audio transcription using either `sherpa-onnx` or OpenAI’s open-source Whisper models via `faster-whisper`. To enable transcription, it is recommended to only configure the features at the global level, and enable it at the individual camera level.

 ```yaml
 audio_transcription:
-  enabled: True
+  enabled: False
+  device: ...
+  model_size: ...
 ```

-Audio transcription can also be enabled for select cameras only at the camera level:
+Enable audio transcription for select cameras at the camera level:

 ```yaml
 cameras:
@ -98,8 +100,11 @@ Audio detection must be enabled and configured as described above in order to us

 :::

-Optional config parameters that can be set at the global level include:
+The optional config parameters that can be set at the global level include:

+- **`enabled`**: Enable or disable the audio transcription feature.
+  - Default: `False`
+  - It is recommended to only configure the features at the global level, and enable it at the individual camera level.
 - **`device`**: Device to use to run transcription and translation models.
  - Default: `CPU`
  - This can be `CPU` or `GPU`. The `sherpa-onnx` models are lightweight and run on the CPU only. The `whisper` models can run on GPU but are only supported on CUDA hardware.
@ -114,9 +119,11 @@ Optional config parameters that can be set at the global level include:
  - Transcriptions for `speech` events are translated.
  - Live audio is translated only if you are using the `large` model. The `small` `sherpa-onnx` model is English-only.

+The only field that is valid at the camera level is `enabled`.
+
 #### Live transcription

-The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the `audio` role.
+The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the `audio` role. Use the Enable/Disable Live Audio Transcription button/switch to toggle transcription processing. When speech is heard, the UI will display a black box over the top of the camera stream with text. The MQTT topic `frigate/<camera_name>/audio/transcription` will also be updated in real-time with transcribed text.

 Results can be error-prone due to a number of factors, including:

@ -128,7 +135,7 @@ Results can be error-prone due to a number of factors, including:

 For speech sources close to the camera with minimal background noise, use the `small` model.

-If you have CUDA hardware, you can experiment with the `large` `whisper` model on GPU. Performance is not quite as fast as the `sherpa-onnx` `small` model, but live transcription is far more accurate.
+If you have CUDA hardware, you can experiment with the `large` `whisper` model on GPU. Performance is not quite as fast as the `sherpa-onnx` `small` model, but live transcription is far more accurate. Using the `large` model with CPU will likely be too slow for real-time transcription.

 #### Transcription and translation of `speech` audio events

--- a/docs/docs/integrations/mqtt.md
+++ b/docs/docs/integrations/mqtt.md
@ -143,16 +143,6 @@ Message published for updates to tracked object metadata, for example:
 }
 ```

-#### Live Audio Transcription Update
-
-```json
-{
-  "type": "transcription",
-  "text": "Hello Johnny, are you home?",
-  "camera": "doorbell"
-}
-```
-
 ### `frigate/reviews`

 Message published for each changed review item. The first message is published when the `detection` or `alert` is initiated. When additional objects are detected or when a zone change occurs, it will publish a, `update` message with the same id. When the review activity has ended a final `end` message is published.
@ -265,6 +255,12 @@ Publishes the rms value for audio detected on this camera.

 **NOTE:** Requires audio detection to be enabled

+### `frigate/<camera_name>/audio/transcription`
+
+Publishes transcribed text for audio detected on this camera.
+
+**NOTE:** Requires audio detection and transcription to be enabled
+
 ### `frigate/<camera_name>/enabled/set`

 Topic to turn Frigate's processing of a camera on and off. Expected values are `ON` and `OFF`.
--- a/frigate/config/config.py
+++ b/frigate/config/config.py
@ -710,6 +710,21 @@ class FrigateConfig(FrigateBaseModel):
        self.model.create_colormap(sorted(self.objects.all_objects))
        self.model.check_and_load_plus_model(self.plus_api)

+        # Check audio transcription and audio detection requirements
+        if self.audio_transcription.enabled:
+            # If audio transcription is enabled globally, at least one camera must have audio detection enabled
+            if not any(camera.audio.enabled for camera in self.cameras.values()):
+                raise ValueError(
+                    "Audio transcription is enabled globally, but no cameras have audio detection enabled. At least one camera must have audio detection enabled."
+                )
+        else:
+            # If audio transcription is disabled globally, check each camera with audio_transcription enabled
+            for camera in self.cameras.values():
+                if camera.audio_transcription.enabled and not camera.audio.enabled:
+                    raise ValueError(
+                        f"Camera {camera.name} has audio transcription enabled, but audio detection is not enabled for this camera. Audio detection must be enabled for cameras with audio transcription when it is disabled globally."
+                    )
+
        if self.plus_api and not self.snapshots.clean_copy:
            logger.warning(
                "Frigate+ is configured but clean snapshots are not enabled, submissions to Frigate+ will not be possible./"
--- a/frigate/data_processing/real_time/audio_transcription.py
+++ b/frigate/data_processing/real_time/audio_transcription.py
@ -1,6 +1,5 @@
 """Handle processing audio for speech transcription using sherpa-onnx with FFmpeg pipe."""

-import json
 import logging
 import os
 import queue
@ -13,7 +12,6 @@ import sherpa_onnx
 from frigate.comms.inter_process import InterProcessRequestor
 from frigate.config import CameraConfig, FrigateConfig
 from frigate.const import MODEL_CACHE_DIR
-from frigate.types import TrackedObjectUpdateTypesEnum
 from frigate.util.downloader import ModelDownloader

 from ..types import DataProcessorMetrics
@ -44,7 +42,7 @@ class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):

        if self.config.audio_transcription.model_size == "large":
            self.asr = FasterWhisperASR(
-                modelsize="tiny",  # could use 'base' for CPU, switch to 'small' or 'large-v2' for GPU
+                modelsize="tiny",
                device="cuda"
                if self.config.audio_transcription.device == "GPU"
                else "cpu",
@ -205,14 +203,7 @@ class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):
                logger.debug(f"Transcribed audio: '{text}', Endpoint: {is_endpoint}")

                self.requestor.send_data(
-                    "tracked_object_update",
-                    json.dumps(
-                        {
-                            "type": TrackedObjectUpdateTypesEnum.transcription,
-                            "text": text,
-                            "camera": obj_data["camera"],
-                        }
-                    ),
+                    f"{self.camera_config.name}/audio/transcription", text
                )

                self.audio_queue.task_done()
@ -237,14 +228,8 @@ class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):
            self.transcription_segments = []

            self.requestor.send_data(
-                "tracked_object_update",
-                json.dumps(
-                    {
-                        "type": TrackedObjectUpdateTypesEnum.transcription,
-                        "text": (output[2].strip()),
-                        "camera": camera,
-                    }
-                ),
+                f"{self.camera_config.name}/audio/transcription",
+                (output[2].strip() + " "),
            )

            # reset whisper
--- a/frigate/embeddings/maintainer.py
+++ b/frigate/embeddings/maintainer.py
@ -179,12 +179,10 @@ class EmbeddingMaintainer(threading.Thread):
                )
            )

-        audio_transcription_cameras = [
-            c
+        if any(
+            c.enabled_in_config and c.audio_transcription.enabled
            for c in self.config.cameras.values()
-            if c.enabled_in_config and c.audio_transcription.enabled
-        ]
-        if audio_transcription_cameras:
+        ):
            self.post_processors.append(
                AudioTranscriptionPostProcessor(self.config, self.requestor, metrics)
            )
--- a/frigate/events/audio.py
+++ b/frigate/events/audio.py
@ -1,7 +1,6 @@
 """Handle creating audio events."""

 import datetime
-import json
 import logging
 import random
 import string
@ -37,7 +36,6 @@ from frigate.data_processing.real_time.audio_transcription import (
 from frigate.ffmpeg_presets import parse_preset_input
 from frigate.log import LogPipe
 from frigate.object_detection.base import load_labels
-from frigate.types import TrackedObjectUpdateTypesEnum
 from frigate.util.builtin import get_ffmpeg_arg_list
 from frigate.video import start_or_restart_ffmpeg, stop_ffmpeg

@ -226,7 +224,6 @@ class AudioEventMaintainer(threading.Thread):

        # run audio transcription
        if self.transcription_processor is not None and (
-            # rms >= self.camera_config.audio.min_volume or self.is_endpoint is False
            self.camera_config.audio_transcription.live_enabled
        ):
            self.transcribing = True
@ -316,14 +313,7 @@ class AudioEventMaintainer(threading.Thread):
                if self.transcription_processor is not None:
                    self.transcription_processor.reset(self.camera_config.name)
                    self.requestor.send_data(
-                        "tracked_object_update",
-                        json.dumps(
-                            {
-                                "type": TrackedObjectUpdateTypesEnum.transcription,
-                                "text": "",
-                                "camera": self.camera_config.name,
-                            }
-                        ),
+                        f"{self.camera_config.name}/audio/transcription", ""
                    )

    def expire_all_detections(self) -> None:
--- a/frigate/types.py
+++ b/frigate/types.py
@ -27,4 +27,3 @@ class TrackedObjectUpdateTypesEnum(str, Enum):
    description = "description"
    face = "face"
    lpr = "lpr"
-    transcription = "transcription"
--- a/web/src/api/ws.tsx
+++ b/web/src/api/ws.tsx
@ -440,6 +440,15 @@ export function useAudioActivity(camera: string): { payload: number } {
  return { payload: payload as number };
 }

+export function useAudioLiveTranscription(camera: string): {
+  payload: string;
+} {
+  const {
+    value: { payload },
+  } = useWs(`${camera}/audio/transcription`, "");
+  return { payload: payload as string };
+}
+
 export function useMotionThreshold(camera: string): {
  payload: string;
  send: (payload: number, retain?: boolean) => void;
--- a/web/src/views/live/LiveCameraView.tsx
+++ b/web/src/views/live/LiveCameraView.tsx
@ -1,4 +1,5 @@
 import {
+  useAudioLiveTranscription,
  useAudioState,
  useAudioTranscriptionState,
  useAutotrackingState,
@ -7,7 +8,6 @@ import {
  usePtzCommand,
  useRecordingsState,
  useSnapshotsState,
-  useTrackedObjectUpdate,
 } from "@/api/ws";
 import CameraFeatureToggle from "@/components/dynamic/CameraFeatureToggle";
 import FilterSwitch from "@/components/filter/FilterSwitch";
@ -204,21 +204,17 @@ export default function LiveCameraView({

  const { payload: audioTranscriptionState, send: sendTranscription } =
    useAudioTranscriptionState(camera.name);
-  const { payload: wsUpdate } = useTrackedObjectUpdate();
+  const { payload: transcription } = useAudioLiveTranscription(camera.name);
  const transcriptionRef = useRef<HTMLDivElement>(null);

  useEffect(() => {
-    if (
-      wsUpdate &&
-      wsUpdate.type == "transcription" &&
-      wsUpdate.camera == camera.name
-    ) {
+    if (transcription) {
      if (transcriptionRef.current) {
        transcriptionRef.current.scrollTop =
          transcriptionRef.current.scrollHeight;
      }
    }
-  }, [wsUpdate, camera.name]);
+  }, [transcription]);

  useEffect(() => {
    return () => {
@ -661,15 +657,12 @@ export default function LiveCameraView({
          </TransformComponent>
          {camera?.audio?.enabled_in_config &&
            audioTranscriptionState == "ON" &&
-            wsUpdate &&
-            wsUpdate.type === "transcription" &&
-            wsUpdate.camera === camera.name &&
-            wsUpdate.text !== "" && (
+            transcription != null && (
              <div
                ref={transcriptionRef}
                className="text-md scrollbar-container absolute bottom-4 left-1/2 max-h-[15vh] w-[75%] -translate-x-1/2 overflow-y-auto rounded-lg bg-black/70 p-2 text-white md:w-[50%]"
              >
-                {wsUpdate.text}
+                {transcription}
              </div>
            )}
        </div>
Author	SHA1	Message	Date
Josh Hawkins	7a1d5e018b	clarify docs	2025-05-27 10:15:53 -05:00
Josh Hawkins	59a7c79b88	config validator and docs	2025-05-27 10:02:32 -05:00
Josh Hawkins	8e56b132f1	publish live transcriptions on their own topic instead of tracked_object_update	2025-05-27 09:50:17 -05:00
Josh Hawkins	772190869f	tweaks	2025-05-27 08:52:46 -05:00