mirror of
https://github.com/blakeblackshear/frigate.git
synced 2026-05-03 06:50:58 +00:00
Compare commits
4 Commits
27bfc81a20
...
7a1d5e018b
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7a1d5e018b | ||
|
|
59a7c79b88 | ||
|
|
8e56b132f1 | ||
|
|
772190869f |
@ -75,14 +75,16 @@ audio:
|
|||||||
|
|
||||||
### Audio Transcription
|
### Audio Transcription
|
||||||
|
|
||||||
Frigate supports fully local text transcription using `sherpa-onnx` and OpenAI's fully local, open source Whisper models (using `faster-whisper`). Enable audio transcription features at the global level in your config:
|
Frigate supports fully local audio transcription using either `sherpa-onnx` or OpenAI’s open-source Whisper models via `faster-whisper`. To enable transcription, it is recommended to only configure the features at the global level, and enable it at the individual camera level.
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
audio_transcription:
|
audio_transcription:
|
||||||
enabled: True
|
enabled: False
|
||||||
|
device: ...
|
||||||
|
model_size: ...
|
||||||
```
|
```
|
||||||
|
|
||||||
Audio transcription can also be enabled for select cameras only at the camera level:
|
Enable audio transcription for select cameras at the camera level:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
cameras:
|
cameras:
|
||||||
@ -98,8 +100,11 @@ Audio detection must be enabled and configured as described above in order to us
|
|||||||
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
Optional config parameters that can be set at the global level include:
|
The optional config parameters that can be set at the global level include:
|
||||||
|
|
||||||
|
- **`enabled`**: Enable or disable the audio transcription feature.
|
||||||
|
- Default: `False`
|
||||||
|
- It is recommended to only configure the features at the global level, and enable it at the individual camera level.
|
||||||
- **`device`**: Device to use to run transcription and translation models.
|
- **`device`**: Device to use to run transcription and translation models.
|
||||||
- Default: `CPU`
|
- Default: `CPU`
|
||||||
- This can be `CPU` or `GPU`. The `sherpa-onnx` models are lightweight and run on the CPU only. The `whisper` models can run on GPU but are only supported on CUDA hardware.
|
- This can be `CPU` or `GPU`. The `sherpa-onnx` models are lightweight and run on the CPU only. The `whisper` models can run on GPU but are only supported on CUDA hardware.
|
||||||
@ -114,9 +119,11 @@ Optional config parameters that can be set at the global level include:
|
|||||||
- Transcriptions for `speech` events are translated.
|
- Transcriptions for `speech` events are translated.
|
||||||
- Live audio is translated only if you are using the `large` model. The `small` `sherpa-onnx` model is English-only.
|
- Live audio is translated only if you are using the `large` model. The `small` `sherpa-onnx` model is English-only.
|
||||||
|
|
||||||
|
The only field that is valid at the camera level is `enabled`.
|
||||||
|
|
||||||
#### Live transcription
|
#### Live transcription
|
||||||
|
|
||||||
The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the `audio` role.
|
The single camera Live view in the Frigate UI supports live transcription of audio for streams defined with the `audio` role. Use the Enable/Disable Live Audio Transcription button/switch to toggle transcription processing. When speech is heard, the UI will display a black box over the top of the camera stream with text. The MQTT topic `frigate/<camera_name>/audio/transcription` will also be updated in real-time with transcribed text.
|
||||||
|
|
||||||
Results can be error-prone due to a number of factors, including:
|
Results can be error-prone due to a number of factors, including:
|
||||||
|
|
||||||
@ -128,7 +135,7 @@ Results can be error-prone due to a number of factors, including:
|
|||||||
|
|
||||||
For speech sources close to the camera with minimal background noise, use the `small` model.
|
For speech sources close to the camera with minimal background noise, use the `small` model.
|
||||||
|
|
||||||
If you have CUDA hardware, you can experiment with the `large` `whisper` model on GPU. Performance is not quite as fast as the `sherpa-onnx` `small` model, but live transcription is far more accurate.
|
If you have CUDA hardware, you can experiment with the `large` `whisper` model on GPU. Performance is not quite as fast as the `sherpa-onnx` `small` model, but live transcription is far more accurate. Using the `large` model with CPU will likely be too slow for real-time transcription.
|
||||||
|
|
||||||
#### Transcription and translation of `speech` audio events
|
#### Transcription and translation of `speech` audio events
|
||||||
|
|
||||||
|
|||||||
@ -143,16 +143,6 @@ Message published for updates to tracked object metadata, for example:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Live Audio Transcription Update
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"type": "transcription",
|
|
||||||
"text": "Hello Johnny, are you home?",
|
|
||||||
"camera": "doorbell"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### `frigate/reviews`
|
### `frigate/reviews`
|
||||||
|
|
||||||
Message published for each changed review item. The first message is published when the `detection` or `alert` is initiated. When additional objects are detected or when a zone change occurs, it will publish a, `update` message with the same id. When the review activity has ended a final `end` message is published.
|
Message published for each changed review item. The first message is published when the `detection` or `alert` is initiated. When additional objects are detected or when a zone change occurs, it will publish a, `update` message with the same id. When the review activity has ended a final `end` message is published.
|
||||||
@ -265,6 +255,12 @@ Publishes the rms value for audio detected on this camera.
|
|||||||
|
|
||||||
**NOTE:** Requires audio detection to be enabled
|
**NOTE:** Requires audio detection to be enabled
|
||||||
|
|
||||||
|
### `frigate/<camera_name>/audio/transcription`
|
||||||
|
|
||||||
|
Publishes transcribed text for audio detected on this camera.
|
||||||
|
|
||||||
|
**NOTE:** Requires audio detection and transcription to be enabled
|
||||||
|
|
||||||
### `frigate/<camera_name>/enabled/set`
|
### `frigate/<camera_name>/enabled/set`
|
||||||
|
|
||||||
Topic to turn Frigate's processing of a camera on and off. Expected values are `ON` and `OFF`.
|
Topic to turn Frigate's processing of a camera on and off. Expected values are `ON` and `OFF`.
|
||||||
|
|||||||
@ -710,6 +710,21 @@ class FrigateConfig(FrigateBaseModel):
|
|||||||
self.model.create_colormap(sorted(self.objects.all_objects))
|
self.model.create_colormap(sorted(self.objects.all_objects))
|
||||||
self.model.check_and_load_plus_model(self.plus_api)
|
self.model.check_and_load_plus_model(self.plus_api)
|
||||||
|
|
||||||
|
# Check audio transcription and audio detection requirements
|
||||||
|
if self.audio_transcription.enabled:
|
||||||
|
# If audio transcription is enabled globally, at least one camera must have audio detection enabled
|
||||||
|
if not any(camera.audio.enabled for camera in self.cameras.values()):
|
||||||
|
raise ValueError(
|
||||||
|
"Audio transcription is enabled globally, but no cameras have audio detection enabled. At least one camera must have audio detection enabled."
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# If audio transcription is disabled globally, check each camera with audio_transcription enabled
|
||||||
|
for camera in self.cameras.values():
|
||||||
|
if camera.audio_transcription.enabled and not camera.audio.enabled:
|
||||||
|
raise ValueError(
|
||||||
|
f"Camera {camera.name} has audio transcription enabled, but audio detection is not enabled for this camera. Audio detection must be enabled for cameras with audio transcription when it is disabled globally."
|
||||||
|
)
|
||||||
|
|
||||||
if self.plus_api and not self.snapshots.clean_copy:
|
if self.plus_api and not self.snapshots.clean_copy:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"Frigate+ is configured but clean snapshots are not enabled, submissions to Frigate+ will not be possible./"
|
"Frigate+ is configured but clean snapshots are not enabled, submissions to Frigate+ will not be possible./"
|
||||||
|
|||||||
@ -1,6 +1,5 @@
|
|||||||
"""Handle processing audio for speech transcription using sherpa-onnx with FFmpeg pipe."""
|
"""Handle processing audio for speech transcription using sherpa-onnx with FFmpeg pipe."""
|
||||||
|
|
||||||
import json
|
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
import queue
|
import queue
|
||||||
@ -13,7 +12,6 @@ import sherpa_onnx
|
|||||||
from frigate.comms.inter_process import InterProcessRequestor
|
from frigate.comms.inter_process import InterProcessRequestor
|
||||||
from frigate.config import CameraConfig, FrigateConfig
|
from frigate.config import CameraConfig, FrigateConfig
|
||||||
from frigate.const import MODEL_CACHE_DIR
|
from frigate.const import MODEL_CACHE_DIR
|
||||||
from frigate.types import TrackedObjectUpdateTypesEnum
|
|
||||||
from frigate.util.downloader import ModelDownloader
|
from frigate.util.downloader import ModelDownloader
|
||||||
|
|
||||||
from ..types import DataProcessorMetrics
|
from ..types import DataProcessorMetrics
|
||||||
@ -44,7 +42,7 @@ class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):
|
|||||||
|
|
||||||
if self.config.audio_transcription.model_size == "large":
|
if self.config.audio_transcription.model_size == "large":
|
||||||
self.asr = FasterWhisperASR(
|
self.asr = FasterWhisperASR(
|
||||||
modelsize="tiny", # could use 'base' for CPU, switch to 'small' or 'large-v2' for GPU
|
modelsize="tiny",
|
||||||
device="cuda"
|
device="cuda"
|
||||||
if self.config.audio_transcription.device == "GPU"
|
if self.config.audio_transcription.device == "GPU"
|
||||||
else "cpu",
|
else "cpu",
|
||||||
@ -205,14 +203,7 @@ class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):
|
|||||||
logger.debug(f"Transcribed audio: '{text}', Endpoint: {is_endpoint}")
|
logger.debug(f"Transcribed audio: '{text}', Endpoint: {is_endpoint}")
|
||||||
|
|
||||||
self.requestor.send_data(
|
self.requestor.send_data(
|
||||||
"tracked_object_update",
|
f"{self.camera_config.name}/audio/transcription", text
|
||||||
json.dumps(
|
|
||||||
{
|
|
||||||
"type": TrackedObjectUpdateTypesEnum.transcription,
|
|
||||||
"text": text,
|
|
||||||
"camera": obj_data["camera"],
|
|
||||||
}
|
|
||||||
),
|
|
||||||
)
|
)
|
||||||
|
|
||||||
self.audio_queue.task_done()
|
self.audio_queue.task_done()
|
||||||
@ -237,14 +228,8 @@ class AudioTranscriptionRealTimeProcessor(RealTimeProcessorApi):
|
|||||||
self.transcription_segments = []
|
self.transcription_segments = []
|
||||||
|
|
||||||
self.requestor.send_data(
|
self.requestor.send_data(
|
||||||
"tracked_object_update",
|
f"{self.camera_config.name}/audio/transcription",
|
||||||
json.dumps(
|
(output[2].strip() + " "),
|
||||||
{
|
|
||||||
"type": TrackedObjectUpdateTypesEnum.transcription,
|
|
||||||
"text": (output[2].strip()),
|
|
||||||
"camera": camera,
|
|
||||||
}
|
|
||||||
),
|
|
||||||
)
|
)
|
||||||
|
|
||||||
# reset whisper
|
# reset whisper
|
||||||
|
|||||||
@ -179,12 +179,10 @@ class EmbeddingMaintainer(threading.Thread):
|
|||||||
)
|
)
|
||||||
)
|
)
|
||||||
|
|
||||||
audio_transcription_cameras = [
|
if any(
|
||||||
c
|
c.enabled_in_config and c.audio_transcription.enabled
|
||||||
for c in self.config.cameras.values()
|
for c in self.config.cameras.values()
|
||||||
if c.enabled_in_config and c.audio_transcription.enabled
|
):
|
||||||
]
|
|
||||||
if audio_transcription_cameras:
|
|
||||||
self.post_processors.append(
|
self.post_processors.append(
|
||||||
AudioTranscriptionPostProcessor(self.config, self.requestor, metrics)
|
AudioTranscriptionPostProcessor(self.config, self.requestor, metrics)
|
||||||
)
|
)
|
||||||
|
|||||||
@ -1,7 +1,6 @@
|
|||||||
"""Handle creating audio events."""
|
"""Handle creating audio events."""
|
||||||
|
|
||||||
import datetime
|
import datetime
|
||||||
import json
|
|
||||||
import logging
|
import logging
|
||||||
import random
|
import random
|
||||||
import string
|
import string
|
||||||
@ -37,7 +36,6 @@ from frigate.data_processing.real_time.audio_transcription import (
|
|||||||
from frigate.ffmpeg_presets import parse_preset_input
|
from frigate.ffmpeg_presets import parse_preset_input
|
||||||
from frigate.log import LogPipe
|
from frigate.log import LogPipe
|
||||||
from frigate.object_detection.base import load_labels
|
from frigate.object_detection.base import load_labels
|
||||||
from frigate.types import TrackedObjectUpdateTypesEnum
|
|
||||||
from frigate.util.builtin import get_ffmpeg_arg_list
|
from frigate.util.builtin import get_ffmpeg_arg_list
|
||||||
from frigate.video import start_or_restart_ffmpeg, stop_ffmpeg
|
from frigate.video import start_or_restart_ffmpeg, stop_ffmpeg
|
||||||
|
|
||||||
@ -226,7 +224,6 @@ class AudioEventMaintainer(threading.Thread):
|
|||||||
|
|
||||||
# run audio transcription
|
# run audio transcription
|
||||||
if self.transcription_processor is not None and (
|
if self.transcription_processor is not None and (
|
||||||
# rms >= self.camera_config.audio.min_volume or self.is_endpoint is False
|
|
||||||
self.camera_config.audio_transcription.live_enabled
|
self.camera_config.audio_transcription.live_enabled
|
||||||
):
|
):
|
||||||
self.transcribing = True
|
self.transcribing = True
|
||||||
@ -316,14 +313,7 @@ class AudioEventMaintainer(threading.Thread):
|
|||||||
if self.transcription_processor is not None:
|
if self.transcription_processor is not None:
|
||||||
self.transcription_processor.reset(self.camera_config.name)
|
self.transcription_processor.reset(self.camera_config.name)
|
||||||
self.requestor.send_data(
|
self.requestor.send_data(
|
||||||
"tracked_object_update",
|
f"{self.camera_config.name}/audio/transcription", ""
|
||||||
json.dumps(
|
|
||||||
{
|
|
||||||
"type": TrackedObjectUpdateTypesEnum.transcription,
|
|
||||||
"text": "",
|
|
||||||
"camera": self.camera_config.name,
|
|
||||||
}
|
|
||||||
),
|
|
||||||
)
|
)
|
||||||
|
|
||||||
def expire_all_detections(self) -> None:
|
def expire_all_detections(self) -> None:
|
||||||
|
|||||||
@ -27,4 +27,3 @@ class TrackedObjectUpdateTypesEnum(str, Enum):
|
|||||||
description = "description"
|
description = "description"
|
||||||
face = "face"
|
face = "face"
|
||||||
lpr = "lpr"
|
lpr = "lpr"
|
||||||
transcription = "transcription"
|
|
||||||
|
|||||||
@ -440,6 +440,15 @@ export function useAudioActivity(camera: string): { payload: number } {
|
|||||||
return { payload: payload as number };
|
return { payload: payload as number };
|
||||||
}
|
}
|
||||||
|
|
||||||
|
export function useAudioLiveTranscription(camera: string): {
|
||||||
|
payload: string;
|
||||||
|
} {
|
||||||
|
const {
|
||||||
|
value: { payload },
|
||||||
|
} = useWs(`${camera}/audio/transcription`, "");
|
||||||
|
return { payload: payload as string };
|
||||||
|
}
|
||||||
|
|
||||||
export function useMotionThreshold(camera: string): {
|
export function useMotionThreshold(camera: string): {
|
||||||
payload: string;
|
payload: string;
|
||||||
send: (payload: number, retain?: boolean) => void;
|
send: (payload: number, retain?: boolean) => void;
|
||||||
|
|||||||
@ -1,4 +1,5 @@
|
|||||||
import {
|
import {
|
||||||
|
useAudioLiveTranscription,
|
||||||
useAudioState,
|
useAudioState,
|
||||||
useAudioTranscriptionState,
|
useAudioTranscriptionState,
|
||||||
useAutotrackingState,
|
useAutotrackingState,
|
||||||
@ -7,7 +8,6 @@ import {
|
|||||||
usePtzCommand,
|
usePtzCommand,
|
||||||
useRecordingsState,
|
useRecordingsState,
|
||||||
useSnapshotsState,
|
useSnapshotsState,
|
||||||
useTrackedObjectUpdate,
|
|
||||||
} from "@/api/ws";
|
} from "@/api/ws";
|
||||||
import CameraFeatureToggle from "@/components/dynamic/CameraFeatureToggle";
|
import CameraFeatureToggle from "@/components/dynamic/CameraFeatureToggle";
|
||||||
import FilterSwitch from "@/components/filter/FilterSwitch";
|
import FilterSwitch from "@/components/filter/FilterSwitch";
|
||||||
@ -204,21 +204,17 @@ export default function LiveCameraView({
|
|||||||
|
|
||||||
const { payload: audioTranscriptionState, send: sendTranscription } =
|
const { payload: audioTranscriptionState, send: sendTranscription } =
|
||||||
useAudioTranscriptionState(camera.name);
|
useAudioTranscriptionState(camera.name);
|
||||||
const { payload: wsUpdate } = useTrackedObjectUpdate();
|
const { payload: transcription } = useAudioLiveTranscription(camera.name);
|
||||||
const transcriptionRef = useRef<HTMLDivElement>(null);
|
const transcriptionRef = useRef<HTMLDivElement>(null);
|
||||||
|
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
if (
|
if (transcription) {
|
||||||
wsUpdate &&
|
|
||||||
wsUpdate.type == "transcription" &&
|
|
||||||
wsUpdate.camera == camera.name
|
|
||||||
) {
|
|
||||||
if (transcriptionRef.current) {
|
if (transcriptionRef.current) {
|
||||||
transcriptionRef.current.scrollTop =
|
transcriptionRef.current.scrollTop =
|
||||||
transcriptionRef.current.scrollHeight;
|
transcriptionRef.current.scrollHeight;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}, [wsUpdate, camera.name]);
|
}, [transcription]);
|
||||||
|
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
return () => {
|
return () => {
|
||||||
@ -661,15 +657,12 @@ export default function LiveCameraView({
|
|||||||
</TransformComponent>
|
</TransformComponent>
|
||||||
{camera?.audio?.enabled_in_config &&
|
{camera?.audio?.enabled_in_config &&
|
||||||
audioTranscriptionState == "ON" &&
|
audioTranscriptionState == "ON" &&
|
||||||
wsUpdate &&
|
transcription != null && (
|
||||||
wsUpdate.type === "transcription" &&
|
|
||||||
wsUpdate.camera === camera.name &&
|
|
||||||
wsUpdate.text !== "" && (
|
|
||||||
<div
|
<div
|
||||||
ref={transcriptionRef}
|
ref={transcriptionRef}
|
||||||
className="text-md scrollbar-container absolute bottom-4 left-1/2 max-h-[15vh] w-[75%] -translate-x-1/2 overflow-y-auto rounded-lg bg-black/70 p-2 text-white md:w-[50%]"
|
className="text-md scrollbar-container absolute bottom-4 left-1/2 max-h-[15vh] w-[75%] -translate-x-1/2 overflow-y-auto rounded-lg bg-black/70 p-2 text-white md:w-[50%]"
|
||||||
>
|
>
|
||||||
{wsUpdate.text}
|
{transcription}
|
||||||
</div>
|
</div>
|
||||||
)}
|
)}
|
||||||
</div>
|
</div>
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user