assistant.vosk

`assistant.vosk`#

Description#

A voice assistant based on the Vosk offline speech recognition engine.

Vosk is a lightweight, offline speech recognition toolkit that supports multiple languages and runs on various platforms including Raspberry Pi.

Setup#

Install the plugin dependencies (pip install vosk sounddevice).
Either set the lang parameter (e.g. en, en-us, it, de) and the plugin will automatically download the best matching small model, or manually download a Vosk model from the Vosk models page and provide its path via model_path.

Models are stored by default under <PLATYPUSH_WORKDIR>/assistant.vosk/models.

Hotword detection#

This plugin does not include built-in hotword detection. You can pair it with a hotword detection plugin such as platypush.plugins.assistant.picovoice.AssistantPicovoicePlugin (with stt_enabled: false) or platypush.plugins.assistant.openwakeword.AssistantOpenwakewordPlugin.

Example configuration with OpenWakeWord for hotword detection:

assistant.openwakeword:
  models:
    - hey_jarvis

assistant.vosk:
  lang: en  # auto-downloads a small en-us model
  # or: model_path: /path/to/vosk-model-en-us-0.22

Then trigger the conversation on hotword detection:

from platypush import run, when
from platypush.message.event.assistant import HotwordDetectedEvent

@when(HotwordDetectedEvent)
def on_hotword_detected():
    run("assistant.vosk.start_conversation")

Speech recognition#

When a conversation is started (either programmatically via start_conversation() or after a hotword is detected), the plugin records audio from the microphone and processes it through Vosk in real-time. When speech is recognized, a platypush.message.event.assistant.SpeechRecognizedEvent is fired.

You can hook into recognized speech:

from platypush import when, run
from platypush.message.event.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='turn on (the)? lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, **context):
    run("light.hue.on")

Configuration#

assistant.vosk:
  # [Optional]
  # Path to the Vosk model directory. You can download
  # models from `<https://alphacephei.com/vosk/models>`_. Either
  # ``model_path`` or ``lang`` must be specified.
  # model_path:   # type=str | None

  # [Optional]
  # Language code (e.g. ``en``, ``en-us``, ``it``, ``de``,
  # ``fr``). If specified and ``model_path`` is not set, the plugin
  # will automatically download the best matching small model from
  # the Vosk model repository. Generic codes like ``en`` will match
  # the most common regional variant (e.g. ``en-us``).
  # lang:   # type=str | None

  # [Optional]
  # Directory where downloaded models are stored.
  # Default: ``<PLATYPUSH_WORKDIR>/assistant.vosk/models``.
  # models_directory:   # type=str | None

  # [Optional]
  # Audio sample rate in Hz (default: 16000). Most
  # Vosk models expect 16 kHz audio.
  # sample_rate: 16000  # type=int

  # [Optional]
  # Number of samples per audio frame (default: 2000).
  # With the default sample rate of 16000, this corresponds to 125 ms
  # per frame. Smaller values reduce latency but increase CPU usage.
  # frame_size: 2000  # type=int

  # [Optional]
  # Number of audio channels (default: 1). Vosk requires
  # mono audio.
  # channels: 1  # type=int

  # [Optional]
  # Audio input device to use for recording. Supported
  # formats: PortAudio/sounddevice device index, PortAudio/sounddevice
  # device name, or PulseAudio/PipeWire source name (e.g.
  # ``alsa_input.usb-...``; requires ``pactl``). Default: system
  # default input device.
  # input_device:   # type=int | str | None

  # [Optional]
  # Recording gain, as a percentage. ``100`` means
  # unchanged, values below ``100`` attenuate, and values above ``100``
  # amplify with clipping. Default: 100.
  # input_volume: 100  # type=float

  # [Optional]
  # Seconds to wait for speech after
  # starting a conversation before timing out (default: 5.0).
  # conversation_start_timeout: 5.0  # type=float

  # [Optional]
  # Seconds of silence after the last
  # detected speech before ending the conversation (default: 1.5).
  # conversation_end_timeout: 1.5  # type=float

  # [Optional]
  # Maximum conversation duration in
  # seconds (default: 15.0).
  # conversation_max_duration: 15.0  # type=float

  # [Optional]
  # If True, include per-word timing and confidence
  # information in the recognition results (default: False).
  # words: False  # type=bool

  # [Optional]
  # Whether to enable Speex-based noise
  # suppression (requires the ``speexdsp_ns`` package). Reduces
  # background noise and improves recognition, especially for distant
  # speech. Default: auto-enabled if the package is available.
  # enable_noise_suppression:   # type=bool | None

  # [Optional]
  # Whether to use Voice Activity Detection for
  # speech boundary detection (default: True). Uses ``webrtcvad`` if
  # available, otherwise falls back to energy-based detection. VAD
  # enables faster end-of-speech detection (~300 ms vs. relying on
  # Vosk partial result timeouts).
  # vad_enabled: True  # type=bool

  # [Optional]
  # WebRTC VAD aggressiveness mode, 0–3 (default: 2).
  # Higher values are more aggressive at filtering non-speech but
  # may miss distant or quiet speech.  Only used when ``webrtcvad``
  # is installed.
  # vad_mode: 2  # type=int

  # [Optional]
  # Fraction of VAD sub-frames within an
  # audio frame that must be classified as speech for the frame to
  # be considered as containing speech (default: 0.3).
  # vad_speech_threshold: 0.3  # type=float

  # [Optional]
  # RMS energy threshold for the
  # energy-based VAD fallback (used when ``webrtcvad`` is not
  # installed).  Voices at conversational distance typically
  # produce RMS > 300 on int16 scale (~-34 dBFS).  Lower values
  # improve sensitivity for distant or quiet speech at the cost
  # of more false positives.  Default: 300.
  # energy_vad_threshold: 300  # type=float

  # [Optional]
  # If set, the assistant will use this plugin (e.g.
  # ``tts``, ``tts.google`` or ``tts.mimic3``) to render the responses,
  # instead of using the built-in assistant voice.
  # tts_plugin:   # type=str | None

  # [Optional]
  # Optional arguments to be passed to the TTS
  # ``say`` action, if ``tts_plugin`` is set.
  # tts_plugin_args:   # type=Dict[str, Any] | None

  # [Optional]
  # If set, the assistant will play this
  # audio file when it detects a speech. The sound file will be played
  # on the default audio output device. If not set, the assistant won't
  # play any sound when it detects a speech.
  # conversation_start_sound:   # type=str | None

  # [Optional]
  # If set, the plugin will
  # prevent the default assistant response when a
  # `SpeechRecognizedEvent <https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent>`_
  # matches a user hook with a condition on a ``phrase`` field. This is
  # useful to prevent the assistant from responding with a default "*I'm
  # sorry, I can't help you with that*" when e.g. you say "*play the
  # music*", and you have a hook that matches the phrase "*play the
  # music*" and handles it with a custom action. If set, and you wish
  # the assistant to also provide an answer if an event matches one of
  # your hooks, then you should call the ``render_response`` method
  # in your hook handler. If not set, then the assistant will always try
  # and respond with a default message, even if a speech event matches
  # the phrase of one of your hooks. In this case, if you want to prevent
  # the default response, you should call ``stop_conversation``
  # explicitly from your hook handler. Default: True.
  # stop_conversation_on_speech_match: True  # type=bool

  # [Optional]
  # How often the `RunnablePlugin.loop <https://docs.platypush.tech/platypush/plugins/.html#platypush.plugins.RunnablePlugin.loop>`_ function should be
  # executed (default: 15 seconds). *NOTE*: For back-compatibility
  # reasons, the `poll_seconds` argument is also supported, but it's
  # deprecated.
  # poll_interval: 15  # type=float | None

  # [Optional]
  # How long we should wait for any running
  # threads/processes to stop before exiting (default: 5 seconds).
  # stop_timeout: 5  # type=float | None

  # [Optional]
  # If set to True then the plugin will not monitor
  # for new events. This is useful if you want to run a plugin in
  # stateless mode and only leverage its actions, without triggering any
  # events. Defaults to False.
  # disable_monitor: False  # type=bool

Dependencies#

pip

pip install numpy speexdsp-ns webrtcvad platypush-vosk sounddevice

Alpine

apk add speexdsp ffmpeg py3-numpy swig

Debian

apt install ffmpeg swig libspeexdsp-dev python3-sounddevice python3-numpy

Fedora

yum install ffmpeg python-webrtcvad swig python-sounddevice python-numpy speexdsp

Arch Linux

pacman -S ffmpeg python-webrtcvad swig python-sounddevice python-numpy speexdsp

Triggered events#

platypush.message.event.assistant.MicUnmutedEvent

platypush.message.event.assistant.MicMutedEvent

platypush.message.event.assistant.SpeechRecognizedEvent

platypush.message.event.assistant.NoResponseEvent

platypush.message.event.assistant.ResponseEndEvent

platypush.message.event.assistant.ResponseEvent

platypush.message.event.assistant.ConversationTimeoutEvent

platypush.message.event.assistant.ConversationEndEvent

platypush.message.event.assistant.ConversationStartEvent

Actions#

assistant.vosk.is_detecting

assistant.vosk.is_muted

assistant.vosk.mute

assistant.vosk.pause_detection

assistant.vosk.render_response

assistant.vosk.resume_detection

assistant.vosk.send_text_query

assistant.vosk.start_conversation

assistant.vosk.status

assistant.vosk.stop_conversation

assistant.vosk.unmute

Module reference#

class platypush.plugins.assistant.vosk.AssistantVoskPlugin(*_, **__)[source]#

Bases: AssistantPlugin, RunnablePlugin

A voice assistant based on the Vosk offline speech recognition engine.

Vosk is a lightweight, offline speech recognition toolkit that supports multiple languages and runs on various platforms including Raspberry Pi.

Setup#

Install the plugin dependencies (pip install vosk sounddevice).
Either set the lang parameter (e.g. en, en-us, it, de) and the plugin will automatically download the best matching small model, or manually download a Vosk model from the Vosk models page and provide its path via model_path.

Models are stored by default under <PLATYPUSH_WORKDIR>/assistant.vosk/models.

Hotword detection#

Example configuration with OpenWakeWord for hotword detection:

assistant.openwakeword:
  models:
    - hey_jarvis

assistant.vosk:
  lang: en  # auto-downloads a small en-us model
  # or: model_path: /path/to/vosk-model-en-us-0.22

Then trigger the conversation on hotword detection:

from platypush import run, when
from platypush.message.event.assistant import HotwordDetectedEvent

@when(HotwordDetectedEvent)
def on_hotword_detected():
    run("assistant.vosk.start_conversation")

Speech recognition#

You can hook into recognized speech:

from platypush import when, run
from platypush.message.event.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='turn on (the)? lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, **context):
    run("light.hue.on")

__init__(model_path: str | None = None, *, lang: str | None = None, models_directory: str | None = None, sample_rate: int = 16000, frame_size: int = 2000, channels: int = 1, input_device: int | str | None = None, input_volume: float = 100, conversation_start_timeout: float = 5.0, conversation_end_timeout: float = 1.5, conversation_max_duration: float = 15.0, words: bool = False, enable_noise_suppression: bool | None = None, vad_enabled: bool = True, vad_mode: int = 2, vad_speech_threshold: float = 0.3, energy_vad_threshold: float = 300, **kwargs)[source]#

Parameters:

model_path – Path to the Vosk model directory. You can download models from https://alphacephei.com/vosk/models. Either model_path or lang must be specified.
lang – Language code (e.g. en, en-us, it, de, fr). If specified and model_path is not set, the plugin will automatically download the best matching small model from the Vosk model repository. Generic codes like en will match the most common regional variant (e.g. en-us).
models_directory – Directory where downloaded models are stored. Default: <PLATYPUSH_WORKDIR>/assistant.vosk/models.
sample_rate – Audio sample rate in Hz (default: 16000). Most Vosk models expect 16 kHz audio.
frame_size – Number of samples per audio frame (default: 2000). With the default sample rate of 16000, this corresponds to 125 ms per frame. Smaller values reduce latency but increase CPU usage.
channels – Number of audio channels (default: 1). Vosk requires mono audio.
input_device – Audio input device to use for recording. Supported formats: PortAudio/sounddevice device index, PortAudio/sounddevice device name, or PulseAudio/PipeWire source name (e.g. alsa_input.usb-...; requires pactl). Default: system default input device.
input_volume – Recording gain, as a percentage. 100 means unchanged, values below 100 attenuate, and values above 100 amplify with clipping. Default: 100.
conversation_start_timeout – Seconds to wait for speech after starting a conversation before timing out (default: 5.0).
conversation_end_timeout – Seconds of silence after the last detected speech before ending the conversation (default: 1.5).
conversation_max_duration – Maximum conversation duration in seconds (default: 15.0).
words – If True, include per-word timing and confidence information in the recognition results (default: False).
enable_noise_suppression – Whether to enable Speex-based noise suppression (requires the speexdsp_ns package). Reduces background noise and improves recognition, especially for distant speech. Default: auto-enabled if the package is available.
vad_enabled – Whether to use Voice Activity Detection for speech boundary detection (default: True). Uses webrtcvad if available, otherwise falls back to energy-based detection. VAD enables faster end-of-speech detection (~300 ms vs. relying on Vosk partial result timeouts).
vad_mode – WebRTC VAD aggressiveness mode, 0–3 (default: 2). Higher values are more aggressive at filtering non-speech but may miss distant or quiet speech. Only used when webrtcvad is installed.
vad_speech_threshold – Fraction of VAD sub-frames within an audio frame that must be classified as speech for the frame to be considered as containing speech (default: 0.3).
energy_vad_threshold – RMS energy threshold for the energy-based VAD fallback (used when webrtcvad is not installed). Voices at conversational distance typically produce RMS > 300 on int16 scale (~-34 dBFS). Lower values improve sensitivity for distant or quiet speech at the cost of more false positives. Default: 300.

is_detecting(*_, **__) → bool#

Returns:: True if the asistant is detecting, False otherwise.

is_muted(*_, **__) → bool#

Returns:: True if the microphone is muted, False otherwise.

mute(*_, **__)[source]#: Note

This plugin has no continuous hotword detection. Speech processing is on-demand via start_conversation() and stop_conversation(). Mute/unmute are no-ops.

pause_detection(*_, **__)#: Put the assistant on pause. No new conversation events will be triggered.

publish_entities(entities: Collection[Any] | None, callback: Callable[[Entity], Any] | None = None, **kwargs) → Collection[Entity]#

Publishes a list of entities. The downstream consumers include:

The entity persistence manager

The web server

Any consumer subscribed to
platypush.message.event.entities.EntityUpdateEvent events (e.g. web clients)

It also accepts an optional callback that will be called when each of the entities in the set is flushed to the database.

You usually don’t need to override this class (but you may want to extend transform_entities() instead if your extension doesn’t natively handle Entity objects).

render_response(text: str, *_, with_follow_on_turn: bool | None = None, **__) → bool#

Render a response text as audio over the configured TTS plugin.

Parameters:

text – Text to render.
with_follow_on_turn – If set, the assistant will wait for a follow-up. By default, with_follow_on_turn will be automatically set to true if the text ends with a question mark.

Returns:

True if the assistant is waiting for a follow-up, False otherwise.

resume_detection(*_, **__)#: Resume the assistant hotword detection from a paused state.

send_text_query(*_, query: str, **__)[source]#

Send a text query to the assistant (emulates speech recognition).

Parameters:: query – The text query to process.

start_conversation(*_, **__)[source]#

Start a conversation with the assistant.

The conversation will be automatically stopped after conversation_max_duration seconds, or after conversation_start_timeout seconds of silence with no speech detected, or after conversation_end_timeout seconds of silence after the last speech, or when stop_conversation() is called.

status(*_, **__)#

Returns:

The current assistant status:

{
    "last_query": "What time is it?",
    "last_response": "It's 10:30 AM",
    "conversation_running": true,
    "is_muted": false,
    "is_detecting": true
}

stop_conversation(*_, **__)#: Programmatically stops a conversation.

toggle_mute(*_, **__)#: Toggle the mute state of the microphone.

transform_entities(entities: Collection[AssistantPlugin], **_)#: This method takes a list of entities in any (plugin-specific) format and converts them into a standardized collection of Entity objects. Since this method is called by publish_entities() before entity updates are published, you may usually want to extend it to pre-process the entities managed by your extension into the standard format before they are stored and published to all the consumers.

unmute(*_, **__)[source]#: Note

This plugin has no continuous hotword detection. Speech processing is on-demand via start_conversation() and stop_conversation(). Mute/unmute are no-ops.

assistant.vosk

Platypush documentation

assistant.vosk#

Description#

Setup#

Hotword detection#

Speech recognition#

Configuration#

Dependencies#

Triggered events#

Actions#

Module reference#

Setup#

Hotword detection#

Speech recognition#

`assistant.vosk`#