assistant.vosk#

Description#

A voice assistant based on the Vosk offline speech recognition engine.

Vosk is a lightweight, offline speech recognition toolkit that supports multiple languages and runs on various platforms including Raspberry Pi.

Setup#

  1. Install the plugin dependencies (pip install vosk sounddevice).

  2. Either set the lang parameter (e.g. en, en-us, it, de) and the plugin will automatically download the best matching small model, or manually download a Vosk model from the Vosk models page and provide its path via model_path.

Models are stored by default under <PLATYPUSH_WORKDIR>/assistant.vosk/models.

Hotword detection#

This plugin does not include built-in hotword detection. You can pair it with a hotword detection plugin such as platypush.plugins.assistant.picovoice.AssistantPicovoicePlugin (with stt_enabled: false) or platypush.plugins.assistant.openwakeword.AssistantOpenwakewordPlugin.

Example configuration with OpenWakeWord for hotword detection:

assistant.openwakeword:
  models:
    - hey_jarvis

assistant.vosk:
  lang: en  # auto-downloads a small en-us model
  # or: model_path: /path/to/vosk-model-en-us-0.22

Then trigger the conversation on hotword detection:

from platypush import run, when
from platypush.message.event.assistant import HotwordDetectedEvent

@when(HotwordDetectedEvent)
def on_hotword_detected():
    run("assistant.vosk.start_conversation")

Speech recognition#

When a conversation is started (either programmatically via start_conversation() or after a hotword is detected), the plugin records audio from the microphone and processes it through Vosk in real-time. When speech is recognized, a platypush.message.event.assistant.SpeechRecognizedEvent is fired.

You can hook into recognized speech:

from platypush import when, run
from platypush.message.event.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='turn on (the)? lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, **context):
    run("light.hue.on")

Configuration#

assistant.vosk:
  # [Optional]
  # Path to the Vosk model directory. You can download
  # models from `<https://alphacephei.com/vosk/models>`_. Either
  # ``model_path`` or ``lang`` must be specified.
  # model_path:   # type=str | None

  # [Optional]
  # Language code (e.g. ``en``, ``en-us``, ``it``, ``de``,
  # ``fr``). If specified and ``model_path`` is not set, the plugin
  # will automatically download the best matching small model from
  # the Vosk model repository. Generic codes like ``en`` will match
  # the most common regional variant (e.g. ``en-us``).
  # lang:   # type=str | None

  # [Optional]
  # Directory where downloaded models are stored.
  # Default: ``<PLATYPUSH_WORKDIR>/assistant.vosk/models``.
  # models_directory:   # type=str | None

  # [Optional]
  # Audio sample rate in Hz (default: 16000). Most
  # Vosk models expect 16 kHz audio.
  # sample_rate: 16000  # type=int

  # [Optional]
  # Number of samples per audio frame (default: 4000).
  # With the default sample rate of 16000, this corresponds to 250 ms
  # per frame.
  # frame_size: 4000  # type=int

  # [Optional]
  # Number of audio channels (default: 1). Vosk requires
  # mono audio.
  # channels: 1  # type=int

  # [Optional]
  # Seconds to wait for speech after
  # starting a conversation before timing out (default: 5.0).
  # conversation_start_timeout: 5.0  # type=float

  # [Optional]
  # Seconds of silence after the last
  # detected speech before ending the conversation (default: 1.5).
  # conversation_end_timeout: 1.5  # type=float

  # [Optional]
  # Maximum conversation duration in
  # seconds (default: 15.0).
  # conversation_max_duration: 15.0  # type=float

  # [Optional]
  # If True, include per-word timing and confidence
  # information in the recognition results (default: False).
  # words: False  # type=bool

  # [Optional]
  # If set, the assistant will use this plugin (e.g.
  # ``tts``, ``tts.google`` or ``tts.mimic3``) to render the responses,
  # instead of using the built-in assistant voice.
  # tts_plugin:   # type=str | None

  # [Optional]
  # Optional arguments to be passed to the TTS
  # ``say`` action, if ``tts_plugin`` is set.
  # tts_plugin_args:   # type=Dict[str, Any] | None

  # [Optional]
  # If set, the assistant will play this
  # audio file when it detects a speech. The sound file will be played
  # on the default audio output device. If not set, the assistant won't
  # play any sound when it detects a speech.
  # conversation_start_sound:   # type=str | None

  # [Optional]
  # If set, the plugin will
  # prevent the default assistant response when a
  # `SpeechRecognizedEvent <https://docs.platypush.tech/platypush/events/assistant.html#platypush.message.event.assistant.SpeechRecognizedEvent>`_
  # matches a user hook with a condition on a ``phrase`` field. This is
  # useful to prevent the assistant from responding with a default "*I'm
  # sorry, I can't help you with that*" when e.g. you say "*play the
  # music*", and you have a hook that matches the phrase "*play the
  # music*" and handles it with a custom action. If set, and you wish
  # the assistant to also provide an answer if an event matches one of
  # your hooks, then you should call the ``render_response`` method
  # in your hook handler. If not set, then the assistant will always try
  # and respond with a default message, even if a speech event matches
  # the phrase of one of your hooks. In this case, if you want to prevent
  # the default response, you should call ``stop_conversation``
  # explicitly from your hook handler. Default: True.
  # stop_conversation_on_speech_match: True  # type=bool

  # [Optional]
  # How often the `RunnablePlugin.loop <https://docs.platypush.tech/platypush/plugins/.html#platypush.plugins.RunnablePlugin.loop>`_ function should be
  # executed (default: 15 seconds). *NOTE*: For back-compatibility
  # reasons, the `poll_seconds` argument is also supported, but it's
  # deprecated.
  # poll_interval: 15  # type=float | None

  # [Optional]
  # How long we should wait for any running
  # threads/processes to stop before exiting (default: 5 seconds).
  # stop_timeout: 5  # type=float | None

  # [Optional]
  # If set to True then the plugin will not monitor
  # for new events. This is useful if you want to run a plugin in
  # stateless mode and only leverage its actions, without triggering any
  # events. Defaults to False.
  # disable_monitor: False  # type=bool

Dependencies#

pip

pip install sounddevice numpy platypush-vosk

Debian

apt install python3-numpy

Fedora

yum install python-numpy

Arch Linux

pacman -S python-sounddevice python-numpy

Triggered events#

Actions#

Module reference#

class platypush.plugins.assistant.vosk.AssistantVoskPlugin(*_, **__)[source]#

Bases: AssistantPlugin, RunnablePlugin

A voice assistant based on the Vosk offline speech recognition engine.

Vosk is a lightweight, offline speech recognition toolkit that supports multiple languages and runs on various platforms including Raspberry Pi.

Setup#

  1. Install the plugin dependencies (pip install vosk sounddevice).

  2. Either set the lang parameter (e.g. en, en-us, it, de) and the plugin will automatically download the best matching small model, or manually download a Vosk model from the Vosk models page and provide its path via model_path.

Models are stored by default under <PLATYPUSH_WORKDIR>/assistant.vosk/models.

Hotword detection#

This plugin does not include built-in hotword detection. You can pair it with a hotword detection plugin such as platypush.plugins.assistant.picovoice.AssistantPicovoicePlugin (with stt_enabled: false) or platypush.plugins.assistant.openwakeword.AssistantOpenwakewordPlugin.

Example configuration with OpenWakeWord for hotword detection:

assistant.openwakeword:
  models:
    - hey_jarvis

assistant.vosk:
  lang: en  # auto-downloads a small en-us model
  # or: model_path: /path/to/vosk-model-en-us-0.22

Then trigger the conversation on hotword detection:

from platypush import run, when
from platypush.message.event.assistant import HotwordDetectedEvent

@when(HotwordDetectedEvent)
def on_hotword_detected():
    run("assistant.vosk.start_conversation")

Speech recognition#

When a conversation is started (either programmatically via start_conversation() or after a hotword is detected), the plugin records audio from the microphone and processes it through Vosk in real-time. When speech is recognized, a platypush.message.event.assistant.SpeechRecognizedEvent is fired.

You can hook into recognized speech:

from platypush import when, run
from platypush.message.event.assistant import SpeechRecognizedEvent

@when(SpeechRecognizedEvent, phrase='turn on (the)? lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, **context):
    run("light.hue.on")
__init__(model_path: str | None = None, *, lang: str | None = None, models_directory: str | None = None, sample_rate: int = 16000, frame_size: int = 4000, channels: int = 1, conversation_start_timeout: float = 5.0, conversation_end_timeout: float = 1.5, conversation_max_duration: float = 15.0, words: bool = False, **kwargs)[source]#
Parameters:
  • model_path – Path to the Vosk model directory. You can download models from https://alphacephei.com/vosk/models. Either model_path or lang must be specified.

  • lang – Language code (e.g. en, en-us, it, de, fr). If specified and model_path is not set, the plugin will automatically download the best matching small model from the Vosk model repository. Generic codes like en will match the most common regional variant (e.g. en-us).

  • models_directory – Directory where downloaded models are stored. Default: <PLATYPUSH_WORKDIR>/assistant.vosk/models.

  • sample_rate – Audio sample rate in Hz (default: 16000). Most Vosk models expect 16 kHz audio.

  • frame_size – Number of samples per audio frame (default: 4000). With the default sample rate of 16000, this corresponds to 250 ms per frame.

  • channels – Number of audio channels (default: 1). Vosk requires mono audio.

  • conversation_start_timeout – Seconds to wait for speech after starting a conversation before timing out (default: 5.0).

  • conversation_end_timeout – Seconds of silence after the last detected speech before ending the conversation (default: 1.5).

  • conversation_max_duration – Maximum conversation duration in seconds (default: 15.0).

  • words – If True, include per-word timing and confidence information in the recognition results (default: False).

is_detecting(*_, **__) bool#
Returns:

True if the asistant is detecting, False otherwise.

is_muted(*_, **__) bool#
Returns:

True if the microphone is muted, False otherwise.

mute(*_, **__)[source]#

Note

This plugin has no continuous hotword detection. Speech processing is on-demand via start_conversation() and stop_conversation(). Mute/unmute are no-ops.

pause_detection(*_, **__)#

Put the assistant on pause. No new conversation events will be triggered.

publish_entities(entities: Collection[Any] | None, callback: Callable[[Entity], Any] | None = None, **kwargs) Collection[Entity]#

Publishes a list of entities. The downstream consumers include:

It also accepts an optional callback that will be called when each of the entities in the set is flushed to the database.

You usually don’t need to override this class (but you may want to extend transform_entities() instead if your extension doesn’t natively handle Entity objects).

render_response(text: str, *_, with_follow_on_turn: bool | None = None, **__) bool#

Render a response text as audio over the configured TTS plugin.

Parameters:
  • text – Text to render.

  • with_follow_on_turn – If set, the assistant will wait for a follow-up. By default, with_follow_on_turn will be automatically set to true if the text ends with a question mark.

Returns:

True if the assistant is waiting for a follow-up, False otherwise.

resume_detection(*_, **__)#

Resume the assistant hotword detection from a paused state.

send_text_query(*_, query: str, **__)[source]#

Send a text query to the assistant (emulates speech recognition).

Parameters:

query – The text query to process.

start_conversation(*_, **__)[source]#

Start a conversation with the assistant.

The conversation will be automatically stopped after conversation_max_duration seconds, or after conversation_start_timeout seconds of silence with no speech detected, or after conversation_end_timeout seconds of silence after the last speech, or when stop_conversation() is called.

status(*_, **__)#
Returns:

The current assistant status:

{
    "last_query": "What time is it?",
    "last_response": "It's 10:30 AM",
    "conversation_running": true,
    "is_muted": false,
    "is_detecting": true
}

stop_conversation(*_, **__)#

Programmatically stops a conversation.

toggle_mute(*_, **__)#

Toggle the mute state of the microphone.

transform_entities(entities: Collection[AssistantPlugin], **_)#

This method takes a list of entities in any (plugin-specific) format and converts them into a standardized collection of Entity objects. Since this method is called by publish_entities() before entity updates are published, you may usually want to extend it to pre-process the entities managed by your extension into the standard format before they are stored and published to all the consumers.

unmute(*_, **__)[source]#

Note

This plugin has no continuous hotword detection. Speech processing is on-demand via start_conversation() and stop_conversation(). Mute/unmute are no-ops.