stt.deepspeech#

class platypush.plugins.stt.deepspeech.SttDeepspeechPlugin(model_file: str, lm_file: str, trie_file: str, lm_alpha: float = 0.75, lm_beta: float = 1.85, beam_width: int = 500, *args, **kwargs)[source]#

Bases: SttPlugin

This plugin performs speech-to-text and speech detection using the Mozilla DeepSpeech engine.

Requires:

  • deepspeech (pip install 'deepspeech>=0.6.0')

  • numpy (pip install numpy)

  • sounddevice (pip install sounddevice)

__init__(model_file: str, lm_file: str, trie_file: str, lm_alpha: float = 0.75, lm_beta: float = 1.85, beam_width: int = 500, *args, **kwargs)[source]#

In order to run the speech-to-text engine you’ll need to download the right model files for the Deepspeech engine that you have installed:

# Create the working folder for the models
export MODELS_DIR=~/models
mkdir -p $MODELS_DIR
cd $MODELS_DIR

# Download and extract the model files for your version of Deepspeech. This may take a while.
export DEEPSPEECH_VERSION=0.6.1
wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
tar -xvzf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite
Parameters:
  • model_file – Path to the model file (usually named output_graph.pb or output_graph.pbmm). Note that .pbmm usually perform better and are smaller.

  • lm_file – Path to the language model binary file (usually named lm.binary).

  • trie_file – The path to the trie file build from the same vocabulary as the language model binary (usually named trie).

  • lm_alpha – The alpha hyperparameter of the CTC decoder - Language Model weight. See <mozilla/DeepSpeech>.

  • lm_beta – The beta hyperparameter of the CTC decoder - Word Insertion weight. See <mozilla/DeepSpeech>.

  • beam_width – Decoder beam width (see beam scoring in KenLM language model).

  • input_device – PortAudio device index or name that will be used for recording speech (default: default system audio input device).

  • hotword – When this word is detected, the plugin will trigger a platypush.message.event.stt.HotwordDetectedEvent instead of a platypush.message.event.stt.SpeechDetectedEvent event. You can use these events for hooking other assistants.

  • hotwords – Use a list of hotwords instead of a single one.

  • conversation_timeout – If hotword or hotwords are set and conversation_timeout is set, the next speech detected event will trigger a platypush.message.event.stt.ConversationDetectedEvent instead of a platypush.message.event.stt.SpeechDetectedEvent event. You can hook custom hooks here to run any logic depending on the detected speech - it can emulate a kind of “OK, Google. Turn on the lights” interaction without using an external assistant.

  • block_duration – Duration of the acquired audio blocks (default: 1 second).

static convert_frames(frames: numpy.ndarray | bytes) numpy.ndarray[source]#

Conversion method for raw audio frames. It just returns the input frames as bytes. Override it if required by your logic.

Parameters:

frames – Input audio frames, as bytes.

Returns:

The audio frames as passed on the input. Override if required.

detect(audio_file: str) SpeechDetectedResponse[source]#

Perform speech-to-text analysis on an audio file.

Parameters:

audio_file – Path to the audio file.

detect_speech(frames) str[source]#

Method called within the detection_thread when new audio frames have been captured. Must be implemented by the derived classes.

Parameters:

frames – Audio frames, as returned by convert_frames.

Returns:

Detected text, as a string. Returns an empty string if no text has been detected.

on_detection_ended()[source]#

Method called when the detection_thread stops. Clean up your context variables and models here.

on_detection_started()[source]#

Method called when the detection_thread starts. Initialize your context variables and models here if required.

on_speech_detected(speech: str) None[source]#

Hook called when speech is detected. Triggers the right event depending on the current context.

Parameters:

speech – Detected speech.