Ways to control Node-RED with speech commands

Not sure if this is of interest, but installed the Rhasspy Toolkit on an RPi 3B+.

Not fully tested it yet as I am awaiting the delivery of PS Eye 3 Camera (using the microphone). Works well on the TTS side (using picoTTS) and seems to be very configurable and once I get the microphone, I can look to see what the JSON string looks like to start developing Node Red command to drive my Opto22 PAC R1 Controller.

You can integrate your own preferred 'engines' using the custom command feature should you wish.

I may be misunderstanding some stuff as this is all so new to me (including using Docker!!). But I reckon it might be worth a few moments of your time to explore it's possibilities as PocketSphinx appears to have a Dutch language capabilities. (Some nice tables showing compatabilities.)

A new forum has been set up in the last few days, and some of the people from the Home Assistant site seem to be very keen on the project.

1 Like

Hi @mudwalker,
Do you know whether you can use this via a smartphone. I mean to have wake-word detection (e.g. "hey rhasspy" or something like that) on a smartphone, and then all the other recognition on yor raspberry? So similar to google and amazon do it ...

Just don't ask it to tell a joke, it may be German. :rofl:

(in reference to a particular German car manufacturer).

1 Like

Hi @BartButenaers,
Looks like it can be used in this mode. One as Client/Server (installation example through Docker), the other using Hermes Audio Server, written to do such a thing.

The first link shows both methods.

I am needing to try to walk before I start trying to run with this. :sweat_smile: Two weeks ago, didn't even consider Voice Control. I rely on the contributions made to the various forums by persons with far more knowledge about these things than me to enable me to explore options! To these people I say a BIG thank you.

1 Like

i actually build a whole voice assistant just using nodered and two very simple python scripts. One for hotword detection and one for asr.
I use a small python script utilizing pocketsphinx python with a custom dictionary and language model. This works very well for german in my case. you also need an accoustic model but they have one here:
As long as you have a way to get the audio from the phone to your nodered server your approach could work.

Thanks for joining this discussion, and sharing your experiences!!

Last week I announce a beta version of my node-red-contrib-ui-media-recorder. I want to add also microphone capture to that node (based on the MediaRecorder API), which could help us for this purpose...

Very interesting! But unfortunately this is way out of my comfort zone. :woozy_face:
If you would find some spare time to explain - in dummy language - how you have accomplished this, that would be appreciated! This forum has a "share your project" category that would be ideal :wink:

Such a wake-word detection baked into my node would be an awesome feature...

please @JGKK give me a tutorial :wink:

I have been thinking about one for a while. Once I have some time I will try to write one. But unfortunately I can’t say when as I’m not really good at writing tutorials and it would take quite some effort.


@JGKK even if you just provided the flow and scripts that would be sharing your work with others and helping them along :slightly_smiling_face:


A lot has changed and I think I found the perfect speech companion to nodered:

This is an opensource project by the same person that also makes the Rhassphy assistant project (https://rhasspy.readthedocs.io/en/latest/).
It includes most of the core features of the later but in a stripped down version as a commandline tool.
Installation couldn’t be any easier as their are prebuilt deb packages for download. Their is already support for many languages and all you have to do is download a profile to get one of them.
Many of the languages support kaldi models by the great https://zamia.org/ project. Normally its a huge pain to try to adapt a kaldi model to your own domain specific language model but voice2json takes all this away and makes it a really easy and straightforward process.
As Kaldi can achieve sub realtime performance on a raspberry pi 4 especially if its a little bit overclocked and is way better accuracy wise than pockesphinx this is awesome news.
You can create your intents by writing them in a very easy to understand template language that is based on the jsgf grammar format.
It took me less than two hours to move all my intents I had in my pocketsphinx language model to this template language.
Im amazed how well this tool works out of the box.
The best part it does not only do speech to text but also includes a tool for intent recognition that will parse the intent out of your command. Because its all commandline based you can easily integrate it using the exec node and as the name suggest it outputs all results in JSON format so its very easy to work with in Nodered.
The documentation is great.
So to round it up I would say this is the easiest to use and install fully offline linux speech to text/intent solution I have used and I recommend everybody go try it.

Stay healthy, Johannes


Thanks a lot for sharing this ! It looks very interesting. Should give it a try.

Did you also use this for hotword detection and if so how exactly ?

No I use snowboy for hotword detection.
this is my workflow:
download snowboy for phyton (https://snowboy.kitt.ai/docspartials/docs/index.html#downloads) and get a hotword file. You can see an example here:
Than i wrote a python script like this is called hotword.py in my snowball folder:

import snowboydecoder
import sys

models = ['/home/pi/hey_pips.pmdl']

detector = snowboydecoder.HotwordDetector(models, sensitivity=float(sys.argv[1]))

def callbackfunc():


I run this from nodered in an exec node using: python -u ~/snowboy/hotword.py with append msg.payload ticked. This script includes that you send the hotword sensitivity from nodered using sys.argv. When a hotword is detected it prints Hotwordand releases the audio resources. This triggers a second exec node in nodered that uses sox (http://sox.sourceforge.net/) to record audio in the right format.
The sox command in the exec node actually does a little bit more:

sox -t alsa default -r 16000 -c 1 -b 16 ~/asr.wav silence -l 0 1 2.0 2.0% trim 0 6 vad -p 0.2 reverse vad -p 0.6 reverse

It only records until it detects silence and for a max of 6 seconds it than applies additional vad to trim silence/noises of the start and end. Once the recording is finished it restarts the hotword and in my case triggers the speech to text using voice to json or on my satellites sends the recorded wav as a buffer over mqtt to do the stt on my main nodered instance. This is the basics of my hotword stuff as a flow(in my actual flow their is stuff like leds on the mic array and finding out which hotword was first from several sites):

[{"id":"ce732054.cde6c","type":"exec","z":"ed7ded5f.9a395","command":"python -u ~/snowboy/hotword.py ","addpay":true,"append":"","useSpawn":"false","timer":"","oldrc":false,"name":"hotword.py","x":590,"y":660,"wires":[["6e159b37.21f8dc"],[],[]]},{"id":"a20a43a6.91b2","type":"inject","z":"ed7ded5f.9a395","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":"30","x":220,"y":660,"wires":[["17f2a8f4.42c1cf"]]},{"id":"6e159b37.21f8dc","type":"switch","z":"ed7ded5f.9a395","name":"hotword?","property":"payload","propertyType":"msg","rules":[{"t":"regex","v":"Hotword","vt":"str","case":false}],"checkall":"true","repair":false,"outputs":1,"x":760,"y":660,"wires":[["2114741d.a8fd24"]]},{"id":"2114741d.a8fd24","type":"exec","z":"ed7ded5f.9a395","command":"sox -t alsa default -r 16000 -c 1 -b 16 ~/asr.wav silence -l 0 1 2.0 2.0% trim 0 6 vad -p 0.2 reverse vad -p 0.6 reverse","addpay":false,"append":"","useSpawn":"false","timer":"","oldrc":false,"name":"record","x":910,"y":660,"wires":[["17f2a8f4.42c1cf"],[],[]]},{"id":"17f2a8f4.42c1cf","type":"change","z":"ed7ded5f.9a395","name":"set sensitivity","rules":[{"t":"set","p":"payload","pt":"msg","to":"0.4","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":400,"y":660,"wires":[["ce732054.cde6c"]]}]



@janvda Unfortunately snowboy is shutting down all operations in december. There is porcupine by picovoice which you can use directly from voice2json (http://voice2json.org/commands.html#wait-wake) but if you want to use personal models you have to retrain them every 30 days on their website.

1 Like

Thanks for informing us about snowboy.

Not sure if the shutdown they announce in the readme really makes a lot of difference as the github repositories will remain open and they were hardly updated in the last 2 years.

Do keep in mind to read carefully through the license agreements and the conditions for use. I think I quoted from them in this topic some 8 months ago.


Yes as you need their servers to train your personal wakeword. once they shut down you will be stuck with their few universal wake words to choose from. But I just read some great news on the Rhasspy forum from the developer behind Rhasspy and voice2json:

he’s is working not just on integrating deepspeech as an option/alternative to kaldi for stt but also on integrating mycroft precise (https://github.com/MycroftAI/mycroft-precise) as a wakeword engine which would make for the first truly open source alternative in the field of wakeword engines.


Hi Johannes,
You got my attention... Thanks a lot for sharing this knowledge with us!
But unfortunately every community has a guy that keeps asking questions until you get nuts. In this case that will be me :wink:

I see a lot of terminology that I'm not familiar with:

Don't get this. Voice2json seem to support 16 languages (a.o. dutch for me), but if I'm not mistaken that Zamia site only supports 3 languages. What is the relation then between them?

Kaldi is being used for the speech to text recognition, but is it also used for the training?
And how much memory has your RPI 4?

This is the first time that I heard about sox... Do you know whether there are any advantage to use this, e.g. compared to ffmpeg? Or do you just use it because voice2json also uses it, in case the wav file format is not as expected?

I know that Mozilla is capture voices in lots of languages via their Common voice project. However when reading this Github issue, they seem not to be very enthousiastic about delivering pretrained models in other languages. Does this mean that voice2json would then start limiting the number of languages it supports?
And is there any reason to replace Kaldi by Deepspeech? I see that Kaldi is rather old, but seems to have a large community, so I assume Deepspeech is better?

From your enthousiasm I assume that is a good evolution? I don't find anything about e.g. wasm so I assume we could not use this in the future in a UI node (for the dashboard) to do wake word detection in the browser?

Thanks a lot !!!!

1st sox
I use sox because it’s a great command line tool for audio processing on linux. They call themselves the swiss army knife of audio tools. It’s just really fast and reliable and has been around for a long time. I was actually using before I did anything with speech to text. I haven’t used ffmpeg much so I can’t say much about how they compare.

2nd kaldi
kaldi is by itself a project that provides software for speech researchers doing stuff about stt. It’s actually quite the opposite of end user friendly but it gives great results with the right model even on limited hardware.
It’s a lot more accurate than older projects like cmu sphinx/pocketsphinx.
But kaldi is worthless if you don’t have an acoustic model trained for it. Kaldi doesn’t provide any by itself i think, except maybe an english one. But fortunately there is people out there which train those models. Researchers and enthusiasts alike.
One of those is the people behind the zamia project. They train mainly german and french models for the kaldi engine. But there is other people like the ones from the cmu project who also train models for kaldi. So there is different sources for kaldi models.

3rd profiles in voice2json
once you have installed voice2json you can download as many profiles as you want for different languages using different speech to text engines. Each profile includes things like the dictionary, the acoustic model, a base language model and the a sentences.ini in which you put the sentences you want to train this model in. As you can specify which model to use for each commandline tool of voice2json and each model is self contained including its training data You could use different languages or even the same language and different stt engines at the same time. So it’s a very flexible approach.
So deepspeech wouldn’t replace anything but there would be for example three profiles for german to choose from, pocketsphinx, kaldi/zamia or deepspeech and you could use the one best suited for your hardware and needs. So adding deepspeech would just mean more choices.

4th deepspeech models
mozilla only provides a prebuilt english model right now but they say they will release more as the common voice data grows. But for example for germany there is a researcher who wrote a paper about building an end to end speech processing pipeline using deepspeech and on training a model from a number of publicly available training data sets including the common voice set.
Fortunately he published the resulting model on github (actually he just published a 2nd revision as they are writing another paper)
So this is for example a candidate for a german deepspeech voice2json profile.

5th training
voice2json uses things like ken lm, openfst, scripts provided by the zamia project, scripts and tools included in the different language processing toolkits and a lot of their own scripts to adapt the language models to your domain specific training set. That’s the great thing about voice2json i think. You don’t have to install ken lm and make a language model and a trie and compile it to a binary or a fst. You don’t have to write dictionaries and understand how each engine works. Voice2json abstracts this away from you and unifies the training process no matter which profile you choose and what the underlying specialities of its stt engine are.

6th mycroft precise
No I don’t think you can run it in a browser. It’s in its early stages right now and the last time I looked at it it was quite complicated to try and train your own wakeword with it. I know that the person behind Rhasspy and voice2json is in talks with them to maybe create something of a library where people could share their trained wakeword models with each other and also add training data to enhance certain models but this is all in the future for now.
Right now Snowboy or Porcupine will still be the best options.
But I’m hoping that with this cooperation and more interest in it precise will develop to be a true alternative.

7th my raspberry is the 2gb model and it’s overclocked to 2ghz

I hope this answers some of your questions, Johannes


Absolutely. Thanks for this detailed introduction!!!!
So there is a LOT going on under the covers of voice2json, since it seems to be a kind of glue between a lot of other projects. It is a good thing that there are pre-built binaries available, because building such a large group of dependent programs might result in quite some compilation errors on some platforms.
Please keep us updated of your nice work!

1 Like

@JGKK Thank you Johannes.

I have been 'playing' with Rhasspy, but it is becoming very integrated with MQTT and Hermes. This looks like a much lighter weight version and I will have a go thanks to your explanations.

Much appreciated.

1 Like