Ways to control Node-RED with speech commands

Thanks a lot for sharing this ! It looks very interesting. Should give it a try.

Did you also use this for hotword detection and if so how exactly ?

No I use snowboy for hotword detection.
this is my workflow:
download snowboy for phyton (https://snowboy.kitt.ai/docspartials/docs/index.html#downloads) and get a hotword file. You can see an example here:
https://pimylifeup.com/raspberry-pi-snowboy/
Than i wrote a python script like this is called hotword.py in my snowball folder:

import snowboydecoder
import sys

models = ['/home/pi/hey_pips.pmdl']

detector = snowboydecoder.HotwordDetector(models, sensitivity=float(sys.argv[1]))

def callbackfunc():
    print("Hotword")
    detector.terminate()
    sys.exit(0)

detector.start(detected_callback=callbackfunc,
               sleep_time=0.03)

I run this from nodered in an exec node using: python -u ~/snowboy/hotword.py with append msg.payload ticked. This script includes that you send the hotword sensitivity from nodered using sys.argv. When a hotword is detected it prints Hotwordand releases the audio resources. This triggers a second exec node in nodered that uses sox (http://sox.sourceforge.net/) to record audio in the right format.
The sox command in the exec node actually does a little bit more:

sox -t alsa default -r 16000 -c 1 -b 16 ~/asr.wav silence -l 0 1 2.0 2.0% trim 0 6 vad -p 0.2 reverse vad -p 0.6 reverse

It only records until it detects silence and for a max of 6 seconds it than applies additional vad to trim silence/noises of the start and end. Once the recording is finished it restarts the hotword and in my case triggers the speech to text using voice to json or on my satellites sends the recorded wav as a buffer over mqtt to do the stt on my main nodered instance. This is the basics of my hotword stuff as a flow(in my actual flow their is stuff like leds on the mic array and finding out which hotword was first from several sites):

[{"id":"ce732054.cde6c","type":"exec","z":"ed7ded5f.9a395","command":"python -u ~/snowboy/hotword.py ","addpay":true,"append":"","useSpawn":"false","timer":"","oldrc":false,"name":"hotword.py","x":590,"y":660,"wires":[["6e159b37.21f8dc"],[],[]]},{"id":"a20a43a6.91b2","type":"inject","z":"ed7ded5f.9a395","name":"","topic":"","payload":"","payloadType":"date","repeat":"","crontab":"","once":false,"onceDelay":"30","x":220,"y":660,"wires":[["17f2a8f4.42c1cf"]]},{"id":"6e159b37.21f8dc","type":"switch","z":"ed7ded5f.9a395","name":"hotword?","property":"payload","propertyType":"msg","rules":[{"t":"regex","v":"Hotword","vt":"str","case":false}],"checkall":"true","repair":false,"outputs":1,"x":760,"y":660,"wires":[["2114741d.a8fd24"]]},{"id":"2114741d.a8fd24","type":"exec","z":"ed7ded5f.9a395","command":"sox -t alsa default -r 16000 -c 1 -b 16 ~/asr.wav silence -l 0 1 2.0 2.0% trim 0 6 vad -p 0.2 reverse vad -p 0.6 reverse","addpay":false,"append":"","useSpawn":"false","timer":"","oldrc":false,"name":"record","x":910,"y":660,"wires":[["17f2a8f4.42c1cf"],[],[]]},{"id":"17f2a8f4.42c1cf","type":"change","z":"ed7ded5f.9a395","name":"set sensitivity","rules":[{"t":"set","p":"payload","pt":"msg","to":"0.4","tot":"str"}],"action":"","property":"","from":"","to":"","reg":false,"x":400,"y":660,"wires":[["ce732054.cde6c"]]}]

Johannes

3 Likes

@janvda Unfortunately snowboy is shutting down all operations in december. There is porcupine by picovoice which you can use directly from voice2json (http://voice2json.org/commands.html#wait-wake) but if you want to use personal models you have to retrain them every 30 days on their website.

1 Like

Thanks for informing us about snowboy.

Not sure if the shutdown they announce in the readme really makes a lot of difference as the github repositories will remain open and they were hardly updated in the last 2 years.

Do keep in mind to read carefully through the license agreements and the conditions for use. I think I quoted from them in this topic some 8 months ago.

2 Likes

Yes as you need their servers to train your personal wakeword. once they shut down you will be stuck with their few universal wake words to choose from. But I just read some great news on the Rhasspy forum from the developer behind Rhasspy and voice2json:


heā€™s is working not just on integrating deepspeech as an option/alternative to kaldi for stt but also on integrating mycroft precise (https://github.com/MycroftAI/mycroft-precise) as a wakeword engine which would make for the first truly open source alternative in the field of wakeword engines.

4 Likes

Hi Johannes,
You got my attention... Thanks a lot for sharing this knowledge with us!
But unfortunately every community has a guy that keeps asking questions until you get nuts. In this case that will be me :wink:

I see a lot of terminology that I'm not familiar with:

Don't get this. Voice2json seem to support 16 languages (a.o. dutch for me), but if I'm not mistaken that Zamia site only supports 3 languages. What is the relation then between them?

Kaldi is being used for the speech to text recognition, but is it also used for the training?
And how much memory has your RPI 4?

This is the first time that I heard about sox... Do you know whether there are any advantage to use this, e.g. compared to ffmpeg? Or do you just use it because voice2json also uses it, in case the wav file format is not as expected?

I know that Mozilla is capture voices in lots of languages via their Common voice project. However when reading this Github issue, they seem not to be very enthousiastic about delivering pretrained models in other languages. Does this mean that voice2json would then start limiting the number of languages it supports?
And is there any reason to replace Kaldi by Deepspeech? I see that Kaldi is rather old, but seems to have a large community, so I assume Deepspeech is better?

From your enthousiasm I assume that is a good evolution? I don't find anything about e.g. wasm so I assume we could not use this in the future in a UI node (for the dashboard) to do wake word detection in the browser?

Thanks a lot !!!!

1st sox
I use sox because itā€™s a great command line tool for audio processing on linux. They call themselves the swiss army knife of audio tools. Itā€™s just really fast and reliable and has been around for a long time. I was actually using before I did anything with speech to text. I havenā€™t used ffmpeg much so I canā€™t say much about how they compare.

2nd kaldi
kaldi is by itself a project that provides software for speech researchers doing stuff about stt. Itā€™s actually quite the opposite of end user friendly but it gives great results with the right model even on limited hardware.
Itā€™s a lot more accurate than older projects like cmu sphinx/pocketsphinx.
But kaldi is worthless if you donā€™t have an acoustic model trained for it. Kaldi doesnā€™t provide any by itself i think, except maybe an english one. But fortunately there is people out there which train those models. Researchers and enthusiasts alike.
One of those is the people behind the zamia project. They train mainly german and french models for the kaldi engine. But there is other people like the ones from the cmu project who also train models for kaldi. So there is different sources for kaldi models.

3rd profiles in voice2json
once you have installed voice2json you can download as many profiles as you want for different languages using different speech to text engines. Each profile includes things like the dictionary, the acoustic model, a base language model and the a sentences.ini in which you put the sentences you want to train this model in. As you can specify which model to use for each commandline tool of voice2json and each model is self contained including its training data You could use different languages or even the same language and different stt engines at the same time. So itā€™s a very flexible approach.
So deepspeech wouldnā€™t replace anything but there would be for example three profiles for german to choose from, pocketsphinx, kaldi/zamia or deepspeech and you could use the one best suited for your hardware and needs. So adding deepspeech would just mean more choices.

4th deepspeech models
mozilla only provides a prebuilt english model right now but they say they will release more as the common voice data grows. But for example for germany there is a researcher who wrote a paper about building an end to end speech processing pipeline using deepspeech and on training a model from a number of publicly available training data sets including the common voice set.
Fortunately he published the resulting model on github (actually he just published a 2nd revision as they are writing another paper)
So this is for example a candidate for a german deepspeech voice2json profile.

5th training
voice2json uses things like ken lm, openfst, scripts provided by the zamia project, scripts and tools included in the different language processing toolkits and a lot of their own scripts to adapt the language models to your domain specific training set. Thatā€™s the great thing about voice2json i think. You donā€™t have to install ken lm and make a language model and a trie and compile it to a binary or a fst. You donā€™t have to write dictionaries and understand how each engine works. Voice2json abstracts this away from you and unifies the training process no matter which profile you choose and what the underlying specialities of its stt engine are.

6th mycroft precise
No I donā€™t think you can run it in a browser. Itā€™s in its early stages right now and the last time I looked at it it was quite complicated to try and train your own wakeword with it. I know that the person behind Rhasspy and voice2json is in talks with them to maybe create something of a library where people could share their trained wakeword models with each other and also add training data to enhance certain models but this is all in the future for now.
Right now Snowboy or Porcupine will still be the best options.
But Iā€™m hoping that with this cooperation and more interest in it precise will develop to be a true alternative.

7th my raspberry is the 2gb model and itā€™s overclocked to 2ghz

I hope this answers some of your questions, Johannes

5 Likes

Absolutely. Thanks for this detailed introduction!!!!
So there is a LOT going on under the covers of voice2json, since it seems to be a kind of glue between a lot of other projects. It is a good thing that there are pre-built binaries available, because building such a large group of dependent programs might result in quite some compilation errors on some platforms.
Please keep us updated of your nice work!

1 Like

@JGKK Thank you Johannes.

I have been 'playing' with Rhasspy, but it is becoming very integrated with MQTT and Hermes. This looks like a much lighter weight version and I will have a go thanks to your explanations.

Much appreciated.

1 Like

yes that is what made me choose voice2json over rhasspy as well. I had already build a complete assistant with nodered as the backbone that tied everything together and did the intent handling and pocketsphinx for the stt part. Rhasspy is a lot more complex and self contained while voice2json works nearly like a nodered contrib.
I thought about making some subflows that package an exec node and a json parser with some environment variables to set the profile path. This would make voice2json feel very integrated with nodered as you could have subflow nodes for each voice2json command that you can just drop in your nodered flows.

5 Likes

@JGKK Hi Johannes,

your setup sounds great, I was using snips.ai for some time but was searching for a solution just like you describe here.
I'm in the process of setting up a test satellite at the moment. Your example workflow was perfect to get started. Would you have a similar starting point for your voice2json-flow? That could be really helpful.

Kind regards
Simon

Hello,
@BartButenaers and me are working on something interesting in this direction. A simple nodered wrapper for some of the voice2json functionality. Itā€™s not quite ready for prime-time but here are some teaser images:

image
So if you are a little patient... :upside_down_face:

8 Likes

Looks very interesting :clap: :clap: :clap: . Will hotword detection be part of this?

@JGKK That looks very cool. I'll certainly be patient and try to prepare my satellites. Would you have an example or know a documentation for how to send the recorded wav files via MQTT? I never sent anything but JSON data...
Vielleicht kann ich auch mit der Dokumentation der voice2json-Nodes unterstĆ¼tzen. Kƶnnte ja auf Englisch und Deutsch Sinn machen :wink:

Looks great, I'd help with beta testing as well :slight_smile:
Kind regards,
Simon

As of now not. I might implement it at a later point. Voice2json only supports Porcupine right now. There might be mycroft precise support in a future version. Right now I think snowboy with the approach i described above is still the best way for Hotword detection from Nodered. At least until they shut down at the end of the year.

Right now we are a little way from the beta but we will definitely post here when there is something worth trying.
For now the documentation will be in english but when that one is done we can think about a german translation. Hallo aus Berlin :wave:
For sending wavs over mqtt just use a file node connected to an mqtt out node set to send the file as a single buffer object.
On the other side the other way around. Mqtt in connected to a file node set do write the buffer from the received msg.payload to a file.
But id really recommend to read and write the files on both sides from a folder mounted to tmpfs to minimize sd card wear.

1 Like

@JGKK Hi Johannes,
that works perfectly fine, thank you for the help. I never used the file node yet, it simply did not occur to me. Great stuff, this is very promising. Snowboy works pretty well so far, I'm really excited to try out your wrapper nodes.
Maybe I'll try using voice2json "raw" until then. Let's see how easy this is...
Thank you for the inspiration. If I can be of some help, just let me know.

Voice2json is easy to use even without our wrappers.
The beta for 2.0 of voice2json is out now. Mike updated the docs to include the features of 2.0 and all the download links now point to 2.0 packages.
There is a lot of performance improvements and a few new features. So have fun exploring.

2 Likes

You are right @JGKK, itā€˜s a pretty good learning curve.
Only issue Iā€˜m running into is getting voice2json to run in my production environment, as I have Node-RED running with the official docker image based on alpine.
I have difficulties building a custom image on top of that.
Do you have your setup running in docker?