[Announce] node-red-contrib-voice2json (beta)

This tiny flow chart i made pretty much summarizes all the ways you can pass audio to the nodes and in which formats and at what points.
Yes i for example run wake word and record command on one pi and than send the wav buffer over mqtt to my base for stt. I also have a siri shortcut on my phone that records audio and sends it to nodered via an http request where it gets converted to the right format and fed to the stt node. Its all in the documentation :wink: but yes you can use any of the nodes by themselves.


yes as all the processing happens in voice2json and not in nodered. Nodered just provides the interface in this case to easily tie the components together. So they are codependent in this case. Fortunately voice2json is very straightforward to install. Its all not that complicated really and bart is working on an easy step by step tutorial right now to give an easy starting point.

1 Like

I am using Rhasspy running on a separate RPi 3+ integrated with Node Red through MQTT, which then drives an automation controller.

I have used both of the ReSpeaker Mic 2 and Mic 4 arrays. The 2 works OK up to about 2m away, but is not very immune to any other sound around. The Mic 4 has better discrimination, and is currently just over 4 m from where we sit.

If the music/TV/Film is not too loud (as in a liveable volume rather than 'listening' volume :rofl:), it will pick up commands OK.

The Respeaker Mic 2 will go onto the Node Red unit I will test this node on.


Than you no what to expect as the underlying libraries and the sentences syntax are pretty much the same as Rhasspy 2.5. Just a lot more modular. I hope it all works well as i kept Mike very busy when beta testing for this release of voice2json and snuck in a feature or two and few pull requests just to make the nodes work better :shushing_face:

I think Mike and the team have done a great job with 2.5.

Sure made me look more into MQTT instead of using Web Sockets. Glad I did.

I cant speak for Rhasspy but im very happy with how voice2json 2.0 turned out with features like the precise integration and the slot programs. Im in close contact with mike and these nodes will probly become part of the official voice2json documentation sometime in the future when there is a non beta release.


This is fantastic - kudos to both Johannes & Bart! In under 30 minutes I was able to get a local voice agent added to my home & work demo setup. Currently running on a Pi 3+, but have already done a quick test on my 4 - noticeably faster performance so will be redeploying there. Of course I spent another hour playing with custom voice wake words -- but who wouldn't? Thank you to you both!


Thank you :pray:
Yes the pi 4 is noticeably faster but im still suprised how usable a pi 3b is.
There is one caveat right now when using docker. I just found a bug that when you try to restart the wait-wake node it will hang and leave orphaned processes behind that can only be killed from htop.
Dont use wait wake with docker installs right now but if you do be aware that you cant properly stop the process and this also is true for restarting the flows or stopping nodered, it will work but you will have to kill it manually from htop if you redeploy, restart, stop nodered.
If possible just use the deb package as this has none of those problems

I stuck on this while the voice2json installation:
" * Download a profile and extract it to $HOME/.config/voice2json

after I installed the deb package there are none of these directorys existing on my raspi.
cli:voice2json seems to work.... or not ?

pi@raspi4B:/usr/lib/voice2json/voice2json $ voice2json test-examples
WARNING:voice2json:/home/pi/.config/voice2json/profile.yml does not exist. Using default settings.
CRITICAL:voice2json.core:Missing /home/pi/.config/voice2json/intent.pickle.gz. Did you forget to run train-profile?
Traceback (most recent call last):
  File "__main__.py", line 6, in <module>
  File "asyncio/runners.py", line 43, in run
  File "asyncio/base_events.py", line 587, in run_until_complete
  File "voice2json/__main__.py", line 73, in main
  File "voice2json/test.py", line 27, in test_examples
AssertionError: Not trained
[13649] Failed to execute script __main__
pi@raspi4B:/usr/lib/voice2json/voice2json $ 

Did you download a language profile from the link in the documentation about language profiles after installing voice2json?:

The language profile is the folder whichs path you have to enter in the config. This is one of the major principles of voice2json. All the data that belongs to each language comes in downloadable profiles so you could even have more than one profile for the same language on one machine. To use it you will have to first extract it as it will be a tar.gz compressed file. Than you need to do an initial training before you can start with any of the examples. I really recommend you read the documentation about the config node most thoroughly next as that is the most complicated part. Your voice2json installation looks good :+1:


Mike from voice2json just accepted a pull request from me to the docker run script that should fix any issues with wait wake. If anybody uses the docker install just adapt the present script to the new one in the documentation.

I can only spend limited time to get it to run, here is what I did and what's my current state:

  • I use tasker to send my voice / commands to the raspi. I used your example flow. The sox conversation to wave works well... At least in my tests when I write it to raspi SD card and used VLC to play it. (http://puu.sh/G3e1I/d065b5d3a2.png)
  • I installed the Deepspeak german profile package (using the official doc). "Train-profile" worked whlie "voice2json transcribe-stream" did throw an "ALSA Error" properly because I dont have a microphone for my raspi (http://puu.sh/G3e2w/79a3b6a846.png). So I tried "voice2json transcribe-wav /home/pi/Downloads/test.wav" while test.wav is a wave file I saved with the sox converter. This throws me the same error (http://puu.sh/G3e5q/55cf914844.png) as in my last post AND as I get when I use the sox-convert-example, which I configured to use the DeepSpeech sentences.ini.(I have tom mention that I didn t add slots to the slots tab... is this necessary on this state of testing?)
    I am very sorry when I did some dumb mistakes. I hope Im getting closer to make ist work :slight_smile:

I think the deepspeech model for german is not actually compatible with a pi. Because afaik voice2json uses this model: https://github.com/AASHISHAG/deepspeech-german which will not work for arm. Unfortunately deepspeech works with two different model formats where one of them doesn’t work on arm due to missing tensorflow features or something. I think only the english model works on the pi for deepspeech. I can’t really recommend it for now anyway see this post Ways to control Node-RED with speech commands where i talk about it in more detail. For german I really recommend the Kaldi profile which i also use personally as it’s really fast. It’s the model from https://github.com/gooofy/zamia-speech which is a great project that have some of the best open source models for kaldi in german and french.

If you’d want to use transcribe stream from the commandline which our nodes don’t support for a couple of reasons you will actually need a microphone and a properly configured asound.conf file.
We didn’t implement transcribe stream which is a combination of record-command and stt because it didn’t actually offer any performance improvements due to a couple of reasons. Every time you start transcribe-stream it has to load all the libraries for the stt part again which is one of the biggest slow down factors. I actually implemented a little trick in our stt node that wouldn’t work in transcribe-stream. We keep stt (transcribe-wav) running with all libraries already loaded once your first transcription was done. It will just idle in the background but this way on every subsequent transcription it works a lot faster. Also it’s a lot more stable and modular with record command and stt separated.

No you don’t need any slots at all. You could write everything just in the sentences.ini.

I hope this answers all of your questions for now.

1 Like

By the way if you find the time to take a few screenshots and explain how you set up tasker to record audio and send it to nodered feel free to send it to me or do a pull request on our documentation as this is would be a very valuable addition to the possible set ups documentation as i unfortunately dont have any android devices. :grinning:

I really recommend the Kaldi profile which i also use personally as it’s really fast.

I installed it and it seems to work very well. I guess its not possible to recognize speech which is not defined in the sentences.ini ?

I actually implemented a little trick in our stt node that wouldn’t work in transcribe-stream.

As soon as I have a microphone connected to my raspi I will try that out. Btw: Is that the (recommended) way it works with a wake word then ?

I hope this answers all of your questions for now.
You did ! Much thanx again for your patience !

By the way if you find the time to take a few screenshots and explain how you set up tasker to record audio and send it to nodered feel free to send it to me or do a pull request on our documentation as this is would be a very valuable addition to the possible set ups documentation as i unfortunately dont have any android devices.

I will write a documentation which you can use the way you like. Actually the tasker app will record a fixed time. This is not optimal. I would like to figure out if a silent detection is possible or a better solution like holding a button for the recording time...or so

This depends. Have a read of the two first chapters in this section of our documentation and the included links: https://github.com/johanneskropf/node-red-contrib-voice2json#advanced-topics

All the nodes are made to work together without out much additional configuration. You send a stream of raw audio buffers to the wait wake node and as soon as it detects a wake word it will forward the buffers to its second output if set that way.
You connect that to record command which will as soon as it detects no more voice activity send a single wav buffer which can be fed straight to the stt node (don’t forget to use a change node to set the wait wake node back to listen mode so it stops forwarding the audio). The stt node emits a transcription and that can be fed straight to the tti node for intent extraction.

Nice thank you :pray:

1 Like


I've been following Rhasspy (and a bit of voice2json) for some time because I thought it might fit perfectly with SEPIA Open Assistant especially since I saw that it supports the same Kaldi/Zamia system that is already working inside the SEPIA STT server.
The SEPIA STT server is working pretty well, but it is not very user friendly when it comes to customizing the language model (it is possible but not documented and the most painful part is to add missing words to the dictionary) and with all the other items on my to-do list for SEPIA I never really found the time to improve this.

Now I saw this Node-RED integration and all the effort that was put into improving on-device, open-source speech recognition and I was thinking about opportunities again to make SEPIA and voice2json compatible. The thing I'm most interested in is the STT node.
Since SEPIA can stream audio to any server (the SEPIA STT server is basically a small WebSocket server around Kalid/Zamia) I was wondering what would be necessary to offer voice2json as a third speech recognition engine inside the SEPIA client app. To clarify this I'd like to ask you few questions:

  • Does this node support live transcription? To be more precise does it start transcribing audio as soon as the audio stream starts or does it save/buffer the whole stream first and then starts transcribing after the stream ends?
  • Does the STT server do VAD as well and sends a 'stop' signal when it thinks speech input is finished?
  • What is the real-time factor approximately meaning how long does it take from 3s speech input to the JSON result (just a rough estimation, lets say on RPi4)?

Thank you and keep up the great work :+1:


Neither right now. If you want to stream live raw audio you would have to feed it to the voice2json record-command node which uses webrtc vad to determine when the command was finished speaking and than emits a single wav buffer with the detected speech. This is what the stt node expects, a single buffer containing wav audio or a path to a wav file.

The realtime factor i see on a pi 4 with kaldi is about 0.7 to 1 depending on the amount of background noise. So a bit over 2 seconds for 3 seconds of input audio.

I am working on live stream transcription with voice2json through nodered but there is a couple of factors why there is no node for that right now. There is a few bugs on the voice2json side and especially when used with a voice2json docker install there is a big speed caveat as the start up of the stream transcription and loading the libraries is enough to nearly negate any speed advantage to the current way. But it will be implemented as a node in the future and is on my internal road map as voice2json can in theory do it. Just not quite ready yet.

If you’d want to integrate voice2json into sepia probably the best way would be to just write your own python service or any other language really that interacts with it as this would save a lot of overhead.

Hope this helps, Johannes

Thanks for the info.

Sorry, but I didn't quite get the argument here. Are you saying the "wav buffer" is different from the raw audio stream (=another buffer)?

Thats pretty good! :slight_smile:

Somehow I was expecting this conclusion ^^ ... its just so much fun to simply connect nodes in Node-RED, especially since I've started to build several nodes for SEPIA as well :smiley:
The text-to-intent part is probably still something that could be integrated easily into SEPIA custom services via Node-RED, but thats a story for another time :wink:

I'll try to read a bit more about voice2json and its interfaces. It would be really great to enable users to share their custom voice models between the systems.

Yes a raw stream from a microphone is just buffers of pcm audio which have no information attached to them like length or encoding. So any program you pass it too wouldn’t know what to do with it. Wav data on the other side has riff headers embedded in it which will provide all that information. That’s why when you want to convert or play raw audio in any tool you actually will have to enter that information manually. The question is if sepia is streaming the raw data from the microphone or if it is actually streaming wav chunks that have headers?

I really recommend you read the white paper about the voice2json pipeline by mike the developer:

1 Like