[Announce] node-red-contrib-voice2json (beta)

This depends. Have a read of the two first chapters in this section of our documentation and the included links: https://github.com/johanneskropf/node-red-contrib-voice2json#advanced-topics

All the nodes are made to work together without out much additional configuration. You send a stream of raw audio buffers to the wait wake node and as soon as it detects a wake word it will forward the buffers to its second output if set that way.
You connect that to record command which will as soon as it detects no more voice activity send a single wav buffer which can be fed straight to the stt node (don’t forget to use a change node to set the wait wake node back to listen mode so it stops forwarding the audio). The stt node emits a transcription and that can be fed straight to the tti node for intent extraction.

Nice thank you :pray:

1 Like


I've been following Rhasspy (and a bit of voice2json) for some time because I thought it might fit perfectly with SEPIA Open Assistant especially since I saw that it supports the same Kaldi/Zamia system that is already working inside the SEPIA STT server.
The SEPIA STT server is working pretty well, but it is not very user friendly when it comes to customizing the language model (it is possible but not documented and the most painful part is to add missing words to the dictionary) and with all the other items on my to-do list for SEPIA I never really found the time to improve this.

Now I saw this Node-RED integration and all the effort that was put into improving on-device, open-source speech recognition and I was thinking about opportunities again to make SEPIA and voice2json compatible. The thing I'm most interested in is the STT node.
Since SEPIA can stream audio to any server (the SEPIA STT server is basically a small WebSocket server around Kalid/Zamia) I was wondering what would be necessary to offer voice2json as a third speech recognition engine inside the SEPIA client app. To clarify this I'd like to ask you few questions:

  • Does this node support live transcription? To be more precise does it start transcribing audio as soon as the audio stream starts or does it save/buffer the whole stream first and then starts transcribing after the stream ends?
  • Does the STT server do VAD as well and sends a 'stop' signal when it thinks speech input is finished?
  • What is the real-time factor approximately meaning how long does it take from 3s speech input to the JSON result (just a rough estimation, lets say on RPi4)?

Thank you and keep up the great work :+1:


Neither right now. If you want to stream live raw audio you would have to feed it to the voice2json record-command node which uses webrtc vad to determine when the command was finished speaking and than emits a single wav buffer with the detected speech. This is what the stt node expects, a single buffer containing wav audio or a path to a wav file.

The realtime factor i see on a pi 4 with kaldi is about 0.7 to 1 depending on the amount of background noise. So a bit over 2 seconds for 3 seconds of input audio.

I am working on live stream transcription with voice2json through nodered but there is a couple of factors why there is no node for that right now. There is a few bugs on the voice2json side and especially when used with a voice2json docker install there is a big speed caveat as the start up of the stream transcription and loading the libraries is enough to nearly negate any speed advantage to the current way. But it will be implemented as a node in the future and is on my internal road map as voice2json can in theory do it. Just not quite ready yet.

If you’d want to integrate voice2json into sepia probably the best way would be to just write your own python service or any other language really that interacts with it as this would save a lot of overhead.

Hope this helps, Johannes

Thanks for the info.

Sorry, but I didn't quite get the argument here. Are you saying the "wav buffer" is different from the raw audio stream (=another buffer)?

Thats pretty good! :slight_smile:

Somehow I was expecting this conclusion ^^ ... its just so much fun to simply connect nodes in Node-RED, especially since I've started to build several nodes for SEPIA as well :smiley:
The text-to-intent part is probably still something that could be integrated easily into SEPIA custom services via Node-RED, but thats a story for another time :wink:

I'll try to read a bit more about voice2json and its interfaces. It would be really great to enable users to share their custom voice models between the systems.

Yes a raw stream from a microphone is just buffers of pcm audio which have no information attached to them like length or encoding. So any program you pass it too wouldn’t know what to do with it. Wav data on the other side has riff headers embedded in it which will provide all that information. That’s why when you want to convert or play raw audio in any tool you actually will have to enter that information manually. The question is if sepia is streaming the raw data from the microphone or if it is actually streaming wav chunks that have headers?

I really recommend you read the white paper about the voice2json pipeline by mike the developer:

1 Like

@JGKK done in hurry:
tasker tutorial
Use it the way you want: edit it, copy it... whatever. if you want to update your voice2json documentation. no credits needed.

1 Like

Hi all,

Looks like very interesting add-on. I want to try it to send voice commands to my Home-automation server. I want to run it on separate RPi just for voice recognition+mic, installed in the best place from audio perspective.
I have one question: on ReSpeaker there is also 6-Mic array available - is it better, than 4 mic? Will it work with this project? I liked the Idea of flat cable, going to RPi - I need to have it a bit separated. BTW will it work with RPi 3 too?

Another question - are there any housings available for ReSpeaker?


I don’t have any experience with the 6 mic unfortunately. I have worked with the 4 mic pi hat, the 2 mic pi hat and the usb mic array v2.
The 2 mic is a good entry point for trying it out and development. The 4 mic pi hat is actually quite good performance and distance wise and very easy to work with.
In production I use the usb mic as it has by far the superior performance and far field capabilities but unfortunately also the highest price.
So overall think the 4 mic pi hat gives the best price versus performance balance and if I would start over I might do my whole setup based on it. Only thing to keep in mind is that the 4 mic hat doesn’t have a speaker output like some of the other respeaker mics.

There are several housing models that people designed available on thingiverse. I used some of those in the past and they were quite good. I got them p3d printed via one of the print on demand services like treatstock. Just search for Respeaker on Thingiverse.
There is a really nice one for a pi3 + 4 mic pi hat.

I will work with a pi 3 just fine but you will not get sub real time performance like on a pi 4. So for a 3 second speech command a pi 3 will take about 3 seconds to process it. You could also separate process into parts happening on different machines. This is what I do. I run the wake word and audio capture for the command on some raspberry pi 3s and than send the recorded command over mqtt to a central overclocked pi4 for speech to text and intent processing. This way I only needed one pi 4 and could use pi 3s i still had lying around for the audio capture.

Hope this helped, Johannes

1 Like

Excited to use this node you guys created. I am having a problem getting it going however. I followed the instructions and the result of the training process provides the following error:

Command failed: voice2json --profile /home/pi/Public/en-us_kaldi-zamia-2.0 train-profilengramcount: /lib/arm-linux-gnueabihf/libm.so.6: version GLIBC_2.27' not found (required by ngramcount)ngramcount: /lib/arm-linux-gnueabihf/libm.so.6: version GLIBC_2.27' not found (required by /usr/lib/voice2json/lib/libngram.so.134)[13296] Failed to execute script __main__Traceback (most recent call last): File "__main__.py", line 6, in <module> File "asyncio/runners.py", line 43, in run File "asyncio/base_events.py", line 587, in run_until_complete File "voice2json/__main__.py", line 73, in main File "voice2json/__main__.py", line 733, in train File "voice2json/core.py", line 65, in train_profile File "voice2json/train.py", line 299, in train_profile File "rhasspyasr_kaldi/train.py", line 92, in train File "rhasspynlu/arpa_lm.py", line 55, in graph_to_arpa File "rhasspynlu/arpa_lm.py", line 70, in fst_to_arpa File "rhasspynlu/arpa_lm.py", line 345, in run_task File "s...

$PATH includes home/pi/Public, which is where the en-us_kaldi-zamia-2.0 files are located. Any suggestions?

Thats a voice2json error outside our nodes.
Are you using the deb package or the docker image? Did you see any errors while installing if it was the deb package?
Which Operating System are you on? When I googled a little bit about GLIB 2.27 this seems to be a problem with some debian based distributions like older ubuntu versions that have an older GLIB version.
So its probably inherent to your operating system and not voice2json.
If you installed using the deb package you could try to uninstall and use the docker install instead which should be agnostic to the lib versions of the host system.
If that doesnt work please open an issue with Mike the voice2json developer directly on the voice2json github:


Thanks Johannes,
I am using Deb package and saw no errors when installing the pre-compiled packages . O/S is Raspbian GNU/Linux 9.11 (stretch). CPU is armhf. I really prefer to avoid using Docker at his point (something else to learn). I will do an O/S update / upgrade to ensure I have the latest.

EDIT: Aha, I see there is now a v10, "Buster". We'll try that.

1 Like

Yes I think they moved to Buster last year. Please let me know because that should be added to the requirements by me than :+1:t2:
Although the docker install of voice2json is fortunately quite painless if you follow there instructions.

Version 0.6.0

  • number of small fixes
  • added a transcribe-stream node (@sepia-assistant):
    • This is in principle a wrapper around the voice2josn transcribe-stream functionality that was introduced in recent versions and is a combination of record-command and stt
    • The transcription is started as soon as raw audio starts arriving. Due to the fact that the transcription happens while the input is still running this gives even on hardware like a raspberry pi nearly instantaneous results.
    • for this too work you need either latest deb package or docker container as there where some bugs in the previous versions of voice2josn that will prevent this node from working.
    • The node will only work with raw audio from a microphone same as the record command node. (So the Stt node is still the way to go if you receive your audio from any other source like for example a phone)
    • the usage is described in the info tab of the node but its not in the readme in the repository yet.
  • The suite of nodes now has proper version numbers in the package json so i don’t know if an update straight from the repository will work as the version now is effectively below the one that you have installed if you installed previously. So if it doesn’t show 0.6.0 after an npm update uninstall the node and reinstall it.

As always I look forward to your feedback and I will try to give the readme some love next week but unfortunately I have been a bit short on time the past weeks.



Thank you for the update. I still have no microphone array but some time to get back to voice2json :slight_smile:
The recognition of my voice is not so good. I still have not trained because I still have no microphone for my raspi (Tasker only). Here comes the quesions: Which microphone can you recommend ? I thing I need an usb microphone since I googled the "respeaker 2 mic pi hats" (you recommended before) is mounted / connected by the GPIO what I think will be a problem with my cooling solution. The other question: What if my gf will try speech recognition when trained to my voice ? Will this work or will I get thrown something to my head after the recognition fails always ?

EDIT: Another thing. I am to dumb for getting how the training works. I start it with: msg.payload = "train"
well... and then ? Normally I know similar software that expects me to read given text and wants voice sample with that text. - hmmmpf, but this one seems to work elsehow.


sounds great, especially the fact that it starts transcribing right away :sunglasses:
Are there transient results as well? ^^
I guess this does not require the wav-header right? (I have both available, just in case).

I think for SEPIA I should still access the voice2json endpoint directly though ... but maybe I could offer a SEPIA node that just streams the audio buffer from any SEPIA client and then that could be connected to the v2j node (for whatever ideas that come up) :slight_smile:

When you talk about similar software I expect you mean something like Dragon NaturallySpeaking. This is not how voice2json works. Most modern speech recognition systems like Kaldi or deepspeech work by training a both an acoustic model as well as a statistical language model on a large corpora on audio plus the transcription of said audio utilizing corporas like the one by the mozilla commonvoice project or similar.
When you train voice2json only the statistical language model part gets trained on your sentences and slots. The acoustic model doesn’t get touched and shouldn’t need to so apart from your sentences that you defined no further input is needed.
For more details and also some limitations of the approach of projects like voice2json please do read the section about how the transcription works in the documentation of the nodes and for much more detail the linked white paper by Mike about the voice2json workflow.
So the training is in principle speaker independent but you results may very depending on the model used as it always depends on how well all genders, ages and accents were represented in the copora used to train the acoustic model.

For Usb microphones it really depends on your budget. You can achieve some good results with some of the cheap usb conference microphones but the often haven’t got the best signal to noise ratio and far field capabilities.
Than there is the respeaker mic array v2 / usb mic array which gives far superior results and has build in leds that can be controlled with some simple python scripts but it more expensive than a pi 4 by itself.


1 Like

No only once a finished command was determined.

You would have to build your own endpoint service as voice2json by itself really only offers all the separate commandline tools to bootstrap a application and will always need the user to build the connecting infra structure around themselves. Which is one of the reasons i started the work on the nodes to make this easily doable with nodered.

Why don’t you offer a websocket or an mqtt topic to subscribe to to receive the audio like rhasspy or snips used to do? Very easy to connect from something like nodered and no extra node needed?


I've seen you node code and I think I could adapt the SEPIA STT server to use those terminal commands instead of the Kaldi ones. Lets see :slight_smile:

Actually this is what the SEPIA STT server is using. To be more precise it sends the audio buffer via socket and waits for results on the same channel.

Well I'm not making much progress just trying to run the voice2json with my usb mic. It calls arecord, but arecord fails with mymic ( a Shure MV5) with the parameters is sends - in particular -c 1. If I remove that arecord records fine. I've opened an issue in the github page for voice2json and I'll see what synesthesiam has to say.

Hmm the 1 channel Mono Format is what the speech transcription systems like Kaldi use so Mike will not have much choice there. You can change the command it calls in the profile.yml i think. If sox works with your mic you could use that instead to be called by voice2json or use the stdin argument and pipe the audio using unix pipes directly.

Or use the voice2json nodes and the sox convert node to convert to the right format after recording if it doesn’t accept a mono setting straight away while recording :wink: