Hi @JGKK
I've been following Rhasspy (and a bit of voice2json) for some time because I thought it might fit perfectly with SEPIA Open Assistant especially since I saw that it supports the same Kaldi/Zamia system that is already working inside the SEPIA STT server.
The SEPIA STT server is working pretty well, but it is not very user friendly when it comes to customizing the language model (it is possible but not documented and the most painful part is to add missing words to the dictionary) and with all the other items on my to-do list for SEPIA I never really found the time to improve this.
Now I saw this Node-RED integration and all the effort that was put into improving on-device, open-source speech recognition and I was thinking about opportunities again to make SEPIA and voice2json compatible. The thing I'm most interested in is the STT node.
Since SEPIA can stream audio to any server (the SEPIA STT server is basically a small WebSocket server around Kalid/Zamia) I was wondering what would be necessary to offer voice2json as a third speech recognition engine inside the SEPIA client app. To clarify this I'd like to ask you few questions:
- Does this node support live transcription? To be more precise does it start transcribing audio as soon as the audio stream starts or does it save/buffer the whole stream first and then starts transcribing after the stream ends?
- Does the STT server do VAD as well and sends a 'stop' signal when it thinks speech input is finished?
- What is the real-time factor approximately meaning how long does it take from 3s speech input to the JSON result (just a rough estimation, lets say on RPi4)?
Thank you and keep up the great work
Florian