Neither right now. If you want to stream live raw audio you would have to feed it to the voice2json record-command node which uses webrtc vad to determine when the command was finished speaking and than emits a single wav buffer with the detected speech. This is what the stt node expects, a single buffer containing wav audio or a path to a wav file.
The realtime factor i see on a pi 4 with kaldi is about 0.7 to 1 depending on the amount of background noise. So a bit over 2 seconds for 3 seconds of input audio.
I am working on live stream transcription with voice2json through nodered but there is a couple of factors why there is no node for that right now. There is a few bugs on the voice2json side and especially when used with a voice2json docker install there is a big speed caveat as the start up of the stream transcription and loading the libraries is enough to nearly negate any speed advantage to the current way. But it will be implemented as a node in the future and is on my internal road map as voice2json can in theory do it. Just not quite ready yet.
If you’d want to integrate voice2json into sepia probably the best way would be to just write your own python service or any other language really that interacts with it as this would save a lot of overhead.
Hope this helps, Johannes