Had a nice discussion last week about voice controlling Node-RED. Based on that discussion I created a prototype of node-red-contrib-deepspeech on Github. It is not released on NPM, since it is just an experiment.
I just wanted to test two things:
Can I use Deepspeech with Node-RED, i.e. can I convert without cloud service speech to text with decent quality (based on trained deep learning, rather similar to Google's cloud service).
And if that works, how fast or slow is it.
I'm not going to repeat all information, since I have described it on my readme page above. The summary:
Quality seems to be good, but converting an audio sample of only 1.975 seconds takes 50.17 seconds on a Raspberry Pi 3 Model B. However all calculations are being done on CPU, while a neural network should be executed at least on a GPU...
Don't know if it is useful to continue with this node? All 'constructive' feedback is again welcome.
Would be nice if users could do some testing on other hardware (corresponding to Deepspeech hardware recommendations) to test whether we can achieve real-time STT without needing a complete datacenter
I can see the reasoning behind wanting to have confidence in system security, I expressed my view in another post, but maybe we need to look more closely at what the risks are in using Google's infrastructure, and other contributions such as Andrea Tatar's work in more detail.
If we are using the service to control security or mission critical processes, I agree that we need total control, but for general home automation maybe a balanced compromise is needed?
I've been trying @andrei-tatar's node for the past week or so, and I'm slowly seeing the potential being unfolded, especially as I've added 'Continuous Conversation' to my Home device (by switching to English (US) language).
You guy's know much more than me, what do you see as the main threats to such solutions?
If this were run on a fairly robust machine that could serve as an offload engine for multiple audio inputs (similar to many voice assistants today, but hosted on the LAN rather than across the WAN) I see a lot of potential. In fact, at the hospital I work at, we use a setup much like that to allow physicians to dictate into digital charts in real time (the recognition engine is highly optimized for medical terminology and does an absolutely stunning job of coping with different accents.)
That all said, if performance on low power hardware is priority, it may be best to instead focus on phonetic analysis rather than true speech to text. Both avenues have major advantages: true TTS may allow for more fluid syntax and perhaps even context analysis, while phonetic analysis would be lightweight but require more rigidly defined command syntax and a more explicit corpus of available utterances.
Hi Paul, I will remove the privacy related sentence from my readme page. I'm sure there will be pretty much other good reasons to have local processing instead of a cloud solution. Had to write the readme page very quickly, so added the privacy reason since that was the only thing that came up in my mind at that moment ... If anybody has some other reasons, please let me know and I write them down.
I now know at least that the prove of concept works, just would like to know if somebody can run it real-time with affordable hardware. For example a neural network USB stick or whatever ... I have nu clue at the moment if that is possible. Mozilla has also planned on their roadmap to make it run on mobile devices, so I assume it will become much faster in the future.
Jason, does such analysis already exist in Node-RED or did you have some NPM module in mind that I could use? And could you please explain "require more rigidly defined command syntax and a more explicit corpus of available utterances." a bit more, because I'm new in this area ... Thanks !
I'm not familiar with any such work in our neck of the woods, but that doesn't mean there's nothing out there.
I've seen a couple projects out there that do simpler waveform comparison to pre-defined available words (corpus being collection of known words) in a sort of fuzzy approach. You define a syntax, (in the linguistic sense, not a programming sense) in which certain parts of an utterance (statement) are valid at specific times. (Essentially a grammar.) If you have a limited number of terms that may be used at the relevant part of a command, you can do a simple comparison to an ideal waveform for that command and make the assumption that as long as the word used meets a certain threshold of similarity to any available word, then the word that is most similar is the one being said. You then consider that part of the statement satisfied and move on to the next list of valid words and interpret accordingly. From what I understand, this is fairly similar to what is being done under the hood in Snips.
Hey Paul, it seems you will need a very large array of Raspberries. Don't know if you are married, but I won't get budget for such a hardware purchase in a million of years
I have added an extra hardware section to the readme page. I'm going to put this node in the freezer, until Deepspeech is being optimized or the price of a Raspberry drops below 1 euro...
But now at least we can say we did try to implement it ...