Although I have a very large todo list, I get distracted every time I see something interesting. So this forum is a real hell for me. I hate you guys with all your brilliant ideas and cool IOT stuff
Saw multiple discussions lately about voice control, so have read some basic articles about the topic. Below a list of possible solutions I found. If anybody has some other options, preferences, dislikes, or whatever 'constructive' feedback, please let me know ... Would be nice if interested users could find here a list of all possible setups.
Using dedicated hardware, which can a.o. do both text-to-speech and speech-to-text. If I understand correctly this are the two major competitors, with Google on top (e.g. due to larger number of languages supported):
Amazon Alexa (voice recognition system) which is integrated in Amazon's Echo speaker, and that can be connected like in Dave's tutorial:
Via the Node-RED dashboard app combined with native browser speech recognition functionality. Currently the dashboard's Audio-out node supports TTS (text-to-speech) which is supported widely by all major browsers. However STT (speech-to-text) is only supported by Chrome, and it needs a connection to their cloud platform (if I'm not mistaken). Would have been a simple solution: just talk in your microphone (of a wall mounted tablet running the dashboard), let the browser convert the speech to text, and send the text to the Node-RED flow. But don't seems to be a good idea ...
Via the Node-RED dashboard app running some thirth-party speech recognition software running local in the browser. There seems to be a lot of open source projects available, but not all well maintained. The following might (?) be worth looking at:
PocketSphinx.js: this is a wasm (webassembly) build so it can be runned in any browser. Disadvantage is that the wasm file is about 3 Mbyte large.
Deepspeach (see details below) will publish a client version in the near future.
With speech recognition software running local in the Node-RED flow. In this case an audio stream (e.g. from the dashboard ...) is send to the Node-RED flow, where a speech recognition module is running. There seems again to be a lot of open source projects available:
Deepspeech from Mozilla, which is based on neural networks in Tensorflow. Here is a demo. They have also created a website which allows everybody to contribute speech fragments to train the system in their own language. The more training data they can collect, the better it will become. The disadvantage I see is a non-automatic installation procedure, so not possible (I think) to install simply from the Node-RED manage palette.
PocketSphinx.js: since NodeJS version >8 is able to execute Webassembly files, this module can also be run in the Node-RED flow...
By calling a cloud service like Google, Amazon, ... Seems that Google is unbeatable in quality, since they use deep neural networks with huge (private) training databases in lots of languages. But you have to pay per 15 seconds of audio. That is the reason I personally prefer more the local non-cloud solutions...
You can create a "flow" in shortcuts that opens a URL within the shortcut. A shortcut can also be added to Siri by recording a Siri Phrase. The phone has to be on the same network as Node-RED, the "Running Your shortcut" response is a tad annoying. But I tend to use this to trigger stuff around the house more than Alexa (turn on / turn off etc).
I have amazon alexa set up with Node-RED to query the train times in the morning, then announce the next two trains using amazons speech-to-text service. The most difficult thing is remembering the syntax of alexa queries.
A good summary Bart. Getting speech into Node Red is only part of the problem though, understanding the words and what they mean is a another challenge.
The approaches range from 'the user said exactly this' to 'the user said something that sort of sounded like they could have meant this'.
The most basic approach is to do simple text comparisons to specific phrases. It relies more on the user learning how to speak to the interface rather than the interface being able to interpret what the user is saying.
A more advanced approach is to use natural language processing which applies a set of syntactic and vocab rules to match what the user is saying. node-red-contrib-ecolect is a natural language parser and matcher that uses fuzzy logic to match input text to known phrases and extract values from the text. In the home automation area, where there are only a limited number of ways to say 'turn on the bedroom light', this approach is quite useful.
The most advanced approach is to use neural networks to determine the context of a sentence. This requires a lot of training with a huge user base. Google's Dialogflow uses this approach to implement complex speech based apps usable by the general public. Imagine how many different ways people could interact with an airline reservation system. Dialogflow can be trained about context in conversations, so it can understand if a user previously asked about flights to London, they are probably wanting to book a flight there when they next say they want to buy a ticket.
Hey Dean, I thought your ecolect node has to be used to extract commands from a text. But I have to convert first the audio stream to text, and then inject that text into your node. Or am I wrong?
Mark, I wasn't even aware that there exists some kind of query syntax for Alexa (and Google Home ?). Then those devices are a 'little' bit less amazing
I'd love to have a simple way of using a PiZero with a USB microphone to recognise simple IoT type commands (e.g Simon says lights off) without a need to use a cloud service
That's right Bart, first part of the problem is turning the speech into text and the second part is understanding the meaning of the text.
Google and Alex offer solutions to both parts of the problem. With Google Assistant, Actions does the speech to text part and Dialogflow does the understanding part.
Hashtag me too...
Dean has shared a nice link (!!), but again you need a Raspberry Pi 3. I.e. a single core Raspberry Zero will not be sufficient. But then the price is getting closer to the Amazon and Google devices. And they have a mic, speaker, a nice case... I'm afraid that they are going to win this battle
Hey Jan, thanks for the link! Are you happy with the device capabilities?
Again hashtag me too ...
But for those among us that don't care about their privacy and want to create a cheap system (I 'think' Raspberry PI Zero might be sufficient), I found another way:
As mentioned above, Chrome is the only browser that offers native Speech Recognition (i.e. speech to text).
How do they do that: seems they don't have added a library to Chrome to do local processing, but they just send the audio samples to their cloud service behind your back! That is how they that managed to get it implemented already in Chrome before the others...
Since these audio samples are sended 'by Chrome', Chrome will not have to pay for the recognition processing . You probably guess what will be the next step ...
Here is a great article of a guy that has reversed engineered the whole stuff, and he explains how you can act like you are Chrome. Result is that you have free audio recognition in the cloud ...
The above is pure informational from technical interest. Don't know if this is legal, so try it at your own risc!!!
And be aware that all voices in your house go to big brother. But that is also the case when you buy a Google Home device, because when you are offline the device will answer your commands with "I'm sorry I dont have an internet connection right now".
And for those with low bandwith, keep in mind that you are sending a massive amount of (audio samples) data across the internet ...
A Google Home device in a nutshell: Google sells you a microphone that you put into your living room. Everything you say will be send to Google, where they convert it to text. From that text they know exactly what you are talking about, and they can analyse your conversations. Based on that analysis they execute the 'required' actions, whatever that might be...
Summarizing it like that, makes those devices a bit less attractive to me at the moment ....
... only the commands that you issue to the Google Home. Google 'tell us' that it does not eavesdrop, and the LED's indicate whether it's listening for commands or not (other than OK/Hey Google).
But yes, they could get to know quite a lot about us.
Note that google is not constantly listening to what is said in the room :
In case of the voice kit : a python script is running on the raspberry pi that listens for the hotword "ok google" or "hey google" which requires no interaction with google cloud. It is only when it has recognized the hotword that the following spoken sentence is streamed to google.
You can even disable the hotword trigger and use a button press to trigger the cloud conversation with google.
It's even simpler than that. It's listening for anything with a pattern starting with the ⟨e⟩ sound shared by the end of "ok" and "hey" followed by just about any pulmonic consonant stop (g, b, d, etc in English) of some sort followed by the ⟨u⟩ sound ("oo" in google") another pulmonic stop, and just about any vowel.
That simplicity allows the wake word on a google home device to handle the wake function with waveform matching in hardware - which is handled by a chip that passes audio to the main board only after its trigger condition has been satisfied. Give it a try sometime: a google home will respond to similar things that match the pattern such as Yogi Bear's "hey Booboo!"
It is easy enough to test whether this is true and that is to hang a packet trace off the device. Trust me, if it were really doing this, it would be splashed all over the world's press.
Still, I wouldn't be saying really sensitive things around my Google Home - a little paranoia is OK.