Ways to control Node-RED with speech commands

Yes, it fully parses your text, though when outside of a “session”, the dialogue context starting after the hotword is mentioned until the end-of-session is given it won’t. If you answer something it doesn’t recognise as command however you can still force it to parse with an MQTT command and store that locally as data.
Here’s one example of someone who did that, but I’m not a fan of their blog format as it’s just a demonstration video and sometimes a link to (messy) code. https://laurentchervet.wordpress.com/2018/03/08/project-alice-arbitrary-text/
Here’s the official docs for the MQTT commands, the ASR part shows how to tell it to record/stop recording/parse the recorded speech to text
https://docs.snips.ai/reference/hermes
This is a low level API you normally don’t have to worry about.

1 Like

Thanks for sharing, most interesting. Like you, I love the concept of voice control but balked at trust and security issues with most (all really) online platforms.

Hi! Just found a promising new option: https://picovoice.ai/

here's their github: https://github.com/Picovoice

They have a wake word engine, a speech to text engine and a speech to intent engine.
Its multiplatform, can run on a raspberry Pi, doesn't need cloud.

Haven't tried it yet, but if it works well, I sure hope we can make some nodes out of it!!!

2 Likes

That memory footprint is looking nice, certainly. I might do a comparative test with Snips some time, but for now Snips will stay as my primary choice, their ootb mqtt support just one of the reasons :slight_smile:

Quick question about Snips, do you think its possible to make it always listening (remove wake word detection) ?
We are building some interactive installations and wake words are a no-go experience-wise.
Thanks!

Yes, through MQTT commands. Will involve a bit more coding but it should work. You can also dynamically start dialogues, which is what I already do based on conditions combined with time of the day it will ask me for pain levels, suggest I drink water, and so on. The pain levels one is an interactive dialogue and programming it out isn’t easy :stuck_out_tongue:
Start with working out flow charts on paper of conversation, then code them out. I’m going to see if I can write and post a snips-skill running directly in NR later. As in, where the handlers to subscribe/publish are set in NR with code in between to act on it.

1 Like

snips is offline, but you will have to pretrain it using their online service that builds the package.
picovoice also looks interesting and implementation looks simple too.

Picovoice does indeed look interesting, although their website looks a bit :crazy_face: Anyone know what their "Request Acess" gets you?

All documentation is on their github repo. But you can test a live demo here without requesting access.

I don't know for sure, but it appears to be for commercial/enterprise usage. However, the open source repositories on github come with small print. Part of the functionality is only allowed to be used/unlocked on a commercial license.
Picovoice comes as 3 projects pushed together:

  1. Rhino, the on-device speech-to-intent; GitHub - Picovoice/rhino: On-device Speech-to-Intent engine powered by deep learning
    "Custom contexts are only provided with the purchase of the commercial license. To enquire about commercial licensing, contact us."
    "A context defines the set of spoken commands that users of the application might say. Additionally, it maps each spoken command to users' intent. For example, when building a smart lighting system the following are a few examples of spoken commands: ..."
  2. Porcupine, on-device wake word detection; GitHub - Picovoice/porcupine: On-device wake word detection powered by deep learning
    "Custom wake-words for Linux, Mac, and Windows can be generated using the optimizer tool only for non-commercial and evaluation purposes . The use of the optimizer tool and the keyword files generated using it in commercial products without acquiring a commercial licensing agreement from Picovoice is strictly prohibited."
    "Custom wake-words for other platforms must be generated by the Picovoice engineering team and are only provided with the purchase of the Picovoice evaluation or commercial license. To enquire about the Picovoice evaluation and commercial license terms and fees, contact us."
  3. Cheetah, on-device speech to text engine; GitHub - Picovoice/cheetah: On-device streaming speech-to-text engine powered by deep learning
    "This repository is provided for non-commercial use only. Please refer to LICENSE for details. The license file in this repository is time-limited. Picovoice assures that the license is valid for at least 30 days at any given time."
    "If you wish to use Cheetah in a commercial product, please contact us. The following table depicts the feature comparison between the free and commercial versions."

While Snips too, requests an enterprise license for commercial usage, it allows you to write your own contexts for non-commercial usage. If you take a look at the context files in the Rhino repository, you'll notice that these are in a binary, undocumented format. Combining these 3 sets of license files, it appears as if to use Picovoice you need an enterprise license if you want anything more than running the demos they supply on their site/repositories. Furthermore, running Cheetah, the on-device speech-to-text engine on anything but regular linux-in-a-non-embedded-environment (for example those raspberry pis shown in every example and photo) require a commercial license.

And, in a further comparison, if you want a different hotword than "hey Snips", Snips allows you to record one of your own and generate models for it. They provide a tool for it on their github, with an Apache 2.0 license and proper documentation on how to use. GitHub - snipsco/snips-record-personal-hotword. This is also referenced in the regular documentation.

Thanks for the summary. Mmmm, they appear to like making life difficult with their licensing don't they? :thinking:

Thanks for pointing out the devil in the fine prints indeed! I'll reach out to Picovoice to enquire if they would consider offering a more permissive non-commercial free licence. In the meantime, Snips it is!

Sick(er) in bed than usually, the writing/posting has to wait. Won’t be from my mind though, just not capable to work it out at the moment. Makes me wonder, could I get it to drag nodes onto the flow and program them with my voice? I had another speech-to-text engine try to help me write python code 5 years ago. Wasn’t too much of a success, mostly because of my horrible pronunciation. I’ve improved a lot in everything over the last years. Maybe it works better now :joy:

Questions about snips speech recognition:

  • do you have to train it so it recognizes your voice ?
  • once trained (or not), does it work with other voices ?
  • from my experience higher voices (kids, etc) are often harder to transcribe ... how does snip perform in that area ?
  • can snip support speaker identification ?
  • Is it capable of transcripting everything you say "out-of-the-box" or do you need to teach it words ?

Typing from my phone, I’ll edit and add references with links for each later.

Unless you want to use your own wake word, no.

Yes, but for custom wake words only a 1-voice-based model is used, meaning that if you want others to use that same custom wake word more training/modelling is required.

Known issue, since children’s voices are less available in large enough sets of data to consensually train models on, it performs less good in that area. I saw a topic on their discourse about it last week, will link it later.

Yes, it can do that. It allows you to use a base station running the main engines, and satellite stations that do exactly that. It is capable of knowing which satellite (or base) the request is coming from, and can use it combined with custom parameters in your code such as speaker A is in room X to perform actions based on that. Say you have the base in the living room, and a satellite in your bedroom it is capable to associate “turn on the lights” with the lights in the room it is in. You have to code that logic yourself, but the information is passed on. I’ll link relevant docs later.

At the moment it has 5 or 6 languages as acoustic models available, where the intent models are based on. If I remember correctly, several more languages including Dutch (which is why I was looking it up) are expected to release in 2020. Japanese, French, German and English are already supported, the others I don’t know from mind.

The Speech Recognition (ASR) is done through the acoustic models for each language, and can be used with things it is not trained to do so (for example for grocery lists) or directly through MQTT commands (a sample for both I’m planning to post here when I feel slightly better and can sit for longer time). The ASR recognises text, and transcribes it to text. No meaning is conveyed to it though. That’s where the NLU comes in, the Natural Language Understanding. This part of the model has to be specifically trained, but is trained through written text as that’s what the ASR will output, and input into the NLU.

I got the following image from the docs, will properly link it later:


As you can see, the wake word is used for it to start listening, but an alternative is to send a command over MQTT to tell it to start listening. This is raw audio is then passed to the ASR to transcribe it. The text coming from it is passed forward to the NLU, which gets the intent from it. The action/skills code you write yourself can hook on top of (read: subscribe to the relevant MQTT topic) parsed intents with slots, and data such as which satellite is sending it or custom data specified at a different stage. This will then execute the actual actions. Since everything is MQTT based, NR can be easily integrated into the process and so far I’m pleasantly surprised how easy everything is connecting.

For comparison, throughout my adult life my voice has on occasion been described as “husky 12 year old something”, not exactly what you want to hear as adult, but hey so far snips is able to handle it just fine. I’ve had success with it correctly recognising the names of medication that both my general practitioner as my pharmacist have trouble pronouncing. Got to add here that I had the names of several types of my medication coded into a “slot” so it would know it was a kind of medication. If you don’t want information like that ending up in the models, it is possible to inject a list of additional words (text form, for the NLU rather than the ASR) into the runtime.

Edit: oh and in response to the alexa topic going on elsewhere on this forum, it is capable of doing text-to-speech and even conversations, but you have to code them properly. Won’t be a fully natural conversation but good enough for my needs at least.

2 Likes

Many thanks for this detailed answer ! Couple of other questions :wink:

  • How does snips SR compares to Google's (accuracy, range of words it understands -- it seems your medication example is already impressive --, etc) ? (for example: https://dictation.io -- needs Google Chrome browser)
  • can Snips learn new words for SR ? For example, if one of your medication names isn't recognized, can you teach snips ?
  • How long is it to train snips for your own wakeword ?
  • Google SR is rather impressive (at least to me) when there's background noise, even more when the background noise is music. How does Snips handle background noise and background music ?

For the rest I’m going to need my computer to look up benchmarks, but this one depends on the microphone you use. I picked one that had decent benchmarks, and has a set of chips onboard that handle the noise cancellation. It is capable of separating music from voice commands, and during my initial tests I was able to have it pick up my voice while I was outside, with the microphone inside about 3-4 metres away, but with the neighbours having a party. It was also able to correctly relay the direction of arrival my voice was coning from, I’ve a video recording of that on my laptop. In theory I can repeat that experiment in a couple hours as there’s another match being played, so that alone is worth a party and much shouting. Who knows, I might set it up again for it :stuck_out_tongue:

Be great to know the model when you have a chance to let us know.

You’re welcome :stuck_out_tongue:

1 Like

Now from my laptop, with quotes and links to the official Snips forums... I've to retract a couple of my statements. The ASR is trained the same way as the NLU. Meaning it is trained on the intents, and training sentences you give it. But the more accurate those examples are, the more accurate the ASR gets too. But yes, you kinda need to teach it words, but it uses acoustic models on how the language sounds and grammar models as addition to assist. There is a generic model available for the ASR, but it's discouraged for use and last I remember seeing it was deprecated and will be removed in the near future.

Couple relevant links:
ASR without intents: https://forum.snips.ai/t/use-asr-transcription-without-having-intents/1549 and https://forum.snips.ai/t/snips-generic-asr-model-for-french/1870
Two year old benchmarks comparing the NLU to other providers including Amazon Alexa and Google. Benchmarking Natural Language Understanding Systems: Google, Facebook, Microsoft, Amazon, and Snips | by Alice Coucke | Snips Blog | Medium
Paper from 2018 on the ML architecture that's training the models: [1805.10190] Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

1 Like