writing my own ok google/alexa/siri type system...

gtoal

this post should probably go in the “what I'm doing outside of scratch” forum, but since there's a version of parser I'm using on Scratch as well, maybe you'll forgive me for posting here…

I built a home-made Google Assistant using a Raspberry Pi, and it turns out to be easy to get the text that is recognised after “ok google” to be fed in to my own code, so I'm writing a parser to accept requests and the Pi will act on them - I have all my home automation stuff accessible from the Pi and I have text to speech on it so it can respond to me by voice also.

Here is my first crude attempt at a grammar. I finished writing the parser half an hour ago so this is clearly just a protoype as a demo, but it should give you a rough idea of how it will work.

The reason I'm posting here is to ask if anyone knows of usable BNF grammars for things like time of day or other common constructs that might be used in a voice control system. Maybe unit conversions, currency conversions, that kind of thing.

I'll be developing this piecemeal as I add the commands that my wife and I use with our Alexa, but some rigor in strongly typed items of data wouldn't hurt…

P<PROGRAM> = <PREFIX_PLEASE> <SIMPLE> <EOF> | <SIMPLE> <POSTFIX_PLEASE> <EOF> ;
P<PREFIX_PLEASE> = “please” | “would” “you” | ;
P<POSTFIX_PLEASE> = “please” | ;
P<SIMPLE> = <TIME> | <DATE> | <DAY> | <LIGHTS> | <ALARM> | <TIMER>;
P<TIME> = “what” “time” “is” “it” | “whats” “the” “time” | “time” ;
P<DATE> = “what” “the” “date” | “whats” “todays” “date” | “date” ;
P<DAY> = “what” “day” “is” <THISDAY> | “what” “days” <THISDAY> | “day” “of” “the” “week” | “day” ;
P<THISDAY> = “it” | “this” | “today” ;
P<ALARM> = “set” “alarm” “for” <ABSTIME>;
P<TIMER> = <DELTATIME> “timer”;
P<DELTATIME> = <NUM> “minute”;
P<ABSTIME> = <NUM> “oclock” | <NUM> “a” “m” <TODAYTOMORROW>
| <NUM> “p” “m” <TODAYTOMORROW> | <NUM> “in” “the” <TIMEPERIOD> | <NUM> “tomorrow” <TIMEPERIOD> | <NUM> ;
P<TODAYTOMORROW> = “today” | “tomorrow” | ;
P<TIMEPERIOD> = “morning” | “afternoon” | “evening” ;
P<LIGHTS> = <IDENT> <OFF_OR_ON> | “turn” “the” <IDENT> <OFF_OR_ON> | “turn” <OFF_OR_ON> “the” <IDENT> | <DIM>;
P<DIM> = <IDENT> <UP_OR_DOWN> | “dim” “the” <IDENT> | “dim” <IDENT> ;
P<OFF_OR_ON> = “off” | “on” ;
P<UP_OR_DOWN> = “up” | “down” | “dim” | “low” | “bright” ;

bobbybee

I do wonder, will parsing the input with a grammar like this be too robotic (I.e. as opposed to fancier NLP techniques, probably driven by machine learning)?

blob8108

Have you heard of https://mycroft.ai ? Seems relevant :-)

gtoal

bobbybee wrote:
I do wonder, will parsing the input with a grammar like this be too robotic (I.e. as opposed to fancier NLP techniques, probably driven by machine learning)?

no, because the Google voice recognition is *really* good and the output from their speech recognition engine doesn't need much tweaking to parse with a strict grammar. I have several dozen sample utterances already and they're good enough for an old style parser as long as the grammar is fairly encompassing. I do actually know a little about the state of the art in natural language parsing but for my purposes an old fashioned strict grammar is good enough (as long as the speech recognition itself is good - the alexa recognition isn't but the Google recognition is just amazing)

I originally intended to do an Eliza-level parser but I decided against it and went with a Dungeon-style parser because I would rather have an error reported than have fuzzy matching guess at the wrong action. It has driven me nuts that the Amazon Echo has done things like power down my linux system when I was just trying to turn a light off.

(An option that is tempting but would require more work than I'm willing to put in right now would be to adapt an actual Dungeon parser (eg Tads) to speech input and output. Might be fun playing IF by voice!)

This parser is adapted from a simple programming-language parser and maps the fixed parts of the grammar to what would be keywords, and the variable slots to the equivalent of variable names (eg “bedroom”, “living room” would be variables and “turn on the … lights” would be a sequence of keywords)

After tweaking for my own environment I do expect the performance to be better than an off-the-shelf tool like Alexa, but because it is likely to be quite customised I don't have any intention of releasing it for others. (Well, I'll put the code out there, but only for programmers like myself who are willing to tweak it to do what they want too). For instance the grammar is embedded in C code, so to make quick changes I just edit the code and type make, and it instantly understands some new sentence. Why bother inventing config file formats when all they boil down to is extra code to read the format into the same program structure anyway… :-) (btw there are formats out there that are suitable for this kind of grammar - JSGF for example. Overkill for what I want…)

In terms of the alexa world, doing it this way lets me implement multiple skills on the fly within a single command interpreter, without requiring all the overhead of registering alexa skills and also allowing any utterance at any time without the stupid alexa method of asking a skill by name to perform an action.

G

Last edited by gtoal (June 25, 2017 18:01:51)

gtoal

blob8108 wrote:
Have you heard of https://mycroft.ai ? Seems relevant :-)

I know about it. And Jarvis and all the others. Still prefer to write my own, although for now I'm happy to let someone else do the speech to text part. (Talking of which, how is OpenTTS coming along?) I wouldn't have taken this on if I hadn't accidentally discovered that the raspberry pi version of Google Assistant prints out the recognised text on stdout as a diagnostic, letting me piggyback on their engine. (I just turn the speaker off and let it respond silently, and handle my own response on another pi). And that the recognition rate was phenomenally good. Until that, I had been using an Alexa and wasn't too impressed with either the speech recognition or the API for adding your own commands,

At some point if anyone ever supplies a good speech to phoneme program, I'ld quite like to tackle phoneme stream -> text …

G

blob8108

Now I want to try this! Parsing is fun, and our clockwork kitchen timer just stopped working…clearly this is the solution

Thanks for posting about this, Graham!

I wonder if it's possible to use the Google cloud APIs directly? https://cloud.google.com/speech/ It's free for an hour of usage a month, which for small control snippets seems like plenty. I'm not sure how you'd trigger it, though…

gtoal

blob8108 wrote:
Now I want to try this! Parsing is fun, and our clockwork kitchen timer just stopped working…clearly this is the solution Thanks for posting about this, Graham!

I wonder if it's possible to use the Google cloud APIs directly? https://cloud.google.com/speech/ It's free for an hour of usage a month, which for small control snippets seems like plenty. I'm not sure how you'd trigger it, though…

sorry, don't know, haven't looked into it.

i did previously look into the Alexa equivalent and the biggest hassle was that you had to either use a physical button to activate it or use a local activation-word recogniser from some other source.

Hey… here's my log of test utterances as fed back by Google (commented where they didn't quite match). I think you'll agree that the hit rate is nothing short of astounding. Note I don't have things like electronic cat door locks, I just said these things as a test. By the way I'm using a PS3 Eye webcam as the microphone - it has a 4-mic array so gives better rendition than a simple usb mic. Also they are *dirt cheap* right now. I got 20 on ebay for $30…

By the way I'm stripping apostrophes before passing the text to my code. Google actually gets them fairly accurately.

hello world
what time is it
what time is it
tell me something interesting
execute Unix command for me LX => execute unix command for me ls
turn the television off please
say hello to the pussy cat
local the heights doors and turn the lights off => lock all the house doors and turn the lights off
look older outside doors => lock all the outside doors
wreck a nice Beach => wreck a nice speech
recognize speech
whats for breakfast
whats up
what stricken => whats ticking
whats my name
its time for bed
evening pools => evening pills
take my pills
repeat after me
repeat after me
set the burglar alarm
Okay Google repeat after me
repeat after me
open the count door => open the cat door
turn off the living room light
turn the lights on
what time is it
Im going to the Kure => Im going to the Kurai
wheres my phone
talk to me
quit copying me
what time is it
turn off the kitchen light
are you awake
reboot the hive
what time is it
whats your name
what time is it
wheres my wife
whats in the fridge
set the alarm for 3:15 a.m.
15 minute timer
set an alarm for quarter past 3 p.m.
whats 500 pounds in dollars
convert 98 degrees Fahrenheit into Centigrade
square root of 507
call my mom on Skype => call my mum on skype
turn on BBC America
turn the volume down
LS Dodge Carolina – no idea. maybe something with ‘oscar lima’ in it?
ring of Ciara please – no idea. ciara could have been ‘sierra’ - phonetic alphabet for radio signallers
I spell alpha bravo charlie
Lennox I spell Limon Sierra and => linux I spell lima sierra end – spelling out a command ‘ls’
lets play Scrabble
whats currently playing on BBC America
2 minute timer
cancel
Ive taken my pills
remind me to check eBay in the morning
cancel alarm
recommend a good movie is on tonight => recommend a good movie thats on tonight

Last edited by gtoal (June 25, 2017 20:52:12)

blob8108

gtoal wrote:
At some point if anyone ever supplies a good speech to phoneme program, I'ld quite like to tackle phoneme stream -> text …

What, and write your own decoder?! Sounds hard! :-) http://www.voxforge.org/home/docs/faq/faq/what-is-an-acoustic-model looks like a relevant page…

I'm pretty sure I can use the Google API directly, and from trying the demo on https://cloud.google.com/speech/ it seems pretty good—not perfect, but still quite impressive. I just need to figure out how to implement a wake word…

gtoal

blob8108 wrote:
gtoal wrote:
At some point if anyone ever supplies a good speech to phoneme program, I'ld quite like to tackle phoneme stream -> text …
What, and write your own decoder?! Sounds hard! :-) http://www.voxforge.org/home/docs/faq/faq/what-is-an-acoustic-model looks like a relevant page…

It is hard, but I've done something extremely similar once before in a spelling correction project so I know how to do it. I can take a sentence where all the words are run together and extract all the different ways it could have been composed from individual words, then pick the most likely path through the graph using probabilities of word combinations and grammar fragments. But that's for the stage when you have a stream of phonemes. Getting the phonemes is a much harder task and one that I'm not especially interested in.

blob8108 wrote:
I'm pretty sure I can use the Google API directly, and from trying the demo on https://cloud.google.com/speech/ it seems pretty good—not perfect, but still quite impressive. I just need to figure out how to implement a wake word…

https://snowboy.kitt.ai/
https://www.hackster.io/shiva-siddharth/install-alexa-on-raspberry-pi-with-wake-word-and-airplay-15fad4

Last edited by gtoal (June 25, 2017 21:14:33)

gtoal

google speech api demo literally does nothing when I try it. The little coloured dots wobble to suggest it is listening but when I hit stop, I get no output. And this is using Google Chrome so surely it's a compatible browser! Well, enough being sidetracked, I need to get back to repairing my raspberry pi. The SD card just failed :-( fortunately I am backed up to about 24 hrs ago. UNfortunately the parser I wrote for this project was all within the last 24 hrs and not backed up :-( Oh well, it wasn't that big a job. I can do it again, maybe a bit better this time :-(

EDIT: Phew… did have a copy of the parser after all… http://gtoal.com/hamish/hamish.c.html

the only code lost was a trivial script to scrape the debug output of Google Home on the pi and send it to syslog.

(syslog then sends it to another machine, where a program scrapes the syslog for voice commands and sends them to my parser - seemed easier than writing a new protocol for communicating between machines… I just use ‘logger’ :-) - no actual code to be written!)

G

Last edited by gtoal (June 25, 2017 21:30:25)

bobbybee

gtoal wrote:
(syslog then sends it to another machine, where a program scrapes the syslog for voice commands and sends them to my parser - seemed easier than writing a new protocol for communicating between machines… I just use ‘logger’ :-) - no actual code to be written!)

Consider just using netcat?

gtoal

bobbybee wrote:
gtoal wrote:
(syslog then sends it to another machine, where a program scrapes the syslog for voice commands and sends them to my parser - seemed easier than writing a new protocol for communicating between machines… I just use ‘logger’ :-) - no actual code to be written!)

Consider just using netcat?

syslog works pretty well *and you get free logging*… not to mention free cycling and purging of the logs. Also it makes multiple writer single reader trivial…

It fits well into my overall strategy, for example I have a couple of dozen PIR sensors on separate machines which send motion reports via syslog. Likewise the ‘motion’ netcam monitor logs via syslog anyway so I get alerts from it for free too.

I know it's not the intended use of syslog but it works out pretty well and has saved a bunch of code writing. I've written enough tcp/ip servers to know how to do it properly but also *when* to do it properly, and for this app, doing it properly is overkill. For example, why bother adding security to connections when it's entirely within your own home behind a firewall. So no login handshakes, no SSL etc. If someone is behind my firewall already I have much bigger problems than worrying about them turn my lights off…

G

iamunknown2

Maybe you could find a list of ways to ask for the time, find their syntax tree equivalents, and using it for the supervised learning of a parser. Here are some samples:

"what time is it" >>
(is
  (time
    (what)
  )
  (it)
)

"what is the time now" >>
(is
  (what)
  (now
    (the
      (time)
    )
  )
)

"please tell me the time" >>
(please
  (tell
    (me)
    (the
      (time)
    )
  )
)

The first component of a bracket is the command/modifier, the rest are the parameters (for the verbs, the parameters are the subject/object). “What” is used to represent an unknown.

However, it seems that it might not be simple - in the first case (“what time is it?”), “is” means “belongs to”. On the other hand, in the second case (“what is the time now?”), “is” means “equal to”.

Also, if you do get your system to work, I suggest you call it MUTHUR

Last edited by iamunknown2 (June 27, 2017 12:38:59)

iamunknown2

On the other hand, maybe it would be easier if you were to make a bunch of pre-made sentences, then selected the closest matching one to the one spoken out.

Discuss Scratch