Hermod Voice Dialog Suite

Hi, I mentioned a while back I’ve been tinkering with a voice toolset hermod to support building an Alexa like device or a voice integrated web page.

I’ve put up a demo using voice to help solve crossword puzzles. https://edison.syntithenai.com

Inspired by Snips the architecture involves a central MQTT server as the only point of communication between a suite of services that support a dialog flow. It relies on RASA for NLU, routing and session management.

Originally a nodejs application, I’ve rebuilt the whole thing in Python for better cross platform availability of core libraries (particularly ARM).

It can be used alongside an existing RASA installation via the HTTP API or for hardware contrained environments, RASA and all other required services are run in a single asyncio loop.

While there is some learning and performance overhead in using network communications for a suite of services on a single computer, the approach has the advantages

  • services can be distributed over multiple hardware devices for high concurrency applications or LAN solutions (low power satellites + central rasa/ASR/TTS)
  • the service contract is clearly defined in a handful of possible messages making it easier to extend without side effects or provide alternate implementations. eg Google, IBM and Deepspeech implementations are provided for speech recognition.
  • a web browser using mqtt over websockets can be a first class member of the suite.
  • with messaging in place the encapsulated service over mqtt approach can be applied to other elements of an application

Juste did some work integrating Deepspeech and RASA from a web browser with less overhead, communicating directly with a Deepspeech HTTP streaming server and the RASA HTTP api. Build an AI voice assistant with Rasa Open Source and Mozilla tools | Rasa Made me question if I was overcomplicating things and creating unnecesary processing burden.

Picovoice stands out as offering webassembly solutions for hotword, speech recognition(ASR) and natural language understanding(NLU) running purely in a browser but they aren’t entirely open source. https://picovoice.ai/tutorials/using-picovoice-engines-with-react/.

End of the day, the crossword example is running on an AWS t3.nano with 1 processor and 1G RAM. It does offload voice processing to Google but runs RASA and all the other services and seems to hold up under mild concurrent load so I don’t think the messaging overhead is a choke point. For my use cases, the added complexity and features of a messaging based approach is beneficial. Horses for courses.

Any feedback much appreciated.

cheers Steve

PS. I have in the past installed RASA on ARM (raspberry pi3) but my recent attempts install have failed due to missing libraries. Deepspeech works great on a pi4. I’d love to see open source standalone speech dialog on a pi4. If anyone could share a Dockerfile to build on ARM that would be AWESOME.

I’ve just played with the system. I’d say it’s pretty cool!

  • User Interface wise, it took me a while to find the “help” button. Might be nice to make that a bit more easy to find.
  • I notice that if the voice-to-text works it’s a neat experience. However, there’s a few hiccups. It sometimes takes a few times for it to figure out that I’m saying “14” instead “10”.
  • Another common hiccup I found is that I said “1 down is a strawberry” and it was wrong because the answer was “strawberry” not “a strawberry”.

As far as architecture goes I’d say that your approach of “keeping it small and light” sounds right. One thing I wonder … do you receive the text that folks say and the labels? As in; got a labelling pipeline?

Hey Vincent,

funny thing. I did have the help content as the home page but in my initial acceptance testing I found that people were unwilling to read the five lines of help text and wanted an obvious sign of what they were supposed to do so I pushed the crossword list to the front.

Your 14 heard as 10 might be a function of the very small delay that is required after the hotword.

Indeed the NLU model is not perfect at identifying entities which is the thinking behind the purple NLU fixer buttons that allow a user to select the text they want treated as an entity.

Text transcriptions and NLU parse responses and NLU corrections are all saved to mongo as RASA md format. Stories are not captured yet.

I haven’t got to do anything with the captured data yet but it will be pretty easy to dump into a text file and include as training data.

cheers

Steve

My main advice might then be to give Rasa X a spin. It should help out a fair bit with labelling and re-deploying. You can also hook it up to a git repo for CI/CD stuff. It’s free and you can use the documentation here to get started.

Very intetesting Steve, i would find good use case. Can you share a little how you did that whith rasa and mozzilla text reader.? Thanks in advance Maxime

Hi Maxime,

detailed documentation at GitHub - syntithenai/hermod: voice services stack from audio hardware through hotword, ASR, NLU, AI routing and TTS bound by messaging protocol over MQTT

cheers Steve

1 Like