Hi, I mentioned a while back I’ve been tinkering with a voice toolset hermod to support building an Alexa like device or a voice integrated web page.
I’ve put up a demo using voice to help solve crossword puzzles. https://edison.syntithenai.com
Inspired by Snips the architecture involves a central MQTT server as the only point of communication between a suite of services that support a dialog flow. It relies on RASA for NLU, routing and session management.
Originally a nodejs application, I’ve rebuilt the whole thing in Python for better cross platform availability of core libraries (particularly ARM).
It can be used alongside an existing RASA installation via the HTTP API or for hardware contrained environments, RASA and all other required services are run in a single asyncio loop.
While there is some learning and performance overhead in using network communications for a suite of services on a single computer, the approach has the advantages
- services can be distributed over multiple hardware devices for high concurrency applications or LAN solutions (low power satellites + central rasa/ASR/TTS)
- the service contract is clearly defined in a handful of possible messages making it easier to extend without side effects or provide alternate implementations. eg Google, IBM and Deepspeech implementations are provided for speech recognition.
- a web browser using mqtt over websockets can be a first class member of the suite.
- with messaging in place the encapsulated service over mqtt approach can be applied to other elements of an application
Juste did some work integrating Deepspeech and RASA from a web browser with less overhead, communicating directly with a Deepspeech HTTP streaming server and the RASA HTTP api. Build an AI voice assistant with Rasa Open Source and Mozilla tools | Rasa Made me question if I was overcomplicating things and creating unnecesary processing burden.
Picovoice stands out as offering webassembly solutions for hotword, speech recognition(ASR) and natural language understanding(NLU) running purely in a browser but they aren’t entirely open source. https://picovoice.ai/tutorials/using-picovoice-engines-with-react/.
End of the day, the crossword example is running on an AWS t3.nano with 1 processor and 1G RAM. It does offload voice processing to Google but runs RASA and all the other services and seems to hold up under mild concurrent load so I don’t think the messaging overhead is a choke point. For my use cases, the added complexity and features of a messaging based approach is beneficial. Horses for courses.
Any feedback much appreciated.
cheers Steve
PS. I have in the past installed RASA on ARM (raspberry pi3) but my recent attempts install have failed due to missing libraries. Deepspeech works great on a pi4. I’d love to see open source standalone speech dialog on a pi4. If anyone could share a Dockerfile to build on ARM that would be AWESOME.