-
started Sara - Rasa Demo bot, https://localhost:8080 opens and the page refreshes to the with the assistant ready to talk. Although I press the mic button and say “Hello there” no response. I can see it is capturing the voice but no out.log (3.4 MB) . Log attached.
-
The quality of the test audio and STT ant TTS is not good at all. How can we fine tune like google.? https://europe1.discourse-cdn.com/flex013/uploads/rasa/original/2X/6/68a98a188ce8f2fa333a885af7ae8e4c834e2bed.wav
Hi @harishruparel On your second point regarding quality, this isn’t an easy task, but here are some things to consider.
Firstly, both the TTS and DeepSpeech are improving continually over time, so it’s worth looking at using the latest versions.
I know the repo/blog indicates to use version 0.5 of DeepSpeech but changing it to use 0.6, whilst requiring a little bit of a rewrite, should help (as they used a lot more training data for the audio model released with 0.6.0). Bear in mind though that for non US English speakers it will struggle somewhat as most of the training data is US English.
It manages acceptably with my British English accent but still slips up (partly my fault I’m sure!! ). With the up-coming releases I believe they’re thinking of training it using more background noise (“augmentation”), which may help it work better with challenging input.
In theory one could fine-tune it for a particular accent, but this is a major undertaking and would need a lot of good quality audio data (with accurate transcriptions) and a lot computing resource to train it repeatedly until the results are good. Details of that are covered here: https://github.com/mozilla/DeepSpeech/blob/master/TRAINING.rst
Two more immediate things to look at are:
1. Mic quality - ensure that you’ve got a decent mic, and listen to the audio that the bot receives (if it’s poor quality / indistinct then the STT is going to struggle)
2. Language model - if you’ve got a bot that people will speak to with a fairly narrow vocabulary you might be able to get better results by training a custom language model. The downside is that it can then only accept near-variants of the sentences you use to train the language model, but it might help. It’s still useful to include as many relevant training sentences as you can, as this will help the LM figure out which word combinations in sentences are more likely.
There are details of how the LM was trained here: DeepSpeech/data/lm at master · mozilla/DeepSpeech · GitHub - you’d need to substitute your sentence list instead of the dataset their script downloads, but it’s a fairly short script so easy to see what it’s doing.
As a quick experiment, I tried this with the vocab from Sara (ie all the text under https://github.com/RasaHQ/rasa-demo/blob/master/data/nlu/nlu.md, stripping out the entity markup and removing the intent “header lines”) - it certainly helps when saying things similar to those sentences, but even so, it still struggles with more obscure things (eg it would be hard to get it to work with emails without a lot more work normalising the text to how it is actually spoken)
Anyway, I hope those pointers give you food for thought. Like I say, it’s not an easy task but there does seem to be progress gradually.
Kind regards, Neil