AMA with Dr. Catherine Breslin, L3-AI Speaker Series (Opens June 1 - AMA on June 4)

The L3-AI conference brings together speakers from all over the world who are experts in the conversational AI community. During the conference, they’ll be sharing their work building truly interactive AI assistants.

But before we kick off L3-AI on June 18th, we want to give you a chance to get to know some of our speakers by hosting a series of Ask Me Anything (AMA) sessions in the forum.

How does it work?

On Monday, June 1, we’ll open this thread to pre-submitted questions. Once we open the thread, you’re free to ask our speaker anything (especially as it relates to conversational interfaces and NLU :wink:). On Thurs. June 4, 8am-9am PDT/5pm-6pm CEST, Dr. Breslin will be available live for one hour to answer both presubmitted and live questions in this forum thread. Be sure to react to other questions you’re interested in, so speakers can see which questions have the most community interest :cowboy_hat_face:At the end of the AMA, we’ll close the thread, but you can catch Dr. Breslin again at L3-AI!

About Dr. Breslin:

Dr. Catherine Breslin is a machine learning scientist and manager. Since completing her PhD at the University of Cambridge, she has commercial and academic experience in automatic speech recognition, natural language understanding and human-computer dialogue systems. Previously, she led the Amazon Alexa AI team as a Manager, Applied Science, and she currently works with Cobalt Speech and Language, advising companies and building high performing voice and language technology.


hi ! I’ve got a question about voice. For teams building conversational AI, what do you think are the most important differences between building ‘voice’ applications on google home & alexa versus building ‘full stack’ voice applications (with your own speech-to-text and text-to-speech)


A post was split to a new topic: What are the best resources to fetch data for training NLU?


I’d like to know if you are aware of any recently discovered / developed ways to collect / generate training data for ASR. Especially for other languages than english this seems kind of tricky if you want to do it on your own. I am currently struggling with german in this case.

Also which training approach would you recommend? I am using DeepSpeech but the results are not good at all.

Kind regards and thanks! Julian

1 Like

@karen-white: Actually my question was to Dr. Catherine. What are the best resources for training data. Where do researchers, find data in order to train there models. Do we have to purchase data from other vendors or is it available out for public?

Thanks for clarification on Intent classification. I will definitely go through RASA NLU in Depth: Intent Classification.

1 Like

My Questions is: I am using an Entity which is used in multiple intents, now the bot gets confused which intent has been called and hence gives wrong utterances at many scenarios. Is there any possibility that we can define an intent which only consists of the entity?

My question is I am building a support chatbot ,So the main entities here would be the nature of the problem and the product ID which faces it.There are similar nature of questions it applies to several products now the bot has to recognize the relevant issue and correspond or lead the user to that solution.I am confused if I have to use slots and forms or look up tables and entities

If look up tables apply how do I shift the focus of the bot from one solution to another if I have 100s of products

A possible newbie question on voice interaction and processing of non verbal part of speech including hesitations and break of speech or repetitions. these seems pretty commons and various depending on local variations, most speech engine can filter these out to out “cleaned” text, however I am wondering how important these could into understanding user intends.

So to sum up: have you seen specific work on taking into account these non-verbal parts of speech and how important they could be in improving the conversation?

1 Like

Hi all!

Thanks Rasa for inviting me on here ahead of your L3-AI conference in a couple of weeks time - I’m looking forward to speaking there!

Today I’m here for the next hour or so, answering your questions about voice & language technology so please ask away and I’ll do my best to answer :slight_smile:


Hi @amn41 :wave:t2:

I think that Amazon and Google have done a great job of making voice technology available via Alexa and the Google Assistant. They let you build up voice experiences (or skills) which are a bit like apps for your smartphone, and you don’t need to know too much about what goes on under the hood. Building & deploying a skill for Alexa makes it available for many users of these devices worldwide.

But, there are some times when those platforms can be limiting. For example, if you want to deploy something in a language that isn’t supported, you’re concerned about privacy or connectivity and so want your technology to run on a local machine rather than the cloud. In those cases, you may want the flexibility of a custom tech stack.

However, there’s a lot of expertise that goes into building speech technology, and not every company can hire the right team to do so. So there’s a trade-off here in terms of ease of use and flexibility.


@surendra_koritala & @JulianGerhard :wave:t2:

Like all machine learning, ASR needs data. Speech recognition usually has two sources of data - transcribed audio and written text. The audio data is used to train the acoustic model, and the text data for the language model. Audio data is the most difficult to source. There are some open sources of data in various languages, some lists I’ve come across for ASR are:

Many folks purchase data - the LDC ( has a large selection available.

Failing that, doing your own data collection and having it transcribed is another option. There are many companies that will do the transcription for you.

To alleviate data issues, model training techniques like transfer learning are often used. They allow you to start with a model (e.g. US English) where there is a lot of data, and transfer to another language where there isn’t as much data available (e.g. Swahili).


Hi @CatherineBreslin!

Super excited to see you in the conference!

My question is:

At or company, we are trying to integrate our chatbot with voice. Currently, we are using only the Chrome browser for taking voice input because it has some builtin support for speech-to-text. But we are struggling with taking inputs that require some kind of code.

For example, if the user wants to say: “My employee code is A6B798”. Now, we are facing difficulty in getting the text out of speech for the ‘A6B798’ part. The builtin speech-to-text library either puts space between the code or does not gets it right entirely.

So, I wanted to ask, what would you suggest we do in this case? Should we build our own model?

Also, I’m from India, so there could be a difference in the accent :sweat_smile: of the language that people speak.

1 Like

Hi @JulianGerhard :grinning:

To answer the question about training I polled a few of my colleagues at Cobalt Speech for their experiences.

There are two main approaches to training, which are called “hybrid” and “end-to-end”. DeepSpeech is an example of the end-to-end paradigm.

Hybrid models build an acoustic model, a lexicon and a language model, and I’ve written about this approach before. They require some in-depth expertise to train, and they also require a lexicon (which tells you how each word in your language is pronounced phonetically). However, they can give reasonably good results with small amounts of audio data, even with a few 10s of hours.

End-to-end models take audio as input, and output characters, words or subwords. They have no need of a lexicon, though they’re typically paired with an external language model. These may be easier to train in principle, being a single model, but they’re usually trained on larger amounts of audio data - perhaps 10,000+ hours of audio - to get the good performance.

Hence, what’s the best training approach depends very much on what sort of data you have, and what your task is.


@ggdupont :wave:t2:

Until now, most of the research into speech recognition has focused on understanding the words that are spoken. That’s been a hard enough problem. But, as you rightly say, there’s more to what we say than just the words! Not just hesitations, but also the prosody is important - that’s how we say something. Linguists and conversational analysts have been interested in the non-verbal aspects of communication for a long time. As we move towards more interactions by voice, rather than just transcription, then we’ll want computers to understand more about how those non-verbal cues affect someone’s meaning.

We’re also interested in this topic not just for ASR, but also for Text-to-speech (TTS), because we want our devices to speak naturally.

I recently joined ICASSP, which is one of the largest conferences covering speech technology. Professor Mari Ostendorf gave one of the keynotes there about her research in exactly this topic - how we should be understanding more of the non-verbal content of speech.


@mishra-atul5001 @martinavalogia :wave:t2:

In the past few years, there’s an entire industry sprung up around conversational design. It’s not always an easy problem to solve! Also, the terminology isn’t entirely consistent between vendors, making it hard to sometimes transfer knowledge.

The specific answers to your questions depend on the platforms that you’re using. But roughly, we model language as intents (I’ve sometimes seen called actions) and slots (also sometimes called entities or concepts). The intent is the meaning behind someone’s request, and the job of the computer is to decide which intent a user’s utterance was expressing. If you have two intents which are really difficult to tell apart, then the computer also has trouble distinguishing them. Sometimes, you try a particular design of intents, but when you try it out in the real world it doesn’t work well, so you have to go back to the drawing board, look closely at all the user utterances, and redesign your system to work better for your real usage data.


Hi @saurabh-m523 :wave:t2:

Voice recognition of things like “A6B798” are often tricky for speech recognition systems! The problem is that the language model of a speech recognition system - the model that tells you which word sequences are likely - isn’t much help. This is a place where a custom model might help, because you can constrain it with some knowledge about exactly what employee codes are valid. We’ve built similar ASR models to cope with specialised language and seen them perform well.

And, as you point out, the accent of your speakers may not match with the accent that the ASR system is trained to recognise, but I can’t say for sure that this is the case without diving deeper!


Thank you so much to @CatherineBreslin for sharing her expertise during today’s AMA - and thank you to the community for your questions! That’s a wrap for today’s AMA.

You can catch Dr. Breslin’s talk at the L3-AI conference on June 18. Register here to get your free ticket.

And be sure to check out our other upcoming AMAs: