Rasa train taking lot of time

Hi I am using RASA 1.10 version. We are having 215(Including smalltalk) Intents and around total 48000 example. while doing training using rasa train core part gets trained in much less time in comparison to NLU part .It seems DIETClassifier is the culprit here, following part is running for almost 8 hours(32 GM RAM , 8 CPU).

`2020-07-15 10:23:13 INFO     rasa.nlu.model  - Starting to train component DIETClassifier

Epochs: 59%|███████████████████████ | 59/100 [4:59:41<2:41:11, 235.90s/it, t_loss=1.988, i_loss=0.030, entity_loss=0.000, i_acc=0.997, entity_f1=0.793]

It is becoming really difficult to work with that kind of training time. Please help. Following are details of my Config file.

Configuration for Rasa NLU.

Components

language: en pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer
  • name: CountVectorsFeaturizer analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: DIETClassifier epochs: 100
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 100

Configuration for Rasa Core.

Policies

policies:

  • name: MemoizationPolicy
  • name: TEDPolicy max_history: 5 epochs: 100
  • name: MappingPolicy
  • name: TwoStageFallbackPolicy nlu_threshold: 0.3 core_threshold: 0.3 fallback_core_action_name: “action_default_fallback” fallback_nlu_action_name: “action_default_fallback” deny_suggestion_intent_name: “out_of_scope”
  • name: FormPolicy

Need your help

Thanks

That could well be right. What type of machine are you training on? In previous versions of Rasa (< 1.6), I noticed that cloud computers are a lot slower than a dedicated machine with a similar configuration, due to storage speed, I think. Training appears to be a very disk-intensive activity.

There is some fine tuning that can be done in the configuration file: Our bot is mostly a “question and answer”, so our max_history is 1, which makes training a lot faster.

How many intents do you have (with 48 000 examples)? What language are you training for?

@samscudder Thanks for the prompt reply. As i said we have around 215 Intent, each intent having examples ranging from 200 to 300 thus total number of example approx. 48000.

My machine is Virtual Machine(VM) Windows server 2016.

I am training for English only. Again I would reiterate core part trains quickly whereas NLU training part taking most of the time(7 to 8 Hours) , that too DIETClassifier component.

Wow… that seems like a huge amount of examples for each intent. Any specific reason you need that many?

I have 5-20. Our biggest NLU has around 570 intents and 5000 examples overall.

Ok…my understanding is … if number of intents are growing we need to increase number of examples as well so that model confidence can be boosted. Please correct if I am wrong.

Now i come to a very basic question how many questions are enough per intent for NLU training. is there a ideal ratio : Example/Intent ?

Certainly reduced number of example will improve speed.

But do you really think even for this dataset size it should take 8 hours of training ?

Please suggest modifications needed to training boost time.

@samscudder, I am interested in this as well. On average you are using around 9 examples for each intent? How specific are your intents and how accurate is your bot by doing this?

I am under the same impression that the more examples for each intent, the more accurate it will be. I have about 45 intents with around 3400 examples total, around 75 examples per intent.

We are achieving around 95% accuracy, in Portuguese using Spacy.

There are only so many ways you can says something, even in Portuguese :slightly_smiling_face:. You don’t need to keep repeating the same phrase over and over again to increase accuracy. Can you post one of the intents with it’s sample phrases to have a look at?

@ASINGH, you don’t need to double the number of examples if you double the number of intents.

I can’t fathom why you’d need 200-300 examples for a single intent. Something is not right there.

@samscudder… I am getting your point … these are chatito generated dataset. Frankly speaking I have no explanation for that because it was not a thoughtful decision to use those many examples. Assumption was more the number of intent more the examples.

So is there a benchmark for number of example per intent ?

Can you please refer some article/blog on this ?

I’ll try to reduce number of examples and see the training time.

I hope my NLU configuration in config file dosnt require modification.

@samscudder, if you don’t mind can you share maybe two intents that are similar but still successful?

You mentioned using Spacy, is this how you can have so few messages per intent?

This is going to be long… but I hope this helps.

First of all, I’m assuming you don’t have anything in your phrases that should be treated as an entity. If you have examples like this:

## intent: order
- I want a loaf of bread
- I want a cake
- I want a glass of water
- I want some butter
- I want some sugar

I would change it to entities, and the variations would be “I want xxx”, “I’d like yyy”, “Could you get me zzz”, “Can I order aaaa”, with the xxx, yyy, zzz, aaa etc… all being treated as an entity. I don’t need 4x4=16 examples. I can get by with 4. Like I mentioned, there are only so many ways you can order something. You can order several different items, but the intent is the same.

On that foot, I haven’t seen any blog posts that say “you need X examples”, but a good place to start is to look at the demo files here: rasa/examples at master · RasaHQ/rasa · GitHub

Explaining a little about accuracy… by 95%, I meant that our chatbot responds correctly to 95% of the test phrases we throw at it (the other 5% triggers a fallback, and is acceptable to our customer). The average accuracy over all the sample phrases is around 88%.

You don’t need to have 100% accuracy on each intent. We use a threshold of 0.6. So anything under 0.6 drops us into our fallback action. We started of at .8, then tested with 0.7 and finally with 0.6. Yours is set at 0.3, which I feel is a bit low, and you’d probably get some incorrect classifications there which should trigger a fallback, instead of wrong classification.

The top ten intents for a test phrase thrown at Rasa can be found using API functions (check out the parse function here: HTTP API). This gives you a JSON document with a ranking of the top 10 intent classifications.The sum of all intents gives you 100%, correct? Testing all the intents, I found that if you have an accuracy of 0.6, the second in the classification ranking is around .2, and the third around .1. An accuracy of 90% or higher will give an accuracy for second and third around 1%, so they are a world away. If you have a confusing phrase, at threshold 0.3, you can get more than one above this, and it’s easier to get the wrong intent.

After you have some idea of the ranking, you can see which are the closest intents, and what examples need tweaking.

Like I mentioned, without taking a look at an intent to see what your doing with your examples, my diagnosis is limited to my experience. But, at the moment it looks like you’re basically trying to brute-force the chatbot, assuming that a ton of examples will give you a good model, and just adding them to your NLU, making it work harder. And don’t forget: having a model with 200-300 examples to test is also more compute-intensive to classify!

I don’t think you need so many, and I would backup your training data, take 50 intents, leave only say 30 significant examples of each one, and train this, and make up a report showing where there’s confusion (see below). Figure out what is causing the confusion and tweak the examples, then add another 50 intents, and repeat over again, and again until you have all intents back in your training data. This will be a lot of work, but you’ll end up with a better model.

Here’s a bit of our spreadsheet (I hope it’s ok to link to an external image here), so you can see what I analyze. I had to hide to names of the intents because of confidentiality issues with our customer, but each line is a test phrase thrown at Rasa. Column E is the accuracy. Column F is a go/no go (OK/RUIM). If it’s the correct intent and over .7, I mark it in green. If it’s over .6 and less than .7, I mark it yellow as it could slip under the threshold if I’m careless with examples from other intents. Under .6 it’s marked in red. Columns K, M and O are the top three intent classifications for the test phrase (hidden columns J, L and N have the intent name so I know where the problem is).

If I see a phrase that needs improving, I look at all the examples of the intent that’s causing confusion, and what phrases are too similar. I then add a couple of examples that emphasize specific words that can differentiate the two intents. A single example can impact on all the intents (even if it’s by 0.00001%). That’s why I mark some in yellow. They’re the one’s I need to keep an eye on. If an intent is 90%, then it’s unlikely that I mess it up unless the exact example is in another intent.

Sometimes, looking at the expected response from the chatbot, we propose merging two intents with our customer, and if we can do that, just add the examples from one to another. (we have a few intents with 30 examples, but none over that).

Why don’t I use the rasa test report? Because it only shows wrongly classified intents, but not if it’s the correct intent below threshold, and I can’t tell if the second place in the ranking is nearly at the same accuracy or irrelevant.

Cheers

3 Likes

@samscudder… Thanks for the patience :slightly_smiling_face: Wonderful explanation. Actually I don’t need an article now.

I’ll certainly go through the steps you have described. Already using HTTP API you mentioned but haven’t thought like this.

Humble request For the benefit of community including myself ,can you please share the sample code to generate that kind of excel you have shared .

Thanks & Regards

I re-wrote it and made up a blog post about it. Read it here: Better report for Rasa chatbot model analysis

3 Likes

@samscudder… What a gem person you are… Your suggestions worked for me I. I have reduced total number of examples from 48000 to approx 9000 for 215 intents and training time has come down to 90 Minutes . I’ll see if there is further scope to reduce examples & I am seeing there is no drop in performance of chatbot.

Below one is wonderful command, I have ignored it for a long time. It gives you clear insight into the accuracy of your model. It helped a lot.

rasa test 

there are many posts around internet about creating the chatbots with RASA… But there are very few related to model’s performance and accuracy analysis. Your post will go a long way: Better report for Rasa chatbot model

Thanks

@samscudder Thanks for this answer and your subsequent blog post. I am having the same problem @ASINGH was hitting up against. From a practical point of view, less examples would get the job done, but the engagement gain from minimizing fallbacks is huge over time. They really can undermine a bot’s ability to hook the user during a first visit.

Of course, having said that, my almost new MBP was huffing and puffing just a little too hard for my liking.

Thanks again, Simon

Hello @samscudder,

Do you have some ideal train timing for 100 Intents? I have 100 intents with 30 training utterances for each intent.

For training 5 Intents with 30 utterances each, it’s taking 10 mins. We have lot of utterances with look up values. Earlier with Rasa 1.7.1 - it used to take less than or equal to 1 min. Other day we were doing for 100 and it didn’t finish in 10 hours and we had to abort it.

As we know DIET classifier is doing 2 things , Entity recognition and Intent identification and so it takes time. So we thought of setting Entity extraction to false and using CRFEntityExtractor - is this good idea?

so it takes 3 mins if DIET’s Entity recognition is set to false and using CRFEntityExtractor.

Please let me know your thoughts.

@akelad - can you provide your inputs on this please?

Hello @dakshvar22

Here you go.

@mpbalshetwar When you say you have utterances with a lot of lookup values, can you please be more specific on what you mean? Do you mean you have large lookup tables and a lot of tokens in the utterances are from the values of lookup tables? It would help if you can give an example as well.

This is for @dakshvar22