Rasa nlu train with a large dataset is stuck

I am training the NLU with a large 3 GB markdown NLU train format file. It has been 12 hours since I started the training phase, but it is stuck and filled up the resources on the machine that does not accept new ssh access. Is there consideration I should remember when training on such large files, or are there options on rasa train nlu so it would use multi-core CPU or the GPU?

Hi Kamyar,

That’s a huge dataset. I’m curious how many intents and utterances are in your dataset and how you generated it.

Either way, that dataset is too big.


Hi Greg and thank you for the response, Actually I am using Rasa NLU for sequence tagging and intent extraction for address geocoding and the intents are the place types and for each map place in a country I have generated 2 or 3 sentences (similar sentence types for each place) that is an address with about 20 intents likes restaurant, shopping center, shop, street, etc. and with sentence tags such as neighborhood, city, etc. So the size of the .md file has became about 3GB. I have set the pipeline to use tensorflow embedings too. So, I was wondering how can I config the train phase to reduce the epochs from 300 to something like 10 and maybe use GPU for training phase, so it may help the system not to stick loading the data or so. I don’t know what is the problem with training NLU with a large data? Btw I haven’t used the lookup tables yet, as I need the model to learn to handle misspellings too, so the lookups would be a bit large but the sentences would reduce to that small 2-3 types.

so for small number of epochs it trains fine? what version are you using?

Yeah, actually I have set the tf epochs to 10 now, but initial loading of a 3GB file makes it stuck, the Rasa version is 1.8

so it cannot start train?

It seems it gets stuck in the data loading phase. Actually the problem is that I give it so much time in a linux screen shell on a machine that I have a ssh connection to. When I start the train after a while I lose the ssh connection and I have to restart the machine. When I recheck the screen logs, I see no log at all, it seems the train did not even start and the system got stuck so we needed to reboot it.

well… 3 Gb is a lot of text data, it might run out of memory, while loading it or featurizing it

The machine has 32GB of ram and fast 8 CPU cores and is not a weak system at all with or without a GPU. So, do you have a suggestion for me about working with such large data for training an NLU model for sequence labeling and intent extraction with rasa?

for such a huge amount of data, it should be loaded in batches into the memory, meaning the whole Rasa NLU pipeline need to be updated.

What is this data, is it generated? I would try with smaller amount but real data first

Is it developed or I should wait or maybe contribute?

Yeah, the data is generated for each place with defined tagged sentences in Rasa NLU acceptable format. With an about 100MB data of ours it works fine but with a large file it does not work.

we don’t currently work on online loading of data. I’d recommend reduce the amount of generated data to the one that fits in the memory of your machine

Do you know the estimated machine config for about 3GB of data with about half a dozen of tags and 15-20 intents? Or maybe 1GB or so? Should I test each of them?

Hi @KamyarGhajar

I have trained my model on around 5 GB of markdown data with around 500 different intents. I did it when rasa-nlu and rasa-core were separate i.e. on rasa-nlu 0.15 version, so I am sharing with you my experience how I did it with that amount of data:

First, segregate this data into separate files in the size which your system memory can bear during training and train it sequentially on these files(don’t train parallelly as that takes the same memory as with complete data files and keep all model into a single directory.

Now load every model from model directory and make your prediction.

  1. You may have to monkey patch rasa model loading code. I did it on 0.15 version but I am not sure if you have to do it in the latest rasa version.
  2. You prediction accuracy will be low as compared to single model accuracy, I don’t know why but this accuracy will also be good, just modify the hyperparameter to the best accuracy.

Thank you very much for your most informative response Abhishak. I will try your way too then. The major problem here is you say I should use a previous version of nlu. Are you sure it won’t work in the new versions?

No I never said it won’t work on latest version I said I used on previous version. I didn’t tested it on latest version but I think it will work on latest version also as rasa developers are great coders they have modified many things but I think you may have to monkey patch load_model function according to your need. If you need any help just ping me anytime, I’ll be happy to help.
Best of luck for your work.
Enjoy & Happy Coding

Oh, I see. Thanks again Abhishak. :ok_hand:t2::wink::rose:

1 Like

Happy to help @KamyarGhajar :blush: