Rasa Training Issue Large Dataset TeslaV100s

I hope this message finds you well. I am currently facing a challenge related to memory usage and featurizer selection in my Rasa NLU pipeline, and I am reaching out to seek your valuable insights and expertise.

I have a substantial dataset comprising 19 intents, each with 30,000 examples, totaling 570,000 training examples. My system is equipped with 45 GB of RAM and a powerful 32 GB Tesla V100s GPU. Despite these resources, I am encountering out-of-memory (OOM) issues during the training process, particularly when using the DIETClassifier component was just started.

Additionally, my vocabulary size is substantial, with 2,185,209 vocabulary items created for the text attribute during training. NOTE: " This i grabbed from the logs during training : rasa.nlu.featurizers.sparse_featurizer.count_vectors_featurizer - 2185209 vocabulary items were created for text attribute. ’ I have experimented with various featurizers, including CountVectorsFeaturizer, SpacyFeaturizer, ConveRTFeaturizer, and others, but I am struggling to find the right balance between memory efficiency and feature richness.

I am seeking advice on the following aspects:

  1. Memory Optimization: Are there specific techniques or configurations I can implement to reduce memory usage during training? I have already experimented with batch sizes and model complexity.
  2. Featurizer Selection: Considering the complexity of my dataset, which featurizer(s) would be most suitable to extract informative features while optimizing memory usage? I have explored options like SpacyFeaturizer, CountVectorsFeaturizer, among others.
  3. Vocabulary Size Management: My substantial vocabulary size is a significant contributor to memory usage. Are there effective strategies to manage vocabulary size without compromising the quality of the model?

I would greatly appreciate any advice, best practices, or experiences you can share regarding similar challenges you have faced or solutions you have implemented. Your expertise will be immensely helpful in guiding me toward an effective resolution.

Thank you in advance for your time and assistance. I am looking forward to learning from your experiences and insights. I’ve also attached the pipeline configurations!

Best regards

I would look at reducing the number of examples per intent from 30,000 to 100-200.

Thank you @stephens for your quick response and valuable suggestion. I tried the number of examples per intent to 10,000 and 20,000, and it worked. However, I encountered challenges when attempting to train the model with a higher number of examples per intent.

I’m curious if there are specific strategies or best practices you recommend when dealing with large datasets in Rasa.

Reason my data is increasing everyday so limiting in the dataset might affect the accuracy of the model.

Are there any techniques for optimizing memory usage, especially when the number of examples per intent needs to be substantial? Any insights you can provide regarding training such models would be immensely helpful.

Thank you once again for your assistance. I truly appreciate your expertise and willingness to share your knowledge.

my data is increasing everyday so limiting in the dataset might affect the accuracy of the model.

With only 19 intents, I would test the assumption that you need this many examples for accuracy.

Yes @stephens , With examples Intents will also Increases Currently i am at 19 but soon there might be 50-60 or more intents with their examples

Still having the issue. @stephens any suggestions?

If you share your repo, I’ll take a look