Rasa NLU intent recognition models in Portuguese

Hi, so i am trying the DIETclassifier model with help from this guide (https://towardsdatascience.com/how-do-chatbots-understand-87227f9f96a7) but i found that it doesnt work well with portuguese, i basically took the guide’s model, changed the language from en to pt and then inserted my own test and train data, but now i have very low accuracy (15-21% MAX) is there anything else that i would need to do to make my classifier models good for portuguese? I also dont think its the training data bc i copy-pasted the guide example data (which in English had close to 100% accuracy) , translated it to pt and only had a max of 21% of accuracy.

heres my config file: pipeline:

  • name: WhitespaceTokenizer
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
  • name: DIETClassifier entity_recognition: false intent_classification: true epochs: 100
  • name: classifier2.LRIntentClassifier max_iter: 100
  • name: EntitySynonymMapper
  • name: ResponseSelector epochs: 200 constrain_similarities: true entity_recognition: false
  • name: FallbackClassifier threshold: 0.7 ambiguity_threshold: 0.1
1 Like

Rasa is language agnostic. That said, how many examples do you have in each intent?

1 Like

Hi so on my own dataset i have 4 intents and a total of 95 examples (ranging from 17-28 examples based on the intent), all intents are short phrases and none have ~ or ç (specific portuguese characters) and that dataset gives me only a 15% accuracy.

Then i translated a small english dataset (provided by the guide) which has 3 intents and 23 examples (4-11 examples on each intent) that smaller dataset gave me a 21% accuracy in PT but close to 100% in EN.

The only change besides the dataset i am making for portuguese is in the config files: language: pt

I heard i could use a spacy transformers to try to get better PT performance, something like this:

pipeline:

  • name: SpacyNLP # Use Spacy as it supports Portuguese. model: “pt_core_news_sm” # Portuguese language model.
  • name: SpacyTokenizer # Spacy tokenizer to use the Spacy model’s tokenization.
  • name: SpacyFeaturizer # Featurizer that uses word vectors from Spacy.
  • name: RegexFeaturizer
  • name: LexicalSyntacticFeaturizer
  • name: CountVectorsFeaturizer

or alternatively using pt_core_news_lg , but then that bypasses the guide python code which saves the model and i dont know exactly how that would work, how would i save this model to use later?

Since i am quite new to NLP would you mind helping me? maybe we can talk on telegram or discord