NLU only best practices

We are using the NLU feature of Rasa to train intents with data from our messaging system. I would like to know best practices with what to store as valid training data. We are using supervised_embeddings.

Specifically:

  1. What is the max length of characters recommended for any one training value? Are long training values discouraged?
  2. Does the Python API (Python API) allow for testing against strings of any length (e.g. interpreter.parse(long_text))?
  3. Is it recommended to filter out certain characters or text like URLs, hashes, etc. We are training by sending raw message data from our system, which generally has HTML and URLs in the message body.

Any documentation on these limits/recommendations would really help.

I would not worry about the length of the text right away. For the URLs, HTML content, etc., it could be valuable cues for the NLU model, but preprocessing it and replacing it with normalized tokens might be a good idea to start with.

You could first try how it goes like this, and it would then be easier to examine what is going wrong if something is not working very well :slight_smile: .

1 Like