NLU only best practices

afowler · August 5, 2019, 11:51pm

We are using the NLU feature of Rasa to train intents with data from our messaging system. I would like to know best practices with what to store as valid training data. We are using supervised_embeddings.

Specifically:

What is the max length of characters recommended for any one training value? Are long training values discouraged?
Does the Python API (Python API) allow for testing against strings of any length (e.g. interpreter.parse(long_text))?
Is it recommended to filter out certain characters or text like URLs, hashes, etc. We are training by sending raw message data from our system, which generally has HTML and URLs in the message body.

Any documentation on these limits/recommendations would really help.

msamogh · August 12, 2019, 4:08pm

I would not worry about the length of the text right away. For the URLs, HTML content, etc., it could be valuable cues for the NLU model, but preprocessing it and replacing it with normalized tokens might be a good idea to start with.

You could first try how it goes like this, and it would then be easier to examine what is going wrong if something is not working very well .

Topic		Replies	Views
Length of utterance for training NLU model Rasa Open Source	0	566	August 8, 2018
Trying to add NLU training data to Rasa X via API calls [Deprecated] Rasa X Community Edition	2	752	February 4, 2020
How to specify input sequence length in rasa NLU pipeline? Rasa Open Source	2	494	June 24, 2020
Best practice to deal with correct predicted intents in NLU inbox: either add them to trainingdata or not? [Deprecated] Rasa X Community Edition	2	488	October 27, 2020
Limit long message Rasa Open Source	7	2222	February 2, 2021

NLU only best practices

Related topics