Advices for creating a data set

datistiquo · August 15, 2018, 9:08pm

I think you guys have made same experiences. If you have multiple intents and add examples for one intent to do recognition better, afterwards the classification of other Intents might suffer.

It is like chasing chicken or egg.

Sometimes sentences only differ in one word for different intents like How much I get or differntly from How I get…

Any advices?

akelad · August 24, 2018, 1:28pm

Could you provide an example of intents where this would be the case?

datistiquo · August 24, 2018, 1:49pm

No it is in general. I start building it from empty set and fill it sentence for sentence. So, I found the quantity is also influencing the classification. I struggle how find the right number of examples per intent.Of course you can have intents where examples within having similiar structure/words like above short example?

Also, when you have a more common intent, then you need somehow a little oversampling for this intent. Otherwise the misclassification is too high.

How does imbalance within one intent for a specific sentence type (structure/count of specific words) is influencing the algo? You have one intent and add various sentences to it and they have different structures but you have more examples of one type compared to other types.

akelad · August 27, 2018, 11:02am

“In general” is always hard to answer.

It depends, if you really need to handle these differently then create a new intent. But if you have something that can be distinguished by entities, then use entities. If you have a lot of intents that are hard to distinguish between, then you should probably reconsider your intent classes. The tensorflow embedding pipeline is better at handling these cases though.

As for more examples for an intent, you should try to keep the number of examples a little balanced, but it’s not a huge problem.

But really you shouldn’t worry too much about these things in general, only when these issues actually arise

AnilLG · September 10, 2018, 10:54am

I have many sentences like,

Can I pay with PayPal?
Can I pay with VIsa?

My entity name is payment_channels, It is possible to write one sentence for many same sentences only entity name is changed?

akelad · September 12, 2018, 7:48am

Yeah you can do that

datistiquo · September 21, 2018, 3:04pm

Hey,

I need some tips for NER_CRF. I have 1000 sentences and and a list of entities. I want to use them for training my NER since intent classification works so far good. The way of going would be now to use the sentences and list of entities with chatito to get training data. But you need to take care of the right balances of different sentences structures. It would be not so good to throw all sentences in training data duplicated with other entity values, right? Then arises the problem that you need to find all different sentcence structures/types (manually) and take only one of each and use it with different entity values. So you get let’s say 20 sentences of each type just with different entities.

Would this be the right strategy?

Since this is manual labour. I thought you can do simple document classification like topic modelling to filter all different sentences structures. Such that I have maybe 50 different structures out of my 1000 examples. Those 50 I can use to sample with entities…

@akelad

datistiquo · September 24, 2018, 4:42pm

It is really difficult to train NER_CRF. I experience that often you need both surrounding words in training data such that it will be recognised. I use the ngram features but it generalizes though very badly.

datistiquo · September 27, 2018, 12:41pm

I found the following results:

The feature low for NER_CRF is very bad because then just it pays too much attention to seen entities. Removing it increases results drastically.

I have many sentences with surrounding words of entity but also trained typical sentences where entity is at the end of sentence. But in testing, the entities at end of sentence are rarely reconized!

It is really difficult to see why this happens. But I think I have to oversample those sentences with entity at the end maybe?

Topic		Replies	Views
When two intents were trained by quite similar data, the intent classification doesn't work so well Rasa Open Source	8	1401	April 5, 2019
Strange misclassification of intent Rasa Open Source	6	1117	October 31, 2018
Strange beghaviour with tensorflow embedding - need advices Rasa Open Source	3	428	October 9, 2018
Tools helping creating good datasets Rasa Open Source	0	414	September 27, 2018
Entity recognition CRF without intent classification Rasa Open Source	2	753	June 13, 2019

Advices for creating a data set

Related topics