Advices for creating a data set

I think you guys have made same experiences. If you have multiple intents and add examples for one intent to do recognition better, afterwards the classification of other Intents might suffer.

It is like chasing chicken or egg.

Sometimes sentences only differ in one word for different intents like How much I get or differntly from How I get

Any advices?

Could you provide an example of intents where this would be the case?

No it is in general. I start building it from empty set and fill it sentence for sentence. So, I found the quantity is also influencing the classification. I struggle how find the right number of examples per intent.Of course you can have intents where examples within having similiar structure/words like above short example?

Also, when you have a more common intent, then you need somehow a little oversampling for this intent. Otherwise the misclassification is too high.

How does imbalance within one intent for a specific sentence type (structure/count of specific words) is influencing the algo? You have one intent and add various sentences to it and they have different structures but you have more examples of one type compared to other types.

“In general” is always hard to answer.

It depends, if you really need to handle these differently then create a new intent. But if you have something that can be distinguished by entities, then use entities. If you have a lot of intents that are hard to distinguish between, then you should probably reconsider your intent classes. The tensorflow embedding pipeline is better at handling these cases though.

As for more examples for an intent, you should try to keep the number of examples a little balanced, but it’s not a huge problem.

But really you shouldn’t worry too much about these things in general, only when these issues actually arise

I have many sentences like,

  1. Can I pay with PayPal?
  2. Can I pay with VIsa?

My entity name is payment_channels, It is possible to write one sentence for many same sentences only entity name is changed?

Yeah you can do that


I need some tips for NER_CRF. I have 1000 sentences and and a list of entities. I want to use them for training my NER since intent classification works so far good. The way of going would be now to use the sentences and list of entities with chatito to get training data. But you need to take care of the right balances of different sentences structures. It would be not so good to throw all sentences in training data duplicated with other entity values, right? Then arises the problem that you need to find all different sentcence structures/types (manually) and take only one of each and use it with different entity values. So you get let’s say 20 sentences of each type just with different entities.

Would this be the right strategy?

Since this is manual labour. I thought you can do simple document classification like topic modelling to filter all different sentences structures. Such that I have maybe 50 different structures out of my 1000 examples. Those 50 I can use to sample with entities…

:grinning: @akelad

It is really difficult to train NER_CRF. I experience that often you need both surrounding words in training data such that it will be recognised. I use the ngram features but it generalizes though very badly.

I found the following results:

The feature low for NER_CRF is very bad because then just it pays too much attention to seen entities. Removing it increases results drastically.

I have many sentences with surrounding words of entity but also trained typical sentences where entity is at the end of sentence. But in testing, the entities at end of sentence are rarely reconized!

It is really difficult to see why this happens. But I think I have to oversample those sentences with entity at the end maybe?