Tools helping creating good datasets

How do you create in practice well balanced data sets for intent and entity classification? Do you offer some tools for that in Rasa Platform @akelad? Because a tool for just adding data is really not enough. You have to take care of balancing different sentence types for each entity and intent classification. I have several thousand of example and struggle to find the proper ones and deleting similiar sentences. Would it be a good strategy for practice to have a tool where you have a look over your structures for sentences? I imagine a simple bag of word topic clustering to cluster my examples and choose the right ones for the tensorlfow embedding. For entities I think a simple statistics over context words would be enough. So you can see how many example you have of each type of context for entities and you can avoiding overfitting. I think that would be a good approach?

1 Like