Intent refinement experiment feedback request

I have a decent size intent/multi-intent NLU dataset of ~50k responses. I’m using tensorboard in Rasa for model performance following this forum.

Using clustering, I observed slightly more pronounced delineation when filtering out samples using less prevalent intents or multi-intents (treating multi intents as dedicated classifiers), so I decided to test intent/multi-intent frequency as a dataset adjustment. Effectively I used filtered datasets as a sort of hyperparameter. I filtered samples according to various frequencies (2, 10, 50, 100, 200, and 500). For example, a threshold of 10 means that any samples using intents or multi-intents that are labeled for 10 or fewer samples (user responses for training) are filtered out of the dataset. I also trained a baseline model with no filtering. I started fresh with each training run, deleting the existing model and rasa cache.

The baseline already demonstrates a really high accuracy on the training set, exceeding 98% upon convergence, with the validation set reaching a commendable 86%. It’s training loss reaches close to 0.2, with validation loss consistently reaching close to 1.4.

Surprisingly (or perhaps not?), none of the filtered datasets produced results that could reach the accuracies (or go lower than the losses) of the baseline model.

I’m wondering if I could get feedback on my experimental approach. Is this sound, or does it make more sense to only focus on the intents themselves - meaning only filter out samples where a solo intent has a frequency threshold (or samples that are multi-intent that use that solo intent)? Or, it is all for naught given the baseline performance? I still feel like some improvement can be squeezed out of the validation set.

Any input would be greatly appreciated!

1 Like