I’m new to NLP and NLU and I was looking for a way to do topic modeling, like extracting the topic of sentence and extracting its tags.
Is Rasa a good choice to do that (how), or there are other libraries/frameworks better for this task ?
I’m new to NLP and NLU and I was looking for a way to do topic modeling, like extracting the topic of sentence and extracting its tags.
Is Rasa a good choice to do that (how), or there are other libraries/frameworks better for this task ?
There are two “kinds” of topic models I guess.
One kind of topic modelling tries to provide tags to text. In a lot of cases these tags will be known ahead of time and the task is to attach a tag to a new text. For example it might say “this newspaper article is about politics” vs “about tech”. This use-case falls into the realm of “supervised learning” and I might call this “classification” or “tagging”.
The other kind of topic modelling is unsupervised. Here there are no labels and you’d be more interested in figuring out if there are clusters in the texts that have been provided. Commonly, you don’t just want to have clusters but you’d also be interested in having some interpretation for each cluster as well.
Rasa comes with algorithms to do the former (supervised), not the latter (unsupervised). The main use-case for us is that we want to detect the “intent” of a message that comes to our assistant.
For unsupervised topic modelling there are a lot of techniques. It’s kind of it’s own field. I’ve done a bit of work in this field and my favourite “trick” is the one demonstrated in my video on Bulk Labelling. Another popular technique is latend dirichlet allocation and you can find an implementation of it in scikit-learn.
One thing to remember about unsupervised topic modelling (and this also holds unsupervised learning in general) → it’s incredibly hard to argue if the found topic models are appropriate. Without labels that represent ground truth, it can be very hard to quantify how well an approach works.