Implementing TFIDF as a custom component? [necro]

I’m really sorry for necro-ing and reviving this topic, but implementing it is really important and I’m still stuck here.

OP
As part of a project at work, I’m building a bot that can answer a predefined set of FAQs. Given the large volume of questions we have, writing training data for them all (including implementation with RasaX) will take a lot of time.

I’ve found that some simple tf-idf vectorization produces really good results for answering FAQs that are similar but have entirely unique answers. Eg.

What is an escrow account?
What is an escrow cushion?

Yields a very accurate result in TFIDF (given how it’s designed to focus on unique words, of course) but requires a substantial amount of training data to make Rasa differentiate between the two acceptably.

I’ve read the tutorial on designing custom components, but there doesn’t seem to be a way to really approach this particular problem.

How should I approach this?

@ActuallyAcey have you tried using the ResponseSelector for this?

As for custom components - which part is unclear? you can use tf-idf vectorization as a featurizer, and hten e.g. the SklearnClassifier. Is that what you’re after?

I mean, honestly I don’t know how exactly to start. HOW do I implement the vectorizer? What would the “train” method on TFIDF, an algorithm that loads and processes data on the spot, even do? And how would I set it to actually provide an intent as an output rather than entities?

I had a look at the ResponseSelector, but it seems targetted towards smalltalk and not really as a full-fledged approach.

what do you mean “not really as a full-fledged approach”? It works very well for Q&A type interactions.

You can pass the train method, that doesn’t have to be implemented. E.g. this custom spell checker component I built as an example a while ago doesn’t use the train method:

from autocorrect import spell

class RasaSpellChecker(Component):

    defaults = {}
    requires = ["tokens"]
    provides = ["tokens"]
    name = "rasa_spell_checker"

    def __init__(self, component_config=None):
        super(RasaSpellChecker, self).__init__(component_config)

    def train(self, training_data, cfg, **kwargs):
        pass

    def process(self, message, **kwargs):
        entity_list = message.get("entities")
        donot_replace = []
        if entity_list:
            message.set("entities", [])
            for e in entity_list:
                print(e)
                if e["entity"] == "name":
                    donot_replace.append(e["value"])

        tokens = [t.text for t in message.get("tokens")]
        correct_tokens = [spell(t) if t not in donot_replace else t for t in tokens]

        for i, t in enumerate(message.get("tokens")):
            t.text = correct_tokens[i]

In this case it sets the tokens of the message, in your case you would set the “features” instead, like in the spacy featurizer for example: rasa/spacy_featurizer.py at master · RasaHQ/rasa · GitHub

1 Like