Hello community! I am using the component CountVectorsFeaturizer for word featurization in my pipeline and i came across to the fact that it has a parameter “OOV_token” that can deal with words that the model didnt see in the training process , so i created an Intent “INT_oov” where i put some random sentences and injected “oov” tokens do that whevever the model see words that were not seen in the training data they will be considered as oov token and therefore predicted as INT_oov , the problem is that i think it made the predection quality very bad after using this approach , because the model if it finds one word that hasn’t been in the training data it consider it automatically as INT_oov intent even tho the rest of the words in the sentence can be classified in other relevant intents. Am I using the “oov_token” wrong ? And what is the best way to deal with unseen words in sentences during prediction ? Thank you
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Use of Out of Vocabulary - OOV | 9 | 3424 | December 22, 2021 | |
Customize OOV_token in CountVectorsFeaturizer? | 1 | 1200 | October 17, 2019 | |
Error in intent classification | 10 | 682 | April 7, 2020 | |
OOV token not found in NLU | 3 | 1290 | May 6, 2020 | |
How to log sentences containing oov words and explicitely mark oov words? | 9 | 1939 | December 18, 2018 |