I am trying to find ways to extract Entity that I havent trained my model on.
eg. install python which has an intent lets say install_software
I have more than 20 examples of a person asking for installing software in the NLU.md
however my model is only recognising the softwares that I have "explicitly"mentioned.
I should also be able to understand “install slack” and extract slack from it that will be saved in the slot software_name later.
Current using CRFEntityExtractor which to my knowledge is for supervised embeddings.
I guess I need some unsupervised extractor too…
Any suggestions ?
Just to clarify: Let’s assume you training data looks like this:
## intent:install_software
- can you install [python](software_name)
- please install [rasa](software_name)
- install [rasa-x](software_name)
- ...
You train your bot and when you start the bot, the bot recognizes python as software_name but not slack, for example. Is that correct?
Normally that should not happen. Your CRFEntityExtractor should generalize and also recognize software names that were not mention explicitly in the training data.
How much training data do you have? If you just have a couple examples that include a software_name it might be hard to generalize for the CRFEntityExtractor. So, maybe try to add more examples to your training data.
Another thing that could help would be lookup tables (Training Data Format). If you have a list of software names that you want to detect, you can add those as lookup table to your training data. Lookup tables basically add a new feature to the CRFEntityExtractor, it should help to improve the performance.
It seems like you are just using “zoom”, “python” and “outlook” as software name examples in the training data. Can you try to use more diverse software names and retrain? Your current NER might just overfit to those software names. However, as I mentioned the NER should be able to generalize. I guess you just need to confront it with more diverse training data.
Another thing think you might try is remove the"low" feature from the middle word. Low, prefix and siffix features accentuate memorization. If surrounding words are usually the same that should help generalize.