Rasa Extracting unsupervised entities

I am trying to find ways to extract Entity that I havent trained my model on.

eg. install python which has an intent lets say install_software I have more than 20 examples of a person asking for installing software in the NLU.md however my model is only recognising the softwares that I have "explicitly"mentioned.

I should also be able to understand “install slack” and extract slack from it that will be saved in the slot software_name later. Current using CRFEntityExtractor which to my knowledge is for supervised embeddings.

I guess I need some unsupervised extractor too… Any suggestions ?

Just to clarify: Let’s assume you training data looks like this:

## intent:install_software
- can you install [python](software_name)
- please install [rasa](software_name)
- install [rasa-x](software_name)
- ...

You train your bot and when you start the bot, the bot recognizes python as software_name but not slack, for example. Is that correct?

Normally that should not happen. Your CRFEntityExtractor should generalize and also recognize software names that were not mention explicitly in the training data. How much training data do you have? If you just have a couple examples that include a software_name it might be hard to generalize for the CRFEntityExtractor. So, maybe try to add more examples to your training data. Another thing that could help would be lookup tables (Training Data Format). If you have a list of software names that you want to detect, you can add those as lookup table to your training data. Lookup tables basically add a new feature to the CRFEntityExtractor, it should help to improve the performance.

1 Like

You have understood my issue correctly. I have tried having more than 20 examples as you can see in the screenshot i took from botfront.

Here is my config.

Here is how it understands a pretrained entity. image

Here is how it doesn’t understand an untrained entity.


I am also aware of the lookup tables however I want to extract the entities generically.

It seems like you are just using “zoom”, “python” and “outlook” as software name examples in the training data. Can you try to use more diverse software names and retrain? Your current NER might just overfit to those software names. However, as I mentioned the NER should be able to generalize. I guess you just need to confront it with more diverse training data.

1 Like

@Tanja you are bang on target. Overfitting was the problem here. Thanks a lot for the help!

Another thing think you might try is remove the"low" feature from the middle word. Low, prefix and siffix features accentuate memorization. If surrounding words are usually the same that should help generalize.