Entity roles: experimentation and issues

Hey!

Our team has lately been experimenting with entity roles in Rasa as it’s a great way of having better implementable dialogue capabilities. While we think it’s a great feature and we know it’s still pretty new and experimental, we are finding some problems while trying to implement models that use this feature.

Before entering into details, here is our setup.

  • Rasa 1.10.14 (NLU only, we use our own dialogue manager)
  • Language: Swedish (we adapt the examples in this post to English, however)
  • Config:
language: "sv"
pipeline:
  - name: HFTransformersNLP
    model_name: "bert"
    model_weights: "models/tokenization-models/bert-base-swedish-cased/"
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "word"
    min_ngram: 1
    max_ngram: 5
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 10
  - name: DIETClassifier
    epochs: 300
  - name: EntitySynonymMapper

What we want:

  • Some user input to be labeled without a role. E.g. the utterance “Bill Smith” (entity: person name).
  • Some user input to be labeled with role(s). E.g. the utterance “My name is Amy Andersson” (entity: person name, role: caller).

Our problem is that we sometimes get an undesired mix of with and without role in the output. For instance:

  • In the utterance “Bill Smith”, “Bill” is labeled without a role (entity: person name) and “Smith” with a role (entity: person name, role: caller).

However, what we want is to have “Bill Smith” without a role. Sometimes in dialogue you want to have entity roles as the context provides enough information to disambiguate the role, but in some cases just the content of the utterance is not enough to assign a role.

To give more information, the data used for training Rasa looks as follows:

## intent:call
- my name is [Amanda]{"entity": "person_name", "role": "caller"} and I want to talk to [Frederick]{"entity": "person_name", "role": "contact"}
- I want to talk to [Amy Smith]{"entity": "person_name", "role": "contact"}
[...]

## intent:answer
- [Ellie](person_name)
- [Johan](person_name)
- [Johnson](person_name)
- [Emily Torres](person_name)
- [Bill Armstrong](person_name)
[...]

As you can see in this extract data, examples with roles and without roles are actually separated in two different intents. And the intents are actually alright in the predictions, but the roles get all mixed up and even affect the intent that should not have them. Then, we sometimes get mixed interpretations with entities mixed (with and without roles) like the one explained above.

Is there any way in Rasa (NLU) to avoid a behaviour like the one we are having? Something like making entities and roles (or the absence of them in this case) being more dependent of the guessed intent. Maybe some component in the pipeline that we have missed.

Or maybe there is some helpful new tool or improvement in Rasa 2.X that can help us with this. We still didn’t have the time to experiment with this new version, but we plan to do so!

We would be really glad to hear your input and ideas on this topic. Thanks!

Hi @jcanosan! Thanks for your interest in entity roles :slight_smile: You have quite an interesting use case. I’ll try to answer all your questions.

In general your pipeline looks good. You are not missing any component. One thing that I would try first is to increase the number of epochs of the DIETClassifier. We made the experience that entity roles need a bit longer to train. So maybe increase the number of epochs to 400 and check again. It might be that your problem is already solved.

Our problem is that we sometimes get an undesired mix of with and without role in the output.

I am afraid we don’t have a solution for that. We have a method that “cleans up” some of those mixed entity predictions, but we don’t have it in place for roles. Might be something we should to add.

Something like making entities and roles (or the absence of them in this case) being more dependent of the guessed intent.

You cannot influence that. As the DIETClassifier is training the intent and entities together, the classifier can “learn” that some entities just occur together with one intent. So model architecture allows for that, but there is no model parameter that you can tune to enforce this behaviour even more.

Or maybe there is some helpful new tool or improvement in Rasa 2.X that can help us with this.

Entity roles did not change in Rasa 2.x. So upgrading should not improve the performance of entity roles.

So please try to increase the number of epochs. Let me know if that worked.

Hi again! Thank you so much for your informative answer @Tanja , we have been experimenting a bit increasing the number of epochs to 400, 500 and 1000 (this last one just for comparison). Unfortunately that didn’t seem to impact much the resulting models.

It would be really nice to have such method available for roles. We definitely encourage that, it would be a really nice feature.

As of now we are trying some stuff like improving the quality of the data and its size, also to make it more robust to person names that are not always uppercased in text or ASR inputs. We have also started investigating the migration to Rasa 2.1 as we have seen some helpful potential for certain entities on the recently added “RegexEntityExtractor”.

We also see potential in that the “LanguageModelFeaturizer” includes the tokenizer behaviour in it. So, if I understand the Rasa docs alright, that means that you can stack another tokenizer (e.g. “WhitespaceTokenizer”) on top of the “LanguageModelFeaturizer” (in our case, the base Swedish model we use) and then it would use both the “WhitespaceTokenizer” and the one from the model in “LanguageModelFeaturizer”, right? It’s something we see worth investigating as sometimes we have observed words in entities getting split into two. For instance, in the specific utterance “call anna kronlid” we had a quite extreme case:

"entities": [
    {
        "entity": "person_name",
        "start": 6,
        "end": 15,
        "value": "anna kron",
        "extractor": "DIETClassifier"
    },
    {
        "entity": "person_name",
        "start": 15,
        "end": 18,
        "role": "contact",
        "value": "lid",
        "extractor": "DIETClassifier"
    }
]

Thanks again for the ideas! :slight_smile:

It would be really nice to have such method available for roles. We definitely encourage that, it would be a really nice feature.

It would be great if you could create an issue on GitHub pointing to this thread and including an example. Thanks.

As of now we are trying some stuff like improving the quality of the data and its size, also to make it more robust to person names that are not always uppercased in text or ASR inputs.

That is always a good idea. Bad data could also be a reason why entity roles are not performing that well.

We also see potential in that the “LanguageModelFeaturizer” includes the tokenizer behaviour in it. So, if I understand the Rasa docs alright, that means that you can stack another tokenizer (e.g. “WhitespaceTokenizer”) on top of the “LanguageModelFeaturizer” (in our case, the base Swedish model we use) and then it would use both the “WhitespaceTokenizer” and the one from the model in “LanguageModelFeaturizer”, right?

Not exactly. The LanguageModelFeaturizer is not actually tokenizing the text. You need to use a tokenizer before in your pipeline. That tokenizer splits the text into tokens. The LanguageModelFeaturizer uses a pretrained language model that expects the tokens to be in a specific format. So the LanguageModelFeaturizer might split up the tokens into smaller subtokens for featurization. However, the original tokens are not modified. This is just an internal process to calculate the feature vector. What changed is that you can now decide what kind of tokenizer to use up front. Before this was fixed and did not work well with some languages. The example you posted around the entity split up is due to the “cleaning method” I mentioned earlier.

I will do so, I guess at some point today. It would be really nice to have that indeed. :slight_smile:

Ok, I see and it makes a lot of sense. We have observed indeed that the model was doing weird things when tokenizing (we are working with Swedish), so this is indeed a very good addition. Yesterday we tried to train an experimental model without roles and similar data and we have seen that we got predictions with split words as well, so it doesn’t seem like the cleanup method was working properly, at least not for Swedish (in Rasa 1.10.14) and/or with this tokenizer.

But actually, we tried Rasa 2.1.3 for the first time with the same data later on and the problem disappeared at least for entities without roles. The data is exactly the same and the only difference in the pipeline is that we add the WhitespaceTokenizer

language: "sv"
pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "bert"
    model_weights: "models/tokenization-models/bert-base-swedish-cased/"
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "word"
    min_ngram: 1
    max_ngram: 5
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 10
  - name: DIETClassifier
    epochs: 500
  - name: EntitySynonymMapper

This makes me think that the fact that we were forced to only use the LanguageModelTokenizer instead of another one seemed to be causing the problem of words being splitted into two. Like the example in my last post: “Anna Kronlid” > “Anna Kron” | “lid”. As you say, it does seem like that was causing issues for our Swedish model, so it’s really good that now we can add another one.

Thanks!