Evaluation results on entity extraction

When supervised_embeddings (which uses CRFEntityExtractor) is used in the config (without any additional option) and evaluated, how are the performances (precision, recall, f1 and accuracy) computed? I am asking this because I found two contradictory info.

According to https://rasa.com/docs/rasa/user-guide/evaluating-models/, the entity scoring seems to be based on a simple tag-based approach which splits multiword tokens and evaluate the performances per each individual tokens. However, when I look at the code (rasa/crf_entity_extractor.py at master · RasaHQ/rasa · GitHub), the default setting of supervised_embeddings sets BILOU_flag as True. When BILOU_flag is true, the extracted entities are not individual tokens but a full sequence of them (rasa/crf_entity_extractor.py at ab382d049471c8f8468547f6f69f3a11a76600aa · RasaHQ/rasa · GitHub).

Which one is right? When the entity extraction is evaluated in the default supervised-embedding, is each entity an individual token or the whole BIL sequence?

Hi @onue5, we are using a the first approach you mentioned:

According to Evaluating Models, the entity scoring seems to be based on a simple tag-based approach which splits multiword tokens and evaluate the performances per each individual tokens.

The BILOU flag is just used to determine the tagging data format. It does not influence the evaluation. The method you menitioned (https://github.com/RasaHQ/rasa/blob/ab382d049471c8f8468547f6f69f3a11a76600aa/rasa/nlu/extractors/crf_entity_extractor.py#L310) just converts the BILOU tag schema into a simple schema. For example, it removes B- from an entity label.

@Tanja, Thanks for the answer.

To clarify, suppose that the tagging is [‘O’, ‘B-per’, ‘I-per’, ‘L-per’, ‘O’]. Would the corresponding entities (considered during the evaluation) be {“entity”: per, “start”: 1, “end”: 3} where “entity”, “start”, “end” represent the entity label, the start index and the ending index respectively? Or, would it be {“entity”: per, “start”: 1, “end”: 1}, {“entity”: per, “start”: 2, “end”: 2}, {“entity”: per, “start”: 3, “end”: 3}?

I looked at the code again (rasa/crf_entity_extractor.py at ab382d049471c8f8468547f6f69f3a11a76600aa · RasaHQ/rasa · GitHub). The method calls self._handle_bilou_label which then calls self._find_bilou_end (rasa/crf_entity_extractor.py at ab382d049471c8f8468547f6f69f3a11a76600aa · RasaHQ/rasa · GitHub). This method seems to find the distant ending index when a multiword entity is tagged. If this is true, aren’t the whole sequences of the multiword entities considered during the evaluation?

Can you point me to the part where the multiword entities are split and the individual tokens are considered during the evaluation?

Let’s look at an example:

(Rasa Technologies GmbH)[company] is based in (Berlin)[location].

This would be equally to

{
   "entities": [
    {
      "start": 0,
      "end": 21,
      "entity": "company",
      "value": "Rasa Technologies GmbH",
    },
    {
      "start": 34,
      "end": 40,
      "entity": "location",
      "value": "Berlin",
    }
  ]
}

During evaluation we convert the above sentence to

["company", "company", "company", "no-entity", "no-entity", "no-entity", "location"]

Same is done for the predictions. So you have an array of gold labels and predicted labels. We use the evaluation metrics from sklearn to obtain the f-score, precision, recall, etc. (see for example sklearn.metrics.classification_report — scikit-learn 0.19.2 documentation).

Regarding your questions:

  1. The first one is correct {“entity”: per, “start”: 1, “end”: 3}, B, I, and L would be merged together.
  2. Not 100% sure what you mean. We convert the BILOU format to the json format you can see above. During evaluation we use the array you see above. The entity is split to token again. Does that clarify your question?
  3. Take a look at rasa/test.py at master · RasaHQ/rasa · GitHub