Entity extraction fails if entity to be extracted is too long

tantris · July 10, 2022, 8:15am

Hello,

I am currently using Rasa Open Source to train a chatbot handling multiple intents and entities. Intent classification and entity extraction works perfectly well for all my intents but one. This intent I am working with is made up of sentences in which users ask for information related to legal documents referred to by their title, which is often very long. I need to extract these legal documents as entities in order to properly generate an answer.

Here is my problem: the trained DIETClassifier model fails to extract some of these entities even if they are present in the training set, if the entity is too long. Intent detection is not a problem, since even when failing to extract entities, the model correctly classifies the sentence into the proper intent.

Here is one example that I have added in my training set :

“Which national law transposes the [directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties]{“entity”: “directive”} ?”

After training the model, and inputting this exact sentence, the model correctly infers the intent but fails to extract an entity.

Now, for the weird part, entity extraction works when I input a new sentence with a shortened legal title (and without modifying the training set or retraining the model). Here is one functional example:

“Which national law transposes the [directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the processing of personal data?]{“entity”: “directive”} ?”

Here is my pipeline in more details:

pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 6
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1
  - name: "DucklingEntityExtractor"
    url: "http://localhost:8000"
    dimensions: ["time", "number"]
    timezone: "Europe/Paris"
    timeout : 3

I am very curious as to why this happens on the entity extraction specifically, since DIET looks like it handles intents properly regardless of input sequence length? Additionally, in the case this was not solvable with my current pipeline, I was wondering if someone had an idea on how to circumvent this problem with another entity extraction strategy (as opposed to DIET entity extraction)?

Thanks in advance for your help!

tantris · August 23, 2022, 4:14pm

I managed to solve the problem, I’m posting the solution here in case anyone encounters the same difficulty I did. I had just failed to notice that:

Entity extraction was not reliable on input sentences containing infrequent characters like “(”, “/”, etc.
Entity extraction worked on some long entities but these long entities were cropped into several smaller entities, which I just had to concatenate to form the final entity.

It seems that nothing was wrong per se with my pipeline.

Topic		Replies	Views
Intent classification failing when entity extraction is performed Getting Started with Rasa	4	171	December 19, 2018
Entity extraction Rasa Open Source	6	1319	April 9, 2019
Extract Long Multi-word Entities Rasa Open Source	7	1843	July 9, 2020
Entities can't get extracted with regex Rasa Open Source	18	1212	January 18, 2022
Intent Matching to be affected by Entity Extracted Rasa Open Source	14	1225	June 8, 2020

Entity extraction fails if entity to be extracted is too long

Related topics