Entity extraction fails if entity to be extracted is too long

Hello,

I am currently using Rasa Open Source to train a chatbot handling multiple intents and entities. Intent classification and entity extraction works perfectly well for all my intents but one. This intent I am working with is made up of sentences in which users ask for information related to legal documents referred to by their title, which is often very long. I need to extract these legal documents as entities in order to properly generate an answer.

Here is my problem: the trained DIETClassifier model fails to extract some of these entities even if they are present in the training set, if the entity is too long. Intent detection is not a problem, since even when failing to extract entities, the model correctly classifies the sentence into the proper intent.

Here is one example that I have added in my training set :

“Which national law transposes the [directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties]{“entity”: “directive”} ?”

After training the model, and inputting this exact sentence, the model correctly infers the intent but fails to extract an entity.

Now, for the weird part, entity extraction works when I input a new sentence with a shortened legal title (and without modifying the training set or retraining the model). Here is one functional example:

“Which national law transposes the [directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the processing of personal data?]{“entity”: “directive”} ?”

Here is my pipeline in more details:

pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier
    epochs: 6
  - name: EntitySynonymMapper
  - name: ResponseSelector
    epochs: 100
  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1
  - name: "DucklingEntityExtractor"
    url: "http://localhost:8000"
    dimensions: ["time", "number"]
    timezone: "Europe/Paris"
    timeout : 3

I am very curious as to why this happens on the entity extraction specifically, since DIET looks like it handles intents properly regardless of input sequence length? Additionally, in the case this was not solvable with my current pipeline, I was wondering if someone had an idea on how to circumvent this problem with another entity extraction strategy (as opposed to DIET entity extraction)?

Thanks in advance for your help!

I managed to solve the problem, I’m posting the solution here in case anyone encounters the same difficulty I did. I had just failed to notice that:

  • Entity extraction was not reliable on input sentences containing infrequent characters like “(”, “/”, etc.
  • Entity extraction worked on some long entities but these long entities were cropped into several smaller entities, which I just had to concatenate to form the final entity.

It seems that nothing was wrong per se with my pipeline.