Hello,
I am currently using Rasa Open Source to train a chatbot handling multiple intents and entities. Intent classification and entity extraction works perfectly well for all my intents but one. This intent I am working with is made up of sentences in which users ask for information related to legal documents referred to by their title, which is often very long. I need to extract these legal documents as entities in order to properly generate an answer.
Here is my problem: the trained DIETClassifier model fails to extract some of these entities even if they are present in the training set, if the entity is too long. Intent detection is not a problem, since even when failing to extract entities, the model correctly classifies the sentence into the proper intent.
Here is one example that I have added in my training set :
“Which national law transposes the [directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties]{“entity”: “directive”} ?”
After training the model, and inputting this exact sentence, the model correctly infers the intent but fails to extract an entity.
Now, for the weird part, entity extraction works when I input a new sentence with a shortened legal title (and without modifying the training set or retraining the model). Here is one functional example:
“Which national law transposes the [directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the processing of personal data?]{“entity”: “directive”} ?”
Here is my pipeline in more details:
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 6
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 100
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
- name: "DucklingEntityExtractor"
url: "http://localhost:8000"
dimensions: ["time", "number"]
timezone: "Europe/Paris"
timeout : 3
I am very curious as to why this happens on the entity extraction specifically, since DIET looks like it handles intents properly regardless of input sequence length? Additionally, in the case this was not solvable with my current pipeline, I was wondering if someone had an idea on how to circumvent this problem with another entity extraction strategy (as opposed to DIET entity extraction)?
Thanks in advance for your help!