Rasa version: 2.3.0
Python version: 3.8.10
Operating system: Ubuntu 20.04
Issue:
I trained a model on a entity including multiple words, but when it comes to prediction, the model sometimes separate the words of that entity - few of them belongs to one example of that entity, and few to another example of that entity. Sometimes it also separates it by word, each word representing a example of same entity.
I read issues where this happened to people who are using lookup tables. However, I am not using it. This situation just happens from time to time. Which solution would you recommend?
Example:
-
Training data: “[Force Majeure Event] (my_event) means any act or event, whether foreseen or unforeseen, that satisfies all of the following criteria.”
-
Prediction: Force Majeure - my_event , Event - my_event , OR: Force - my_event , Majeure Event - my_event , OR: Force - my_event, Majeure - my_event , Event - my_event
Content of configuration file (config.yml):
language: en
pipeline:
- name: "WhitespaceTokenizer"
- name: "CountVectorsFeaturizer"
analyzer: "word"
- name: "CountVectorsFeaturizer"
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: "DIETClassifier"
random_seed: 42
intent_classification: True
entity_recognition: False
epochs: 50
learning_rate: 0.0002
embedding_dimension: 60
number_of_transformer_layers: 1
batch_size: 64
hidden_layers_sizes:
text: [256, 128]
drop_rate: 0.3
weight_sparsity: 0.9
- name: "LexicalSyntacticFeaturizer"
"features": [
[
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"digit",
],
[
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"digit",
],
[
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"digit",
],
[
"low",
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"title",
"digit",
],
[
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"digit",
],
[
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"digit",
],
[
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"digit",
]
]
- name: "DIETClassifier"
random_seed: 42
intent_classification: False
entity_recognition: True
epochs: 200
learning_rate: 0.0002
embedding_dimension: 60
number_of_transformer_layers: 1
batch_size: 32
hidden_layers_sizes:
text: [256, 128]