In my bot, I’m using CRF for entity extraction along with DIET for intent classification. I’m also using Duckling to extract time-based entities. The issue arises when there’s an entity with MongoDB object ID. What’s happening is that though CRF is correctly extracting the entity, Duckling is sometimes mistaking it to be a datetime.
I tried adding a prefix to the entity and added regex for it. The user message would look like this:
some_text MONGOID//612a1731af13ee4e235e5ead
Duckling sometimes sees this as a datetime. It extracts the “1731” as year, etc. Specifically, this is the result:
{
'start': 99,
'end': 107,
'text': '612a1731',
'value': '1731-01-01T06:12:00.000-07:53',
'confidence': 1.0,
'additional_info': {
'values': [{
'value': '1731-01-01T06:12:00.000-07:53',
'grain': 'minute',
'type': 'value'
}, {
'value': '1731-01-02T06:12:00.000-07:53',
'grain': 'minute',
'type': 'value'
}, {
'value': '1731-01-03T06:12:00.000-07:53',
'grain': 'minute',
'type': 'value'
}],
'value': '1731-01-01T06:12:00.000-07:53',
'grain': 'minute',
'type': 'value'
},
'entity': 'time',
'extractor': 'DucklingEntityExtractor'
}
My pipeline is this:
pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: CRFEntityExtractor
"BILOU_flag": True
"features": [["low", "title", "upper"],[
"bias",
"low",
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"upper",
"title",
"digit",
"pattern",],["low", "title", "upper"]]
"max_iterations": 50
"featurizers": []
- name: DucklingEntityExtractor
url: "http://localhost:8000"
locale: "en_US"
dimensions: ["time", "duration", "ordinal"]
- name: DIETClassifier
epochs: 500
entity_recognition: False
constrain_similarities: true
- name: EntitySynonymMapper
- name: FallbackClassifier
threshold: 0.7
ambiguity_threshold: 0.1
Since there’s no issue in my trainable components like CRF or DIET, and I can’t exactly train Duckling, any suggestions for what I can do?
Thanks in advance.