Returned entity getting formatted

0

I’ve got rasa_nlu integrated into my python app. I’m passing it a glob of characters, ‘${webAddress}’ in this example, and I’d like to get that glob of characters back as an entity. For some reason, ner_crf is adding spaces in the entity value, even though it is not in the response text. How do I make it stop doing that?

I’m using the spacy_sklearn pipeline for training. I have included several very similar examples in my training data (substituting some other string for webAddress), and it does recognized the entitity. Just…just…stop it with the spaces!

 $ curl -XPOST localhost:5000/parse -d '{"q":"navigate to ${webAddress}"}'
 {
   "intent": {
     "name": "navigate",
     "confidence": 0.1911308126544064
   },
   "entities": [
     {
       "start": 12,
       "end": 25,
       "value": "$ { webaddress }",
       "entity": "url",
       "confidence": 0.5229620578330448,
       "extractor": "ner_crf"
     }
   ],
   "text": "navigate to ${webAddress}",**
   "project": "default",**
   "model": "model_20190409-153615"**
 }

I just ran into another example. The text Double-click-me, which I want to get back as Double-click, is instead returned as double - click - me.

I guess a more direct way of asking the question is, “How to stop entities formatting?”

I fixed the part where spaces are being added. In crf_entity_extractor.py, I found _create_entity_dict() and removed the space from the join() that creates ‘value’.

1 Like

Oh, I’m facing the same issue with entities that have a hyphen in them. Thanks for the solution.

Eventually, I ran into another problem. Given my solution above, a multiword entity came back concatenated. The solution to that was to modify the _create_entity_dict() method as below:

def _create_entity_dict(self, tokens, start, end, entity, confidence):
if self.pos_features:
    _start = tokens[start].idx
    _end = tokens[start:end + 1].end_char
    value = tokens[start:end + 1].text
else:
    _start = tokens[start].offset
    _end = tokens[end].end
    text_array = [t.text for t in tokens[start : end + 1]]
    value = text_array[0]
    if end > start + 1:
      for i in range (1,len(text_array)):
        if len(text_array [i-1]) == 1 or len(text_array[i]) == 1:
            value += text_array[i]
        else:
            value = value + ' '  + text_array[i]
return {
    'start': _start,
    'end': _end,
    'value': value,
    'entity': entity,
    'confidence': confidence
}

It’s a little hokey, at likely very error prone, but this is the best I’ve been able to do so far.

Yes, I expected it to cause issues with multi-word entities. So I’m just modifying my actions.py to account for the extra spaces. I’ll try this out.