Entity Extraction for hostnames with punctuation's (< - | >)

HI @akelad @Ghostvv @erohmensing

The problem with SLACK platform is, it sends hostname.net input to slackclient as <http://hostname.net|hostname.net>.
I would like to extract that <http://hostname.net|hostname.net> as hostname entity and later i will convert that into hostname.net in my actions.py. I have given the nlu.md with below data sets. Also below is my config.yml where am using CRFEntityExtractor to get entities.

But am getting misaligned entity error when running rasa interactive.

Misaligned entity annotation for 'http://abc.net abc.net' in sentence 'pls remove blackout for <http://abc.net|abc.net> please' with intent 'stop_blackout+host_or_db'. Make sure the start and end values of the annotated training examples end at token boundaries (e.g. don't include trailing whitespaces or punctuation).

Config.yml

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy
  - name: FormPolicy

pipeline: 
- name: "WhitespaceTokenizer"
  case_sensitive: false
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
  "stop_words": ['pls','lol','hmm','uggh','ok','then']
- name: "intent_classifier_tensorflow_embedding"
  intent_tokenization_flag: true
  intent_split_symbol: "+"
  path: "./models/"
  data: "./data/"
- name: "ner_duckling_http"
  url: "http://localhost:8081"
  locale: "en"
  dimensions: ["time", "duration","unit-of-duration"]

nlu.md

## intent:host_or_db

- blackout for [<http://hostname1.net|hostname1.net>](hostname)
- for [<http://hostname2.net|hostname2.net>](hostname)
- for [<http://hostname3.net|hostname3.net>](hostname)

Can you guys please give me some insight how i can solve this issue?

Seems like it would make more sense for this to be handled in the slack input channel. It is a similar problem from here: Entity extraction problem due to Slack adding link

I.e. in the slack input channel, it should look for a regex of e.g…

<http://{something}|{something}>

and turn it back to {something} before it gets to Rasa. There’s already a similar one in the input channel that strips the user tags from the input. Would you be interested in working on this problem and PRing a fix?

@erohmensing let me work on it. Will submit a PR once done

1 Like

awesome! can you claim this issue on github? improve slack sanitization · Issue #4418 · RasaHQ/rasa · GitHub

@erohmensing here is the PR