Lookup table for language that has no space to separate words

Hello, I am currently creating chatbot in language that doesn’t have space to separate words. For normal intent classifier and entity extractor, it works really well since Spacy has the tokenizer for the language. However, when I tried using lookup tables to support the entity extractor, it doesn’t work out of the box right away. If I recall correctly, there was an error because there is no space to separate the entities or something (last time I tried was few months ago). As a quick fix, I transformed all the training data to have space with the same spacy tokenizer that I use within the NLU pipeline, and trained it with the lookup tables. It works well, but I need to preprocess the input first to have space before I put it into the NLU model.

This time, since I want to use Rasa Core (currently only Rasa NLU), since I’m currently planning to use the core within python and create my own custom server, I can do something like this:

def preprocess(text):
   nlp = spacy.load('lang')
   tokens = nlp.tokenizer(text)
   return ' '.join([t.text for t in token])

>>> agent = Agent.load('path/to/model.tar.gz', action_endpoint=action_endpoint)
>>> response = await agent.handle_text(text, message_preprocessor=preprocess)

However, this is quite troublesome to do if I want to do interactive learning or using rasa shell to test out only the response since it’s outside of the pipeline. I need to put the message in opened up python shell to copy the output of that function into Rasa interactive/shell window.

My question is, how can I incorporate that preprocessor so extracting features with lookup tables can work inside rasa interactive, shell, and run? If it’s adding custom pipeline, it’s quite redundant with the SpacyTokenizer, but having SpacyTokenizer doesn’t seem to fix the issue.

RegexFeaturizer and our entity extractors operate on token level. So if SpacyTokenizer splits the tokens, it should work

what version and what config are you using?

There was an error when I used Rasa 1.4. Now I’m using Rasa 1.9.0, there was no error, but it can’t correctly extract entity compared to using preprocessing beforehand.

For example, I will try to write it in English, I want to extract movie title from sentence “AgeofUltronreleasedate”. Assuming spacy tokenizer can properly separate it to “Age of Ultron release date”, I put ‘AgeofUltron’ into lookup table, but not within the training data and I make sure that other movie titles in training data also exist in the lookup table so the model can learn the regex feature.

I did the above steps, but it cannot extract AgeofUltron as an entity. However, if I re-format the training data and lookup table to be tokenized beforehand (“Age of Ultron” instead of “AgeofUltron”), train it again, then preprocess the input like in my first comment, it can extract “Age of Ultron” correctly.

My config is:

- name: SpacyNLP
- name: SpacyTokenizer
- name: SpacyFeaturizer
- name: RegexFeaturizer
- name: CRFEntityExtractor
- name: DIETClassifier
  entity_recognition: False
  epochs: 50
- name: EntitySynonymMapper

I’m still using CRFEntityExtractor because after testing it, it gives better accuracy/F1Scores compared to DIETClassifier (combined and separated setting with the intent) for the language I’m working on.

Please tell me if my explanation wasn’t clear enough

I think the problem is in lookup table. The text there should be with spaces

Thanks for pointing that out. It seems like it’s kinda working, but another problem arises when I use 2 lookup tables, in my case it’s movie and person, for one NLU models. With my original way, the model seems like to be able to differentiate between the each of lookup table, but if I only give space in the lookup table and not in training data and input, the movie is always extracted as person. I’m not sure if there is a difference in how to process space-separated sentence or original sentence in the regexFeaturizer but since you mentioned regexFeaturizer works on token level, it shouldn’t have any difference.

As an information, I generated the training data with Chatette and what I did to try your solution is to only remove space in the template files so in a way other than space-separated, both training data should be the same.

Another option that I can use is to pass the input to search movie custom actions if the intent detected is related to movie.

I think the solution would be to process text in lookup table through the tokenizer in the nlu pipeline. Could you please create an issue for that? Would you like to work on the PR?

Will it have any problem to other language if lookup table is processed through tokenizer in the NLU pipeline? I will try to create an issue if it’s not a problem and probably look into the lookup table inside the Rasa main code over the weekend. As for working on the PR, I probably don’t have the confidence, but I’ll try to look into it.

I think there should be no problem

I’ve created an issue for this problem. But, I’m not sure I have enough time to create a PR.

However, I’ve created an alternative solution without modifying the Rasa Core main codes. It’s to convert the inside of lookup tables into regex_features instead in the training data. I took the _generate_lookup_regex method from RegexFeaturizer, changes regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)" into regex_string = "(?i)(" + "|".join(elements_sanitized) + ")", process the lookup tables before training and put the generated lookup tables patterns into regex_features of NLU training data. It works quite well.

This is because even though RegexFeaturizer operates on token level, the regex patterns still need to be matched with the original input string. This will be a problem if we change the original string to have space since there will be index mismatch between entity index and edited string with spaces index.