Multiple NLU files

Hello guys,

I have a big project with multiple nlu-*.md files. All these files live under the data directory.

image

I have built a custom tokenizer (for debug purposes, it prints every detected token). What I have observed, is that during training, only the words of the nlu.md file are getting processed (i.e. printed). It seems like the content of the other two files doesn’t go through my tokenizer.

I have also tried placing these files in both data and data/nlu directories, but the result is the same.

Is there a way to have the tokenizer applied in all nlu files during training?

Note: intent identification works for all nlu data files.

Hi @Nikos, that sounds strange. Can you run the bot in debug mode (--debug) and paste the log output here? I think that might help. I assume you are doing rasa train?

Hello @Tanja.

Indeed I am doing rasa train to train my model. This is the logs I get during training: logs rasa train (2.4 KB)

And this here is the logs I get after running rasa run --debug Rasa run --debug logs (5.1 KB)

My problem (if there is one) is during training.

Thank you a lot for your time :slight_smile:

The training log does not seem to be complete, can you take a look at that again? I am particular interested at the first lines where the training data are read. Thanks.

This file should be complete : logs rasa train (11.3 KB)

Looks like the data is read as expected. You also mentioned that intent classification is considering all files. So, it should not be a problem during reading training data. Can you share the code of your custom tokenizer and your config.yml file? Thanks.

here is my config file,

# Config (675 Bytes)

and my tokennizer,

# Tokenizer (1.6 KB)

Thank you for all your help @Tanja

@Nikos Not exactly sure why this happens. What Rasa version are you using? Here a couple of things that I noticed:

In your config.yml:

  • The EntitySynonymMapper should always be placed at the end of your pipeline or at least after an entity extractor. Otherwise it will not do anything.
  • LexicalSyntacticFeaturizer does not have any token_pattern option.
  • The CountVectorsFeaturizer does not have the options intent_tokenization_flag and intent_split_symbol, those are only present for the tokenizers.

Your tokenizer itself:

I am not sure what Rasa version you are using, but if you are at one of the latest versions I recommend that you only overwrite the tokenize method in your tokenizer and don’t overwrite the train and process methods. Basically you could do something like this:

import typing
from typing import Text, List, Any, Type
import unicodedata
import unidecode

from rasa.nlu.tokenizers.tokenizer import Token, Tokenizer
from rasa.nlu.components import Component
from rasa.nlu.utils.spacy_utils import SpacyNLP
from rasa.nlu.training_data import Message

from rasa.nlu.constants import SPACY_DOCS

if typing.TYPE_CHECKING:
    from spacy.tokens.doc import Doc  # pytype: disable=import-error

class SpacyTokenizer(Tokenizer, Component):
    @classmethod
    def required_components(cls) -> List[Type[Component]]:
        return [SpacyNLP]

    defaults = {
        # Flag to check whether to split intents
        "intent_tokenization_flag": False,
        # Symbol on which intent should be split
        "intent_split_symbol": "_",
        # Regular expression to detect tokens
        "token_pattern": None,
    }

    def get_doc(self, message: Message, attribute: Text) -> "Doc":
        return message.get(SPACY_DOCS[attribute])

    def tokenize(self, message: Message, attribute: Text) -> List[Token]:
        doc = self.get_doc(message, attribute)

        doc1 = str(doc)
        list_of_tokens = []
        for t in doc:
            nfkd_form = unicodedata.normalize('NFKD', t.lemma_)
            cleaned_token = u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
            list_of_tokens.append(Token(cleaned_token, t.idx))
        
        return list_of_tokens

Consider updating your Rasa version, if you use an older version.

As the tokenizer gets the same input as the intent classifier, I am not 100% sure why your tokenizer is not processing all data. How exactly did you recognized that your tokenizer is not processing all files?

Hello @Tanja,

I am using rasa version 1.10.8. Thank you for your answer, I will definitely follow your instructions.

Regarding your question: I had an extra line of code on my tokenizer, printing every processed token. By inspecting the printed tokens, I saw that only tokens from the nlu.md file were getting printed.

That seemed strange to me… and this is why I started this thread.

I just create a small dummy project to test this. Everything is working as expected. See code attached. Can you maybe try again after you updated your code and double check? Thanks.

Archive.zip (4.4 KB)

may I know how to train nlu , config , stories and domain files through one command?