WhitespaceTokenizer ignored from pipeline

sourcedexter · April 17, 2022, 3:58am

I am trying to train an NLU model. It was working fine and then it suddenly started to throw errors when I tried to train more models. I have been trying out various pipeline options to improve the nlu model.

Currently, it fails at CountVectorFeaturizer component. The error I am getting is:

AttributeError: 'CountVectorizer' object has no attribute 'vocabulary_'

I feel that the reason for this error is that the featurizer is not getting tokens? The regexFeaturizer and LexicalFeaturizer seem to be working fine as I can see the “FInished training Component” log for both of them.

I also noticed that once I call “rasa train nlu” , I don’t get a log message saying “training component: Whitespace Tokenizer”. So, I am trying to figure out if that’s why the CountVector Featurizer is failing? any pointers on what I could try?

Here’s the config pipeline file that I am using:

language: "en"  # your two-letter language code

pipeline:
  - name: WhitespaceTokenizer
    intent_tokenization_flag: True
    intent_split_symbol: "+"
  - name: RegexFeaturizer
    case_sensitive: False
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "words"
    min_ngram: 1
    max_ngram: 3
  - name: DIETClassifier
    epochs: 50
    number_of_transformer_layers: 2
    intent_tokenization_flag: true
    intent_split_symbol: "+"
  - name: EntitySynonymMapper
  - name: RegexEntityExtractor
    # text will be processed with case insensitive as default
    case_sensitive: False
    # use lookup tables to extract entities
    use_lookup_tables: True
    # use regexes to extract entities
    use_regexes: False
    # use match word boundaries for lookup table
    "use_word_boundaries": True

RASA Details:

rasa version: 3.1.0
Operating system: Ubuntu 20.04
Python Version: 3.8.10

Topic		Replies	Views
Error in training nlu with BytePairFeaturizer Rasa Open Source	6	1148	July 28, 2020
Exception: Failed to validate component CountVectorsFeaturizer. Missing property: 'tokens' Rasa Open Source	7	971	February 29, 2020
Rasa nlu training problem Rasa Open Source	2	831	December 23, 2019
Rasa 3.x custom bert Rasa Open Source	1	723	May 9, 2022
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided Getting Started with Rasa	2	289	October 9, 2020

WhitespaceTokenizer ignored from pipeline

Related topics