Featurizer for DIET

giorgibaghdavadze · April 3, 2020, 12:47pm

Hello, i have change rasa’s older version to version 1.9.5.
I have a following issue: I wrote custom featurizer component based on word2vec and sent2vec. Each sentence has a 576 length feature vector (for ex. [0.12, … , 0.2]). In previous version i used to write self._combine_with_existing_text_features to tell the embedding_classifier to classify sentences by word2vec features, but in newer version rasa don’t have a this method. it has two methods: _combine_with_existing_sparse_features and _combine_with_existing_text_features. when i used _combine_with_existing_sparse_features method to concatenate sparse (or dense) features together i had a following error: features = self._combine_with_existing_sparse_features(message, optio_features, 'text_sparse_features') File "/home/optio/rasaprojects/rasaenv/lib/python3.6/site-packages/rasa/nlu/featurizers/featurizer.py", line 89, in _combine_with_existing_sparse_features if message.get(feature_name).shape[0] != additional_features.shape[0]: AttributeError: 'list' object has no attribute 'shape' . It seems that this method does not like my 576 dimension feature vector.

Can you help me to make DIET classifier use my word2vec features and use it for classification? or can you just some examples how to write custom featurizer for DIET classifier.

vitalyuf · April 27, 2020, 7:24am

I also have the same problem. Have you solved it?

vitalyuf · April 29, 2020, 7:14am

I solved the problem.

It had happened because “Prior to DIET, Rasa’s NLU pipeline used a bag of words model where there was one feature vector per user message.”

But for DIET you should give a separate vector for each token of a user message according to tokenization made by a tokenizer component in the same pipeline.

Tanja · May 6, 2020, 9:12am

@giorgibaghdavadze We divided our featurizers into sparse and dense featurizers (see docs). Your word2vec featurizer falls into the category of dense featurizers. So you should use the method _combine_with_existing_dense_features . We also introduced a __CLS__ token: All tokenizers add an additional token __CLS__ to the end of the list of tokens when tokenizing text and responses. Make sure to consider that in your featurizer. Also as mentioned by vitalyuf, each featurizer returns a sequence of features, e.g. one feature vector per token. Did you consider that? If you still run into issues, could you maybe share the code of your custom component?

giorgibaghdavadze · May 7, 2020, 9:32pm

Hi, Thank you for your response. Actually i don’t have feature vector per token, I have feature vector for sentence (576 dim.). I want only concatenate it existing dense feature. When I used features = self._combine_with_existing_dense_features( message, features, DENSE_FEATURE_NAMES[TEXT] ) to combine my features to existing dense feature in Diet classifier alert a following error: File "/home/optio/rasaprojects/rasaenv/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 383, in _extract_features f"Sequence dimensions for sparse and dense features " ValueError: Sequence dimensions for sparse and dense features don't coincide in 'ბარათი დამებლოკა და რა ვქნა?' for attribute 'text'. This is my code:

features = np.array(custom_features)
features = self._combine_with_existing_dense_features(
       message, features, DENSE_FEATURE_NAMES[TEXT]
 )
message.set(DENSE_FEATURE_NAMES[TEXT], features)

where custom_features is a list of real numbers. (for ex. [1.123, 2.23 , … , 123.23] 576 length).

This is a framgent of code spacy_featurizer.py:

features = self._features_for_doc(message_attribute_doc)
cls_token_vec = self._calculate_cls_vector(features, self.pooling_operation)
 features = np.concatenate([features, cls_token_vec])
features = self._combine_with_existing_dense_features(
           message, features, DENSE_FEATURE_NAMES[attribute]
)

Tanja · May 11, 2020, 6:27am

The resulting features of the SpacyFeaturizer are 2 dimensional: length of sequence x size of features. E.g. if your sentence contains 3 tokens, we append the __CLS__ token, so that it has a total sequence length of 4. The dense features from the SpacyFeaturizer have then a dimensionality of 4x300 as spaCy is using GloVe as feature vectors.

You mentioned that you have just a feature vector for the complete message. Thus, you try to concatenate a 1-dimensional array with a 2-dimensional array. To fix that you need to create feature vectors for every token.

giorgibaghdavadze · May 14, 2020, 8:00am

If i don’t want to have one vector per token, is there a way to put it in Diet Classifier?

Tanja · May 15, 2020, 12:00pm

@giorgibaghdavadze Not sure what you mean. You need to have a vector per token otherwise training DIETClassifier will fail.

Topic		Replies	Views
Custom sentence embedding component Rasa Open Source	0	771	May 8, 2022
Diet total loss goes up if I include CountVectorizers in the pipeline Feedback on Rasa Open Source rasa	6	699	August 4, 2021
How to access DIET embedding vectors? Rasa Open Source	2	1150	January 20, 2021
Semantic Hashing with DIETClassifier Rasa Open Source	2	351	May 24, 2021
Using DIETClassifier with ConveRT and Transfomer based featurizers Rasa Open Source	7	1922	April 3, 2020

Featurizer for DIET

Related topics