Semantic Hashing with DIETClassifier

adai183 · May 14, 2021, 12:02pm

Hi! I am trying to understand best practices for the DIET classifier. I will only consider intent classification for simplicity and not NER. Here is my config:

- name: WhitespaceTokenizer
   - name: CountVectorsFeaturizer
   - name: CountVectorsFeaturizer
     analyzer: char_wb
     min_ngram: 1
     max_ngram: 4
   - name: DIETClassifier
     epochs: 100
     constrain_similarities: true

As you can see I followed the recommended defaults and added two CountVectorsFeaturizers one at word level and another one on character level. The inclusion of character n-grams should lead to a better approximation of out of vocabulary words and also add robustness for misspellings at inference time.

@koaning I would love to see a video lecture on Semantic Hashing. I also have a couple of questions:

Are the features from the two CountVectorsFeaturizers simply concatenated?
How does the positional encoding work with this?

Thanks!

koaning · May 17, 2021, 8:07am

I’m not 100% familiar with the term “semantic hashing” but after googling the term it sounds like it’s mostly what most embedding models do. They “hash” tokens to a numeric representation where the distance between hashed tokens represents a proxy for similarity.

To answer your questions.

Yes, they are simply concatenated. All sparse features and all dense features are merely concatenated together. The diagram below (from this blogpost) shows this nicely.

To clarify, you’re talking about the positional encoding in the transformer layer of DIET? If so, it’s worth pointing out that the effect of the positional encoding isn’t that great when we’re dealing with short texts. Second, there positional encoding applies to the tokens going in. It’s effectively just “a vector that we’re adding depending on if the token is the 1st or the nth token in the utterance”. Feel free to ask for more details, but just to check, have you seen this part of our series on the attention mechanism?

adai183 · May 24, 2021, 12:12pm

Thanks, @koaning! Very helpful. I actually got the term Semantic Hashing and corresponding paper directly from the Rasa code: rasa/count_vectors_featurizer.py at 41e3b227101e6ace3f85c2d99a7f48f4528a8b93 · RasaHQ/rasa · GitHub

Topic		Replies	Views
Custom sentence embedding component Rasa Open Source	0	771	May 8, 2022
Featurizer for DIET Rasa Open Source	7	1477	May 15, 2020
DIETClassifier with sparse input features only Rasa Open Source	9	2548	January 19, 2021
Injecting pretrained sentence level semantic features to the DIETClassifier Rasa Open Source	0	459	December 31, 2021
Clarification regarding NLU Pipeline and DIETClassifier Rasa Open Source	4	1567	March 4, 2021

Semantic Hashing with DIETClassifier

Related topics