Hey there everybody!
I’m trying to get a better understanding for the pipeline in general and also about how the DIETClassifier works. I consider myself to be having a good basic knowledge, but I am needing some help to solidify my understanding of some of the concepts.
I already read through a lot of ressources, like the rasa masterclass ebook, the documentation, watched plenty of Rasa YouTube Videos, for the master class, as well as the algorithm whiteboard and also had a look at the Paper for the DIETClassifier. But for some aspects I’m having a hard time wrapping my head around it and thought about asking within the community.
I will try to state some facts regarding an example pipeline to subsequently ask associated questions to help me clarify the concepts. Thanks!
The pipeline contains multiple components that process data sequentially. Every utterance will be processed by these components during the training, as well as during production. The chatbot learns how to extract entities during the trainingphase, and afterwards uses the learned model to classify intents and to extract entities. So far so good. Let’s consider the described pipeline from the documentation @ Sensible Starting Pipelines:
- name: SpacyNLP
- name: SpacyTokenizer
- name: SpacyFeaturizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
- name: DIETClassifier
- name: EntitySynonymMapper
- name: ResponseSelector
In the first step we load the pre-trained embeddings of a spaCy-supported model. This allows our model to get a certain “feeling” for a language, without having to define a lot of traingsdata ourselves. The Tokenizer splits utterances into different tokens that the Featurizers use to convert them into dense numerical vectors that try to represent the information of a word. These numbers can then be fed into our DIETClassifier to be trained on, to learn how to extract entities and classify intents.
I hope what I stated so far was true. Now to my further questions:
What is happening to my own trainingdata within the pipeline in contrast to the training data that is provided through the spaCy language model? Is the trainingsdata just being thrown together to be then be fed through the pipeline or do the different kinds of trainingdata somehow use different components?
Do multiple Featurizers mean that the data is being shaped further down the road (since it’s a pipeline) or do they create seperate features? In the documentation is also stated, that the SpacyFeaturizer provides pre-trained word embeddings from GloVe or fastText…how do I know which is being used, I couldn’t find further information to this.
Now in the Paper and the associated youtube video it is stated, that the DIETClassifier during the training uses a pretrained embedding like BERT, GloVe or ConveRT. Is in our case spaCy used or is this not to be confused with each other? How is it related to the SpacyFeaturizer?
I think looking through all the ressources had me left being more confused instead of feeling a true understanding. I’m grateful for anyone trying to help me connect the dots.