Why does incorporating Spacy decrease model performance

I am building an FAQ response selector model with about 200 intents. The bot answers questions regarding medication usage, general information about the medication, and the general abilities of the chatbot.

I developed two pipelines:

Spacy

pipeline:
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
  - name: SpacyNLP
    model: en_core_web_md
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: RegexFeaturizer
  # - name: RegexEntityExtractor #uncomment when model done -- very slow while training 
  # - name: LanguageModelFeaturizer
  #   model_name: "xlnet"
  #   model_weights: "xlnet-base-cased"
    cache_dir: null
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4    
  - name: "DucklingEntityExtractor"
    # url of the running duckling server
    url: "http://localhost:8000"
    # dimensions to extract
    dimensions: ["time","email","number"]
    # allows you to configure the locale, by default the language is
    # used
    #locale: "de_DE"
    # if not set the default timezone of Duckling is going to be used
    # needed to calculate dates from relative expressions like "tomorrow"
    timezone: "America/New_York"
    # Timeout for receiving response from http url of the running duckling server
    # if not set the default timeout of duckling http url is set to 3 seconds.
    timeout : 20
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector   
    epochs: 100
    retrieval_intent: faq
  - name: FallbackClassifier
    threshold: 0.7
    ambiguity_threshold: 0.1

policies:
- name: AugmentedMemoizationPolicy
- name: TEDPolicy
  epochs: 40
- name: RulePolicy

No Spacy - baseline

recipe: default.v1
language: en

pipeline:
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  # - name: RegexEntityExtractor #uncomment when model done -- very slow while training 
  # - name: LanguageModelFeaturizer
  #   model_name: "xlnet"
  #   model_weights: "xlnet-base-cased"
  #   cache_dir: null
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4    
  - name: "DucklingEntityExtractor"
    # url of the running duckling server
    url: "http://localhost:8000"
    # dimensions to extract
    dimensions: ["time","email","number"]
    # allows you to configure the locale, by default the language is
    # used
    #locale: "de_DE"
    # if not set the default timezone of Duckling is going to be used
    # needed to calculate dates from relative expressions like "tomorrow"
    timezone: "America/New_York"
    # Timeout for receiving response from http url of the running duckling server
    # if not set the default timeout of duckling http url is set to 3 seconds.
    timeout : 20
  - name: DIETClassifier
    epochs: 100
  - name: EntitySynonymMapper
  - name: ResponseSelector   
    epochs: 100
    retrieval_intent: faq
  - name: FallbackClassifier
    threshold: 0.7
    ambiguity_threshold: 0.1

policies:
- name: AugmentedMemoizationPolicy
- name: TEDPolicy
  epochs: 40
- name: RulePolicy

The difference between the two is that the first one incorporates Spacy pre-embeddings into the model, however, I noticed the model does a poor job at classifying the correct intent.

For example, I have an intent faq/googling where all the training examples relate to google. When I try to ask “what is the difference between you and googling” the bot responds with the incorrect response. However, when I remove spacy, the response is correct.

Why is the model performing worse with Spacy added? It doesn’t make sense to me. I only have one intent about googling and all the training examples under this intent vary enough to help generalize but it’s still related to the googling.

Also I observed the confidence ranking output, the confidence for each intent is near 0.

What does this all mean?

      "ranking": [
        {
          "confidence": 0.02295936644077301,
          "intent_response_key": "faq/main_ingredient"
        },
        {
          "confidence": 0.021990343928337097,
          "intent_response_key": "faq/will_cravings_return"
        },
        {
          "confidence": 0.02169930748641491,
          "intent_response_key": "faq/side_effects"

@stephens Any chance you know why my model is performing so poorly?