How a SpaCy language model component improves performances?

Hi sorry for my maybe trivial question (I’m still a RASA “newbie”).

I’m experimenting RASA 2.8 capabilities for conversational apps in Italian language. As suggested in RASA documentation, I added the SPacyNLP component using an italian language model. Here my config.yml:

language: it

pipeline:

  # pip3 install rasa[spacy]
  # python3 -m spacy download it_core_news_sm
  # python3 -m spacy download it_core_news_lg
  - name: "SpacyNLP"
    # language model to load
    # italian large model: it_core_news_lg
    # italian small model: it_core_news_sm
    model: "it_core_news_sm"
    case_sensitive: false

  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: LexicalSyntacticFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4

  - name: DIETClassifier
    epochs: 100
    constrain_similarities: true

  - name: EntitySynonymMapper

  - name: ResponseSelector
    epochs: 100
    constrain_similarities: true

  - name: FallbackClassifier
    threshold: 0.3
    ambiguity_threshold: 0.1


policies:

  - name: MemoizationPolicy

  - name: RulePolicy
    core_fallback_threshold: 0.4
    core_fallback_action_name: "action_default_fallback"
    enable_fallback_prediction: True

  - name: TEDPolicy
    max_history: 10
    epochs: 100
    constrain_similarities: true 

Now, what is not clear to me is how this component improves the RASA NLU. Reading the recent Vincent article “Non English Tools for Rasa”, I understand the possible 3 helpers: tokenizer, featurizer, entities extractor. I can generally understand the added value of tokenizer and entities extractor (a bit more obscure how the featurizer helps), anyway I have two questions:

Q1. There is any practical example or “benchmark” demonstrating “how much” a (Specy or others) external model helps the RASA NLU working better? Any article to deep this topic?

Q2. About Spacy NER: I know that using the SpacyNLP component in the pipeline I can get Spacy entities using the Spacy naming. E.g. Spacy detects my name and surname as entity PERSON if I check teh sentence: mi chiamo Giorgio Robino e vivo a Genova in Italia:

Now, if I test the above sentence in one of my RASA chatbots I can’t see an expected PERSON entity:

$ rasa shell nlu
2021-08-24 15:52:07 INFO     rasa.model  - Loading model models/20210823-171315.tar.gz...
2021-08-24 15:52:24 INFO     rasa.nlu.components  - Added 'SpacyNLP' to component cache. Key 'SpacyNLP-it_core_news_sm'.
NLU model loaded. Type a message and press enter to parse it.
Next message:
Mi chiamo Giorgio Robino e vivo a Genova, in Italia
{
  "text": "Mi chiamo Giorgio Robino e vivo a Genova, in Italia",
  "intent": {
    "id": 5671084092719348945,
    "name": "goodbye",
    "confidence": 0.5075552463531494
  },
  "entities": [
    {
      "entity": "oxygen_saturation",
      "start": 25,
      "end": 26,
      "confidence_entity": 0.9176041483879089,
      "value": "e",
      "extractor": "DIETClassifier"
    }
  ],

I’ts because I have to add a PERSON entity and at least 1 intent containing that entity in my training data? Or I missing something in my configuration file?

Q3. A related question to the above is: can I augment Spacy (or other external entities extractor) entities with those defined internally in RASA training data? By example: Suppose I would like to extend the Spacy PERSON entity set with others RASA application-defined names (maybe extending Italian names with Arabic or English names, etc). How to do? What if I name in my RASA domain an entity lookup table with name PERSON? I get the sum of Spacy PERSON set plus the RASA PERSON set?

Thanks

1 Like

Ah, so a few things!

  1. It’s best to use it_core_news_md or it_core_news_lg instead of it_core_news_sm. I’m comparing apples to pears a bit here because I’m basing this on my experience with Dutch and English, but I’ve always noticed that the vectors in the small model usually don’t cover enough of the language. This is also listed as a reason in the NLU blogpost. When you use spaCy I think you don’t need to add an example in your training data but I’m not 100% sure.
  2. In terms of benchmarks, I’ve ran some myself, but I should stress that the benchmarks that I ran might not be relevant for your situation. I’ve never ran it against an Italian assistant and I’ve mainly ran my tools against the Rasa datasets which may also differ from yours. What I can share: in my experience thusfar, it mainly seems as the vectors cause an uplift in DIETs entity detection. The intents don’t really gain too much of an uplift.
  3. Yeah name detection is super hard in general. I wrote a big blogpost and made an algorithm whiteboard video to explain why. There is a solution that I mention at the end which involves just doing a name-lookup instead. If that’s a path you’d like to consider, you may enjoy the baby-name-lists mentioned here.
  4. If you’re interested in using a lookup-based approach, you may enjoy the trick explained here. There’s an experimental Rasa component for that trick available here.

I’m also adding a link to this blogpost on our NLU pipeline. It might explain how the moving parts connect in more detail.

2 Likes

Thanks a lot Vincent, for all the useful info/links!

Regarding first-names lookup table, I added a pull request containing some open-data Italian common first-names in the github repo: https://github.com/RasaHQ/rasa-nlu-examples/tree/main/data/namelists. Maybe it could be useful to RASA devs: Italian names by solyarisoftware · Pull Request #144 · RasaHQ/rasa-nlu-examples · GitHub

Just for (my) curiosity, seems to me that (italian) family-names is a list much bigger than the first-names! I’m just discovering an explosion of variants for each imaginable word comes in my mind… :slight_smile: Here a list if ~600000 (uncomplete) italian surnames: https://github.com/napolux/paroleitaliane/blob/master/paroleitaliane/lista_cognomi.txt.

About Q2, my question was about why Giorgio Robino is not detected as the SpaCy entity PERSON, but maybe is just because I did not add explicitly at least one PERSON entity example in my intents. I’ll double check.

Approach 2: NameLists" solution you propose in your article Proper Name Detection | The Rasa Blog | Rasa help me to understand how to do. Thanks! Even if more I think about it and more I convince myself that the third approach, using interactive confirmation (you say “UI”), is maybe my preferred solution.

BTW, just a note, IMMO the RASA docs documentation lack of Spacy Component attributes details/description. By example I discovered the dimensions attribute, just because your article mention it:

- name: SpacyEntityExtractor
  dimensions: ["PERSON"]
2 Likes