Is it possible to define entity values upfront, instead of learning them from examples?

I know you can annotate words in intent examples with an entity, e.g.:

intent: add.drink
- examples: |
  - can I get a [small](drink.size) [latte](drink.type)

But adding (at least) 1 example for every value of every entity in my database would make the NLU file huge!

Is it possible to define entity values elsewhere, thereby making the NLU file (c)leaner?

Things I have tried but to no avail:

Lookup tables:

- lookup: drink.size
  examples: |
    - small
    - medium
    - large

Categorical slots:

slots:
  drink.size:
    type: categorical
    influence_conversation: true
    values:
    - small
    - medium
    - large

My pipeline:

pipeline:
  - name: WhitespaceTokenizer
  - name: RegexFeaturizer
  - name: DIETClassifier
    epochs: 100
    constrain_similarities: true
  - name: RegexEntityExtractor
    case_sensitive: false
    use_lookup_tables: true
    use_regexes: true
    use_word_boundaries: true
1 Like

How did you implement look-up tables and categorial slots?

1 Like

As above really. Lookup tables in data/nlu.yml and categorical slots in domain.yml.

1 Like

If your goal is to make the NLU files smaller/cleaner, you could implement your own custom slot type (see: https://rasa.com/docs/rasa/domain#custom-slot-types) which you would write in python and which would enable you to upload a lookup table as a text file instead of dumping the whole thing in your domain file.

Alternatively, you can break your domain file down into several:

The domain can be defined as a single YAML file or split across multiple files in a directory. (See Domain)

Finally, you do not really need to set at least one example for every value. Most likely using a pertained language model for entity extraction will help pick these small nuances up. Another thing you could do to avoid having examples for every single thing sprinkled across your training data is to use synonyms (see here: https://rasa.com/docs/rasa/nlu-training-data#synonyms).

2 Likes

Thanks for all the ideas. Much appreciated! Let me comment on each:

you could implement your own custom slot type (see: Domain)

This looks like a cool feature, but also too cool (aka overkill) for what I’m trying to achieve. All I really need is that Rasa matches something from a fixed set of possibilities, which is why the lookup table approach without custom code, i.e. only via yaml files, seems more attractive: it’s simpler to understand/maintain and should do the trick.

Alternatively, you can break your domain file down into several:

I suppose you meant to break down my “nlu” file? That’s where (I believe) one would annotate tokens with entity types. The size of the file is a concern, but I am actually more concerned with having to make up (or repeat) intent examples, just to squeeze in a new value for an entity type.

Most likely using a pertained language model for entity extraction will help pick these small nuances up.

I am building a system for a real client, who is not very keen on error margins =) Which is why (again) a lookup table approach is more desirable, because it guarantees matching. A LM would be an interesting approach I’d think for wrong spellings/transcriptions, but maybe Fuzzy Matching takes care of that? In any case, wrong spellings/transcriptions is a problem for v 2.0 =)

Another thing you could do to avoid having examples for every single thing sprinkled across your training data is to use synonyms

Ah, better not =) I do need synonyms but in the true sense of the word, not as a workaround. My list of entities will have a, b, c… which in turn should map to their respective synonyms, a1, a2, a3, b1, b2, b3…

That all being said, I finally got it to work with a small modification to my initial setup. Will post the solution below.

Got it work with almost the same setup as before:

Lookup tables in nlu.yml:

- lookup: drink.size
  examples: |
    - small
    - medium
    - large

No categorical slots in domain.yml; my slots are now what they should be: text, list, etc.

And removed the DIET classifier from the pipeline in config.yml:

pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: RegexEntityExtractor
  case_sensitive: false
  use_lookup_tables: true
  use_regexes: true
  use_word_boundaries: true

Apparently one can also extract the lookup table to a text file (as per this tutorial at 42:16), but I haven’t tried that yet.