Ideal Ratio Between training questions and lookup file values?

I have a lookup file with nearly 1MM values. Is there an ideal number of training questions I should have in my training dataset? I’ve found when I only have 100 training examples for my intent, the resulting models can only detect values from the lookup tables that were actually included in the training file, rendering my lookup file essentially useless. I figured I could replicate questions however many more times to ensure all values get included in the actual file, but again that defeats the purpose of having a lookup file. Here is the config I am using: language: “en”

pipeline:

  • name: “tokenizer_whitespace”
  • name: “intent_featurizer_count_vectors”
  • name: “intent_entity_featurizer_regex”
  • name: “ner_crf” features: [ [“low”, “title”, “upper”], [“bias”, “low”, “prefix5”, “prefix2”, “suffix5”, “suffix3”, “suffix2”, “upper”, “title”, “digit”, “pattern”], [“low”, “title”, “upper”] ]
  • name: “intent_classifier_tensorflow_embedding”

Unfortunately, there are no ideal parameters, it is better to find optimal for your particular case