Some entities in lookup could not be extracted for Chinese NLU Model

Thanks for Rasa team’s wonderful work. I am a new Rasa user and learning some basic usages.

I’m building Chinese weather NLU models which includes city slot.

After training the models, some entities which in the lookup table but not in the examples such as '西安' could be tagged as common.city entity, but some entities could not be tagged out, such as '日照'. I am confused about this result. With the same context and same entity regex feature value, why the output is different for this two entities?

Maybe I miss something in the config for Chinese? Or I need to add more data?

Here is my data and config.

data/nlu.yml

nlu:
- intent: ask_weather
  examples: |
    - 查一下 [上海](common.city) 天气
    - 查一下 [苏州](common.city) 天气
    - 查一下 [无锡](common.city) 天气
    - 查一下 [杭州](common.city) 天气
    - [上海](common.city) 天气
    .....

- lookup: common.city
  examples: |
    - 上海
    - 北京
    - 苏州
    - 西安
    - 广州
    - 纽约
    - 日照
    ...

config.yml

    - name: JiebaTokenizer
    - name: RegexFeaturizer 
      use_word_boundaries: False
    - name: CountVectorsFeaturizer
    - name: CountVectorsFeaturizer
      analyzer: "char_wb"
      min_ngram: 1
      max_ngram: 4
    - name: DIETClassifier
      epochs: 100
    - name: EntitySynonymMapper
1 Like

Ni Hao @chaoyang ! :slight_smile: please can you share complete config file with pipelines and policies.

Thanks!

Below is my whole config file. I remove the comment lines.

language: zh

pipeline:
    - name: JiebaTokenizer
    - name: RegexFeaturizer 
      use_word_boundaries: False
    - name: CountVectorsFeaturizer
    - name: CountVectorsFeaturizer
      analyzer: "char_wb"
      min_ngram: 1
      max_ngram: 4
    - name: DIETClassifier
      epochs: 100
    - name: EntitySynonymMapper

policies:

@chaoyang thanks for the share but polices are missing in the above mention code.

Yes. This is all the config.yml content. I think the policies config are used for training dialogue models from stories data? Please correct me if I am wrong.

Cause I only need to train the nlu model. I don’t add any custom policies in it.

rasa train nlu --domain data/

@chaoyang Please see this link Model Configuration your config file is not complete.

@chaoyang you also need default policies for training the model.

@chaoyang When you install basic rasa init project the default config.yml you need to check.

Thanks, But the config.yml says if no custom policy is added, the default will be used.

And why the policy config is needed if I only train nlu part?

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
#   - name: MemoizationPolicy
#   - name: RulePolicy
#   - name: UnexpecTEDIntentPolicy
#     max_history: 5
#     epochs: 100
#   - name: TEDPolicy
#     max_history: 5
#     epochs: 100
#     constrain_similarities: true

@chaoyang I just asked to see the config complete file, if you comment all then also it will take the default config.yml, but you had customise as per your use case, by that it will not be default, does it make sense now?

Thanks. Below is the complete file. Sorry that I show the content without the comments cause I just thought it will not impact the training. According to this complete file, do you have any suggestions?

# Configuration for Rasa NLU.
# https://rasa.com/docs/rasa/nlu/components/
language: zh

pipeline:
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
    - name: JiebaTokenizer
    - name: RegexFeaturizer 
      use_word_boundaries: False
    - name: CountVectorsFeaturizer
    - name: CountVectorsFeaturizer
      analyzer: "char_wb"
      min_ngram: 1
      max_ngram: 4
    - name: DIETClassifier
      epochs: 100
    - name: EntitySynonymMapper

# Configuration for Rasa Core.
# https://rasa.com/docs/rasa/core/policies/
policies:
# # No configuration for policies was provided. The following default policies were used to train your model.
# # If you'd like to customize them, uncomment and adjust the policies.
# # See https://rasa.com/docs/rasa/policies for more information.
  # - name: MemoizationPolicy
  # - name: RulePolicy
  # - name: UnexpecTEDIntentPolicy
  #   max_history: 5
  #   epochs: 100
  # - name: TEDPolicy
  #   max_history: 5
  #   epochs: 100
  #   constrain_similarities: true

``

@chaoyang as your example is mention in Chinese, its very difficult for me honestly so please bare with me. You are creating weather chatbot first change common.city to city only it those 上海 , 苏州 etc and even in lookup mention only city. What ever city name you will give in lookup it will fetch that only, apart from that it will not return. Try till this now and tell me result.

I appreciate so much that you give so careful and frequent reply on this Chinese related problem. But I don’t quite understand your suggestin. Is that mean I need to change the entity name from common.city to city ? Would you like to give me more interpretation on this? Thanks!

@chaoyang Yes and please provide more training examples and also update more cities in lookup table.

@nik202 Thank you so much! It works!

I just add more city entities(from 18 city names to 26 city names) in the common.city lookup table.

The models trained with this 26 cities works well. All entities in the lookup table could be tagged.

Then I reduce the city number from 26 to 18 and train nlu again. This model also work well!

But when I remove the nlu.*.tar.gz file from models/ dir and retrain the model with 18 entities. It does not work.

So does this mean Rasa nlu model training does not start from random initialization but update parameters on last trained model?

I not get you on this, can you please explain more.

@chaoyang If your issue is solved, please close this thread as solution for other.

  • Model 1. Lookup table has 18 entities. Train nlu model. Not work well. Some entities are not tagged by model.
  • Model 2. Add 8 more entites. Now the Lookup table has 26 entities. Train nlu model. Work well. All entities in lookup table are tagged by model.
  • Model 3. Remove the 8 entites. Now the Lookup table has the same 18 entities with model 1. Train nlu model. Still Work well. All entities in lookup table are tagged by model.

@chaoyang Nice, cool.Model 2 work well then, try provide more training and lookup examples and train and delete older models and re-train. If you have any issue please let me know :slight_smile: Xièxiè

It could be just pure luck due to the randomness of Machine Learning :slight_smile: If you want to accurately compare two Pipeline Components or Policies across multiple trainings, you could set a Seed for DIET, ResponseSelector, and TED like so for example:

- name: DIETClassifier
  random_seed: 1
  // other parameters

I also suggest you use Tensorboard to make comparisons and choose an optimal configuration. This is also doable on DIET, ResponseSelector, and TED like so for example:

- name: DIETClassifier
  // other parameters
  evaluate_on_number_of_examples: 200
  evaluate_every_number_of_epochs: 5
  tensorboard_log_directory: ./tensorboard/DIET
  tensorboard_log_level: epoch

Try to set evaluate_on_number_of_examples to about 20% of your total number of examples (of course, this means these examples will not be used for training and you will have to give a bit more examples). You can use this script I write to count the number of examples you have.