Bad accuracy rasa shell

It’s weird when I’m in training mode RASA finds the prediction of intention, entities… However when I’m in shell mode, the accuracy is bad.

my configuration

language: fr
pipeline:
- name: WhitespaceTokenizer
  case_sensitive: false
- name: CRFEntityExtractor
  BILOU_flag: true
  features:
  - - low
    - title
    - upper
  - - bias
    - low
    - prefix5
    - prefix2
    - suffix5
    - suffix3
    - suffix2
    - upper
    - title
    - digit
    - pattern
  - - low
    - title
    - upper
- name: EntitySynonymMapper
- name: CountVectorsFeaturizer
  intent_tokenization_flag: true
  intent_split_symbol: +
- name: EmbeddingIntentClassifier
- name: RegexFeaturizer
- name: "DucklingHTTPExtractor"
  url: "http://localhost:8000"
  dimensions: ["time", "number", "amount-of-money", "distance"]
  locale: "fr_FR"
  timezone: "Europe/Paris"
  timeout : 3
policies:
- name: KerasPolicy
  epochs: 700
  batch_size: 100
  featurizer:
  - name: MaxHistoryTrackerFeaturizer
    max_history: 5
    state_featurizer:
    - name: BinarySingleStateFeaturizer
- name: MemoizationPolicy
  max_history: 5
- name: FallbackPolicy
  nlu_threshold: 0.7
  core_threshold: 0.4
  fallback_action_name: utter_oupsomethingfailed
- name: FormPolicy

rasa test result

valentin@mbp-de-valentin archelot % rasa test
2020-01-27 20:09:25 INFO     absl  - Entry Point [tensor2tensor.envs.tic_tac_toe_env:TicTacToeEnv] registered with id [T2TEnv-TicTacToeEnv-v0]
2020-01-27 20:09:25 INFO     rasa.core.policies.ensemble  - MappingPolicy not included in policy ensemble. Default intents 'restart and back will not trigger actions 'action_restart' and 'action_back'.
Processed Story Blocks:   0%|                                                                 | 0/29 [00:00<?, ?it/s, # trackers=1]/usr/local/lib/python3.7/site-packages/rasa/core/slots.py:217: UserWarning: Categorical slot 'sexe' is set to a value ('femmme') that is not specified in the domain. Value will be ignored and the slot will behave as if no value is set. Make sure to add all values a categorical slot should store to the domain.
  f"Categorical slot '{self.name}' is set to a value "
Processed Story Blocks: 100%|███████████████████████████████████████████████████████| 29/29 [00:00<00:00, 629.86it/s, # trackers=1]
2020-01-27 20:09:25 INFO     rasa.core.test  - Evaluating 14 stories
Progress:
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:01<00:00, 10.88it/s]
2020-01-27 20:09:26 INFO     rasa.core.test  - Finished collecting predictions.
2020-01-27 20:09:26 INFO     rasa.core.test  - Evaluation Results on CONVERSATION level:
2020-01-27 20:09:26 INFO     rasa.core.test  - 	Correct:          12 / 14
2020-01-27 20:09:26 INFO     rasa.core.test  - 	F1-Score:         0.923
2020-01-27 20:09:26 INFO     rasa.core.test  - 	Precision:        1.000
2020-01-27 20:09:26 INFO     rasa.core.test  - 	Accuracy:         0.857
2020-01-27 20:09:26 INFO     rasa.core.test  - 	In-data fraction: 0.976
2020-01-27 20:09:26 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2020-01-27 20:09:26 INFO     rasa.core.test  - 	Correct:          244 / 246
2020-01-27 20:09:26 INFO     rasa.core.test  - 	F1-Score:         0.992
2020-01-27 20:09:26 INFO     rasa.core.test  - 	Precision:        0.994
2020-01-27 20:09:26 INFO     rasa.core.test  - 	Accuracy:         0.992
2020-01-27 20:09:26 INFO     rasa.core.test  - 	In-data fraction: 0.976
2020-01-27 20:09:26 INFO     rasa.core.test  - 	Classification report: 
                                   precision    recall  f1-score   support

                     utter_thanks       1.00      1.00      1.00         6
            utter_ask_precision_s       1.00      1.00      1.00         6
          utter_favoris_ask_train       1.00      1.00      1.00        14
           utter_onboarding_crush       1.00      1.00      1.00        14
         utter_onboarding_mission       1.00      1.00      1.00        14
                    utter_goodbye       1.00      1.00      1.00         3
                form_find_someone       1.00      1.00      1.00         6
   action_reset_slot_find_someone       1.00      1.00      1.00         6
           utter_onboarding_limit       1.00      1.00      1.00        14
                  utter_show_menu       1.00      0.86      0.92        14
                      utter_greet       1.00      1.00      1.00        14
utter_interest_find_someone_false       1.00      1.00      1.00         6
 utter_interest_find_someone_true       1.00      1.00      1.00         8
            utter_onboarding_goal       1.00      1.00      1.00        14
          action_ask_favoris_city       1.00      1.00      1.00        14
                    utter_iamabot       1.00      1.00      1.00         1
                    action_listen       1.00      1.00      1.00        72
         utter_resume_favoris_all       0.75      1.00      0.86         6
          action_check_itineraire       1.00      1.00      1.00         8
        utter_resume_favoris_city       1.00      1.00      1.00         6

                        micro avg       0.99      0.99      0.99       246
                        macro avg       0.99      0.99      0.99       246
                     weighted avg       0.99      0.99      0.99       246

2020-01-27 20:09:27 INFO     rasa.nlu.test  - Confusion matrix, without normalization: 
[[14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 72  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0 14  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  3  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0 14  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  6  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0 14  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 14  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  6  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  6  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  0 12  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  6]]
2020-01-27 20:09:31 INFO     rasa.nlu.test  - Running model for predictions:
100%|████████████████████████████████████████████████████████████████████████████████████████████| 295/295 [00:03<00:00, 75.70it/s]
2020-01-27 20:09:35 INFO     rasa.nlu.test  - Intent evaluation results:
2020-01-27 20:09:35 INFO     rasa.nlu.test  - Intent Evaluation: Only considering those 295 examples that have a defined intent out of 295 examples
2020-01-27 20:09:35 INFO     rasa.nlu.test  - Classification report saved to results/intent_report.json.
2020-01-27 20:09:35 INFO     rasa.nlu.test  - Incorrect intent predictions saved to results/intent_errors.json.
2020-01-27 20:09:35 INFO     rasa.nlu.test  - Confusion matrix, without normalization: 
[[16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 26  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0 10  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0 10  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0 67  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0 12  0  1  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0 15  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 15  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  8  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  1  0  0  0  0  4  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  6  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  8  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  8  0  0]
 [ 0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0 10  0]
 [ 0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0 19]]
2020-01-27 20:09:38 INFO     rasa.nlu.test  - Entity evaluation results:
2020-01-27 20:09:38 INFO     rasa.nlu.test  - Evaluation for entity extractor: CRFEntityExtractor 
2020-01-27 20:09:38 INFO     rasa.nlu.test  - Classification report for 'CRFEntityExtractor' saved to 'results/CRFEntityExtractor_report.json'.
2020-01-27 20:09:38 INFO     rasa.nlu.test  - Incorrect entity predictions saved to results/CRFEntityExtractor_errors.json.

Why if the training is good in these predictions, isn’t it the case when I discuss with the bot? Is my setup bad? Would spacy be better? What similar configuration would apply?

Thanks for tips.

Did you create a train test split first? You can take a look at Evaluating Models to understand how to evaluate your models correctly.