Diacritics and declension of nouns

Hi,

I have some questions for you guys and I will be very grateful if you could help me. I am working on chatbot in Czech language and I am facing to 2 issues caused by grammar of Czech language.

  1. There are diacritical marks - I can write “Chtěl bych najít [foťák] (query)”, but also quite commonly without diacritics “Chtel bych najit [fotak] (query)” or maybe mixed “Chtel bych najit [foták] (query)”. How to deal with situations like that? RasaNLU can not classify correctly intent if input message is without diacritical marks and training data contains exactly same sentence with diacritics. Not only for intent, but MAINLY during entity extraction.

  2. Also there is declension of nouns (or different word endings) in this language. I have about 10000 entities (nouns) in lookup table. All entities are in first case - nominative singular, but during the dialogue they may also occur in one of 6 other cases defined in this language for singular and 7 cases for plural. So in the worst case, one entity may have up to 14 different word endings. And moreover, entities can be composed of 2, 3, 4, 5 words and word endings are different… I have more than 400 labelled training sentences with examples of entities from lookup table and these sentences contain entities in different cases. But there is problem during dialogue. Entity in different case than in training data is not recognized by RasaNLU (in other words: If there is entity in second case in training data and user’s message contains same entity in first case, it is not recognized).

**Example**:
Training: "Chtěl bych najít [foťák] (query)"
Dialogue example 1: "Chtěl bych najít nové foťáky"
Dialogue example 2 (other noun case): "Chtěl bych najít seznam foťáku" 
Dialogue example 3 (no diacritics): "Chtel bych najit mobil s fotakem"

I guess you can imagine how much training data I would need if I should combine all the noun cases (14) with and without diacritics for 10000 entities.

I am using tensorflow_embedding pipeline with this setup:

pipeline:
- name: tokenizer_whitespace
- name: intent_entity_featurizer_regex
- name: ner_crf
- name: ner_synonyms
- name: intent_featurizer_count_vectors
    OOV_token: oov
    max_ngram: 2
- name: intent_classifier_tensorflow_embedding
    intent_tokenization_flag: true
    intent_split_symbol: +

language: cs

What is the best approach to these cases? I came to these solutions (are they best practise?):

  • add all possible variations of each entity to lookup table (10000 entities * 14 noun cases * 2 with/without diacritics = total 280000 occurrences), group them into synonyms to know that all entity’s variations have the same meaning and then train model - time consuming to create this dataset, large amount of time to train
  • have action for this type of intent, which everytime loads lookup table file and for every word in user’s input checks for partial or exact match in lookup table file (Fuzzy matching?). Then use som threshold to filter inaccurate matches. - may have slow performance?

Thank you for help!

@akelad any suggestions, please?

May be worth trying to preprocess your data to remove accents? You can do that in a few minutes by adding a component to the pipeline. Implement both train and process method to remove diacritics.