Hindi entity extraction. Tokenizer issue

I’m trying to extract entities for indian language hindi. And most of the data gives the following warning UserWarning: Misaligned entity annotation in message ‘2?? ??? ??? ??? ??? ??? ??? ?? ???’ with intent ‘order’. Make sure the start and end values of entities in the training data match the token boundaries (e.g. entities don’t incl ude trailing whitespaces or punctuation). More info at Training Data Format

following is my pipeline

pipeline:

  • name: HFTransformersNLP

    Name of the language model to use

    model_name: “bert”

    Pre-Trained weights to be loaded

    model_weights: “bert-base-multilingual-cased”

  • name: LanguageModelTokenizer

  • name: LanguageModelFeaturizer

  • name: RegexFeaturizer

  • name: CRFEntityExtractor BILOU: True

  • name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4

  • name: LexicalSyntacticFeaturizer

  • name: DIETClassifier epochs: 100

  • name: EntitySynonymMapper

following is sample of my input training file

intent:order

intent:deny

  • नहीं चाहिए
  • नहीं चाहिए
  • नहीं चाहिए
  • नहीं
  • बिलकुल नहीं
  • बिलकुल नहीं चाहिए
  • मुझे नहीं चाहिए
  • नहीं चाहिए मुझे
  • बादमे कॉल कीजिये
  • बादमे कॉल करना
  • बादमे
  • अभी नहीं बादमे

Hi @007sk!

I think you are lacking a language key in your pipeline:

language: "hi"

This should be added at the very top of the config file.

Hi @007sk, thanks for reporting this. Looks like there is a bug in the WhitespaceTokenizer which is internally used by the HFTransformersNLP component. I have opened an issue here