Hindi entity extraction. Tokenizer issue

007sk · June 9, 2020, 2:27am

I’m trying to extract entities for indian language hindi. And most of the data gives the following warning UserWarning: Misaligned entity annotation in message ‘2?? ??? ??? ??? ??? ??? ??? ?? ???’ with intent ‘order’. Make sure the start and end values of entities in the training data match the token boundaries (e.g. entities don’t incl ude trailing whitespaces or punctuation). More info at Training Data Format

following is my pipeline

pipeline:

name: HFTransformersNLP

Name of the language model to use

model_name: “bert”

Pre-Trained weights to be loaded

model_weights: “bert-base-multilingual-cased”
name: LanguageModelTokenizer
name: LanguageModelFeaturizer
name: RegexFeaturizer
name: CRFEntityExtractor BILOU: True
name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4
name: LexicalSyntacticFeaturizer
name: DIETClassifier epochs: 100
name: EntitySynonymMapper

following is sample of my input training file

intent:order

2 दस वाली कैडबरी डेरीमिल्क के बॉक्स दे दीजिये
10 दस रुपये के डेरी मिल्क बॉक्स देना प्लीज़
210 प्लीज़ 10 रुपये वाली डेरी मिल्क के बॉक्स दे सकते है
2 प्लीज़ ₹10 रुपये वाली डेरी मिल्क के बॉक्स दे सकते है
4 दस रुपये वाली डेरी मिल्क के बॉक्स दीजिये
15 10 रुपये वाली डेरी मिल्क के बॉक्स देना
50 क्या आपके पास डेरी मिल्क 10 वाले बॉक्स मिल सकते है
दस वाली कैडबरी डेरीमिल्क के 2 बॉक्स दे दीजिये

intent:deny

नहीं चाहिए
नहीं चाहिए
नहीं चाहिए
नहीं
बिलकुल नहीं
बिलकुल नहीं चाहिए
मुझे नहीं चाहिए
नहीं चाहिए मुझे
बादमे कॉल कीजिये
बादमे कॉल करना
बादमे
अभी नहीं बादमे

saurabh-m523 · June 11, 2020, 5:54am

Hi @007sk!

I think you are lacking a language key in your pipeline:

language: "hi"

This should be added at the very top of the config file.

dakshvar22 · June 11, 2020, 1:05pm

Hi @007sk, thanks for reporting this. Looks like there is a bug in the WhitespaceTokenizer which is internally used by the HFTransformersNLP component. I have opened an issue here

Topic		Replies	Views
[HELP NEEDED] Misaligned entity annotation in message Rasa Open Source	6	1802	September 13, 2022
Misaligned entity annotation Rasa Open Source	7	4599	June 3, 2020
After using SpacyTokenizer: Misaligned entity annotation error when using CRFEntityExtraction Rasa Open Source	0	1046	February 24, 2020
Issue while classifying intent Rasa Open Source	5	580	December 18, 2019
Data format of Rasa for Arabic Rasa Open Source	3	682	June 10, 2021

Hindi entity extraction. Tokenizer issue

Name of the language model to use

Pre-Trained weights to be loaded

intent:order

intent:deny

Related topics