How do the regex patterns generated from a lookup table work for multi token entities?
Hi! I am creating a chatbot and I want it to recognize Mexican locations, in order to do that I want to use Lookup tables and Forms; ask the user for first, second and third level geo-political divisions, with every division having its own table.
However I am struggling with the first level locations, where there are states whose names are very close e.g. Baja California and Baja California Sur.
When, I test the NLU model with these two entities it extracts correclty Baja California as state, but, for Baja California Sur, it splits it in two parts, Baja and California Sur, where Baja is the state and California Sur recognize it as a name, even though there is no Baja inside my lookup table. I think that the generated regex table first find the Baja regex and extract it. However I am not sure how is the lookup table and the regex working in this case.
Here is the pipeline I am using:
pipeline: - name: "WhitespaceTokenizer" - name: "RegexFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "CountVectorsFeaturizer" - name: "EmbeddingIntentClassifier" intent_tokenization_flag: true intent_split_symbol: "+"
and the Lookup table:
Aguascalientes Baja California Sur Baja California Campeche Chiapas Chihuahua Ciudad de México Coahuila de Zaragoza Colima Durango Estado de México Guanajuato Guerrero Hidalgo Jalisco Michoacán de Ocampo Morelos Nayarit Nuevo León Oaxaca Puebla Querétaro Quintana Roo San Luis Potosí Sinaloa Sonora Tabasco Tamaulipas Tlaxcala Veracruz de Ignacio de la Llave Yucatán Zacatecas gto sonora guerero morelos campeche tamaulipas michoacan slp chiapas zacatecas yucatan sinaloa tabasco qro Hidalgo baja california jalisco chihuahua cdmx nuevo Leon aguascalientes nayarit estado de mex bjs veracuz oaxaca puebla ags san luis potosí Queretaro Coahuila queretaro Nuevo leon ciudad de mexico tlaxcala quinatana roo bcs baja sur baja california sur san luis potosi San Luis potosi San Luis san luis
I have also added the synonyms and examples in my NLU (data.md) file as indicated in the forums Hope you can help understanding why am I getting these results