Tokenization

ramyanarwa · December 20, 2021, 7:47am

Hi,

I’m using white space tokenization and when I give any text with an entity which is supposed to be tokenized collectively, it is being split and identified as different entities.

For example, If ‘ammonium bromide’ is an entity to be identified as X, it is being split into ‘ammonium’ and ‘bromide’ then identified as X and Y. This ‘ammonium bromide’ is part if a lookup table for entity X.

Can you please help me with a tokenizer that can split entities correctly either based on lookup provided or something else.

Appreciate any kind of help.

Thanks, Ramya

MatthiasLeimeister · December 21, 2021, 5:40pm

Hi @ramyanarwa, can you please share your Rasa version (from rasa --version), your config.yml and your training data files (nlu.yml, lookup tables, etc.). This would help a lot with further debugging your problem. Please enclose the file contents in code blocks (using ```). Thanks!

ramyanarwa · December 22, 2021, 1:54am

Rasa Version : 2.8.16

Config:

'- name: WhitespaceTokenizer'
'- name: RegexFeaturizer'
'  case_sensitive: false'
'- name: LexicalSyntacticFeaturizer'
'- name: CountVectorsFeaturizer'
'- name: CountVectorsFeaturizer'
'  analyzer: char_wb'
'  min_ngram: 1'
'  max_ngram: 4'
'- name: DIETClassifier'
'  epochs: 100'
'  constrain_similarities: true'
'- name: addons.my_custom_components.EntityTypoFixer'  
'- name: EntitySynonymMapper'
'- name: ResponseSelector'
'  epochs: 100'
'  constrain_similarities: true'
'- name: FallbackClassifier'
'  threshold: 0.3'
'  ambiguity_threshold: 0.1'

nlu:

- Who is the teacher of math
- Which subjects does Ramya sree teach?
- who leads cricket team?
- which teams does Alex lead?

lookup: I have text files with names of the teachers, leaders etc which consist of First name, middle name and last name

I have a database from where I’m getting the info. For example for the query “Which subjects does Ramya sree teach?” “Ramya sree” should be considered as one entity which is teacher but “Ramya” is considered as one entity “Teacher” and “sree” as another which is “Leader” because I have Ramya sree in both the lookups.

I want that to be considered as single entity. I can validate that later in the custom actions if leader or teacher.

MatthiasLeimeister · December 22, 2021, 8:15am

Hi, you don’t seem to use RegexEntityExtractor in your pipeline. This is the component that makes use of the lookup tables in order to detect entities. DIETClassifier will learn from the entity annotations in your NLU training data. You could try adding it to your pipeline, like this:

- name: RegexEntityExtractor
  use_lookup_tables: true

ramyanarwa · December 22, 2021, 9:07am

I have added regexentityextractor and changed the nlu to new format. Created lookups in yml but getting error when I try to train the model.

The lookup file name is not considered correctly. It is taking the first element name as the lookup table name.

MatthiasLeimeister · December 22, 2021, 10:03am

From the stacktrace, it seems that when loading the lookup table, it is checking if the first entry points to a file. In the code his is happening here.

I’m not on Windows, therefore I cannot test that, but it could be that because your regex contains a /, the Pathlib thinks it is a directory, but then cannot find it.

One thing to try would be to put a regex at the first entry that does not have any slashes (like your third one), and test if this fixes it.

ramyanarwa · December 22, 2021, 10:13am

Yes, the issue is with the _load_lookup_table where it is replacing the name with first entry. But why is this happening?

Is this supposed to work this way?

MatthiasLeimeister · December 22, 2021, 10:21am

I think this is because in principle you could specify an external file containing your regexes in the same NLU format. Something like this:

nlu:
- lookup: materialname
  examples: |
  - C:\Some Path\materialname.txt

where the text file then contains the actual regexes. Therefore this function tries to distinguish whether the first entry points to a file or if there are already the regexes in the NLU format. In your case, I suspect that the slashes trigger the file loading, since they are a typical component of a file path. But I cannot test this on Windows, and it seems not to happen in the same way on Mac.

Did you try to switch the order and put your third regex in the first place? Did this help?

ramyanarwa · December 22, 2021, 10:31am

Yes switching the order helped. Thank you!

To add text files, Can we do it similar to markdown format in previous version? I tried this earlier but the lookups didn’t function as expected.

lookup : materialname examples: |
- data/materialname.txt

MatthiasLeimeister · December 22, 2021, 10:40am

Actually, looking at the documentation the txt format no longer seems to be supported. Sorry for the confusion.

In Rasa 2, you can put your lookup table either in your main nlu.yml file, or in a separate file and put this in your data folder next to your nlu.yml. You don’t have to point to the lookup file from your nlu.yml, it will be found automatically. But you have to annotate some examples for the entity in your NLU file.

See this post for an example folder structure.

ramyanarwa · December 22, 2021, 10:43am

Yep, I have already looked into that.

One more question related to lookups if I have full name in the lookup but I give only first name in the query, it is not picking up the entity. Is there is a way to make this possible apart from adding the examples of only first name in the lookup?

MatthiasLeimeister · December 23, 2021, 10:01am

Not that I’m aware of, since regexes are matched completely. I would suggest adding the first names to the lookup table as well, if you expect those to be given in isolation. If you want to match them back to the full names (assuming they are unique) you could use synonyms and EntitySynonymMapper.

Topic		Replies	Views
Adding new token patterns to Whitespace Tokenizer Rasa Open Source	0	569	September 28, 2021
Rasa regex Rasa Open Source	5	651	February 23, 2022
Remove whitespace from entity Rasa Open Source	1	647	March 18, 2022
Extracting entity separated by space Feedback on Rasa Open Source rasa	7	767	July 8, 2021
What is format for marking up possessives on entities (Tom's) Rasa Open Source	5	556	October 15, 2019

Tokenization

Related topics