I have my own tokenizer and would like to replace the tokenizer_whitespace
in the official doc
Here is my LextoTokenizer
class
Problem is when the _from_json_to_crf
is executed. self.pos_features
becomes False
Then tokens
becomes None
from here program will raises
TypeError: ‘NoneType’ object is not iterable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-249a0d3f08af> in <module>
13
14 # train the model!
---> 15 interpreter = trainer.train(training_data)
16
17 # store it for future use
~/.pyenv/versions/3.6.8/envs/rasa/lib/python3.6/site-packages/rasa_nlu/model.py in train(self, data, **kwargs)
194 component.prepare_partial_processing(self.pipeline[:i], context)
195 updates = component.train(working_data, self.config,
--> 196 **context)
197 logger.info("Finished training component.")
198 if updates:
~/.pyenv/versions/3.6.8/envs/rasa/lib/python3.6/site-packages/rasa_nlu/extractors/crf_entity_extractor.py in train(self, training_data, config, **kwargs)
140 # this will train on ALL examples, even the ones
141 # without annotations
--> 142 dataset = self._create_dataset(filtered_entity_examples)
143
144 self._train_model(dataset)
~/.pyenv/versions/3.6.8/envs/rasa/lib/python3.6/site-packages/rasa_nlu/extractors/crf_entity_extractor.py in _create_dataset(self, examples)
149 for example in examples:
150 entity_offsets = self._convert_example(example)
--> 151 dataset.append(self._from_json_to_crf(example, entity_offsets))
152 return dataset
153
~/.pyenv/versions/3.6.8/envs/rasa/lib/python3.6/site-packages/rasa_nlu/extractors/crf_entity_extractor.py in _from_json_to_crf(self, message, entity_offsets)
454 tokens = message.get("tokens")
455 import pdb; pdb.set_trace()
--> 456 ents = self._bilou_tags_from_offsets(tokens, entity_offsets)
457
458 if '-' in ents:
~/.pyenv/versions/3.6.8/envs/rasa/lib/python3.6/site-packages/rasa_nlu/extractors/crf_entity_extractor.py in _bilou_tags_from_offsets(tokens, entities, missing)
474 def _bilou_tags_from_offsets(tokens, entities, missing='O'):
475 # From spacy.spacy.GoldParse, under MIT License
--> 476 starts = {token.offset: i for i, token in enumerate(tokens)}
477 ends = {token.end: i for i, token in enumerate(tokens)}
478 bilou = ['-' for _ in tokens]
TypeError: 'NoneType' object is not iterable
config.yml
language: "en"
pipeline:
- name: "sentiment.LextoTokenizer"
- name: "ner_crf"
- name: "ner_synonyms"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
nlu.md
## intent:greet
- hey
- hello [Peter](PERSON)
- hi [Andrew](PERSON)
- รถ [Benz](CAR) สวย
- รถ [Acura](CAR) งามแท้
## intent:query
- สี[เขียว](COLOR)เท่าไหร่
- สี[เขียว](COLOR)ราคาเท่าไหร่
How to correctly implement my own tokenizer
?