SpacyTokenizer token_pattern

tomp · March 14, 2022, 9:29pm

I’m looking for documentation explaining how to add a token_pattern for SpacyTokenizer but not finding anything but the default “None”:

pipeline:
- name: "SpacyTokenizer"
  # Regular expression to detect tokens
  "token_pattern": None

In particular, I would like to modify SpacyTokenizer so that it will also split numbers from words even when there are no spaces between them, e.g. so I can label the two parts of size and shape in user utterances like 8x12 or 8.5x11.

Advice?

tomp · March 15, 2022, 7:51pm

Mohd Shukri Hasan found this page with a good example:

github.com/RasaHQ/rasa

Error with SpacyTokenizer `token_pattern` causing mismatched Spacy vs. Rasa tokens

opened 07:31PM - 25 Jan 22 UTC

melindaloubser1

type:bug

area:rasa-oss

cse-issues

### Rasa Open Source version 2.8.17 ### Python version 3.8 ### What …happened? Given: 1. A config with `SpacyTokenizer` with a `token_pattern`, SpacyFeaturizer, and at least one non-spacy token based featurizer (CountVectorsFeaturizer, LexicalSyntacticFeaturizer): ``` language: en pipeline: - name: SpacyNLP case_sensitive: false model: en_core_web_md - name: SpacyTokenizer token_pattern: "(\\d+|\\D+)" - name: SpacyFeaturizer - name: CountVectorsFeaturizer - name: DIETClassifier epochs: 10 constrain_similarities: true ``` 2. Input with multiple digits adjoined to a word e.g. ``` it's 12euro ``` At inference time, inputting the example sentence above causes DIETClassifier to complain of mismatched input dimensions (stacktrace below). In the absence of a second featurizer i.e. given only SpacyFeaturizer, this does not happen, implying that the final tokens iterated over by SpacyFeaturizer differs from those iterated over by the non-Spacy featurizer. It could be because SpacyFeaturizer uses the _spacy_ tokens [here](https://github.com/RasaHQ/rasa/blob/main/rasa/nlu/featurizers/dense_featurizer/spacy_featurizer.py#L68) not necessarily the ones Rasa uses. The goal is that if the amount and the currency are stuck together without white space like this: ``` it's 12euro ``` This would be split into tokens `it's`, `12`, `euro`. At inference time, inputting the example sentence above causes DIETClassifier to complain of mismatched input dimensions (stacktrace below). In the absence of a second featurizer i.e. given only SpacyFeaturizer, this does not happen, implying that the final tokens iterated over by SpacyFeaturizer differs from those iterated over by the non-Spacy featurizer. It could be because SpacyFeaturizer uses the _spacy_ tokens [here](https://github.com/RasaHQ/rasa/blob/main/rasa/nlu/featurizers/dense_featurizer/spacy_featurizer.py#L68) not necessarily the ones Rasa uses. ### Command / Request ```shell rasa train nlu rasa shell nlu > it's 12euro ``` ### Relevant log output ```shell Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/nlu/model.py", line 470, in parse component.process(message, **self.context) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 988, in process out = self._predict(message) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 904, in _predict return self.model.run_inference(model_data) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/utils/tensorflow/models.py", line 318, in run_inference ] = self._rasa_predict(batch_in) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/utils/tensorflow/models.py", line 280, in _rasa_predict outputs = tf_utils.sync_to_numpy_or_python_type(self.predict_step(batch_in)) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/utils/tensorflow/models.py", line 228, in predict_step return self.batch_predict(batch_in) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 1707, in batch_predict text_transformed, _, _, _, _, attention_weights = self._tf_layers[ File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1037, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/utils/tensorflow/rasa_layers.py", line 992, in call seq_sent_features, mask_combined_sequence_sentence = self._tf_layers[ File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1037, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/utils/tensorflow/rasa_layers.py", line 645, in call sequence_features_combined = self._combine_sequence_level_features( File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/utils/tensorflow/rasa_layers.py", line 570, in _combine_sequence_level_features sequence_features_combined = self._tf_layers[f"sparse_dense.{SEQUENCE}"]( File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1037, in __call__ outputs = call_fn(inputs, *args, **kwargs) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/rasa/utils/tensorflow/rasa_layers.py", line 338, in call return tf.concat(dense_features, axis=-1) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper return target(*args, **kwargs) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py", line 1769, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1213, in concat_v2 _ops.raise_from_not_ok_status(e, name) File "/Users/melinda/.pyenv/versions/rasa2/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 6941, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [1,3,128] vs. shape[1] = [1,2,300] [Op:ConcatV2] name: concat ``` ### Defintion of Done - [ ] Facilitate discussion to scope fix - [ ] Create new issue for refinement for fix

token_pattern: "(\\d+|\\D+)"

tomp · March 18, 2022, 4:14pm

The final tokenization regex I settled with is this:

token_pattern: "(\\d+|[^\\s\\d\\W]+|[^\\w\\s]+)"

kassem404 · July 27, 2022, 11:11am

Hi @tom, did you find any documentation on how to write patterns ?

tomp · July 27, 2022, 12:18pm

@kassem404 I did not find documentation, but the examples above were enough to get me going with the expected syntax. They answered questions like how to escape special characters (use double back slash instead of single back slash) and what character class symbols I could use (lots of options like \d, \W, etc.), and whether I needed to surround my regex with parentheses (for capturing groups) and/or double quotes. Beyond that, it was just a matter of experimenting to find the right regex semantics, and maybe some other existing regex documentation online can help with that. Let me know if you have specific questions.

kassem404 · July 27, 2022, 1:20pm

Thank you @tomp for replying, that’s really helpful !

Topic		Replies	Views
How to config token_pattern for CountVectorsFeaturizer in config.yml? Rasa Open Source	0	1088	July 22, 2019
RASA issue with SpacyTokenizer Rasa Open Source	3	1067	February 13, 2022
ValueError: Sequence dimensions for sparse and dense features don't coincide Rasa Open Source	23	1998	February 11, 2020
WhitespaceTokenizer ignored from pipeline Rasa Open Source	0	356	April 17, 2022
Adding new token patterns to Whitespace Tokenizer Rasa Open Source	0	563	September 28, 2021

SpacyTokenizer token_pattern

Related topics