SpacyTokenizer token_pattern

I’m looking for documentation explaining how to add a token_pattern for SpacyTokenizer but not finding anything but the default “None”:

pipeline:
- name: "SpacyTokenizer"
  # Regular expression to detect tokens
  "token_pattern": None

In particular, I would like to modify SpacyTokenizer so that it will also split numbers from words even when there are no spaces between them, e.g. so I can label the two parts of size and shape in user utterances like 8x12 or 8.5x11.

Advice?

Mohd Shukri Hasan found this page with a good example:

token_pattern: "(\\d+|\\D+)"

The final tokenization regex I settled with is this:

token_pattern: "(\\d+|[^\\s\\d\\W]+|[^\\w\\s]+)"

Hi @tom, did you find any documentation on how to write patterns ?

@kassem404 I did not find documentation, but the examples above were enough to get me going with the expected syntax. They answered questions like how to escape special characters (use double back slash instead of single back slash) and what character class symbols I could use (lots of options like \d, \W, etc.), and whether I needed to surround my regex with parentheses (for capturing groups) and/or double quotes. Beyond that, it was just a matter of experimenting to find the right regex semantics, and maybe some other existing regex documentation online can help with that. Let me know if you have specific questions.

1 Like

Thank you @tomp for replying, that’s really helpful !