As already pointed out by @ganeshv, we have a regex in place that splits words on those characters into separate tokens. So if you are using the WhitespaceTokenizer
this will happen. If you want to keep the words, you can first of all try a different tokenizer or update the regex by writing a custom tokenizer (you can use the WhitespaceTokenizer
as an example and just update the regex over there).