In RegexFeaturizer the regex_string seems not suitable for chinese?

In the RegexFeaturizer

  # regex matching elements with word boundaries on either side
        regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
        return regex_string

It seems the above code not suitable for chinese.

     # regex matching elements with word boundaries on either side
        regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
        if self.language == 'zh':
            regex_string = "(?i)(" + "|".join(elements_sanitized) + ")"
        return regex_string

whether the above it better?

@journey Can you maybe share an example which does not work? If it turns out to be a problem, you can also open a GitHub issue. Thanks.

1 Like

@Tanja Thank you for reply me.

Below is the test code,which shows the regex_string is not suitable for chinese.

import re

elements_sanitized = ["南京", "北京", "上海"]
text = "买一张从南京到上海的火车票"
regex_string1 = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
print(regex_string1)
matches1 = re.finditer(regex_string1, text)
matches1 = list(matches1)
print("matches1 len is:%d" % len(matches1))

regex_string2 = "(?i)(" + "|".join(elements_sanitized) + ")"
print(regex_string2)
matches2 = re.finditer(regex_string2, text)
matches2 = list(matches2)
print("matches2 len is:%d" % len(matches2))

@journey Thanks for providing the example. Looks like something we should fix in our code. Can you please open a GitHub issue and link to this forum thread? Thanks. If you want to solve it yourself, just create a PR, we would appreciate it :slight_smile: Feel free to also tag me in the GitHub issue.

Hi @Tanja I open a github issue here:

The english words are seperate by whitespace by default while the chinese words are not.So you just need to remve the “\b“” in the regex