In RegexFeaturizer the regex_string seems not suitable for chinese?

journey · July 22, 2019, 6:16am

In the RegexFeaturizer

  # regex matching elements with word boundaries on either side
        regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
        return regex_string

It seems the above code not suitable for chinese.

     # regex matching elements with word boundaries on either side
        regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
        if self.language == 'zh':
            regex_string = "(?i)(" + "|".join(elements_sanitized) + ")"
        return regex_string

whether the above it better?

Tanja · July 22, 2019, 7:39am

@journey Can you maybe share an example which does not work? If it turns out to be a problem, you can also open a GitHub issue. Thanks.

journey · July 22, 2019, 11:48am

@Tanja Thank you for reply me.

Below is the test code,which shows the regex_string is not suitable for chinese.

import re

elements_sanitized = ["南京", "北京", "上海"]
text = "买一张从南京到上海的火车票"
regex_string1 = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
print(regex_string1)
matches1 = re.finditer(regex_string1, text)
matches1 = list(matches1)
print("matches1 len is:%d" % len(matches1))

regex_string2 = "(?i)(" + "|".join(elements_sanitized) + ")"
print(regex_string2)
matches2 = re.finditer(regex_string2, text)
matches2 = list(matches2)
print("matches2 len is:%d" % len(matches2))

Tanja · July 22, 2019, 1:40pm

@journey Thanks for providing the example. Looks like something we should fix in our code. Can you please open a GitHub issue and link to this forum thread? Thanks. If you want to solve it yourself, just create a PR, we would appreciate it Feel free to also tag me in the GitHub issue.

journey · July 23, 2019, 12:51am

Hi @Tanja I open a github issue here:

github.com/RasaHQ/rasa

In RegexFeaturizer the regex_string seems not suitable for chinese?

opened 11:55AM - 22 Jul 19 UTC

closed 10:36AM - 28 Jan 21 UTC

journey1986

type:enhancement

help wanted area:rasa-oss

area:rasa-oss/ml 👁 area:rasa-oss/ml/nlu-components

**Rasa version**: "1.1.2" **Rasa X version** (if used & relevant): **Python version**: 3.7.3 **Operating system** (windows, osx, ...): windows **Issue**: In RegexFeaturizer the regex_string seems not suitable for chinese? In the RegexFeaturizer ```python # regex matching elements with word boundaries on either side regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)" return regex_string ``` It seems the above code not suitable for chinese. ```python # regex matching elements with word boundaries on either side regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)" if self.language == 'zh': regex_string = "(?i)(" + "|".join(elements_sanitized) + ")" return regex_string ``` test code: ```python import re elements_sanitized = ["南京", "北京", "上海"] text = "买一张从南京到上海的火车票" regex_string1 = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)" print(regex_string1) matches1 = re.finditer(regex_string1, text) matches1 = list(matches1) print("matches1 len is:%d" % len(matches1)) regex_string2 = "(?i)(" + "|".join(elements_sanitized) + ")" print(regex_string2) matches2 = re.finditer(regex_string2, text) matches2 = list(matches2) print("matches2 len is:%d" % len(matches2)) ``` **Error (including full traceback)**: ``` ``` **Command or request that led to error**: ``` ``` **Content of configuration file (config.yml)** (if relevant): ```yml ``` **Content of domain file (domain.yml)** (if relevant): ```yml ```

sdu-2044 · September 29, 2019, 9:32am

The english words are seperate by whitespace by default while the chinese words are not.So you just need to remve the “\b“” in the regex

Topic		Replies	Views
Fulldialoguetrackerfeaturizer is not json serializable Rasa Open Source	2	411	December 31, 2019
Regex is not working in Rasa 3.1 Rasa Open Source	1	244	January 4, 2024
Can someone explain the RegexFeaturizer/RegexEntityExtractor for me? Rasa Open Source	6	930	December 16, 2022
Regex matching for entities vs. featurizer Rasa Open Source	4	1367	September 4, 2020
Chinese Pipeline Suggestion Rasa Open Source	3	1048	December 6, 2021

In RegexFeaturizer the regex_string seems not suitable for chinese?

Related topics