journey
(zhouli)
July 22, 2019, 6:16am
1
In the RegexFeaturizer
# regex matching elements with word boundaries on either side
regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
return regex_string
It seems the above code not suitable for chinese.
# regex matching elements with word boundaries on either side
regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
if self.language == 'zh':
regex_string = "(?i)(" + "|".join(elements_sanitized) + ")"
return regex_string
whether the above it better?
Tanja
(Tanja Bunk)
July 22, 2019, 7:39am
2
@journey Can you maybe share an example which does not work? If it turns out to be a problem, you can also open a GitHub issue. Thanks.
1 Like
journey
(zhouli)
July 22, 2019, 11:48am
3
@Tanja Thank you for reply me.
Below is the test code,which shows the regex_string is not suitable for chinese.
import re
elements_sanitized = ["南京", "北京", "上海"]
text = "买一张从南京到上海的火车票"
regex_string1 = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
print(regex_string1)
matches1 = re.finditer(regex_string1, text)
matches1 = list(matches1)
print("matches1 len is:%d" % len(matches1))
regex_string2 = "(?i)(" + "|".join(elements_sanitized) + ")"
print(regex_string2)
matches2 = re.finditer(regex_string2, text)
matches2 = list(matches2)
print("matches2 len is:%d" % len(matches2))
Tanja
(Tanja Bunk)
July 22, 2019, 1:40pm
4
@journey Thanks for providing the example. Looks like something we should fix in our code. Can you please open a GitHub issue and link to this forum thread? Thanks. If you want to solve it yourself, just create a PR, we would appreciate it Feel free to also tag me in the GitHub issue.
journey
(zhouli)
July 23, 2019, 12:51am
5
Hi @Tanja I open a github issue here:
opened 11:55AM - 22 Jul 19 UTC
closed 10:36AM - 28 Jan 21 UTC
type:enhancement
help wanted
area:rasa-oss
area:rasa-oss/ml 👁
area:rasa-oss/ml/nlu-components
<!-- THIS INFORMATION IS MANDATORY - YOUR ISSUE WILL BE CLOSED IF IT IS MISSING.… If you don't know your Rasa version, use `rasa --version`.
Please format any code or console output with three ticks ``` above and below.
If you are asking a usage question (e.g. "How do I do xyz") please post your question on https://forum.rasa.com instead -->
**Rasa version**:
"1.1.2"
**Rasa X version** (if used & relevant):
**Python version**:
3.7.3
**Operating system** (windows, osx, ...):
windows
**Issue**:
In RegexFeaturizer the regex_string seems not suitable for chinese?
In the RegexFeaturizer
```python
# regex matching elements with word boundaries on either side
regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
return regex_string
```
It seems the above code not suitable for chinese.
```python
# regex matching elements with word boundaries on either side
regex_string = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
if self.language == 'zh':
regex_string = "(?i)(" + "|".join(elements_sanitized) + ")"
return regex_string
```
test code:
```python
import re
elements_sanitized = ["南京", "北京", "上海"]
text = "买一张从南京到上海的火车票"
regex_string1 = "(?i)(\\b" + "\\b|\\b".join(elements_sanitized) + "\\b)"
print(regex_string1)
matches1 = re.finditer(regex_string1, text)
matches1 = list(matches1)
print("matches1 len is:%d" % len(matches1))
regex_string2 = "(?i)(" + "|".join(elements_sanitized) + ")"
print(regex_string2)
matches2 = re.finditer(regex_string2, text)
matches2 = list(matches2)
print("matches2 len is:%d" % len(matches2))
```
**Error (including full traceback)**:
```
```
**Command or request that led to error**:
```
```
**Content of configuration file (config.yml)** (if relevant):
```yml
```
**Content of domain file (domain.yml)** (if relevant):
```yml
```
sdu-2044
(Sdu 2044)
September 29, 2019, 9:32am
6
The english words are seperate by whitespace by default while the chinese words are not.So you just need to remve the “\b“” in the regex