First of all, thank you for such amazing product.
I am working in a bot project where it is needed to extract movie titles and tv station from user’s utterances, for example:
- "Please, I want to watch "
- “” (in this case, the user directly provide the movie title)
- "Put on the tv "
- "Watch <tv_station>
- “<tv_station” Although I have put the utterance in English the user’s language is in Spanish. The pipeline that I am intend to use is the following: “es”
- name: “WhitespaceTokenizer”
- name: “RegexFeaturizer”
- name: “CRFEntityExtractor” features: [ [“low”, “title”, “upper”], [“bias”, “low”, “prefix5”, “prefix2”, “suffix5”, “suffix3”,“suffix2”, “upper”, “title”, “digit”, “pattern”], [“low”, “title”, “upper”] ]
- name: “EntitySynonymMapper”
- name: “CountVectorsFeaturizer”
- name: “CountVectorsFeaturizer” analyzer: “char_wb” min_ngram: 1 max_ngram: 4
- name: “EmbeddingIntentClassifier”
I have several problems that I have not sure how to resolve:
- The user can request a tv_station or a movie title directly without using a verb (e.g. saying only “<movie_title>”
- The set of tv_stations is a limited and a short one (about 200) but some of them can include normal spanish words (e.g. the equivalent in english would be the article “the”)
- The set of “movie_titles” is limited (we have a list) but it is very large (about 100.000) and its vocabulary is totally open.
For the tv_station, I think we can try to use the lookup tables. Because the lookup table do “exact matching” we could inject in them some expected misspelling. However I dont think that lookup tables is a good solution for the movie titles (e.g. the match is exact and titles can include any word of the language). Any suggestion about how to proceed in this case? For example, doing a custom entity extractor? If so, any recommendation about how to do it? Also, any suggestion about the case in which the user does not use a verb and say directly the name of the tv station and/or the movie title?
Thanks very much in advance.