Extracting movie titles and tv stations

Hi guys,

First of all, thank you for such amazing product.

I am working in a bot project where it is needed to extract movie titles and tv station from user’s utterances, for example:

  • "Please, I want to watch "
  • “” (in this case, the user directly provide the movie title)
  • "Put on the tv "
  • "Watch <tv_station>
  • “<tv_station” Although I have put the utterance in English the user’s language is in Spanish. The pipeline that I am intend to use is the following: “es”


  • name: “WhitespaceTokenizer”
  • name: “RegexFeaturizer”
  • name: “CRFEntityExtractor” features: [ [“low”, “title”, “upper”], [“bias”, “low”, “prefix5”, “prefix2”, “suffix5”, “suffix3”,“suffix2”, “upper”, “title”, “digit”, “pattern”], [“low”, “title”, “upper”] ]
  • name: “EntitySynonymMapper”
  • name: “CountVectorsFeaturizer”
  • name: “CountVectorsFeaturizer” analyzer: “char_wb” min_ngram: 1 max_ngram: 4
  • name: “EmbeddingIntentClassifier”

I have several problems that I have not sure how to resolve:

  • The user can request a tv_station or a movie title directly without using a verb (e.g. saying only “<movie_title>”
  • The set of tv_stations is a limited and a short one (about 200) but some of them can include normal spanish words (e.g. the equivalent in english would be the article “the”)
  • The set of “movie_titles” is limited (we have a list) but it is very large (about 100.000) and its vocabulary is totally open.

For the tv_station, I think we can try to use the lookup tables. Because the lookup table do “exact matching” we could inject in them some expected misspelling. However I dont think that lookup tables is a good solution for the movie titles (e.g. the match is exact and titles can include any word of the language). Any suggestion about how to proceed in this case? For example, doing a custom entity extractor? If so, any recommendation about how to do it? Also, any suggestion about the case in which the user does not use a verb and say directly the name of the tv station and/or the movie title?

Thanks very much in advance.

Hi again,

There are some mistakes in the utterance examples. The good ones are the following (of course this is just a sample of them):

  • "Please, I want to watch “<movie_title>”
  • “<movie_title>” (the user does not say the verb"
  • “<tv_station>” (the user does not say the verb)
  • “Put on the tv <tv_station>”

Welcome to the forum @benikenobi!

Lookup tables for the tv_station looks like a good idea. Please keep in mind that we don’t do an exact match if lookup tables are used, e.g. we do not mark an exact match directly as an entity. We rather use those matches as features that go into our machine learning models. Thus, it might nevertheless be a good idea to add a lookup table for movie_title. As the entity extractor only gets the user message, e.g. no surrounding utterances, it might be really hard to figure out whether something like “<movie_title>” is actual a movie_title or not. You should make sure to add some examples to your training data that just list the movie title.

If you want to create a custom entity extractor, please have a look at Custom NLU Components.

Hi, Thanks for the response. What I wanted to mean by exact matching is that If I add the entry “MTV” as tv_station and the user says “Put mtv”, there will be no match :frowning: . I think for tv_stations this can handled (is it possible to provide a regexp in the lookup table itself)? Anyway I would appreciate a lot any advice about the following:

  • We have around 200 tv_stations. So, any rough estimation about how many train samples to provide?
  • We have around 30.000 movie titles (i made a mistake in the previous message). So, any rough estimation about how many train samples to provide?
  • Do you recommend to have different intentions for utterances related to tv_stations and movie_titles although sometimes the same verb (e.g. put ) is used?

Thanks again

Hi, if you want to have a component that extract entities on exact matches with a lookup table, you should write your own custom component.

(1) Regarding the number of training examples, it is really hard to say. As always, the more the better. Start with 100 examples for each and if the results are not as expected, increase the number of examples where you see the most errors. You can use rasa test for that purpose (Evaluating Models).

(2) Do you need the different intents for your stories? The entity extraction is independent from the intent classification. So there is no benefit in splitting them up just because you want to improve your entity recognition.

Hope that helps.

Thanks again for your response. The problem with providing 100 examples is that it is difficult to come with 100 different ways of saying the same thing "Put <station_tv> :frowning: I mean, I would not want to overfit using the same sentence and varying the entity value.

Thanks again for the support

mmhh… in order for the lookup table to work, you should have a few examples in your training data containing some entities listed in the lookup table. It is always hard to say you need to have x number of training examples to get the bot to work. I’m afraid you simply have to try it out and test your bot until you are satisfied.

I don’t think there is a need for extracting movie titles and TV stations in order to watch the TV series that you like. I had a TV that was getting really old and recently broke but I didn’t really panic because I already had a backup plan. I am now using simplyswitch.com to watch my favorite TV series like Prison Break until I buy a new TV. And speaking of a new TV I would be very grateful if someone could give me some advice on what TV brand to choose for a reasonable price. I will listen to any advice until the end of the month when I will receive my paycheck and will be able to buy the new TV.

Welcome to the forum! Glad to hear you’re diving into such an interesting project. Your setup looks pretty solid, and I can see how extracting movie titles and TV stations, especially in a language like Spanish, can be a bit of a challenge. About mentioning theatrical releases, perhaps you could integrate a component that identifies phrases like “movies recently released” to help filter out newer titles from your large list.

For the direct mentions of movie titles and TV stations without verbs, maybe you could leverage some context clues or patterns in the user’s input to infer the action. It might not always be foolproof, but could help in most cases. As for the vast array of movie titles, a custom entity extractor sounds like a good plan. You could train it using a mix of your existing dataset and maybe some additional data scraped from sources like IMDb or theatrical releases to keep it updated with movies recently released. Regarding TV stations, lookup tables seem handy, but yeah, handling potential misspellings can be tricky. Perhaps integrating some fuzzy matching algorithms could help accommodate variations.