Web scrapping or create a DB ,,opinion from professional perspectives needed

heeey community

i am working on project chatbot for the university to respond to question and request from students , so am a bit confused between using web scrapping on the university site and show results forchatbot or creating database to extract data from, any guidance or advices please :slight_smile: . i really need an opinion whats better from a professional life .

professional perspectives :slight_smile:

With scraping you may want to double check if you’re allowed to just copy a website. The university website may allow it, but there’s plenty of situations where scraping for professional use is illegal.

Independant of where the data is from, the data that you’ll use will most likely appear in either of two places:

  • in a nlu.md text file for training
  • in some storage layer where it needs to be retreived via an action, potentially a CustomAction.

A CustomAction can be defined to retreive data from a file or from a database that you’re hosting. It’s up to you. It the dataset is very large a database might make more sense but you can definately have a custom action written in python that retreives information from a csv file. The documentation for custom actions can be found here.

Depends on what you want to do with your career, I’d say both solutions are used. But in this case in particular if you are scrapping from a single website I’d just scrape it and put the relevant information on a db and query from it as you’ll know for sure that it won’t break if someone changes the website. Unless you are fetching data that is changing with time, in that case scraping it will be. I wouldn’t worry too much if I were you, I’d pick the one that would interest me more.

1 Like

thaaank you so much your interaction @ZordoC i really appreciated it .

Hello Youssef ! I’m kinda working on a similar project right now and I’m also lost between the two scenarios. Can you please share your final decision and feedback ? And maybe a tutorial you used to achieve your goal ? I’m a newbie here, thanks a lot !

hey @forwitai, welcome to the community

well for my project i ended up using both, i divided my chatbot responses into two section one section which we can call it query type responses, and other section i called it live data or real time live data responses.

for the first type i used a database to store the information and for the second i used scraping.

so it really depend on what type of data you are working with.

for my case i used Grakn as graph data base, and Selenium and BeautifulSoup for the Webscarping. :smiley:

Hello Youssef,

Thank you so much for the reply ! Can you give me more details especially for the webscraping part ? How to implement response generation in RASA using webscraping ? I don’t really know how to proceed.

Thanks.

hey @forwitai,

You can write WebScrapping code by creating a CustomAction you can watch tutorials on webscrapping its really helpfum, and just show data retrieved.

Hey Youssef, I’m thinking about using web scraping to build a CSV file for example that would have two columns, intent and answer, use RASA NLU and then a custom action that would lookup the csv file for an answer according to the detected intent. What do you think about this approach ?

Thanks.

hey @forwitai, what i understand from your words that you will have a csv file that has two columns, and you’ll always match the intent using RASA NLU with the first column, then choose the answer provided in the second column. For this i think there is no need to work with web scrapping, and all you need is to create a knowledge base so you can query data from it. you can take a look on how to create a Knowledge Base Actions. ii hope that i got your point and helped you.

@Youssef-0 I will check it out, thanks a lot !

Hello Youssef,

One more question please ! What is the difference between the data you stored in the graph database and the data you look up with web scraping ? what was the separation criteria ? Also, how did you manage to build a graph database from the content of your university website ? as it must have entities related to each other somehow but the universities websites contain mostly general independent information.

Thank you for your help.

hey @forwitai,

sorry for my late reply, well

  • the difference is that the data stored in the graph is data that can be easily founded and stored and not changing frequently, such like students names, teachers, phone numbers, clubs and so on …, but the data i look up is data that is most thing is highly changeable and i have not the access to, like news about the university, scholarships, events, for this type of data i used web scraping since the administration will not give me permission to access their data base, so basically there is two ways to get information from a website, first way is using an API, the second way is using web scraping. since university has o api i used web scraping .
  • the separation criteria is the availability of the data in my database, and as i sayed when using web scraping to extract data from website, that means that the data is frequently changeable.
  • to build the graph database i used the approach of OpenData, which means i stored every information that can be accessible without restriction and present or provided by the university in my graph using the natural relations.
  • and the content of univ website is changeable every time so there is no need to be stored cuz as you said the informations are independent so all you need to do is to scrap the website everytime the user ask to.

i hope that helped you .

Hello Youssef,

It did help, thanks a lot ! I am trying to do the same by storing the permanent information in a csv file, and go with web scrapping for the content that is likely to be changed in the future.

I am working now on a custom action that will be triggered for every question from the FAQ, I was wondering if you found a simple way to do it, rather than having to write a story for every intent (too many) with the same custom action as a response, and also having to write all intents in the domain file.

Thanks.

Also another question, when using the web scrapping, how do you determine which keywords to look for exactly from the user’s intent ?

Thanks.

hey again @forwitai,

custom action that will be triggered for every question from the FAQ

for this case i also worked on a custom action that will be triiggered every quest i will try to explain it as i can, and i will give you the source code later this week.

i named all the intents of question type FAQ in the same way for expl:

faq_askweather
faq_askname
faq_askage

then i created one story that define all the use cases of faq questions, OR

faq_askweather or faq_askname or faq_askage
 - action_faq

in this action action_faq i extracted the intent name and remplaced faq_askage with utter_askage. with

intent= tracker.latest_message[‘intent’].get(‘name’)

this action take the name of the faq question and respond with the correct answer. hope that i helped you :slight_smile:

That was the answer I’m looking for, because writing a story with the same action for every intent didn’t seem to be the best thing to do. Thanks a lot.

That’s exactly what I did.

Also, can you please answer this question ? And how can you be sure that the scrapped “sentence” is the right answer to give to the user ? I was thinking about returning the whole link, it would be more safe. But again, I need to identify the right keywords to look for, I thought about pos tagging but it could return some words that are not really useful and for that, I think that determining a list with frequent not very useful words might do the trick, but may eventually not be efficient. So how did you determine the right keyword in the user’s intent ?

Thank you so much for your replies, I really appreciate it.

well for my case when i use scrapping i scrapped the whole block, as i said my project was a chatbot for the university, so after the intent classification and entity extraction the chatbot will scrap a block of information and respond with the title and a link to check more information. for example news, or emails, clubs and so on …

i didn’t use scraping to search for a specific information, i used it just to facilitate navigation and automate certain tasks to improve user experience. :slight_smile:

Ah okay I see, thanks a lot !