How can i train a FAQ model with large dataset?

Hi there, I’m working on the FAQ model recently, and i have a very large data to train, I have try two solution.(Edit: I’m not yet training the whole dataset, but trying a subset data about 10000, and encountered OOM)

The first solution is making multiple “mapping” from intent to the utter, like this:

in the domain.yml

intents:
- 26a4f255-ecdc-311f-8bee-d1445315b941:
    triggers: action_faq
templates:
  utter_26a4f255-ecdc-311f-8bee-d1445315b941:
  - text: <utter>
actions:
- utter_26a4f255-ecdc-311f-8bee-d1445315b941

in the nlu.md

## intent:26a4f255-ecdc-311f-8bee-d1445315b941
- <intent>

The “action_faq” is my custom action which is used mapping intent to the corresponding utter, eg, 26a4f255-ecdc-311f-8bee-d1445315b941 to utter_26a4f255-ecdc-311f-8bee-d1445315b941.

I found this solution training is very very slow. About 5 hours per epoch, may be because my train data is very large about 140000+ intent & utter pairs. What’s worse is the memory is overflow after two days. So i found the second solution from the forum.


The second solution is using ResponseSelector, so the train data like this:

in the nlu.md

## intent:faq/26a4f255-ecdc-311f-8bee-d1445315b941
- <intent>

in the domain.yml

actions:
- respond_faq
intents:
- faq:
    triggers: respond_faq

in the nlg_nlu.md

## 
* faq/26a4f255-ecdc-311f-8bee-d1445315b941
- <utter>

The second solution is also memory overflow, the log like this:

MemoryError: Unable to allocate array with shape (156368, 13826) and data type int64


         [[{{node PyFunc}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[softmax_cross_entropy_loss/num_present/broadcast_weights/assert_broadcastable/AssertGuard/Assert/data_5/_165]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: MemoryError: Unable to allocate array with shape (156368, 13826) and data type int64
Traceback (most recent call last):

  File "/env/miniconda3/envs/rasa/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/env/miniconda3/envs/rasa/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/env/miniconda3/envs/rasa/lib/python3.6/site-packages/rasa/utils/train_utils.py", line 202, in gen_batch
    session_data = balance_session_data(session_data, batch_size, shuffle)

  File "/env/miniconda3/envs/rasa/lib/python3.6/site-packages/rasa/utils/train_utils.py", line 184, in balance_session_data
    X=np.concatenate(new_X),

  File "<__array_function__ internals>", line 6, in concatenate

MemoryError: Unable to allocate array with shape (156368, 13826) and data type int64


         [[{{node PyFunc}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[IteratorGetNext]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

So what can i do if training a faq model with a large dataset? any help is very appreciated

Creating an intent and utter for every question seems to inefficient and could cause errors in the response. Try to use a full text search with a database. You can search in a question column and give the obtained response.

Hi @Gehova, thanks for your time to reply me. I’d like to use the rasa framework to design my FAQ system rather than database query, so i think the two solution i have tried are all work perfectly until the dataset is very large, and i need help for how can i train so large dataset with either the two solution i have tried or some other good ideas, can you help me again? For my two solution, i have coded some python scripts to preprocess the dataset, so there is no problem to prepare the train data, eg: nlu.md, domain.yml, from my raw dataset.