Parallel Inference

I am currently building an NLU system with rasa and I noticed that the inference is very slow. This definitely is partially due to a heavy model, but its also not beneficial to perform inference sequenitally.

This is not only a problem when using rasa nlu only, but also when performing cross validation during chatbot development. I do inference with the following code snippet. Training here is significantly faster than inference (probably due to the sequential approach).

interpreter = rasa.nlu.model.Interpreter.load(model_path)
for index, instance in tqdm(data.items()):
    pred = interpreter.parse(instance["text"])

Is there a way to parallelize this? The components obviously are capable of using batches, since this is done during training.

Thanks in advance.

i mean you could use standard multiprocessing packages in python to parallelize inference.

Another nice tool i have used to parallelize processes in python is Dask(Dask — Dask documentation)

@souvikg10 thanks for your reply.

This is of course true, although i doubt that this would bring major improvements, since the pipeline still has to run for each utterance, even if you could run, say 4, of the pipelines concurrently. Also this would heavily increase memory usage, since processes cant share objects.

If you could give the interpreter a batch of texts, i would expect a strong performance boost. In the end, the components are built to process batches, right? At least thats my only explanation why training is so much faster.

BTW, I looked it up and the cross validation also processes the utterances sequentially: ( rasa.nlu.test, line 1271)

 for example in tqdm(test_data.nlu_examples):
        result = interpreter.parse(example.get(TEXT), only_output_properties=False)

I also had a look at the implementation of the interpreter and there is definitely no way to pass it a batch. So this might be a useful feature request wouldnt it? If all components accept batches, this should not be hard to implement.

aha, i got it now, i thought you are only using the NLU interpreter as a python package. I think one of the arguments would be that rasa is not a NLP library but rather a chatbot system with NLP embedded. Using it for batch processing for NLP tasks can be an overkill of the platform.

I would infact suggest to use rasa as a python library and import the interpreter to run such large scale tasks.

the system is designed infact to handle one input message from a user.

On the other hand, a PR for such a feature could be useful.

i think there is a concept of lock stores for the processes to share objects between them using Redis for example. this in fact is to scale the rasa server and putting a load balancer in front but again pipeline runs inference sequentially.

I don’t know your usecase, but why are you using Rasa NLU pipeline for large scale text processing? couldn’t spaCy help you with that?

I do use rasa as a python package only (no cli and no server) and import the necessary modules. In that manner, I do cross validation, training and inference.

My usecase is an academic NLU challenge, to predict intent and entities. I had used rasa before to develop a chatbot and was impressed, by the speed of development and training (compared to e.g. finetuning bert) while achieving very competitive results.

Rasa has indeed shifted the focus from also being usable as general NLP lib to being more of an end-to-end chatbot framework. Still I really hope, that the python API remains maintained for users, such that you can use the great components for other NLU tasks as well.

Im not quite sure about the capabilities of spaCy but for me, using the DIET Architecture and the fast prototyping in general is the reason to use rasa in this usecase.

if you use rasa as a python package, then why can’t you use multiprocessing?

if you want to spread the pipeline components which i don’t know would make sense because each step depends on the output of the previous.

the best you could do is run each pipeline for every utterance on one subprocess using the multiprocessing package

I could of course use multiprocessing, i just doubt that it would result in a very big performance boost. But most importantly, not being able to do inference in batches is weird, because the same components use batches (or the whole dataset) during training, which includes inference (forward pass)