To compare pipelines, we need a lot of data in nlu file, this data should contains training of one intent or all intents ?
I would probably compare all intents, just in case a specific pipeline happened to work particularly well on one of the intents but not all of them. For example, a specific tokenizer + entity extractor might result in higher errors for only one entity and thus affect only intents that use that entity.