What really happens during `rasa test`

I have an intent named nlu_fallback with examples of trash messages inside the nlu.yml file.

I want it to be predicted when there is low intent confidence fallback as well as when users write messages irrelevant to the story flow. The bot works fine, and when I enter some gibberish messages rasa interactive perfectly predicts nlu_fallback intent. The problem is when I want to evaluate the bot using rasa test nlu --cross-validation --folds 3 the bot hardly ever predicts the nlu fallback, instead, it predicts using other intents even if their confidence is ~0.0005. I’ve found in rasa/nlu/test.py the following code snippet

async def get_eval_data(
    processor: MessageProcessor, test_data: TrainingData
) -> Tuple[

<some code>

for example in tqdm(test_data.nlu_examples):
        result = await processor.parse_message(
            UserMessage(text=example.get(TEXT)), only_output_properties=False
        _remove_entities_of_extractors(result, PRETRAINED_EXTRACTORS)
        if should_eval_intents:
            if fallback_classifier.is_fallback_classifier_prediction(result):
                # Revert fallback prediction to not shadow
                # the wrongly predicted intent
                # during the test phase.
                result = fallback_classifier.undo_fallback_prediction(result)
            intent_prediction = result.get(INTENT, {})
                    example.get(INTENT, ""),


Could somebody explain to me, why the fallback prediction needs to be undone? I do not really understand this comment. Do I need to change my intent name to sth similar to out-of-scope?

Thanks for your help in advance