Custom Transformer-based Featurizer producing inconsistent outputs vs HuggingFace's version


I am following the lm_featurizer to implement a dense featurizer that uses PaddleNLP’s Transformer API PaddleNLP/ at develop · PaddlePaddle/PaddleNLP · GitHub

While the BERT models are the same (in terms of weights), and it seems like the methods are largely similar, I am getting unstable results (running the same input more than once against the NLU shell, and the confidence will fluctuate). This never happened on the HuggingFace’s LanguageModel version.

I am not an expert in the space (still trying to learn), so I am trying to get some ideas on how that’d happen. I think it largely depends on the integration method I use (I integrated all the way till where the model is run against the tokens and attention masks to generate sequence embedding). I am happy to provide the code repo so someone can help diagnose and see what is the delta needed to make the code more consistent.