Train on very small set?

Hi everyone.,

We are trying to understand the underlying model and have two main questions:

  1. we understand that this is a transformer based architecture. Was it pre-trained on any data set? (eg wikipedia, etc)
  2. then, if we understand correctly, the intent classification is a fine tuning task on top of that transformer. How come it works with such small training sets?

appreciate any insights!

thanks Lior