We are trying to understand the underlying model and have two main questions:
- we understand that this is a transformer based architecture. Was it pre-trained on any data set? (eg wikipedia, etc)
- then, if we understand correctly, the intent classification is a fine tuning task on top of that transformer. How come it works with such small training sets?
appreciate any insights!