Feedback: Upgrading to Tensorflow 2.6

For OOM issues.

1 Like

Yes, that could work, depending on what stage the OOM occurs. It’s worth a try @nonola

1 Like

Hi there!

With Rasa 3, without a lower batch_size still got OOM. With batch_size: [8, 32] and with this config.yml,

language: pt

pipeline:
- name: WhitespaceTokenizer
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 30
  learning_rate: 0.005
  batch_size: [8, 32]
  constrain_similarities: true
- name: EntitySynonymMapper
- name: ResponseSelector
  epochs: 100
- name: FallbackClassifier
  threshold: 0.3
  ambiguity_threshold: 0.1

policies:
- name: MemoizationPolicy
- name: TEDPolicy
  max_history: 5
  epochs: 10


I got this result:

Epochs: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 30/30 [1:34:37<00:00, 189.26s/it, t_loss=100, i_acc=0.98, e_f1=0.869, g_f1=0.944]

Why t_loss=100 ?

Thanks!

Just to confirm, are you getting the OOM issue with the DIET classifier or with TED?

With the DIET classifier.

I have some findings i did here. For me the big impact i have seen is the model load time upon startup… Since i run the model inference on k8s which quite often restarts the pod… the uptick in model load is a problem. I tested 2.8.9 and 3.0.x which has tensorflow 2.6 and seems like this is the problem.

I did not try instead of DIET. i can give that a go as well but for the training time increase wasn’t the real issue tbh

Part of me wonders … did you add the UnexpecTEDIntentPolicy in your pipeline? That component is now part of the rasa init pipeline and may help explain the extra memory/startup time.

@koaning - my pipeline is purely an NLU one :slightly_frowning_face:

But taking the suggestion of using CRF instead of DIET for entity recognition did bring things back to normal.

It might the tensorflow add-ons package which might be the culprit. I revisited your videos on DIET and you mentioned that the CRF in DIET is being fed the output of the transformer layer and built using tensorflow addons.

It looks like the memory explosion as well an increased load time for happens due to that and by removing entity recognition solves it

1 Like

You’re probably right that something about tensorflow is the culprit, since our CRFEntityExtractor uses sklearn. Our working theory is that the culprit is tf.scan (see issue here)

Hi @joancipria, Did you solve the problem of increasing training time? I met the same question when moving from rasa 2.4.2 to 3.1.6.

Hi @fkoerner, I encountered the problem of increased training time when I upgraded rasa from 2.4.2 to 3.1.6.

Taking the officially provided formbot as an example, it only takes less than 1 minute to use gpu to train in rasa2.4, but it takes more than 2.5 minutes in rasa3.1, and I found that the gpu usage rate is very low. We are using an RTX 3090.

Is this normal? If not, is there any solution? Thanks!