I have been playing around with BERT for Q&A systems. One thing that I noticed fairly quickly was the latency involved in determining the answers to your questions. Simply put its no where in the order of magnitude that qualifies for interactive systems. I understand using GPUs on the inference side could speed things up quite a bit. But is this the only lever we have? Also the examples I followed limit the number of words/tokens to a few hundred. Has anyone attempted to build a system with a large document with thousands of words?
Any pointers will be greatly appreciated!