Language models keep growing in size. This is driven by the fact that model quality scales extremely well alongside model size. As a result, delivering these models to end users is becoming increasingly challenging. It’s a constant question of how to make serving these models faster and more cost-effective.