Language models keep growing in size. This is driven by the fact that model quality scales extremely well alongside model size. As a result, delivering these models to end users is becoming increasingly challenging. It’s a constant question of how to make serving these models faster and more cost-effective.
With this evolving space in mind, Cohere has developed an in-house solution, The Inference Framework (TIF), to help address these challenging problems. We want TIF to deliver the fastest inference possible on our models, as well as to maintain extensibility and the flexibility to incorporate new technology, deep learning engines, and frameworks. In this blog post, we’ll walk through the high-level structure of the TIF system architecture and some of the methods that help us efficiently serve massive language models.
Supporting a variety of model architectures and frameworks
There are vibrant communities and open-sourced frameworks for deep learning models, and each has its unique features and advantages. It’s not an easy question of PyTorch or TensorFlow. It is a question of which framework is the best for a given model architecture, model size, and hardware needed to run the model. Additionally, there are new frameworks emerging frequently. TIF is designed to remain agile and extensible, so that our team can incorporate new technology and start experimentation as fast as possible with minimal disruption to the model production pipeline.
Our process for managing models and runtimes can be thought of in three major steps:
1) The first step is model ingestion and translation.
This step accepts a wide range of trained deep learning models from a variety of frameworks (TensorFlow, PyTorch, in-house, etc). It analyzes the model’s format and intelligently extracts variable structures and parameter tensors.
2) The second step, abstract model architecture, stores abstract definitions of our language models, which are not tied to any framework.
Given the spec and model parameters ingested from the previous step, abstract models are created with variables pointing to loaded parameters.
3) The last step manages the concrete model runtimes.
This step contains the model architecture and layer implementations for each runtime that TIF supports (e.g., Tensorflow, PyTorch, Faster Transformer, JAX, ONNX, and TensorRT). Parameters are populated throughout the network and the model will be ready to run!
To increase the time and compute efficiency of our inference platform, we need to continually invest in how we optimize models before serving them. Here are a few methods we use to optimize the resources needed to run these models.
Post-training model optimization
Large language models are computationally intensive, which translates to extremely high latency in prediction, as well as prohibitively high costs to serve. Model speed optimization and parameter compression become key to make LLMs viable for business demands. After a lot of experimentation and ongoing research, we’ve adopted the following strategy to enable TIF to achieve the best balance between model performance and speed.
1) Reducing the size of weight matrices with low rank re-parametrization
Transformer parameters consist mostly of weight matrices. These live in large dense layers for self-attention and the subsequent feedforward layer. Low rank reparametrization decomposes each of these giant matrices into the products of two smaller matrices.
This results in a significant reduction in model size, which can be tuned by choosing the ranks of the two smaller matrices. From our experiments, we have been able to keep the model quality uncompromised while reducing up to 30% of the model parameters.
Neural network quantization is a common technique used to speed up model inference. TIF supports both fp16 and int8 (in development) quantization.
We have also improved Transformer architectures to have more stable activation distributions, adding extra stability while operating in a lower precision and lower range numerical environments.
3) Sparse attention
The autoregressive nature of the Transformer decoder, combined with the quadratic scaling of the attention mechanism, make it extremely slow to generate long sequences of texts. This made it important to build in support for a variety of attention patterns, from local window to global banded attention.
4) Model parallelism
Running prediction on models of hundreds of billions of parameters requires efficient model parallelism. TIF supports two key model parallelism techniques:
1) Pipeline-based parallelism, where a model is split vertically into partitions and a batch is broken into micro-batches to hide pipeline bubbles.
2) Tensor-sharding-based parallelism, where a model is split horizontally along the hidden state into multiple shards with each shard residing in a separate device.
Additionally, we continuously integrate the latest runtimes from hardware providers to optimize maximal bandwidth for communicating parameters and activations.
Efficient inference of large language models is an evolving area, and the industry as a whole is still learning more about it. This article covers a few of the things we’ve seen work, and some things that we’ve learned over years of running these models in production. If you want to learn more about the low-ranking method we describe above, read the white paper: On Low Rank Training of Deep Neural Networks.
If you have any questions or would like to share your experience running large language models, we’d love to hear from you in our Community Discord!
Thanks to contributors and reviewers Bharat Venkitesh, Jeremy Udit, Linus Chui, and Sally Vedros.