Table of Contents

Unify API

Compression

January 6, 2024

10 min read

Machine learning has witnessed a surge in interest in recent years driven by several factors. including the availability of large datasets, advancements in transfer learning, and the development of more potent neural network structures, all giving rise to powerful models with wide ranges of applications.

However, the size of these models has been growing at an unprecedented rate, with some models exceeding billions of parameters. This growth in size has led to increased computational costs, making it difficult to train and deploy these models at scale on current GPUs. To address this challenge, researchers have been exploring various model compression techniques to reduce their size and computational requirements.

**Model compression** refers to a set of algorithms that aim to reduce the size and memory requirements of neural networks without significantly impacting their accuracy. This can help make models more efficient and cost-effective, allowing them to be deployed on various environments, such as edge devices and cloud services.

These algorithms are typically integrated into various libraries that provide bespoke APIs for applying compression techniques on the user’s models. On the other hand, the rapid pace of development for these algorithms has made it difficult for user-facing tools to integrate all available techniques on a timely basis. Further, because some algorithms are better suited for specific compiler toolchains, this typically requires some tools to make selective choices as to which techniques to integrate given their unique focus and design choices.

As a result, the landscape for model compression utilities has become complex with non-perfectly-overlapping requirements across available tools, making it difficult for the user to get a clear sense of the tools and techniques most relevant for their unique use-cases.

In this series of blog posts, we will provide an overview of the landscape for model compression techniques and tools. For each broad category of techniques, we will take some time to explain the core idea behind it before discussing some of the main libraries offering relevant tools, including how they extend PyTorch’s built-in capabilities as well as the unique features they provide, while presenting the high-level intuition for some of the algorithms involved. We then proceed to compare the libraries to provide guidelines in choosing which one to use for different needs and use-cases.

By the end of this series, you will have a better understanding of how model compression can be used to optimise your model for edge AI and other deep learning applications!

Several high level frameworks provide third party libraries with model compression features. Initially, TensorFlow had a significant advantage in the deployment race due to its well-established ecosystem of tools and its own Model Optimization Toolkit.

However, PyTorch has been rapidly closing the gap in the deployment space with the introduction of torch.compile and its built-in Quantization support in Torch 2.0, allowing for the efficient deployment of machine-learning models. These advancements have positioned PyTorch as a strong contender in the deployment arena. Moreover, PyTorch boasts a vast collection of models. At the time of writing this post, there are an impressive 10,270 TensorFlow models on Hugging Face, but an even more staggering 148,605 PyTorch models. This indicates the growing popularity and adoption of PyTorch in the machine-learning community. In addition to its native capabilities, PyTorch benefits from a thriving ecosystem of third-party compression tools specifically developed for PyTorch models. This ecosystem provides a wide range of options for compressing and optimising PyTorch models, further enhancing their deployment efficiency.

Considering all these factors, it is evident that PyTorch's native quantization and the accompanying third-party tools have become a focal point in the field. Therefore, these posts will primarily focus on exploring PyTorch's quantization capabilities and some of the tools developed around it. While not comprehensive, we strive to cover as many of the major tools per compression technique as possible. That being said, we will be focusing slightly more on quantization and pruning given the wider array of available tools for these techniques compared to tensorization and knowledge distillation.

Quantization is a model compression technique that reduces the precision of the weights and activations of a neural network. In other words, it involves representing the weights and activations of a neural network using fewer bits than their original precision. For example, instead of using 32-bit floating-point numbers to represent weights and activations, quantization may use 8-bit integers. This transformation significantly reduces storage requirements and computational complexity. Although some precision loss is inherent, careful quantization techniques can achieve substantial model compression with only minimal accuracy degradation.

Essentially, quantization involves mapping from a continuous space to a discrete space such that full-precision values are transformed to new values with lower bit-width called quantization levels using a quantization map function.

Ideally, quantization should be used when the model's size and computational requirements are a concern, and when the reduction in precision can be tolerated without significant loss in performance. This is often the case with LLMs in tasks such as text classification, sentiment analysis, and other NLP tasks, where the models are massive and resource-intensive. Quantization is also best used when deploying on resource-constrained devices such as mobile phones, IoT and edge devices. By reducing the precision of weights and activations, quantization can significantly reduce the memory footprint and computational requirements of a neural network, making it more efficient to run on these devices.

Quantization can be categorised into two main approaches:

**1. Quantization Aware Training (QAT): **Quantization-aware training emulates inference-time quantization by using quantized values in the forward pass. The backward pass still relies on non-quantized values. This creates a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment. This approach leads to better accuracy preservation as the model learns to adapt to the precision loss during training.

**2. Post-Training Quantization (PTQ):** Unlike quantization-aware training, where the model learns to adjust to lower precision during training, post-training quantization directly applies this reduction in bit width after the model has been trained. While it may not offer the same level of accuracy preservation as quantization-aware training, it is widely used in practice since it doesn’t require retraining the model.

These quantization approaches can further be divided based on the following factors:

**1. Quantization Granularity**:

**Per Tensor Quantization**: in this type, the entire tensor has the same quantization parameters.

**Per Channel Quantization**: here each channel has different quantization parameters.

The latter approach generally leads to better accuracy preservation but it requires more parameters.

****

**2. Static or Dynamic Quantization of Activations**: Since weights are always known before inference they can be quantized offline, however, activations depend on the input to the model, and there are two ways of quantizing them:

**Static Quantization:**In this type, the min and max ranges of activations are calculated using a calibration/fine-tuning data set. For PTQ, usually, 200 samples in the calibration dataset are enough to determine these ranges. This technique is typically used when both memory bandwidth and compute savings are important with CNNs being a typical use case.

**Dynamic Quantization:**In this type, the min-max ranges are calculated on the fly during inference. This is useful in situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. This is true for LSTM and Transformer type models with small batch size for e.g.

**3. Symmetric and Affine Quantization:**

**Affine Quantization:**Incorporates scaling (Scale S) and shifting(Zero Point Z) allowing for a flexible representation of a wide range of values by adjusting magnitude and position before rounding. The formula for affine quantization is x = S * (x_q - Z), where x_q represents the quantized value and x represents the full-precision value.

**Symmetric Quantization:**Is a special case of affine quantization where-in the values are mapped to a symmetric range of values i.e [-a, a]. In this case, the integer space is usually [-127, 127], meaning that the -128 is opted out of the regular [-128, 127] signed int8 range. The reason being that having both ranges symmetric allows to have Z = 0. While one value out of the 256 representable values is lost, it can provide a speedup since a lot of additional operations can be skipped.

Pruning is a model compression technique that involves removing the unnecessary connections or weights from a neural network. The goal of pruning is to reduce the size of the network while maintaining its accuracy. Pruning can be done in different ways, such as removing the smallest weights, or removing the weights that have the least impact on the output of the network.

From a technical perspective, pruning involves three main steps: (1) training the original neural network, (2) identifying the connections or weights to prune, and (3) fine-tuning the pruned network. The first step involves training the original neural network to a desired level of accuracy. The second step involves identifying the connections or weights to prune based on a certain criterion, such as the magnitude of the weights or their impact on the output of the network. The third step involves fine-tuning the pruned network to restore its accuracy.

There are mainly two types of pruning techniques namely:

**Unstructured Pruning**in which there are no constraints on which weights can be pruned resulting in sparse weights, and**Structured Pruning**which focuses on pruning larger structures such as whole neurons, convolution filters, channels etc., and directly leads to reduction in memory (which should translate to inference time speed up as well given the reduction in arithmetic intensity), It should be noted that most frameworks and hardware cannot accelerate sparse matrices’ computation, however, there are inferences engines like**Neural Magic’s CPU Accelerator****DeepSparseEngine**,**ApacheTVM**, and**Nvidia’s****TensorRT**that are able to leverage the sparsity in the networks to reduce memory requirements as well as provide significant speedups.

- Pruning can also be divided into
**Iterative Pruning**in which pruning is performed over several iterations while training/fine-tuning a model or**OneShot**where the model and/or weights are pruned in just one pass. It is recommended to use the former method as the oneshot method generally leads to a lot of accuracy degradation.

- Finally, pruning can also be divided based on whether it is performed
**locally**i.e. only with specific layers or**globally**across the whole network.

Pruning is ideally used to speed up the training and/or inference of neural networks by reducing the number of parameters that need to be updated during each iteration. However, it is important to note that pruning may not always lead to a significant reduction in the size of the network and may require careful tuning to achieve the desired level of accuracy. Also, it must be noted that hardware support for pruned models is limited at the moment.

Tensorization is a model compression technique that involves decomposing the weight tensors of a neural network into smaller tensors with lower ranks. In machine learning, it is used to reveal underlying patterns and structures within the data whilst also reducing its size. Tensorization has many practical use cases in ML such as detecting latent structure in the data for e.g representing temporal or multi-relational data, as well as latent variable modelling.

The goal of tensorization is to reduce the number of parameters in the network while maintaining its accuracy. Tensorization can be done using different methods, such as singular value decomposition (SVD), tensor train decomposition, or Tucker decomposition.

Tensorization is most useful when a model can be optimised at a mathematical level, i.e. when the model’s layers can be further broken down into lower ranking tensors to reduce the number of parameters needed for computation.

Various tensor decomposition algorithms aim to factorise a given tensor into a set of smaller tensors, which represent different aspects or modes of the data. Some of the most important algorithms include Canonical Polyadic Decomposition (CP or PARAFAC) which decomposes a tensor into a sum of rank-one tensors, each capturing one mode of variation. Tensor Train Decomposition (TT) which factorises a tensor into a network of smaller tensors, making it efficient for high-dimensional data representation, and Tucker Decomposition which decomposes a tensor into a core tensor and factor matrices for each mode, allowing for efficient approximation of high-dimensional data while preserving structure.

Knowledge distillation is a model compression technique that involves transferring the knowledge from a large, complex neural network (teacher network) to a smaller, simpler neural network (student network). The goal of knowledge distillation is to reduce the size of the network while maintaining its accuracy by leveraging the knowledge learned by the teacher network.

From a high level perspective, knowledge distillation involves two main steps:

- Training the teacher network: Involves training the teacher network to a desired level of accuracy

- Training the student network: Using the outputs of the teacher network as soft targets, which are probability distributions over the classes instead of hard labels

The student network is trained to mimic the behaviour of the teacher network by minimising the difference between the soft targets and its own predictions.

Knowledge Distillation techniques generally pertain to one of the following categories:

**Offline Distillation**: Which is the most common distillation technique, which involves using a pre-trained teacher model to guide the student model. It's the easiest to implement.

**Online Distillation**- Or deep mutual learning: Which is used when a pre-trained teacher model isn’t available, in this technique, both teacher and student models are updated simultaneously in a single end to end training process.

**Self Distillation**: Which is a special case of online distillation and involves using the same model as the teacher as well as the student. In this type, knowledge from deeper layers of the network is used to guide the shallow layers.

Throughout this blog post, we have given an overview of some model compression techniques that shine in different situations, use cases, architectures and hardware devices. However, there is much more to it! This is just the introduction of our Model Compression blog series, so stay tuned for the next posts where we will dive deeper into each one of these techniques, the corresponding tools and workflows to apply them to your models, and when to apply each one.

Faster, Cheaper and Simpler?

Use the Unify API to send your prompts to the best LLM endpoints and get your LLM applications flying