Vendor-specific APIs provide an interface to define customized operations for hardware from specific vendors. The libraries are written exclusively for hardware from this vendor, and so the code is clearly not generalized nor is it intended to be. These APIs are often used by higher level multi-vendor compilers and frameworks, and most machine learning practitioners will not interface with these low level vendor-specific APIs directly.
Built on top of CUDA, TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators. It is integrated with PyTorch and TensorFlow. When conducting deep learning training in a proprietary or custom framework, then the TensorRT C++ API can be used to import and accelerate models. Several optimizations contribute to the high performance: reduced mixed precision maximizes throughput, layer and tensor fusion optimizes device memory, kernel autotuning selects the best data layers and algorithms, time fusion optimizes recurrent neural networks, multi-stream execution manages input streams, and dynamic tensor memory optimizes memory consumption.
Compute Unified Device Architecture (CUDA) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). It is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels. It is designed to work with programming languages such as C, C++, and Fortran.