The LLM landscape is incredibly fast moving, with new models coming out every week. In the last few weeks alone, Mamba showed that structured state space models are more parameter efficient than transformers, Mixtral showed that mixture-of-experts achieve better performance than monolithic models such as Llama2, and MIQU (leaked from Mistral) suggests that we may be fast approaching GPT4 capabilities for open source LLMs. With so many providers offering different LLMs, and also offering different endpoints for the same LLMs, all with varying costs and runtime performance, there has never been a more urgent need for objective benchmarks.
Valiant efforts have recently been made, such as Anyscale’s LLMPerf leaderboard and Martian Router’s LLM Inference Provider Leaderboard. However, these benchmarks take the form of static tables. In this post, we argue that static benchmarks are simply not enough, and we outline the necessity to present benchmarking data across time, in order to make any meaningful comparisons.
When benchmarking hardware, it’s fair to assume that the runtime performance of the hardware will be the same whether you test it today or tomorrow. Static benchmarks such as MLPerf rely on this assumption. The metrics measured can be assigned to the hardware once, and then these static metrics and scores are considered intrinsic to the hardware.
However, public endpoints for LLMs do not behave like this at all. From the perspective of an end user, the runtime performance varies drastically over time, and this is for a number of reasons. Unlike a piece of hardware, the endpoints do not represent a static system. Endpoints are instead a gateway into a black-box system, which is unbeknownst to the user. Factors which can (and will) change over time which affect the runtime performance of this black-box system include:
As a result, the runtime performance of these endpoints are best thought of as time series data, rather than a fixed static metric. To illustrate this point, consider the data presented below, which shows the tokens/second for several different providers of Llama2 70B throughout a single day.
Had we taken a single set of measurements at 03:30 AM, we would be concluding that Together AI had the fastest endpoints. However, had we instead taken the measurements at 12:30 PM on the very same day, we would be concluding that Together AI was slower than Anyscale, Perplexity, and Replicate.
Each data point presented above is averaged over large input sequences and over several concurrent requests, and so these observed variations throughout the day are not measurement noise. The observed variation comes from transience in the underlying system itself throughout the day, affected by factors such as the overall traffic to the endpoint, the number of reserved GPUs at that moment in time, and the network speed etc.
We observe similar trends to the graph above across all metrics, models, and providers. With such transient data being so common, it begs the question: do static runtime scoreboards make any sense at all? From our view, static scoreboards for runtime performance are not especially helpful, and they can disguise the fact that the metrics are constantly changing, sometimes on an hour-by-hour basis.
Our benchmarks present the raw data across time for the key metrics: input cost, output cost, time-to-first-token (TTFT), output-tokens-per-second, inter-token-latency (ITL), end-to-end-latency (E2E), and cold-start time, with tables presenting the most recent values.
Given the inherent transient and highly volatile nature of these metrics, we avoid scoreboards and we instead present the raw data and leave it to the user to leverage this data in order to make genuinely informed decisions about the endpoint they would like to use for their application.
In reality, every application depends on each of these metrics to different extents. For some time-critical applications the ITL is of paramount importance. For other non time-critical applications, it’s only about minimizing the cost. Going further, for input-heavy applications such as document summarization, the input cost is most important to minimize. For output-heavy applications such as content creation, the output cost is most important to minimize. For other applications, the TTFT is most important, where there is a need to create a very responsive feeling for the end user. If these raw metrics are ever combined to create “scores”, then this should always be done in a task-specific manner. We therefore leave all such “scoring” to the user for now.
Our benchmarking logic, named AI Bench, is all fully open source. The full benchmarking methodology is also explained here in detail. We strongly welcome and encourage feedback from the community!
As the next step, we plan on creating dynamic and customizable scoring systems, where the user can specify the relative importance of each metric, with soft and hard constraints, before we propose the best model for them based on their preferences and their specific use case.
Stay tuned for more updates as we work on our dynamic scoring and recommendation features! For feedback, please email us at hello@unify.ai or simply tag us on twitter if you have any feature requests, comments, or suggestions. We love to hear from you 😊
Use the Unify API to send your prompts to the best LLM endpoints and get your LLM applications flying