Why LLM Benchmarks Can Be Misleading - AWQ vs. GPTQ
Local LLMs-Series: In this part I will show a comparison of different benchmarks with quantized models. The results of official benchmarks differ alot from custom benchmarks.
Table of Contents
tl;dr - What is the Post About?
- Official benchmarks, like IFEval, often do not reflect real-world performance accurately.
- Quantized models (AWQ, GPTQ) can be run efficiently on consumer GPUs like the RTX 3090 but exhibit different behaviors depending on the benchmark type.
- While IFEval showed almost identical performance across all models, custom benchmarks revealed that GPTQ performed significantly worse than full-precision and AWQ models.
- The AWQ variant showed performance indistinguishable from the full-precision (bf16) model in both benchmarks
- GPTQ's weaker performance may be due to overfitting its calibration data.
- For applications requiring high reliability, prefer AWQ over GPTQ. Always run custom benchmarks tailored to your specific use case instead of relying solely on leaderboard scores.
About the Series
This is the first part of the small local LLMs-Series created for the talk "Energy Efficiency in AI: Use of Quantized Language Models" at the CNCF Sustainability Week Stuttgart 2024 event on the 10. October 2024. A big thank you to Red Hat for hosting such a great event and giving me the opportunity to speak on this topic.
In this post, I will share a comparison of different benchmarks using quantized models. It's interesting to note how the official benchmark results can vary significantly from custom benchmarks.
Introduction
In this blog post, I will compare the performance of quantized models with the full precision model on the IFEval benchmark and a custom benchmark. The goal is to show that the results of open benchmarks can differ significantly from custom benchmarks. This is important because benchmarks may not always reflect the real-world performance of a model. Let's face it, all major companies optimize their models for benchmarks.
For this comparison, I used Meta's Llama 3.1 8B model, along with its quantized versions using AWQ and GPTQ quantization algorithms. These can be run easily on consumer-grade hardware (RTX 3090) in all variants.
For the official evaluation I chose the IFEval benchmark, as it is not just a multiple choice test and consists of multiple different tasks. However, the quality of the questions could be better: Twitter
To create a custom benchmark, I generated synthetic emails with varying structures in German. German was chosen intentionally to evaluate the model’s ability to handle out-of-domain data. Below is an example email conversation and the desired output.
Desired JSON Output:
In a future dataset I would also include the pricing information, especially with discounts or additional necessary calculations. The task is not too complex, so there is probably also some room for improvements. However I just wanted to make a quick use case. In total I generated around 50 emails with different structures.
Inference with Quantized and Non-Quantized Models
For inference, I used the open-source vllm engine, which supports multiple quantization algorithms like AWQ
and GPTQ
. All experiments were run on a machine with two RTX 3090 GPUs, however for the actual inference only one RTX 3090 was chosen. Below are the commands I used to serve the models:
During testing, I used the sglang library to get structured output, as vllm didn't have grammar support at that time. Sglang enables constrained decoding and structured generation. In contrast to the reduced functionality of the OpenAI models, open source models offer a much more finegrained control like regex patterns, categories, constrained lengths and much more. It evolved into a great alterative to vllm. The command for the serving are:
You will maybe later also see why the Ifeval benchmarks were also run with sglang. 😉
Results: IFEval Benchmark
For this evaluation, I chose the IFEval benchmark due to its diverse set of tasks. While some tasks in the benchmark are of lower quality, it's still a more practical and realistic test compared to multiple-choice questions. IFEval includes a variety of task types, such as:
- Keywords:
- Include keywords
{keyword1}
,{keyword2}
in your response - In your response, the word word should appear
{N}
times.
- Include keywords
- Length Constraints:
- Answer with at least/around/at most
{N}
words - Your response should contain
{N}
paragraphs
- Answer with at least/around/at most
- Formatting Requirements:
- Your answer must contain exactly
{N}
bullet points - Your answer must contain a title in double angular brackets
- Your answer must contain exactly
- Language and Style:
- Your ENTIRE response should be in
{language}
- Your entire response should be in all lowercase/uppercase letters
- Your ENTIRE response should be in
- Content Structure:
- Your response must have
{N}
sections with specific markers - At the end, add a postscript starting with
{postscript marker}
- Your response must have
When running the benchmark, you'll get four metrics: inst_level_loose_acc
, inst_level_strict_acc
, prompt_level_loose_acc
, and prompt_level_strict_acc
. According to the Llama 3.1 Paper by Meta, the results should be summarized by averaging the instruction-level and prompt-level accuracies. Here’s how they describe it:
We report the average of prompt-level and instruction-level accuracy, under strict and loose constraints in Table 2.
So, to follow the same standard, we'll calculate the mean of the instruction-level and prompt-level accuracy for both strict and loose constraints.
I served the model as an OpenAI compatible server and used the lm-evaluation-harness from EleutherAI with the following command:
The duration for a single benchmark run was relatively short with around 2 minutes on a single RTX 3090. The results are shown in the diagram below.
It seems weird that the AWQ model outperformed the base model on the IFEval benchmark. This is not what I expected, as quantization usually leads to a decrease in performance. However, the difference is not significant, and the results are within the margin of error.
Results: Custom Benchmark
For the custom benchmark, I used the synthetic emails I generated earlier. I served the model with sglang and created multiple different strategies for the extraction. Your best bet for the extraction and classification would probably be to set the temperature to 0. However I wanted to see how the model reacts with a temperature of 1.0.
Let's start by creating the regex patterns for the extraction with SGLang. The regex patterns are used to define the structured generation from the emails. Here are the regex patterns I used:
You can see that in contrast to the OpenAI structured generation, SGLang supports complex regex patterns. This is a huge advantage for structured output generation. You can define for example the exact length of a string, the allowed characters and much more. This is especially useful for something like the date format or specific identifiers. For the OpenAI models you can only specify String
, Number
, Boolean
, Integer
, Object
, Array
, Enum
and anyOf
.
While you could for example also get a similar result with a more complex structure like in the date example with date as an object and the day, month and year as integers, it is not as easy as with regex.
For the prompt I intentionally used one without an explicit example as this will be part of a later blog post.
I then created the SGLang function as following:
As mentioned before, the extraction is done with a temperature of 1.0. Therefore I ran the function over 200 times with the ~50 emails and logged these results to MLflow using the following snippets:
The results are shown in the boxplot below. The corresponding values for the median, the 25th and 75th percentile and whiskers are extracted from the mlflow plots.
Llama 3.1 8B - Custom Benchmark Comparison
The results show that the GPTQ model perform significantly worse than the full precision model on the custom benchmark.
Keep in mind the y-Axis here is scaled from 0.5 to 1.0. The results of the custom benchmark for GPTQ are quite different from the IFEval benchmark. The GPTQ model performed significantly worse than the full precision model. This seemed odd, and at first I thought the issue could be due to the SGLang library. However, I reran the IFEval benchmark with the SGLang backend instead of vllm and the results were the same. The GPTQ model performed significantly worse than the full precision model on the custom benchmark.
Conclusion
Official benchmarks and real-world performance of LLMs can be quite different, similar to how official fuel consumption ratings in cars rarely match real driving conditions. This became clear in the testing. While the IFEval benchmark suggested all model versions would perform equal, the custom benchmark showed that GPTQ models actually performed much worse than full precision or AWQ ones. This difference wasn't related to the SGLang library, as I used it for both inference libraries after seeing such large deviations.
The gap in performance might be explained by something mentioned in a quantization paper [ZeroQuant(4+2)]. GPTQ tends to overfit on its calibration data. Because of this, when choosing between AWQ and GPTQ models, AWQ should always be the better choice.
Like in the automotive industry where the official fuel consumption values are often not achievable in real life, the same can be true for LLM benchmarks. The scores you see on the LLM Leaderboard might not reflect how well a model will work for your specific needs. Instead of chasing benchmark scores, it's safer to stick with proven models like Llama 3.3, Qwen, or Mistral.