Why LLM Benchmarks Can Be Misleading - AWQ vs. GPTQ

llm tools vllm quantization vllm llama gptq awq local-llm-series

Local LLMs-Series: In this part I will show a comparison of different benchmarks with quantized models. The results of official benchmarks differ alot from custom benchmarks.

December 3, 2024

Exploring the Pitfalls of Open-source LLMs — Created with my custom Flux LoRA on Replicate

tl;dr - What is the Post About?

Official benchmarks, like IFEval, often do not reflect real-world performance accurately.
Quantized models (AWQ, GPTQ) can be run efficiently on consumer GPUs like the RTX 3090 but exhibit different behaviors depending on the benchmark type.
While IFEval showed almost identical performance across all models, custom benchmarks revealed that GPTQ performed significantly worse than full-precision and AWQ models.
The AWQ variant showed performance indistinguishable from the full-precision (bf16) model in both benchmarks
GPTQ's weaker performance may be due to overfitting its calibration data.
For applications requiring high reliability, prefer AWQ over GPTQ. Always run custom benchmarks tailored to your specific use case instead of relying solely on leaderboard scores.

About the Series

This is the first part of the small local LLMs-Series created for the talk "Energy Efficiency in AI: Use of Quantized Language Models" at the CNCF Sustainability Week Stuttgart 2024 event on the 10. October 2024. A big thank you to Red Hat for hosting such a great event and giving me the opportunity to speak on this topic.

In this post, I will share a comparison of different benchmarks using quantized models. It's interesting to note how the official benchmark results can vary significantly from custom benchmarks.

Introduction

In this blog post, I will compare the performance of quantized models with the full precision model on the IFEval benchmark and a custom benchmark. The goal is to show that the results of open benchmarks can differ significantly from custom benchmarks. This is important because benchmarks may not always reflect the real-world performance of a model. Let's face it, all major companies optimize their models for benchmarks.

For this comparison, I used Meta's Llama 3.1 8B model, along with its quantized versions using AWQ and GPTQ quantization algorithms. These can be run easily on consumer-grade hardware (RTX 3090) in all variants.

For the official evaluation I chose the IFEval benchmark, as it is not just a multiple choice test and consists of multiple different tasks. However, the quality of the questions could be better: Twitter

To create a custom benchmark, I generated synthetic emails with varying structures in German. German was chosen intentionally to evaluate the model’s ability to handle out-of-domain data. Below is an example email conversation and the desired output.

Von: Schmidt, Peter <p.schmidt@sanitaer-profi.de>
An: vertrieb@sealpro.com
Betreff: Anfrage: Universelles Dichtungsset 
Datum: 22.07.2025 09:15
 
Hallo SealPro-Team,
Interesse an 300 Sets PLS-200-UNI. Preis? Lieferzeit? Mengenrabatt möglich?
Gruß, Peter Schmidt
 
Von: SealPro Vertrieb <vertrieb@sealpro.com>
An: Schmidt, Peter <p.schmidt@sanitaer-profi.de>
Betreff: AW: Anfrage: Universelles Dichtungsset 
Datum: 22.07.2025 10:30
 
Sehr geehrter Herr Schmidt,
Danke für Ihre Anfrage. Preis pro Set: 18,99 € (5% Rabatt). Lieferzeit: 5 Werktage. Gesamtpreis: 5.697,00 €. Versandkostenfrei. Zusätzlicher 2% Skonto bei Zahlung innerhalb 14 Tagen. Passt das für Sie?
Beste Grüße, Lisa Müller
 
Von: Schmidt, Peter <p.schmidt@sanitaer-profi.de>
An: SealPro Vertrieb <vertrieb@sealpro.com>
Betreff: Re: AW: Anfrage: Universelles Dichtungsset 
Datum: 22.07.2025 11:45
 
Frau Müller,
Klingt gut. Bitte Auftragsbestätigung. Lieferadresse: Sanitär Profi GmbH, Lager, Rohrstraße 15, 50667 Köln. Skonto nehmen wir in Anspruch.
Schmidt
 
Von: SealPro Vertrieb <vertrieb@sealpro.com>
An: Schmidt, Peter <p.schmidt@sanitaer-profi.de>
Betreff: Auftragsbestätigung: Universelles Dichtungsset 
Datum: 22.07.2025 13:00
 
Sehr geehrter Herr Schmidt,
Bestätige Auftrag: 300x PLS-200-UNI, 5.697,00 €. 2% Skonto bei Zahlung bis 12.08.2025.
Mit freundlichen Grüßen, Lisa Müller
 
Von: SealPro Logistik <logistik@sealpro.com>
An: Schmidt, Peter <p.schmidt@sanitaer-profi.de>
Cc: SealPro Vertrieb <vertrieb@sealpro.com>
Betreff: Versandinfo: Universelles Dichtungsset 
Datum: 29.07.2025 08:00
 
Guten Morgen Herr Schmidt,
Ihre Bestellung ist unterwegs. Tracking: SP2025072901. Zustellung heute zwischen 14-17 Uhr.
SealPro Logistik

Desired JSON Output:

{
    "spare_part_number": "PLS-200-UNI",
    "arrival_date": "29.07.2025",
    "anzahl": "300",
    "tonality": "neutral"
}

In a future dataset I would also include the pricing information, especially with discounts or additional necessary calculations. The task is not too complex, so there is probably also some room for improvements. However I just wanted to make a quick use case. In total I generated around 50 emails with different structures.

Inference with Quantized and Non-Quantized Models

For inference, I used the open-source vllm engine, which supports multiple quantization algorithms like AWQ and GPTQ. All experiments were run on a machine with two RTX 3090 GPUs, however for the actual inference only one RTX 3090 was chosen. Below are the commands I used to serve the models:

# For the standard model (setting the hf_token necessary)
vllm serve meta-llama/Llama-3.1-8B --max-model-len 8096 --gpu-memory-utilization 0.8
# For the AWQ quantized model
vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 --max-model-len 8096 --gpu-memory-utilization 0.8 -- quantization awq
# for the GPTQ quantized model
vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --max-model-len 8096 --gpu-memory-utilization 0.8 --quantization gptq

During testing, I used the sglang library to get structured output, as vllm didn't have grammar support at that time. Sglang enables constrained decoding and structured generation. In contrast to the reduced functionality of the OpenAI models, open source models offer a much more finegrained control like regex patterns, categories, constrained lengths and much more. It evolved into a great alterative to vllm. The command for the serving are:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 3000
# with --quantization you can again choose between awq and gptq

You will maybe later also see why the Ifeval benchmarks were also run with sglang. 😉

Results: IFEval Benchmark

For this evaluation, I chose the IFEval benchmark due to its diverse set of tasks. While some tasks in the benchmark are of lower quality, it's still a more practical and realistic test compared to multiple-choice questions. IFEval includes a variety of task types, such as:

Keywords:
- Include keywords {keyword1}, {keyword2} in your response
- In your response, the word word should appear {N} times.
Length Constraints:
- Answer with at least/around/at most {N} words
- Your response should contain {N} paragraphs
Formatting Requirements:
- Your answer must contain exactly {N} bullet points
- Your answer must contain a title in double angular brackets
Language and Style:
- Your ENTIRE response should be in {language}
- Your entire response should be in all lowercase/uppercase letters
Content Structure:
- Your response must have {N} sections with specific markers
- At the end, add a postscript starting with {postscript marker}

When running the benchmark, you'll get four metrics: inst_level_loose_acc, inst_level_strict_acc, prompt_level_loose_acc, and prompt_level_strict_acc. According to the Llama 3.1 Paper by Meta, the results should be summarized by averaging the instruction-level and prompt-level accuracies. Here’s how they describe it:

We report the average of prompt-level and instruction-level accuracy, under strict and loose constraints in Table 2.

So, to follow the same standard, we'll calculate the mean of the instruction-level and prompt-level accuracy for both strict and loose constraints.

I served the model as an OpenAI compatible server and used the lm-evaluation-harness from EleutherAI with the following command:

lm_eval --model local-chat-completions \
        --tasks ifeval \
        --model_args model="<model_name>",base_url=http://localhost:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False  \
        --apply_chat_template \
        --fewshot_as_multiturn \
        --output_path="./llama3.1_<quantization>.jsonl"

The duration for a single benchmark run was relatively short with around 2 minutes on a single RTX 3090. The results are shown in the diagram below.

It seems weird that the AWQ model outperformed the base model on the IFEval benchmark. This is not what I expected, as quantization usually leads to a decrease in performance. However, the difference is not significant, and the results are within the margin of error.

Results: Custom Benchmark

For the custom benchmark, I used the synthetic emails I generated earlier. I served the model with sglang and created multiple different strategies for the extraction. Your best bet for the extraction and classification would probably be to set the temperature to 0. However I wanted to see how the model reacts with a temperature of 1.0.

Let's start by creating the regex patterns for the extraction with SGLang. The regex patterns are used to define the structured generation from the emails. Here are the regex patterns I used:

mail_regex = (
    r""" \{\n"""
    + r"""    "spare_part_number": "[A-Za-z0-9 \-+_]{1,40}",\n"""
    + r"""    "arrival_date": "\d{2}\.\d{2}\.\d{4}",\n"""
    + r"""    "anzahl": "\d{1,6}",\n"""
    + r"""    "tonality": "(positive|neutral|negative)"\n"""
    + r"""\}"""
)

You can see that in contrast to the OpenAI structured generation, SGLang supports complex regex patterns. This is a huge advantage for structured output generation. You can define for example the exact length of a string, the allowed characters and much more. This is especially useful for something like the date format or specific identifiers. For the OpenAI models you can only specify String, Number, Boolean, Integer, Object, Array, Enum and anyOf.

While you could for example also get a similar result with a more complex structure like in the date example with date as an object and the day, month and year as integers, it is not as easy as with regex.

For the prompt I intentionally used one without an explicit example as this will be part of a later blog post.

direct_extraction = """<EMAIL>{email}</EMAIL>
 
Extract the informations of the email into a structured json with the keys and params: 
- spare_part_number: Number or unique identifier of the spare part
- arrival_date: Arrival date of the spare part in the format dd.MM.YYYY
- anzahl: Amount of spare parts 
- tonality: Tonality of the mail (positive/neutral/negative)
Always extract all of the informations."""

I then created the SGLang function as following:

@function
def single_step_extraction(s, input_text):
    s += user(direct_extraction.format(email=input_text))
    s += assistant(gen("json_output", max_tokens=4096, regex=mail_regex))

As mentioned before, the extraction is done with a temperature of 1.0. Therefore I ran the function over 200 times with the ~50 emails and logged these results to MLflow using the following snippets:

# Setting the experiment and the run name
classification_eperiment = mlflow.set_experiment("Classification_Experiment")
mlflow.start_run(run_name=RUN_NAME)
# logging the parameters
mlflow.log_param("model_name", MODEL_NAME)
mlflow.log_param("extraction_method", EXTRACTION_METHOD)
# logging the metrics
mlflow.log_metric("overall_accuracy", overall_accuracy)
...
# logging the results as a table
mlflow.log_table(data=results_df, artifact_file="extraction_results.json")
mlflow.log_artifact("detailed_results.txt")
mlflow.end_run()

The results are shown in the boxplot below. The corresponding values for the median, the 25th and 75th percentile and whiskers are extracted from the mlflow plots.

Llama 3.1 8B - Custom Benchmark Comparison

The results show that the GPTQ model perform significantly worse than the full precision model on the custom benchmark.

This seems to be a contradiction to the IFEval benchmark results.

Keep in mind the y-Axis here is scaled from 0.5 to 1.0. The results of the custom benchmark for GPTQ are quite different from the IFEval benchmark. The GPTQ model performed significantly worse than the full precision model. This seemed odd, and at first I thought the issue could be due to the SGLang library. However, I reran the IFEval benchmark with the SGLang backend instead of vllm and the results were the same. The GPTQ model performed significantly worse than the full precision model on the custom benchmark.

Conclusion

Official benchmarks and real-world performance of LLMs can be quite different, similar to how official fuel consumption ratings in cars rarely match real driving conditions. This became clear in the testing. While the IFEval benchmark suggested all model versions would perform equal, the custom benchmark showed that GPTQ models actually performed much worse than full precision or AWQ ones. This difference wasn't related to the SGLang library, as I used it for both inference libraries after seeing such large deviations.

The gap in performance might be explained by something mentioned in a quantization paper [ZeroQuant(4+2)]. GPTQ tends to overfit on its calibration data. Because of this, when choosing between AWQ and GPTQ models, AWQ should always be the better choice.

Like in the automotive industry where the official fuel consumption values are often not achievable in real life, the same can be true for LLM benchmarks. The scores you see on the LLM Leaderboard might not reflect how well a model will work for your specific needs. Instead of chasing benchmark scores, it's safer to stick with proven models like Llama 3.3, Qwen, or Mistral.