LLM leaderboards are platforms that rank Large Language Models (LLMs) based on their performance across various evaluation benchmarks. These leaderboards provide a comparative look at leading models, showing how each performs in tasks ranging from language understanding to content generation. They serve as a reference for researchers, developers, and users to gauge the capabilities and improvement areas of different LLMs.
These leaderboards are updated regularly, incorporating the latest models and benchmarks, making them a dynamic resource in the rapidly advancing field of Natural Language Processing (NLP). They highlight the top-performing models and provide insights into trending technologies and techniques within the community.
This is part of a series of articles about best LLMs.
LLM leaderboards are playing a crucial role in the rapidly developing field of generative AI:
Below is a list of the leading LLM leaderboards as of the time of this writing. For each model, we provide a direct link, the number of LLM models compared, and the evaluation benchmarks. In the following sections we explain each of the benchmarks in more detail. Related content: Read our guide to open llm leaderboard.
Direct link: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
**Number of models compared: **Hundreds of models, with more added daily
**Evaluation benchmarks: **ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
Huggingface’s Open LLM Leaderboard aims to foster open collaboration and transparency in the evaluation of language models. It supports a range of datasets and tasks, and encourages contributions from developers, promoting diversity in model entries and continual improvement in benchmarking methods.
The leaderboard uses six widely accepted benchmarks for evaluating LLMs and computes an average of these benchmarks for each model. It also enables easy filtering of models according to model size, precision, and other attributes.
Learn more in our detailed guide to open LLM leaderboard (coming soon)
Direct link: https://chat.lmsys.org/?leaderboard
Number of models compared: 95+
**Evaluation benchmarks: **Bradley-Terry Human Pairwise Comparisons
The LMSYS Chatbot Arena Leaderboard specifically targets the assessment of models in conversational AI contexts. It tests various chatbots' abilities to handle complex and nuanced dialogues, making it useful for developers focused on enhancing human-computer interaction through language. It uses a benchmarking method that relies on human evaluation of conversational responses.
**Direct link: **https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
**Number of models compared: **60+
**Evaluation benchmarks: **HumanEval,
The Big Code Models Leaderboard focuses on evaluating the performance of models intended for programming language processing. It benchmarks models on tasks like code generation, documentation, and bug detection, providing insights into each model's utility in software development environments. It uses a dataset of human-written programming problems created by OpenAI, and automatically translated into multiple programming languages.
**Direct link: **https://huggingface.co/spaces/mteb/leaderboard
Number of models compared: 300+
**Evaluation benchmarks: **Massive Text Embedding Benchmark (MTEB)
The MTEB leaderboard focuses on the performance of text embeddings in LLMs. Embedding is the process of converting textual content into vectors that can be processed by machine learning algorithms. The MTEB benchmark tests multiple embedding tasks across 58 datasets and 112 languages. It tests how well LLMs are able to convert their training data into meaningful mathematical representations.
**Direct link: **https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
**Number of models compared: **300+
**Evaluation benchmarks: **CanAICode Benchmark
The CanAiCode Leaderboard benchmarks models on their ability to handle programming-related tasks, from code generation to problem solving in various programming languages. It uses programming interview questions written by humans, and automatically tests AI-generated code using inference scripts and sandbox environments. This provides a realistic test of AI code quality.
**Direct link: **https://evalplus.github.io/leaderboard.html
**Number of models compared: **110+
**Evaluation benchmarks: **HumanEval, Mostly Basic Python Programming (MBPP) Benchmark
EvalPlus ranks LLMs based on a dataset of human-written programming problems provided by OpenAI, and a complementary dataset of basic Python programming problems.
**Direct link: **https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
**Number of models compared: **40+
**Evaluation benchmarks: **Word Error Rate (WER), Real Time Factor (RTF)
The Open ASR (Automatic Speech Recognition) Leaderboard evaluates models based on their ability to transcribe spoken language accurately. This task is particularly challenging due to the nuances of speech such as accents, dialects, and background noise, making this leaderboard an important tool for developers in the voice technology sector.
The following table summarizes the evaluation models used in the top LLM leaderboards we listed above. Below we provide more information about each benchmark.
Used in leaderboard: HuggingFace Open LLM
The MMLU Benchmark (Massive Multi-task Language Understanding) is a comprehensive evaluation framework designed to measure the multitask accuracy of large language models (LLMs) in both zero-shot and few-shot settings. Introduced to assess AI performance across diverse tasks, the MMLU covers a wide array of subjects, from simple mathematics to complex legal reasoning, providing a standardized approach for evaluating LLM capabilities.
MMLU consists of 57 tasks spanning various domains, requiring models to demonstrate a broad knowledge base and problem-solving skills. These tasks cover topics such as elementary mathematics, US history, computer science, and law. Each task evaluates the model's ability to understand and generate language accurately across different contexts.
Key attributes of MMLU tasks include:
MMLU assesses model performance based on several criteria:
These criteria ensure that the models are evaluated on their ability to generate coherent, relevant, detailed, and clear responses across various tasks.
Used in leaderboard: HuggingFace Open LLM
TruthfulQA is a benchmark designed to evaluate the truthfulness of language models by measuring their ability to avoid generating human-like falsehoods. Developed by researchers from the University of Oxford and OpenAI, TruthfulQA comprises two main tasks: a generation task and a multiple-choice task, each aimed at assessing different aspects of model performance in generating and recognizing true statements.
TruthfulQA consists of two tasks that utilize the same set of questions and reference answers:
The evaluation criteria for TruthfulQA are focused on two primary objectives:
Used in leaderboard: HuggingFace Open LLM
The Winograd Schema Challenge (WSC) has long been a benchmark for commonsense reasoning, consisting of expert-crafted pronoun resolution problems designed to be challenging for statistical models. To address the limitations of scale and inadvertent biases in WSC, researchers introduced WINOGRANDE, a large-scale dataset with 44,000 problems inspired by the original WSC but improved to enhance both scale and difficulty.
WINOGRANDE's tasks involve pronoun resolution problems where a sentence contains an ambiguous pronoun that must be correctly resolved. Each task consists of pairs of nearly identical sentences (called "twins"), with a trigger word flipping the correct answer between the sentences. This design tests the model's ability to use commonsense reasoning rather than relying on statistical patterns.
Key attributes of WINOGRANDE tasks include:
WINOGRANDE evaluates model performance based on:
Used in leaderboard: HuggingFace Open LLM
GSM8K, or Grade School Math 8K, is a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. The dataset is specifically designed to challenge state-of-the-art language models by requiring multi-step reasoning to solve basic mathematical problems. Despite the conceptual simplicity of the problems, the diversity and linguistic complexity make it a significant benchmark for evaluating large language models (LLMs).
The GSM8K dataset is structured to evaluate a model's ability to understand and solve mathematical word problems that involve multiple steps of reasoning. Each problem requires between 2 and 8 steps to solve and primarily involves performing sequences of elementary calculations using basic arithmetic operations.
Key attributes of GSM8K tasks include:
GSM8K evaluates model performance based on the following criteria:
Used in leaderboard: HuggingFace Open LLM
HellaSwag is an evaluation dataset designed to test grounded commonsense inference in large language models (LLMs). Introduced by Zellers et al. (2019), the dataset's name stands for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations. HellaSwag was specifically created to challenge LLMs' ability to understand and infer physical situations.
HellaSwag consists of 70,000 multiple-choice questions, each derived from video captions describing events in the physical world. These questions provide an initial context followed by four possible endings, with only one being correct. The incorrect options, known as "adversarial endings," are designed to be misleadingly plausible, containing expected words and phrases but ultimately defying common sense.
Key attributes of HellaSwag tasks include:
HellaSwag employs adversarial filtering (AF) to create deceptive incorrect answers. This process involves generating plausible yet incorrect completions that challenge LLMs while remaining easily distinguishable for humans. The evaluation criteria for HellaSwag focus on:
Used in leaderboard: HuggingFace Open LLM
ARC Evals assesses the safety of large language models (LLMs) by identifying early signs of dangerous capabilities, specifically focusing on "autonomous replication and adaptation" (ARA). The methodology is structured around a set of 12 real-world tasks designed to test the limits of LLM capabilities. These tasks vary in difficulty and are intended to simulate practical scenarios where LLMs could potentially operate autonomously and perform complex actions.
The 12 tasks used in ARC's evaluation are categorized to cover a broad spectrum of potential LLM capabilities:
The tasks are chosen to ensure a comprehensive assessment of the LLMs' ARA capabilities. The evaluation criteria focus on:
Used in leaderboard: Big Code Models, EvalPlus
HumanEval is a benchmark dataset designed to measure the functional correctness of language models in synthesizing programs from docstrings. Created by OpenAI, the dataset consists of 164 hand-written programming problems that require generating standalone Python functions.
HumanEval tasks involve generating Python code that correctly implements the functionality described in a given docstring. Each problem includes a function signature, docstring, and multiple unit tests to verify correctness. The tasks are designed to assess various aspects of programming, including language comprehension, algorithms, and simple mathematics.
Key attributes of HumanEval tasks include:
HumanEval evaluates model performance based on the following criteria:
To evaluate functional correctness, models generate multiple samples for each problem. The primary metric, pass@k, is calculated by determining the fraction of problems for which at least one of the k samples passes all unit tests. This approach provides a robust measure of the model's ability to generate correct solutions.
Used in leaderboard: Big Code Models
MultiPL-E is a system designed to translate unit test-driven code generation benchmarks into multiple programming languages, creating a multilingual benchmark for evaluating code generation models. It addresses the limitation of existing benchmarks that primarily focus on Python by translating two popular benchmarks, HumanEval and MBPP, into 18 additional programming languages.
MultiPL-E tasks involve generating code that meets the specifications provided in prompts. These tasks are designed to test the functional correctness of code generated by language models across various programming languages. The benchmarks include:
Key attributes of MultiPL-E tasks include:
MultiPL-E evaluates model performance based on the following criteria:
Used in leaderboard: MTEB
The Massive Text Embedding Benchmark (MTEB) is designed to evaluate text embedding models across a wide range of tasks and datasets, covering multiple languages. This benchmark addresses the limitation of existing evaluation methods, which often focus on a single task, typically semantic textual similarity (STS). MTEB provides a comprehensive framework for assessing the performance of text embeddings in diverse real-world applications.
MTEB includes eight embedding tasks, each encompassing multiple datasets. These tasks are designed to evaluate the functional capabilities of text embeddings across different applications:
MTEB evaluates model performance based on task-specific metrics:
Used in leaderboard: CanAICode
The CanAiCode Benchmark is designed to evaluate the coding capabilities of AI models through a structured interview format. This benchmark uses interview questions created by humans, which are then answered by AI models. The evaluation process includes a sandbox environment for code validation and scripts to assess the impact of different prompting techniques and sampling parameters on coding performance.
Metrics used:
Evaluation process:
Used in leaderboard: EvalPlus
The Mostly Basic Python Programming (MBPP) Benchmark is designed to evaluate the coding skills of AI models using basic Python programming problems. These problems are intended to be approachable for entry-level programmers and cover fundamental programming concepts and standard library functions.
The MBPP benchmark includes about 1,000 Python programming problems sourced from crowds. Each problem comes with a task description, a sample code solution, and three automated test cases. To ensure accuracy, a subset of these problems has been hand-verified.
The dataset is provided in .jsonl format, where each line is a JSON object representing a problem. The verified problems are stored in sanitized-mbpp.json. For evaluation purposes, the dataset is split into different sets:
Evaluation metrics used:
Used in leaderboard: Open ASR
The Word Error Rate (WER) is a widely used metric to evaluate the accuracy of speech recognition and machine translation systems. It measures the ratio of errors in a transcript to the total number of words spoken, providing a clear indication of the system's performance. A lower WER signifies better accuracy in recognizing speech.
WER calculates the error rate in a transcript by considering three types of errors:
It helps quantify the accuracy by comparing the recognized words against the reference (correct) words. Despite its simplicity, WER is effective in providing a comparative measure of ASR system accuracy.
Used in leaderboard: Open ASR
The Real Time Factor (RTF) is a key metric used to evaluate the performance of speech recognition and machine translation systems in terms of processing speed. It measures the ratio of the time taken to process audio to the actual length of the audio, providing an indication of the system's efficiency. A lower RTF signifies better efficiency and faster processing.
RTF calculates the efficiency of a speech recognition system by comparing the processing time to the duration of the audio input. To calculate RTF, you use this simple formula: Processing Time/Audio Length:
Here is how to use RTF to evaluate model performance:
RTF is a vital metric for evaluating the performance of Automatic Speech Recognition (ASR) systems, especially in real-time scenarios. It helps quantify the system's efficiency by comparing the processing time with the actual audio length. This is particularly important for applications that require immediate or near-immediate processing, such as live captioning or interactive voice response.
While LLM leaderboards can be useful for assessing and comparing models, it’s also important to be aware of their limitations.
Some LLM benchmarks use human evaluation. While human feedback is beneficial, it introduces the risk of biased assessments due to subjective preferences and individual perceptions. This bias can skew leaderboard rankings, particularly in close competitions where subtle nuances might influence the final scores. For example, recent studies found that some human evaluations of LLMs are biased according to length, meaning that evaluators preferred longer, and not necessarily better, LLM responses.
A significant challenge with LLM leaderboards is the potential for overfitting, where models are fine-tuned to excel on leaderboard benchmarks at the expense of general performance. This issue arises when developers prioritize leaderboard rankings over the model's ability to handle real-world tasks effectively.
Overfitting to specific tasks or datasets can mislead prospective users about a model's utility, suggesting high competence in controlled tests but failing to translate that success into broader applications. Leaderboards should continually update and broaden their benchmarks to discourage overfitting and encourage truly versatile models.
Data contamination occurs when training datasets include information from the test sets used in leaderboards. This issue can artificially inflate a model's performance, as it would ostensibly recognize parts of the test data during evaluations.
To combat data contamination, leaderboards must enforce strict separations between training and testing datasets and perform rigorous checks to ensure integrity in the evaluation process. Ensuring clean data usage is crucial for maintaining the credibility of leaderboard standings.
Visit https://gptscript.ai to download GPTScript and start building today. As we expand on the capabilities with GPTScript, we are also expanding our list of tools. With these tools, you can create any application imaginable: check out tools.gptscript.ai to get started.