Which LLM is better - Chabot Arena

Anson Park

7 min read

Jan 9, 2023

Chatbot Arena is a benchmarking platform for Large Language Models (LLMs), utilizing a unique approach to assess their capabilities in real-world scenarios. Here are some key aspects of Chatbot Arena:

  1. Purpose and Function: Chatbot Arena is designed to evaluate open-source LLMs in a more practical and user-focused manner. It addresses the challenge of benchmarking LLM assistants, which can be difficult due to the open-ended nature of problems and the complexity of programming an automatic response quality evaluator.

  2. Pairwise Comparison and Human Evaluation: Since automatic evaluation can be challenging, Chatbot Arena relies on human evaluations through pairwise comparisons. Users interact with two anonymous models and vote for which one they think is better.

  3. Elo Rating System: The platform adopts the Elo rating system, commonly used in chess and other competitive games, to rate LLMs. This system calculates the relative skill levels of players (or in this case, models) and is used to predict the outcome of matches. The Elo rating system is suitable for scenarios where there are multiple models and pairwise battles are conducted between them.

  4. Characteristics of a Good Benchmark System: For effective benchmarking, a system should possess scalability (to accommodate a large number of models), incrementality (ability to evaluate a new model with a minimal number of trials), and a unique order (establishing a clear ranking for all models).

  5. Data Collection Method: In Chatbot Arena, users can chat with two anonymous models side-by-side and vote for their preferred one. This crowdsourcing method of data collection represents some use cases of LLMs in the wild.

  6. Comparative Analysis: Chatbot Arena provides an overview of how different LLMs perform against each other in these head-to-head matchups. This information can be used to compare the effectiveness and user preference for various models.

  7. Significance: Chatbot Arena is significant as it offers a real-world, user-centric approach to evaluating LLMs. By focusing on user experience and preferences, it provides insights that are often missed in more traditional, automated benchmarking methods.


DeepNatural Blog - Chatbot Arena

Chatbot Arena thus offers a novel way to evaluate the effectiveness and user experience of LLMs, making it a valuable tool in the evolving landscape of AI and machine learning.


Chatbot Arena Leaderboard

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Chatbot Arena Leaderboard

You can observe that the models are ranked in the order of GPT-4, Claude, Mixtral, and Gemini Pro. It's particularly impressive to note that while the top ranks are dominated by proprietary models, starting with OpenAI's GPT-4, Mistral AI's open-source model Mixtral-8x7b-Instruct-v0.1 is also achieving remarkable results.


Chatbot Arena

https://chat.lmsys.org/

Chatbot Arena

When you enter Chatbot Arena, you can compare and converse with two chatbots based on large language models (LLMs). While writing this post, I started the conversation with the following question:

'How is the performance of large language models evaluated?'

As soon as I posed the question, two anonymous models among the LLMs participating in Chatbot Arena responded, providing answers to my question.

Response of Model A

The performance of large language models is typically evaluated using a variety of metrics and tasks that assess their ability to understand and generate language. Here are some common ways in which the performance of large language models is evaluated:

  1. Perplexity: Perplexity measures how well a language model predicts a given sequence of words. Lower perplexity values indicate that the model is better at predicting the next word in a sequence.

  2. Language Modeling Tasks: Language modeling tasks involve predicting the next word in a sequence of words. The model is given a prompt, and it generates text one word at a time. The accuracy of the generated text is evaluated using metrics such as perplexity, vocabulary size, and fluency.

  3. Text Classification: Text classification involves assigning a label to a piece of text based on its content. Large language models can be trained on text classification tasks to evaluate their ability to understand the meaning of text.

  4. Sentiment Analysis: Sentiment analysis involves determining the sentiment (positive, negative, or neutral) of a piece of text. This task can be used to evaluate the ability of a language model to understand the emotional tone of text.

  5. Named Entity Recognition (NER): NER involves identifying named entities (e.g., people, organizations, locations) in text. This task can be used to evaluate the ability of a language model to understand the structure and content of text.

  6. Question Answering: Question answering involves evaluating a model's ability to answer questions based on the content of a text passage. This task can be used to evaluate the model's ability to understand the meaning of text and generate relevant responses.

  7. Machine Translation: Machine translation involves translating text from one language to another. Large language models can be trained on machine translation tasks to evaluate their ability to understand and generate text in different languages.

  8. Dialogue Generation: Dialogue generation involves generating natural language responses to given prompts. This task can be used to evaluate the ability of a language model to generate coherent and contextually appropriate text.

  9. Summarization: Summarization involves generating a summary of a given text passage. This task can be used to evaluate the ability of a language model to understand the main ideas and concepts in a piece of text and generate a concise summary.

These tasks and metrics provide a comprehensive evaluation of the performance of large language models and help researchers and developers understand their strengths and limitations.

Response of Model B

Large language models are evaluated using a variety of metrics and benchmarks, both automatic and human-based, to assess their performance in various tasks. Some common evaluation methods include:

  1. Automatic metrics: These metrics are computed programmatically and can be divided into two categories:
    a. Task-specific metrics: These metrics are designed for specific tasks, such as accuracy, precision, recall, F1-score, and BLEU score. For example, accuracy is commonly used in classification tasks, while BLEU (Bilingual Evaluation Understudy) is used for machine translation tasks to measure the quality of the generated output.
    b. Task-agnostic metrics: These metrics are not specific to any particular task. Examples include perplexity (a measure of how well a language model predicts the next word in a sequence) and cross-entropy loss (a measure of the probability of the true label).

  2. Human evaluation: Large language models are also evaluated by human evaluators to assess aspects like fluency, coherence, and informativeness of the generated text. This is particularly important for tasks like natural language understanding, where the model's ability to generate human-like responses is crucial.

  3. Benchmark datasets: Large language models are often evaluated on benchmark datasets that are designed to test the model's performance on a wide range of tasks. Examples include GLUE (General Language Understanding Evaluation), SuperGLUE, and LAMBADA. These datasets provide a standardized way to compare the performance of different models across various tasks.

The choice of evaluation methods depends on the specific tasks and goals of the large language model. A combination of automatic metrics, human evaluation, and benchmark datasets can provide a comprehensive assessment of a model's performance.

What do you think is the better response from the models? To participate in the evaluation, you can select and submit one of the following four options:

  • 👈🏻 A is better

  • 👉🏻 B is better

  • 🤝 Tie

  • 👎🏻 Both are bad"


Result Analysis

Through this process, a large number of people can directly experience and submit their comparative evaluations. From this data, we can calculate the Elo Rating and derive rankings. The mechanism for calculating the Elo Rating can be checked through the following notebook:

https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH

Fraction of Model A Wins for All Non-tied A vs. B Battles:

Battle Count for Each Combination of Models (without Ties):

Bootstrap of MLE Elo Estimates (1000 Rounds of Random Sampling):

Average Win Rate Against All Other Models (Assuming Uniform Sampling and No Ties):


Evaluating AI like LLMs, which perform complex functions, has its limitations when solely relying on mechanical assessments based on benchmark datasets according to fixed algorithms. In such cases, it's essential to collect people's feedback to understand how they actually perceive these models and to determine which LLM receives better evaluations. From this perspective, Chatbot Arena can be seen as an important leaderboard that deserves our attention.


References:

  • https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

  • https://chat.lmsys.org/

  • https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH


Written by Anson Park

CEO of DeepNatural. MSc in Computer Science from KAIST & TU Berlin. Specialized in Machine Learning and Natural Language Processing.