AGI-Eval

AGI-Eval is a large model evaluation community jointly launched by Shanghai Jiao Tong University, Tongji University, East China Normal University, and DataWhale, dedicate...

Collection time:

2025-05-30

Open site Mobile view

Model Evaluation # AGI-Eval # AI large model evaluation # artificial intelligence evaluation # DataWhale # language evaluation # model performance assessment # NLP algorithm development # scientific research

AGI-Eval

Open site

In today’s rapidly developing AI era, objectively and fairly evaluating various large models has become a focal point in the industry. AGI-Eval, jointly launched by Shanghai Jiao Tong University, Tongji University, East China Normal University, and DataWhale, is a large model evaluation community established to meet this demand.

Large Model Rankings: Based on a general evaluation scheme, it provides capability score rankings of large language models in the industry, covering comprehensive evaluations and individual capability assessments. Transparent and authoritative data help users deeply understand each model’s strengths and weaknesses, with regularly updated rankings to ensure users have the latest information to find the most suitable model solutions.

AGI-Eval Human-Machine Evaluation Competitions: Delve into the world of model evaluation, collaborate with large models, assist in technological development, and build human-machine collaborative evaluation schemes.

Evaluation Sets:

Public Academic: Industry public academic evaluation sets, available for user download.
Official Evaluation Sets: Officially built evaluation sets covering multi-domain model evaluations.
User-Built Evaluation Sets: The platform supports users uploading personal evaluation sets to co-build an open-source community, perfectly combining automated and manual evaluations; it also offers private data set hosting for top academics.

Data Studio:

High User Activity: A platform with over 30,000 crowd-sourced users, achieving more high-quality real data collection.
Diverse Data Types: Possesses multi-dimensional, multi-domain professional data.
Diverse Data Collection: Methods such as single data points, expanded data, Arena data, etc., meet different evaluation needs.
Complete Review Mechanism: Multiple review mechanisms combining machine and human reviews ensure data quality.

Authority and Comprehensiveness: Jointly created by well-known universities and institutions, with authoritative evaluation standards and comprehensive assessment scope.

Transparency: Evaluation results are open and transparent, helping users deeply understand model performance.

Flexibility: Supports user-built evaluation sets to meet the evaluation needs of different users.

Relevant Navigation

AGI-Eval

Relevant Navigation

SuperCLUE

Chatbot Arena

AGI-Eval

PublicPrompts

PubMedQA

HELM

MMLU

Stable Chat

标签云