Sources and Methodology

Where RankedAGI's model data comes from, how benchmarks are selected, and how composite scores should be interpreted.

Data collection

I collect benchmark results manually from primary public sources. That usually means official model-provider blogs, papers, technical reports, release posts, model cards, and benchmark leaderboards. Benchmark platforms include sources such as LiveBench, Arena, Aider, SWE-Bench, Terminal-Bench, MathArena, and similar public result pages.

RankedAGI does not own the underlying benchmark data. The site organizes public results for comparison, research, and discovery.

Benchmark inclusion criteria

Benchmarks are included when they help answer practical model-comparison questions and have publicly checkable results. I prioritize current evaluations for coding, reasoning, math, general preference, multimodal understanding, and agentic tasks.

Older benchmarks may be hidden or de-emphasized when they become saturated, superseded, hard to compare across releases, or less useful for ranking current frontier models.

Benchmark index

The public table columns are generated from benchmark metadata. Similar names can represent different evaluations or releases, so the subtitle and description matter. For example, MMLU and MMLU-Pro are separate benchmarks, not interchangeable aliases.

MMLU-Pro is tracked separately from standard MMLU when usable public scores are available. If a harder or pro benchmark variant is not shown for a model, it means RankedAGI does not currently have a usable public score recorded for that model on that benchmark.

coding

Code RankedAGI

RankedAGI Coding Score

SWEBench Pro

Diverse Agentic Coding

SWEBench Multilingial

SWEBench Verified

Agentic Coding

Source

Terminal Bench 2.0

Agentic Terminal Coding

Source

Code Arena

Code Arena Elo Score

Source

Code Livebench

Livebench Coding 26.01.08

Source

Svelte Bench

SvelteBench - Benchmark for Svelte

Source

Agents

Agentic RankedAGI

RankedAGI Agentic Score

Browse Comp

A benchmark for browsing agents

Source

OSWorld Verified

Vending Bench 2

Benchmark for measuring AI model performance on running a business over long time horizons. Models are tasked with running a simulated vending machine business over a year and scored on their bank account balance at the end.

Source

𝜏²-Bench Telecom

Agentic Tool Use

GDPval AA

Office Tasks (Artificial Analysis)

safety

Cyber Gym

reasoning

Reason RankedAGI

RankedAGI Reasoning Score

HLE

Multidisciplinary Reasoning (no tools)

HLE w/ Tools

Multidisciplinary Reasoning (with tools)

GPQA Diamond

Generalized Prefix Question Answering Score (Reasoning) PhD Level Reasoning

NYT Connections

NYT Connections Extended Version

Source

ARC AGI 2.0

Abstract Reasoning Puzzles (Public)

general

Text Arena

ChatArena (LMSYS) ELO Score

Source

RAGI RankedAGI

Overall RankedAGI score

math

AIME 2026

AIME 2026 Competition Math

Source

imaging

MMMU

Multimodal Understanding College-level visual problem-solving

MMMU Pro

Multimodal Understanding

MMMU Pro w/ Tools

Design

Composite score methodology

RankedAGI composite scores combine selected benchmarks that target a related capability area, such as coding or agentic work. Scores are normalized before mixing so benchmarks with different raw formats can contribute to one comparable percentage-style score.

The benchmark mix and weights are chosen to reflect practical model performance, source quality, recency, benchmark coverage, and how well the benchmark tends to translate to real-world results. A model that is very strong in coding but weaker in reasoning will score well in coding-focused composites, while broader composites balance that strength against reasoning, agentic, math, general, and multimodal evidence when those signals are available.

Where public benchmark coverage is sparse, RankedAGI may use estimated or synthetic evidence to reduce gaps between models. Those estimates are supporting signals, not replacements for real benchmark results, and are weighted with less confidence than direct public scores. Missing benchmark data should not be read as a zero score.

Limitations: benchmark availability differs by model, some scores come from provider-reported runs, and public leaderboards can change over time. Composite scores are directional summaries, not claims that one public formula captures every use case.

Published composite formula

Each benchmark source is normalized to a 0-1 score. For direct benchmark results, trust = 1.0. For simulated evidence, RankedAGI requires confidence of at least 0.35 and uses trust = confidence * 0.75, so real evidence always carries more weight than an estimate.

signal_weight = (source_relevance / 50) ^ 1.25
effective_weight = signal_weight * trust * sampling_reliability

family_score = sum(source_score * effective_weight) / sum(effective_weight)
family_evidence = min(sum(effective_weight), strongest_source_weight * 1.2)

observed_score = sum(family_score * family_evidence)
evidence = sum(family_evidence)
final_score = (prior_score * prior_weight + observed_score) / (prior_weight + evidence)

The family cap prevents several similar benchmarks, such as multiple versions of the same benchmark family, from overwhelming the composite as if they were fully independent evidence.

For the overall RankedAGI score, the current relative capability weights are coding 50, reasoning 50, agentic 50, and math 18. Coding, reasoning, and agentic performance therefore contribute as primary pillars, while math is a supporting signal in the overall score. Category scores such as RankedAGI Coding and RankedAGI Reasoning use their own benchmark-source lists before feeding into the overall score.

Simulated evidence quality controls

Simulated benchmark estimates are generated only from real benchmark overlap between models. RankedAGI compares models in normalized benchmark space, finds nearby models with shared real results, and gives closer, more consistent neighbors more influence. The estimate also carries a confidence value based on shared coverage, neighbor quality, and agreement between the available signals.

Low-confidence estimates are excluded from composite scoring. Estimates below 0.35 confidence do not contribute, and accepted estimates are still discounted through the trust cap above. This keeps simulated evidence useful for sparse models without letting it outrank direct benchmark evidence.

Real benchmark values always override estimates. Simulated values are never written into the main model data as real scores, and simulated values are not used as inputs to generate other simulated values.

Thinking model variants

For older model families with explicit thinking and non-thinking variants, RankedAGI records separate model entries when public benchmark data distinguishes those modes. If a benchmark source publishes a thinking-mode score, that score is recorded for the relevant thinking variant; if it publishes a non-reasoning or standard-mode score, that is recorded separately when the model naming makes the distinction clear.

RankedAGI records the public benchmark result as reported by the source. Reasoning traces are not scored separately unless the benchmark or source itself publishes a separate trace-level score.

Human preference benchmark uncertainty

Arena-style human preference benchmarks are useful signals, but they are not treated as a dominant penalty or top-priority input in composite scores. Their contribution is weighted by practical relevance and by how well the benchmark appears to translate to real-world model quality.

RankedAGI checks public Arena-style leaderboards weekly and updates changed scores when the source updates. Confidence intervals and sample sizes are not independently recalculated unless the source publishes them in a way that can be represented consistently on the site.

Model update frequency

New public model data is generally updated within 24 hours of release, often within 5-6 hours when the release and benchmark sources are clear. Each model carries a last-updated value so freshness is visible on the site.

Model versioning

Models are listed as public names change over time. Duplicate-looking names can exist when a provider releases a new dated build, preview, thinking variant, or API-visible version with different benchmark behavior.

A suffix such as -old means the row represents an earlier model entry kept for comparison or historical continuity, not the preferred current listing. The older pattern was to move a replaced same-name model to an -old slug and keep the newer release at the main slug.

The newer standard is to add a date or version marker to the slug when it avoids ambiguity. Weight or size is added for open-model families that release multiple sizes under the same model name. If a model has only one relevant size and there is no naming ambiguity, the size is usually omitted from the slug.

Context window sourcing

Context-window values are provider- or developer-reported unless otherwise marked. They may represent API limits, product limits, or documented model-card limits depending on what the source publishes.

Custom comparison behavior

Users can compare models by filtering the model table and showing or hiding benchmark columns. These controls change the view in the browser; they do not change the underlying public data.

Corrections and revisions

When source data changes or errors are found, RankedAGI updates the current public data rather than preserving a separate public revision log for every score. Per-score source attribution is planned as a larger provenance/data-model project.

Model pages usually include one or more global source links, such as an official release blog, model card, Hugging Face page, or public announcement. For third-party benchmarks, data is collected from the official leaderboard or result page for that benchmark, and many of those benchmark pages are linked from the benchmark index above.

Data access FAQ

Is there a public API for RankedAGI model data?

No. RankedAGI does not currently provide a public API for programmatic access to model data. The public website is the supported interface for browsing, filtering, and comparing models.

What are the API rate limits?

There are no API rate limits because there is no public API. The site is static and crawlable through normal public pages, robots.txt, llms.txt, and the sitemap.

Can I export the model table as CSV or JSON?

Not currently. RankedAGI does not yet offer CSV or JSON export for the full comparison table. Users can filter models and show or hide benchmark columns in the browser, but downloadable exports are not part of the current public site.

Are entries like DeepSeek 2.5, DeepSeek 2.5-old, and DeepSeek 2.5-236B-old duplicates?

They are separate RankedAGI entries used to preserve release or size distinctions. A -old suffix marks an earlier entry kept for continuity, while a size marker such as 236B identifies a specific open-model weight when the same family has multiple sizes or naming would otherwise be ambiguous.

Do individual benchmark scores show provider versus independent source type?

Not yet at the score level. Current model pages usually include model-level source links, and benchmark-specific third-party scores are collected from official benchmark leaderboards. Per-score source attribution is planned so each benchmark value can point to the exact provider, independent lab, model card, or leaderboard source used for that value.

Planned data improvements

A price-performance view is in progress so users can compare capability against API pricing without calculating every ratio manually.

Per-score source attribution is also planned. The goal is to link each benchmark value for each model to the specific source used for that value, rather than relying only on model-level and benchmark-level source links.