Code RankedAGI
RankedAGI Coding Score
Where RankedAGI's model data comes from, how benchmarks are selected, and how composite scores should be interpreted.
I collect benchmark results manually from primary public sources. That usually means official model-provider blogs, papers, technical reports, release posts, model cards, and benchmark leaderboards. Benchmark platforms include sources such as LiveBench, Arena, Aider, SWE-Bench, Terminal-Bench, MathArena, and similar public result pages.
RankedAGI does not own the underlying benchmark data. The site organizes public results for comparison, research, and discovery.
Benchmarks are included when they help answer practical model-comparison questions and have publicly checkable results. I prioritize current evaluations for coding, reasoning, math, general preference, multimodal understanding, and agentic tasks.
Older benchmarks may be hidden or de-emphasized when they become saturated, superseded, hard to compare across releases, or less useful for ranking current frontier models.
The public table columns are generated from benchmark metadata. Similar names can represent different evaluations or releases, so the subtitle and description matter. For example, MMLU and MMLU-Pro are separate benchmarks, not interchangeable aliases.
MMLU-Pro is tracked separately from standard MMLU when usable public scores are available. If a harder or pro benchmark variant is not shown for a model, it means RankedAGI does not currently have a usable public score recorded for that model on that benchmark.
RankedAGI Coding Score
Diverse Agentic Coding
Agentic Coding
Agentic Terminal Coding
Code Arena Elo Score
Livebench Coding 26.01.08
SvelteBench - Benchmark for Svelte
RankedAGI Agentic Score
A benchmark for browsing agents
Benchmark for measuring AI model performance on running a business over long time horizons. Models are tasked with running a simulated vending machine business over a year and scored on their bank account balance at the end.
Agentic Tool Use
Office Tasks (Artificial Analysis)
RankedAGI Reasoning Score
Multidisciplinary Reasoning (no tools)
Multidisciplinary Reasoning (with tools)
Generalized Prefix Question Answering Score (Reasoning) PhD Level Reasoning
NYT Connections Extended Version
Abstract Reasoning Puzzles (Public)
ChatArena (LMSYS) ELO Score
Overall RankedAGI score
AIME 2026 Competition Math
Multimodal Understanding College-level visual problem-solving
Multimodal Understanding
RankedAGI composite scores combine selected benchmarks that target a related capability area, such as coding or agentic work. Scores are normalized before mixing so benchmarks with different raw formats can contribute to one comparable percentage-style score.
The benchmark mix and weights are chosen to reflect practical model performance, source quality, recency, benchmark coverage, and how well the benchmark tends to translate to real-world results. A model that is very strong in coding but weaker in reasoning will score well in coding-focused composites, while broader composites balance that strength against reasoning, agentic, math, general, and multimodal evidence when those signals are available.
Where public benchmark coverage is sparse, RankedAGI may use estimated or synthetic evidence to reduce gaps between models. Those estimates are supporting signals, not replacements for real benchmark results, and are weighted with less confidence than direct public scores. Missing benchmark data should not be read as a zero score.
Limitations: benchmark availability differs by model, some scores come from provider-reported runs, and public leaderboards can change over time. Composite scores are directional summaries, not claims that one public formula captures every use case.
Each benchmark source is normalized to a 0-1 score. For direct benchmark results, trust = 1.0. For
simulated evidence, RankedAGI requires confidence of at least 0.35 and uses trust = confidence * 0.75, so real evidence always carries more weight than an estimate.
signal_weight = (source_relevance / 50) ^ 1.25
effective_weight = signal_weight * trust * sampling_reliability
family_score = sum(source_score * effective_weight) / sum(effective_weight)
family_evidence = min(sum(effective_weight), strongest_source_weight * 1.2)
observed_score = sum(family_score * family_evidence)
evidence = sum(family_evidence)
final_score = (prior_score * prior_weight + observed_score) / (prior_weight + evidence) The family cap prevents several similar benchmarks, such as multiple versions of the same benchmark family, from overwhelming the composite as if they were fully independent evidence.
For the overall RankedAGI score, the current relative capability weights are coding 50, reasoning 50, agentic 50, and math 18. Coding,
reasoning, and agentic performance therefore contribute as primary pillars, while math is a
supporting signal in the overall score. Category scores such as RankedAGI Coding and
RankedAGI Reasoning use their own benchmark-source lists before feeding into the overall
score.
Simulated benchmark estimates are generated only from real benchmark overlap between models. RankedAGI compares models in normalized benchmark space, finds nearby models with shared real results, and gives closer, more consistent neighbors more influence. The estimate also carries a confidence value based on shared coverage, neighbor quality, and agreement between the available signals.
Low-confidence estimates are excluded from composite scoring. Estimates below 0.35 confidence do
not contribute, and accepted estimates are still discounted through the trust cap above.
This keeps simulated evidence useful for sparse models without letting it outrank direct
benchmark evidence.
Real benchmark values always override estimates. Simulated values are never written into the main model data as real scores, and simulated values are not used as inputs to generate other simulated values.
For older model families with explicit thinking and non-thinking variants, RankedAGI records separate model entries when public benchmark data distinguishes those modes. If a benchmark source publishes a thinking-mode score, that score is recorded for the relevant thinking variant; if it publishes a non-reasoning or standard-mode score, that is recorded separately when the model naming makes the distinction clear.
RankedAGI records the public benchmark result as reported by the source. Reasoning traces are not scored separately unless the benchmark or source itself publishes a separate trace-level score.
Arena-style human preference benchmarks are useful signals, but they are not treated as a dominant penalty or top-priority input in composite scores. Their contribution is weighted by practical relevance and by how well the benchmark appears to translate to real-world model quality.
RankedAGI checks public Arena-style leaderboards weekly and updates changed scores when the source updates. Confidence intervals and sample sizes are not independently recalculated unless the source publishes them in a way that can be represented consistently on the site.
New public model data is generally updated within 24 hours of release, often within 5-6 hours when the release and benchmark sources are clear. Each model carries a last-updated value so freshness is visible on the site.
Models are listed as public names change over time. Duplicate-looking names can exist when a provider releases a new dated build, preview, thinking variant, or API-visible version with different benchmark behavior.
A suffix such as -old means the row represents an earlier model entry kept for comparison or historical continuity,
not the preferred current listing. The older pattern was to move a replaced same-name model
to an -old slug and keep
the newer release at the main slug.
The newer standard is to add a date or version marker to the slug when it avoids ambiguity. Weight or size is added for open-model families that release multiple sizes under the same model name. If a model has only one relevant size and there is no naming ambiguity, the size is usually omitted from the slug.
Context-window values are provider- or developer-reported unless otherwise marked. They may represent API limits, product limits, or documented model-card limits depending on what the source publishes.
Users can compare models by filtering the model table and showing or hiding benchmark columns. These controls change the view in the browser; they do not change the underlying public data.
When source data changes or errors are found, RankedAGI updates the current public data rather than preserving a separate public revision log for every score. Per-score source attribution is planned as a larger provenance/data-model project.
Model pages usually include one or more global source links, such as an official release blog, model card, Hugging Face page, or public announcement. For third-party benchmarks, data is collected from the official leaderboard or result page for that benchmark, and many of those benchmark pages are linked from the benchmark index above.
No. RankedAGI does not currently provide a public API for programmatic access to model data. The public website is the supported interface for browsing, filtering, and comparing models.
There are no API rate limits because there is no public API. The site is static and
crawlable through normal public pages, robots.txt, llms.txt, and
the sitemap.
Not currently. RankedAGI does not yet offer CSV or JSON export for the full comparison table. Users can filter models and show or hide benchmark columns in the browser, but downloadable exports are not part of the current public site.
They are separate RankedAGI entries used to preserve release or size distinctions. A -old suffix marks
an earlier entry kept for continuity, while a size marker such as 236B identifies a
specific open-model weight when the same family has multiple sizes or naming would
otherwise be ambiguous.
Not yet at the score level. Current model pages usually include model-level source links, and benchmark-specific third-party scores are collected from official benchmark leaderboards. Per-score source attribution is planned so each benchmark value can point to the exact provider, independent lab, model card, or leaderboard source used for that value.
A price-performance view is in progress so users can compare capability against API pricing without calculating every ratio manually.
Per-score source attribution is also planned. The goal is to link each benchmark value for each model to the specific source used for that value, rather than relying only on model-level and benchmark-level source links.