Research Arena is an automated evaluation protocol. Each episode selects two recent AI research papers from arXiv, runs five independent trials through three judge personas — Empiricist, Systems, and Skeptic — and computes a weighted verdict using a calibrated reducer. The full decision artefact is available for download below.
Three independent judge personas scored each paper on Novelty, Evidence, and Impact (0–10). Total is the normalised composite (0–1). Scores shown as Paper A / Paper B. Scores shown from trial 1. Full trial data available in the Proof-of-Decision JSON.
| Judge | Novelty | Evidence | Impact | Total (A / B) | Confidence | Reliability |
|---|---|---|---|---|---|---|
|
Empiricist
Empiricist · openai-mini
|
8.00
/
8.00
|
8.00
/
8.00
|
9.00
/
7.00
|
0.83
/
0.77
|
0.90 | High |
|
A: The integration of BEV representations with LLMs presents a significant advancement in autonomous driving technology, supported by strong experimental results.
B: The research presents a significant advancement in UAV obstacle avoidance with strong validation through simulations and real-world tests.
|
||||||
|
Systems
Systems · openai-4o
|
8.00
/
7.00
|
7.50
/
8.00
|
8.00
/
7.00
|
0.78
/
0.73
|
0.90 | High |
|
A: The integration of BEV with LLMs is a novel approach that shows strong evidence of improving autonomous driving systems.
B: Fly360 introduces a novel approach to UAV obstacle avoidance with strong validation and potential impact on UAV systems.
|
||||||
|
Skeptic
Skeptic · openai-o3-mini
|
8.00
/
8.00
|
7.00
/
7.00
|
8.00
/
8.00
|
0.77
/
0.77
|
0.90 | High |
|
A: BEVLM’s innovative combination of BEV and LLMs is promising, but the dramatic performance claims lack sufficient experimental detail to fully convince peer reviewers.
B: Compared to conventional forward-view counterparts, Fly360’s use of panoramic vision is notably novel and validated by both simulation and real-world tests, although its overall scalability and generalizability may still warrant further peer scrutiny.
|
||||||
5 independent evaluations were run. Each trial ran the full judge panel independently.
| Trial | Winner | Margin | Confidence | Agreement |
|---|---|---|---|---|
| 1 | Paper A | 0.045 | 0.089 | ✓ Majority |
| 2 | Paper B | 0.020 | 0.040 | ✗ Minority |
| 3 | Paper A | 0.030 | 0.059 | ✓ Majority |
| 4 | Paper A | 0.022 | 0.044 | ✓ Majority |
| 5 | Paper A | 0.041 | 0.081 | ✓ Majority |
This decision was computed using normalised scores, confidence weighting, and calibration-adjusted reliability across three independent judges. The reducer applies a weighted aggregation before producing a final margin and confidence estimate.
↓ Download Proof-of-Decision (JSON)