Research Arena is DefChat's automated research evaluation protocol. Each episode selects two AI research papers relevant to voice identity, agent trust, and episodic cognition, then runs five independent trials through three specialist judges — Voice & Identity, Trust & Provenance, and Episodic Cognition — and computes a weighted verdict using a calibrated reducer. The full decision artefact is available for download below.
Three independent judge personas scored each paper on Novelty, Evidence, and Impact (0–10). Total is the normalised composite (0–1). Scores shown as Paper A / Paper B. Scores shown from trial 1. Full trial data available in the Proof-of-Decision JSON.
| Judge | Novelty | Evidence | Impact | Total (A / B) | Confidence | Reliability |
|---|---|---|---|---|---|---|
|
Voice_Identity
Voice & Identity · claude-haiku
|
2.00
/
1.00
|
4.00
/
2.00
|
1.00
/
0.00
|
0.23
/
0.10
|
0.95 | High |
|
A: This paper analyzes AI policy consultation letters using standard topic modeling but has no connection to voice identity, speaker verification, synthetic voice detection, voice cloning, biometric rights, or consent architecture for voice data—the core domains of voice identity research.
B: This paper addresses behavioral cloning and multimodal policy learning in reinforcement learning, which is entirely outside the domain of voice identity, speaker verification, voice authentication, synthetic voice detection, voice cloning, biometric rights, or consent architecture for voice data.
|
||||||
|
Trust_Provenance
Trust & Provenance · claude-sonnet
|
2.50
/
2.00
|
3.50
/
2.00
|
2.00
/
1.00
|
0.27
/
0.17
|
0.82 | High |
|
A: This is a policy discourse analysis with no mechanism, protocol, or framework for trust or provenance in AI systems, making it largely irrelevant to trustworthy AI infrastructure despite touching on AI governance themes.
B: This paper addresses imitation learning multimodality, which has no meaningful connection to trust infrastructure, provenance chains, agent accountability, or AI auditability, making it essentially irrelevant to the evaluation domain.
|
||||||
|
Episodic_Cognition
Episodic Cognition · openai-4o
|
2.00
/
6.00
|
3.00
/
5.00
|
2.00
/
5.00
|
0.23
/
0.53
|
0.85 | High |
|
A: The paper offers a sociopolitical analysis rather than advancing memory or cognition in AI systems.
B: The paper provides an incremental improvement in understanding multimodal policy parameterizations but lacks strong empirical validation across diverse tasks.
|
||||||
5 independent evaluations were run. Each trial ran the full judge panel independently.
| Trial | Winner | Margin | Confidence | Agreement |
|---|---|---|---|---|
| 1 | Paper A | 0.018 | 0.028 | ✗ Minority |
| 2 | Paper B | 0.068 | 0.109 | ✓ Majority |
| 3 | Paper B | 0.128 | 0.205 | ✓ Majority |
| 4 | Paper A | 0.047 | 0.074 | ✗ Minority |
| 5 | Paper B | 0.029 | 0.047 | ✓ Majority |
This decision was computed using normalised scores, confidence weighting, and calibration-adjusted reliability across three independent judges. The reducer applies a weighted aggregation before producing a final margin and confidence estimate.
↓ Download Proof-of-Decision (JSON)