Leaderboard
Full rankings across all datasets and models.
Survey Parity Score (SPS) measures how closely AI-generated survey responses match real human opinion distributions. 1.0 = perfect match.
Default view hides configs with <3 runs on <2 datasets. Toggle “All variants” to see every run.
Leaderboard
Select a column header to sort. Activate a row (Enter) to open its configuration, or use the chevron button to expand details inline.
| Expand row | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SynthPanel (GPT-4o-mini) conditioned ✓ verified SynthPanel (GPT-4o-mini) | globalopinionqa | product | 0.786 | — | 0.689 | 0.694 | 0.976 | 100 | — | [0.646, 0.730] | |
| 2 | Gemini 2.5 Flash ⚠ flagged Gemini 2.5 Flash | globalopinionqa | raw | 0.770 | — | 0.687 | 0.645 | 0.980 | 100 | — | [0.614, 0.708] | |
| 3 | Llama 3.3 70B ✓ verified Llama 3.3 70B | globalopinionqa | raw | 0.762 | — | 0.635 | 0.672 | 0.980 | 100 | — | [0.607, 0.695] | |
| 4 | SynthPanel (Gemini Flash Lite) conditioned ⚠ flagged SynthPanel (Gemini Flash Lite) | globalopinionqa | product | 0.762 | — | 0.687 | 0.624 | 0.974 | 100 | — | [0.605, 0.701] | |
| 5 | GPT-4o-mini ✓ verified GPT-4o-mini | globalopinionqa | raw | 0.749 | — | 0.633 | 0.648 | 0.966 | 100 | — | [0.591, 0.683] | |
| 6 | SynthPanel Ensemble (3-model) ensemble ⚠ flagged SynthPanel Ensemble (3-model) | globalopinionqa | product | 0.747 | — | 0.807 | 0.687 | 0.000 | 0 | — | [0.705, 0.789] | |
| 7 | Claude Sonnet 4.6 ✓ verified Claude Sonnet 4.6 | globalopinionqa | raw | 0.738 | — | 0.593 | 0.642 | 0.980 | 100 | $3.27 | [0.562, 0.667] | |
| 8 | Claude Haiku 4.5 ⚠ flagged Claude Haiku 4.5 | globalopinionqa | raw | 0.726 | — | 0.601 | 0.598 | 0.980 | 100 | — | [0.540, 0.650] | |
| 9 | SynthPanel (Haiku 4.5) conditioned ✓ verified SynthPanel (Haiku 4.5) | globalopinionqa | product | 0.725 | — | 0.666 | 0.628 | 0.881 | 100 | — | [0.594, 0.692] | |
| 10 | Random Baseline baseline ⚠ flagged Random Baseline | globalopinionqa | baseline | 0.710 | — | 0.747 | 0.399 | 0.983 | 10 | — | [0.481, 0.680] | |
| 11 | Majority Baseline baseline ✓ verified Majority Baseline | globalopinionqa | baseline | 0.690 | — | 0.534 | 0.555 | 0.980 | 100 | — | [0.493, 0.592] | |
| 1 | SynthPanel Ensemble (3-model) ensemble ✓ verified SynthPanel Ensemble (3-model) | opinionsqa | product | 0.835 | — | 0.833 | 0.837 | 0.000 | 0 | — | [0.827, 0.843] | |
| 2 | Gemini 2.5 Flash ✓ verified Gemini 2.5 Flash | opinionsqa | raw | 0.829 | — | 0.738 | 0.761 | 0.990 | 684 | — | [0.736, 0.760] | |
| 3 | SynthPanel (Sonnet 4) conditioned ✓ verified SynthPanel (Sonnet 4) | opinionsqa | product | 0.829 | — | 0.726 | 0.793 | 0.968 | 684 | — | [0.749, 0.770] | |
| 4 | SynthPanel (Haiku 4.5) conditioned ✓ verified SynthPanel (Haiku 4.5) | opinionsqa | product | 0.829 | — | 0.736 | 0.795 | 0.956 | 684 | — | [0.755, 0.777] | |
| 5 | SynthPanel (GPT-4o-mini) conditioned ✓ verified SynthPanel (GPT-4o-mini) | opinionsqa | product | 0.823 | — | 0.708 | 0.778 | 0.982 | 684 | — | [0.732, 0.753] | |
| 6 | Llama 3.3 70B ✓ verified Llama 3.3 70B | opinionsqa | raw | 0.819 | — | 0.693 | 0.774 | 0.990 | 684 | — | [0.723, 0.743] | |
| 7 | SynthPanel (Gemini Flash Lite) conditioned ✓ verified SynthPanel (Gemini Flash Lite) | opinionsqa | product | 0.816 | — | 0.749 | 0.766 | 0.933 | 684 | — | [0.745, 0.767] | |
| 8 | Claude Haiku 4.5 ✓ verified Claude Haiku 4.5 | opinionsqa | raw | 0.815 | — | 0.690 | 0.767 | 0.990 | 684 | — | [0.716, 0.739] | |
| 9 | GPT-4o ✓ verified GPT-4o | opinionsqa | raw | 0.813 | — | 0.726 | 0.720 | 0.993 | 200 | — | [0.698, 0.746] | |
| 10 | GPT-4o-mini ✓ verified GPT-4o-mini | opinionsqa | raw | 0.813 | — | 0.686 | 0.762 | 0.990 | 684 | — | [0.712, 0.735] | |
| 11 | Claude Sonnet 4 ✓ verified Claude Sonnet 4 | opinionsqa | raw | 0.782 | — | 0.648 | 0.710 | 0.990 | 684 | — | [0.663, 0.694] | |
| 12 | Random Baseline baseline ✓ verified Random Baseline | opinionsqa | baseline | 0.763 | — | 0.806 | 0.493 | 0.990 | 684 | — | [0.638, 0.661] | |
| 13 | Majority Baseline baseline ✓ verified Majority Baseline | opinionsqa | baseline | 0.705 | — | 0.508 | 0.616 | 0.990 | 684 | — | [0.546, 0.577] | |
| 14 | Gemini Flash Lite ✓ verified Gemini Flash Lite | opinionsqa | raw | 0.000 | — | 0.000 | 0.000 | 0.000 | 100 | — | [0.000, 0.000] | |
| 1 | SynthPanel Ensemble (3-model) ensemble ✓ verified SynthPanel Ensemble (3-model) | subpop | product | 0.833 | — | 0.871 | 0.795 | 0.000 | 0 | — | [0.817, 0.848] | |
| 2 | SynthPanel (Gemini Flash Lite) conditioned ✓ verified SynthPanel (Gemini Flash Lite) | subpop | product | 0.821 | — | 0.707 | 0.780 | 0.976 | 200 | — | [0.724, 0.763] | |
| 3 | SynthPanel (Haiku 4.5) conditioned ✓ verified SynthPanel (Haiku 4.5) | subpop | product | 0.809 | — | 0.712 | 0.757 | 0.958 | 200 | — | [0.715, 0.755] | |
| 4 | Llama 3.3 70B ✓ verified Llama 3.3 70B | subpop | raw | 0.796 | — | 0.655 | 0.756 | 0.976 | 200 | — | [0.683, 0.726] | |
| 5 | SynthPanel (GPT-4o-mini) conditioned ✓ verified SynthPanel (GPT-4o-mini) | subpop | product | 0.787 | — | 0.652 | 0.733 | 0.976 | 200 | — | [0.671, 0.713] | |
| 6 | Gemini 2.5 Flash ✓ verified Gemini 2.5 Flash | subpop | raw | 0.783 | — | 0.669 | 0.698 | 0.980 | 100 | — | [0.643, 0.718] | |
| 7 | GPT-4o-mini ⚠ flagged GPT-4o-mini | subpop | raw | 0.770 | — | 0.628 | 0.702 | 0.980 | 100 | — | [0.628, 0.697] | |
| 8 | Claude Haiku 4.5 ✓ verified Claude Haiku 4.5 | subpop | raw | 0.768 | — | 0.616 | 0.713 | 0.976 | 200 | — | [0.638, 0.690] | |
| 9 | Random Baseline baseline ✓ verified Random Baseline | subpop | baseline | 0.757 | — | 0.816 | 0.481 | 0.976 | 200 | — | [0.627, 0.669] | |
| 10 | Majority Baseline baseline ✓ verified Majority Baseline | subpop | baseline | 0.673 | — | 0.467 | 0.576 | 0.976 | 200 | — | [0.494, 0.547] |
- SynthPanel (GPT-4o-mini) conditioned0.786SynthPanel (GPT-4o-mini) · globalopinionqa · productp_dist 0.689p_rank 0.694p_refuse 0.976$/100Q —
- Gemini 2.5 Flash0.770Gemini 2.5 Flash · globalopinionqa · rawp_dist 0.687p_rank 0.645p_refuse 0.980$/100Q —
- Llama 3.3 70B0.762Llama 3.3 70B · globalopinionqa · rawp_dist 0.635p_rank 0.672p_refuse 0.980$/100Q —
- SynthPanel (Gemini Flash Lite) conditioned0.762SynthPanel (Gemini Flash Lite) · globalopinionqa · productp_dist 0.687p_rank 0.624p_refuse 0.974$/100Q —
- GPT-4o-mini0.749GPT-4o-mini · globalopinionqa · rawp_dist 0.633p_rank 0.648p_refuse 0.966$/100Q —
- SynthPanel Ensemble (3-model) ensemble0.747SynthPanel Ensemble (3-model) · globalopinionqa · productp_dist 0.807p_rank 0.687p_refuse 0.000$/100Q —
- Claude Sonnet 4.60.738Claude Sonnet 4.6 · globalopinionqa · rawp_dist 0.593p_rank 0.642p_refuse 0.980$/100Q $3.27
- Claude Haiku 4.50.726Claude Haiku 4.5 · globalopinionqa · rawp_dist 0.601p_rank 0.598p_refuse 0.980$/100Q —
- SynthPanel (Haiku 4.5) conditioned0.725SynthPanel (Haiku 4.5) · globalopinionqa · productp_dist 0.666p_rank 0.628p_refuse 0.881$/100Q —
- Random Baseline baseline0.710Random Baseline · globalopinionqa · baselinep_dist 0.747p_rank 0.399p_refuse 0.983$/100Q —
- Majority Baseline baseline0.690Majority Baseline · globalopinionqa · baselinep_dist 0.534p_rank 0.555p_refuse 0.980$/100Q —
- SynthPanel Ensemble (3-model) ensemble0.835SynthPanel Ensemble (3-model) · opinionsqa · productp_dist 0.833p_rank 0.837p_refuse 0.000$/100Q —
- Gemini 2.5 Flash0.829Gemini 2.5 Flash · opinionsqa · rawp_dist 0.738p_rank 0.761p_refuse 0.990$/100Q —
- SynthPanel (Sonnet 4) conditioned0.829SynthPanel (Sonnet 4) · opinionsqa · productp_dist 0.726p_rank 0.793p_refuse 0.968$/100Q —
- SynthPanel (Haiku 4.5) conditioned0.829SynthPanel (Haiku 4.5) · opinionsqa · productp_dist 0.736p_rank 0.795p_refuse 0.956$/100Q —
- SynthPanel (GPT-4o-mini) conditioned0.823SynthPanel (GPT-4o-mini) · opinionsqa · productp_dist 0.708p_rank 0.778p_refuse 0.982$/100Q —
- Llama 3.3 70B0.819Llama 3.3 70B · opinionsqa · rawp_dist 0.693p_rank 0.774p_refuse 0.990$/100Q —
- SynthPanel (Gemini Flash Lite) conditioned0.816SynthPanel (Gemini Flash Lite) · opinionsqa · productp_dist 0.749p_rank 0.766p_refuse 0.933$/100Q —
- Claude Haiku 4.50.815Claude Haiku 4.5 · opinionsqa · rawp_dist 0.690p_rank 0.767p_refuse 0.990$/100Q —
- GPT-4o0.813GPT-4o · opinionsqa · rawp_dist 0.726p_rank 0.720p_refuse 0.993$/100Q —
- GPT-4o-mini0.813GPT-4o-mini · opinionsqa · rawp_dist 0.686p_rank 0.762p_refuse 0.990$/100Q —
- Claude Sonnet 40.782Claude Sonnet 4 · opinionsqa · rawp_dist 0.648p_rank 0.710p_refuse 0.990$/100Q —
- Random Baseline baseline0.763Random Baseline · opinionsqa · baselinep_dist 0.806p_rank 0.493p_refuse 0.990$/100Q —
- Majority Baseline baseline0.705Majority Baseline · opinionsqa · baselinep_dist 0.508p_rank 0.616p_refuse 0.990$/100Q —
- Gemini Flash Lite0.000Gemini Flash Lite · opinionsqa · rawp_dist 0.000p_rank 0.000p_refuse 0.000$/100Q —
- SynthPanel Ensemble (3-model) ensemble0.833SynthPanel Ensemble (3-model) · subpop · productp_dist 0.871p_rank 0.795p_refuse 0.000$/100Q —
- SynthPanel (Gemini Flash Lite) conditioned0.821SynthPanel (Gemini Flash Lite) · subpop · productp_dist 0.707p_rank 0.780p_refuse 0.976$/100Q —
- SynthPanel (Haiku 4.5) conditioned0.809SynthPanel (Haiku 4.5) · subpop · productp_dist 0.712p_rank 0.757p_refuse 0.958$/100Q —
- Llama 3.3 70B0.796Llama 3.3 70B · subpop · rawp_dist 0.655p_rank 0.756p_refuse 0.976$/100Q —
- SynthPanel (GPT-4o-mini) conditioned0.787SynthPanel (GPT-4o-mini) · subpop · productp_dist 0.652p_rank 0.733p_refuse 0.976$/100Q —
- Gemini 2.5 Flash0.783Gemini 2.5 Flash · subpop · rawp_dist 0.669p_rank 0.698p_refuse 0.980$/100Q —
- GPT-4o-mini0.770GPT-4o-mini · subpop · rawp_dist 0.628p_rank 0.702p_refuse 0.980$/100Q —
- Claude Haiku 4.50.768Claude Haiku 4.5 · subpop · rawp_dist 0.616p_rank 0.713p_refuse 0.976$/100Q —
- Random Baseline baseline0.757Random Baseline · subpop · baselinep_dist 0.816p_rank 0.481p_refuse 0.976$/100Q —
- Majority Baseline baseline0.673Majority Baseline · subpop · baselinep_dist 0.467p_rank 0.576p_refuse 0.976$/100Q —
Sub-Metric Radar
Top 3 models compared on SPS sub-metrics: distribution accuracy (p_dist), rank correlation (p_rank), and refusal match (p_refuse).
Radar plot comparing the top 3 models on distribution accuracy (p_dist), rank correlation (p_rank), and refusal match (p_refuse). Larger polygon = better.
Demographic Parity Heatmap
Models × demographic groups, colored by p_dist (distribution
similarity — higher = closer match to the conditioned subpopulation). Use the selector to
drill into a specific attribute.
Coverage flag derived from n_questions:
high (≥100)
medium (50–99)
low (<50)
SPS by Model
Survey Parity Score per model with 95% confidence intervals. Higher is better.
Dot plot of Survey Parity Score per model; horizontal whiskers are 95% CIs. Higher is better.
Per-Metric Breakdown
SPS and component metrics side-by-side per model. All metrics: higher is better.
Grouped dot plot of SPS, p_dist, p_rank, and p_refuse for every model. Legend below identifies each metric.
Confidence Intervals
95% confidence interval for each model's SPS. Center dot = point estimate, whiskers = CI bounds.
Each row shows a model's point-estimate SPS (dot) and 95% confidence interval (whiskers).