Can AI Pass the Spanish Bar Exam?

Abstract: We tested 4 AI models on the official 2024 Spanish bar exam. 91 questions, 5 areas of law. GPT-5.4 scored 98.9%, Gemini 97.8%, Opus 95.6%, local Gemma 76.9%. Every frontier error was a knowledge gap, not a reasoning failure.

Can AI Pass the Spanish Bar Exam? A Multi-Model Evaluation on the 2024 Examen de Acceso a la Abogacia

Solarient — NOUMENTS Project — April 2026

We tested four AI models on the official 2024 Spanish bar exam — 91 questions, five areas of law, all in Spanish, no tools. GPT-5.4 scored 98.9% (one error). Gemini 3.1 Pro scored 97.8% (two errors). Claude Opus 4.6 scored 95.6%. A local 31B model barely passed at 76.9%. Two performance tiers among frontier models. One critical finding: the errors cluster in procedural-law disambiguation, and every frontier error is a knowledge gap, not a reasoning failure.

This post describes the methodology, the multi-model results, and the context — including why most published LLM-on-professional-exam evaluations are less rigorous than they appear, and what we did differently.

THE PROBLEM: CIRCULAR VALIDATION IN AGENT EVALUATION

In March 2026, we ran internal evaluations of our consultant agents — a legal advisor (lawer), a tax specialist (fiskaler), a physician (dokter), and others. We wrote the questions ourselves, defined the grading criteria ourselves, and scored the answers ourselves. Lawer scored 10 out of 10. So did most of the others.

This is circular validation. A student who writes their own exam and grades it will always pass. The phenomenon is well-documented in the broader LLM evaluation literature. Recent research (2025) on LLM-as-a-Judge frameworks has established that models systematically favor their own outputs — a "self-enhancement bias" with a proven linear correlation between self-recognition capability and self-preference bias strength (NeurIPS 2024). When an AI system evaluates itself, you get Goodhart's Law in miniature: the evaluation metric stops measuring what it claims to measure.

We needed an external benchmark — questions we did not write, with answer keys we did not create, testing competence at a level defined by someone other than us.

THE INSTRUMENT: EXAMEN DE ACCESO A LA ABOGACIA 2024

The Examen de Acceso a la Abogacia is the official professional aptitude exam that every law graduate in Spain must pass to practice as an attorney. Published annually by the Ministerio de Justicia, administered through the UNED's AvEX platform. Approximately 6,443 candidates per convocation, with historical pass rates around 77-80%.

We used the 2024 "Plantilla Definitiva de Respuestas" — the final official answer key.

Exam structure (149 scorable questions):

Materias Comunes: 50 questions (deontology, procedure, ethics)
Civil-Mercantil: 25 questions (contracts, property, commercial)
Penal: 24 questions* (penalties, procedure, juvenile)
Administrativo: 25 questions (admin procedure, judicial review)
Laboral: 25 questions (employment, social security) (*1 annulled by the Ministry)

Each question: 4 options (a, b, c, d), one correct. Professional difficulty. Entirely in Spanish. Grounded in current statutory law.

This is a CIVIL-LAW exam — it tests statutory recall and rule application, not reasoning from precedent. Most published LLM legal benchmarks (GPT-4 on the US UBE, Harvey BigLaw Bench, LegalBench) operate within common-law traditions. Different cognitive demands. LEXam (ICLR 2026, 340 exams across English and German) does not include Spanish exams. We are not aware of any published LLM evaluation on the Spanish Abogacia exam.

CONTAMINATION PREVENTION

We partitioned all 149 questions into two pools using a deterministic hash: SHA-256("abogacia-2024-{section}-q{num}-seed42"). First 8 hex chars / 0xFFFFFFFF → [0,1). Values < 0.60 enter the eval pool; rest enter the learning pool.

Result: 91 eval (61.1%), 58 learning (38.9%). Deterministic and immutable.

The official PDF marks correct answers in red. We manually transcribed each eval-pool question into clean text files with no color marking, no visual cues.

MODELS UNDER EVALUATION

Four models, all tested from parametric knowledge — no tool access, no web search, no retrieval augmentation.

Model	Provider	Prompt	Local?
GPT-5.4	OpenAI	Neutral	No
Gemini 3.1 Pro	Google	Neutral	No
Claude Opus 4.6	Anthropic	Legal agent	No
Gemma 4 31B	Google	Neutral	Yes

Claude Opus 4.6 was tested with a specialized legal consultant system prompt — the agent AS DEPLOYED. The other three models received neutral exam-taking instructions — BASE MODEL CAPABILITY. Gemma 4 31B ran locally on Apple Silicon via Ollama, Q4-quantized at 17GB.

RESULTS

Model	Correct	Answered	Total	Accuracy
GPT-5.4	90	91	91	98.9%
Gemini 3.1 Pro	89	91	91	97.8%
Claude Opus 4.6	87	91	91	95.6%
Gemma 4 31B	70	89	91	76.9%
Human pass rate	—	—	149	~78%

TWO PERFORMANCE TIERS AMONG FRONTIER MODELS:

FRONTIER (~98%): GPT-5.4 and Gemini 3.1 Pro — near-ceiling, separated by one question
FRONTIER (~96%): Opus — strong but with more procedural-law blind spots
LOCAL (~77%): Gemma — below human baseline

The gap between GPT-5.4 (98.9%) and Gemini 3.1 Pro (97.8%) is a single question. Both are effectively at ceiling for parametric legal knowledge. Opus trails by 2-3 questions — still far above human pass rate but measurably behind the top two.

BY SECTION:

Section	GPT-5.4	Gemini	Opus	Gemma
Comunes (32)	96.9%	100.0%	96.9%	87.5%
Civil-Mercantil (16)	100%	100.0%	93.8%	80.0%
Penal (14)	100%	100.0%	92.9%	78.6%
Administrativo (11)	100%	100.0%	100.0%	90.0%
Laboral (18)	100%	88.9%	94.4%	55.6%

Gemini 3.1 Pro achieves 100% in four of five sections — comunes, civil-mercantil, penal, and administrativo. Its two errors are both in laboral. GPT-5.4 also achieves 100% in four sections, with its single error in comunes. LABORAL remains the weakest section across all non-GPT models: Gemma collapses to 55.6%.

Notable: Gemini 3.1 Pro is the only model to score 100% on comunes — the broadest section covering deontology, procedure, and ethics. GPT-5.4's single error (Q17) falls precisely in this section.

ERROR ANALYSIS

GPT-5.4 (1 error)

One error: comunes Q17. The question asks how a "cuestion incidental de especial pronunciamiento" is resolved. GPT answered (c) "mediante auto." The correct answer is (a) "en la sentencia definitiva." This is a precise statutory disambiguation in art. 393 LEC — see Appendix A for full analysis.

GEMINI 3.1 PRO (2 errors)

Question	Subdomain	Error type
laboral Q12	procesal-laboral	Missed that oral reposicion is the only recourse against proof-admissibility rulings
laboral Q21	seguridad-social	Partially correct (no return obligation) but missed right to remaining unpaid benefits during appeal

Both errors are partial-knowledge: the model knew the relevant article but stopped one subsection short of the complete rule. See Appendix A.

Notable improvement over Gemini 2.5 Pro, which scored 94.5% with 5 errors. Gemini 3.1 Pro gained 3.3 percentage points, fixing its predecessor's errors on comunes Q17 (the diagnostic question), penal Q5 (juvenile appeal route), and laboral Q7 (monitorio reform). It now correctly distinguishes especial from previo pronunciamiento — a distinction that 2.5 Pro failed.

CLAUDE OPUS 4.6 (4 errors)

Question	Subdomain	Error type
comunes Q4	deontologia	Missed that non-practicing bar membership requires no criminal record for serious offenses
civil-mercantil Q14	procesal-civil	Believed oral sentences are possible in ordinary civil proceedings (they are never permitted)
penal Q1	penas-medidas	Inverted the favor rei rule — said ambiguous severity defaults to "menos grave" when it defaults to "leve"
laboral Q21	seguridad-social	Partially correct (no return obligation) but missed right to remaining unpaid benefits during appeal

All four are knowledge gaps. In penal Q1, Opus cited the correct article (art. 13.4 CP) but got the rule backwards — a training data issue, not a reasoning failure.

CROSS-MODEL ERROR PATTERNS

Laboral Q21 is the shared failure. Three of four frontier models (Opus, Gemini 3.1 Pro, and Gemma) chose the same wrong answer (b) — that Ricardo keeps benefits already received but has no right to further payment. The correct answer (c) grants both protections: no return obligation AND right to remaining unpaid benefits accrued during appeal. The models all knew art. 295.2 LRJS (no return) but missed art. 295.3 (right to remaining benefits). Only GPT-5.4 held both subsections.

Comunes Q17 remains diagnostic but the pattern has shifted. In the Gemini 2.5 Pro evaluation, three of four models got Q17 wrong (GPT, Gemini, Gemma). With Gemini 3.1 Pro, only two of four get it wrong (GPT, Gemma). Both Opus and Gemini 3.1 Pro correctly distinguish especial pronunciamiento (sentencia) from previo pronunciamiento (auto) — evidence that the latest frontier models have internalized this fine statutory distinction.

LOCAL MODEL (Gemma 4 31B)

21 errors across all five sections. 8 in laboral alone (55.6%). Factual confusion alongside procedural recall failures. The model has legal vocabulary but lacks precise statutory knowledge.

THE CALIBRATION FINDING

This was the central question — not "how well does the agent score?" but "how honest were our own evaluations?"

Metric	Value
Self-authored score (Mar 2026)	100% (n=10)
External score (Apr 2026)	95.6% (n=91)
Gap	4.4 pp
Threshold (calibrated)	<10%
Verdict	CALIBRATED

The 4.4% gap means our self-authored evaluations were not significantly inflating competence. And we only know that because we checked.

SCIENTIFIC CONTEXT

GPT-4 on the US Bar Exam: Katz et al. (2024) reported ~297 on the UBE, initially claiming ~90th percentile. Martinez (2024) demonstrated this was inflated: using first-time taker data, GPT-4 placed at approximately the 63rd percentile overall and 42nd percentile on essays.

LEXam (ICLR 2026): 340 law exams across English and German. Found that LLMs "notably struggle with open questions that require structured, multi-step legal reasoning" while performing better on MC. Our MC-only evaluation captures this strongest dimension.

Harvey BigLaw Bench: GPT-5 scored 89.22%; Claude Opus 4.6 scored 90.2%. BigLaw Bench tests generation quality; our evaluation tests recognition accuracy.

MedQA: On USMLE-derived questions, o1-preview reached 96%, GPT-5 achieved 95.84%. Our top scores (98.9%, 97.8%) on Spanish legal MC exceed these.

Human comparison: Under negative marking (+1 correct, -1/3 incorrect), the best model's net score would be 90 - 0.33 = 89.67 out of 91 — far above any passing threshold.

LIMITATIONS

MC only. The real exam includes a practical case component.
Single exam, single year. 91 questions is reasonable but not definitive.
Possible contamination. The 2024 exam predates model training cutoffs.
Limited ablation. Four models but no sensitivity analysis across prompts or temperatures.
Specialized vs. neutral prompt is a confound between Opus and Gemini/GPT.
Gemma 4 31B ran locally (Q4-quantized, 17GB) — not representative of its full-precision capability.
Small subdomain samples.

FUTURE WORK

Three lines of investigation follow from this evaluation.

Tool-augmented evaluation. This paper reports parametric scores only. A planned follow-up will re-run the same 91 questions with tool access enabled (web search, statutory database retrieval) and measure the delta per model and per question. The error analysis suggests that every frontier error is a knowledge gap that retrieval could resolve — but this remains a hypothesis until tested. The tool-augmented condition will quantify how much retrieval improves scores and whether it eliminates all frontier errors or only a subset.

Deeper failure analysis. The current error taxonomy classifies error types but does not fully explain their causes. Planned analyses include: (1) testing whether errors correlate with legislative recency (e.g., post-reform statutes where pre- and post-reform rules coexist in training data), (2) measuring whether providing the relevant statute text in context fixes the error (distinguishing retrieval-solvable from representation-level gaps), and (3) comparing error patterns across model generations — we already have Gemini 2.5 Pro and 3.1 Pro data on the same instrument, enabling within-family analysis.

Learning and re-evaluation. The hash partition reserves 58 questions in the learning pool — questions the agent may study with correct answers and their statutory sources. We plan to test whether targeted learning on this pool improves performance on the 91-question eval pool. The eval pool remains held out: the agent never sees its correct answers. This creates a pre/post design where learning progress can be measured on a fixed benchmark. Different learning methods (system-prompt injection, few-shot exemplars, retrieval-augmented context) can be compared on the same held-out set.

The multi-consultant campaign continues with the same protocol applied to a Spanish tax consultant (using Agente de Hacienda Publica exams), a real estate advisor (using CACOA Q&A on Andalucian urbanistic law), and a clinical reasoning agent (using MIR exam past papers). Re-evaluation is event-driven (triggered by agent configuration changes), not calendar-driven.

CONCLUSION

Four AI models took the official Spanish bar exam 2024, all from parametric knowledge.

GPT-5.4 scored 98.9% (90/91) — one error. Gemini 3.1 Pro scored 97.8% (89/91) — two errors. Claude Opus 4.6 scored 95.6% (87/91). Gemma 4 31B running locally scored 76.9% (70/91), with catastrophic failure in labor law (55.6%).

Two frontier tiers:

TOP FRONTIER (~98%): GPT-5.4 and Gemini 3.1 Pro — effectively at ceiling, separated by one question
STRONG FRONTIER (~96%): Opus — excellent but with more procedural-law gaps
LOCAL (~77%): Gemma — below human baseline

The gap between the top two models and Opus (2-3 pp) is small but consistent across sections. The gap between frontier and local (20+ pp) is structural. Gemini 3.1 Pro's improvement over 2.5 Pro (+3.3 pp, from 94.5% to 97.8%) demonstrates rapid capability gain in legal knowledge — three errors fixed in one generation.

The shared failure mode across models is not reasoning but disambiguation: distinguishing near-identical legal categories (especial vs. previo pronunciamiento, art. 295.2 vs. 295.3, leve vs. menos grave). These are statutory facts, not logical inferences. Every frontier model error in this evaluation is a knowledge gap, not a reasoning failure.

The calibration gap between self-authored evaluation (100%) and the external benchmark (95.6%) was 4.4% — calibrated. Four models, two frontier tiers: convergent validation that the capability is real.

REFERENCES

Katz, D.M., Bommarito, M.J., Gao, S., Arredondo, P. "GPT-4 Passes the Bar Exam." Phil. Trans. R. Soc. A, 382(2270), 2024.
Martinez, E. "Re-evaluating GPT-4's Bar Exam Performance." Artificial Intelligence and Law, 33(3), Springer, 2024.
Guha, N. et al. "LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models." NeurIPS 2023.
Feger, M. et al. "LEXam: Benchmarking Legal Reasoning on 340 Law Exams." ICLR 2026.
Harvey AI. "Introducing BigLaw Bench." 2025.
Dong, Y. et al. "Benchmark Data Contamination of Large Language Models: A Survey." arXiv:2406.04244, 2024.
Li, Y. et al. "A Survey on Data Contamination for Large Language Models." arXiv:2502.14425, 2025.
Zheng, L. et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2024.
Vals AI. MedQA Benchmark. 2025.
Ministerio de Justicia. "Plantilla Definitiva de Respuestas — Examen de Acceso a la Abogacia 2024."

APPENDIX A: COMPLETE ERROR CATALOG WITH ROOT CAUSE ANALYSIS

This appendix documents every frontier-model error (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro), the exam question that produced it, and our diagnosis of why the model failed. Six questions produced errors across the three frontier models.

A.1. COMUNES Q17 — Especial vs. previo pronunciamiento (2 models wrong)

Question: "La resolucion de una cuestion incidental de especial pronunciamiento planteada en el proceso se realizara:" (a) Con la debida separacion, en la sentencia definitiva. [CORRECT] (b) Mediante providencia. (c) Mediante auto. (d) Mediante diligencia de ordenacion.

Model	Answer	Correct?
GPT-5.4	c	WRONG
Opus	a	correct
Gemini 3.1 Pro	a	correct
Gemma	c	WRONG

Root cause: Art. 393 LEC distinguishes two types of cuestiones incidentales. De ESPECIAL pronunciamiento are resolved in the sentencia definitiva ("con la debida separacion"). De PREVIO pronunciamiento are resolved by auto within 10 days. The question asks about especial, but the stronger training-data association is "cuestiones incidentales -> auto" (the previo case, which is more commonly discussed in legal education). GPT and Gemma pattern-matched to the dominant association without parsing the qualifier.

GPT's justification: "Las cuestiones incidentales de especial pronunciamiento se resuelven mediante auto." — confidently wrong, citing the correct procedural category but applying the wrong resolution type.

Generational shift: Gemini 2.5 Pro failed this question (chose c); Gemini 3.1 Pro answers it correctly. Both Opus and Gemini 3.1 hold the especial/previo distinction in parametric memory. This suggests the latest frontier models have internalized this fine statutory detail.

Significance: This question best separates the frontier models. GPT and Gemma fail it; Opus and Gemini 3.1 Pro get it right.

A.2. COMUNES Q4 — Non-practicing membership requirements (Opus wrong)

Question: Raul wants to join the Madrid Bar as non-practicing. What requirements apply? (a) Must lack criminal record for offenses carrying serious penalties. [CORRECT] (d) No additional requirements beyond the professional title. [Opus's answer]

Model	Answer	Correct?
GPT-5.4	a	correct
Gemini 3.1 Pro	a	correct
Opus	d	WRONG
Gemma	d	WRONG

Root cause: Opus assumed that non-practicing status has minimal requirements — a reasonable default but wrong. The Estatuto General de la Abogacia requires absence of serious criminal convictions even for non-practicing membership. This is a specific regulatory detail, not a general legal principle.

A.3. CIVIL-MERCANTIL Q14 — Oral sentences in civil proceedings (Opus wrong)

Question: A judge wants to issue an oral sentence in ordinary proceedings. What deadline applies? (d) Oral sentences are NEVER permitted in civil proceedings. [CORRECT] (c) Twenty days from the hearing. [Opus's answer]

Model	Answer	Correct?
GPT-5.4	d	correct
Gemini 3.1 Pro	d	correct
Opus	c	WRONG
Gemma	b	WRONG

Root cause: Opus confused the general sentencing deadline (20 days) with the specific rule. The LEC explicitly prohibits oral sentences in all civil proceedings. Opus knew the 20-day rule but failed to apply the categorical prohibition — a case of the general rule overriding the specific one in the model's reasoning.

A.4. PENAL Q1 — Favor rei and offense classification (Opus wrong)

Question: When a penalty can be classified as either "menos grave" or "leve" by its range, how is the offense classified? (a) Always as delito leve. [CORRECT] (b) Always as delito menos grave. [Opus's answer]

Model	Answer	Correct?
GPT-5.4	a	correct
Gemini 3.1 Pro	a	correct
Opus	b	WRONG
Gemma	c	WRONG

Root cause: Art. 13.4 CP establishes the favor rei principle — ambiguous severity defaults to the lesser classification (leve). Opus cited art. 13.4 correctly but inverted the rule, saying it defaults to "menos grave." This is a rule-inversion error: the model knows the article, knows the context, but stores the rule backwards. The 2015 CP reform changed this rule, and the model may have trained on both pre- and post-reform texts.

A.5. LABORAL Q12 — Proof admissibility challenge (Gemini wrong)

Question: How to challenge potentially rights-violating evidence during the hearing? (d) Raise it at proposition; judge decides on the spot; only oral reposicion available. [CORRECT] (b) Same, but no recourse at all. [Gemini's answer]

Model	Answer	Correct?
GPT-5.4	d	correct
Opus	d	correct
Gemini 3.1 Pro	b	WRONG
Gemma	b	WRONG

Root cause: The distinction between (b) and (d) is whether recurso de reposicion oral is available. Art. 90.2 LRJS grants it. Gemini and Gemma both selected the option that denies any recourse — they knew the mechanism (challenge at proposition, judge decides in the act) but missed the specific remedy. This is a partial-knowledge error: the model has the main procedure correct but omits the final detail.

Persistent error: Gemini 2.5 Pro made the same mistake on this question. The upgrade to 3.1 Pro did not fix this particular knowledge gap.

A.6. LABORAL Q21 — Benefits during appeal (Opus + Gemini wrong)

Question: Ricardo received disability benefits during appeal. The TSJ revokes the decision. What happens? (c) Keeps amounts received AND has right to remaining unpaid benefits accrued during appeal. [CORRECT] (b) Keeps amounts received but no right to further payment. [Opus's and Gemini's answer]

Model	Answer	Correct?
GPT-5.4	c	correct
Opus	b	WRONG
Gemini 3.1 Pro	b	WRONG
Gemma	b	WRONG

Root cause: Art. 295 LRJS grants two protections: subsection 2 — no obligation to return amounts already received, and subsection 3 — right to remaining unpaid benefits accrued through the date of firmeza. Three of four models knew subsection 2 but missed subsection 3. This is the most common error pattern in the evaluation: partial statute recall that captures the main rule but omits the extension.

This is the "hardest question" of this evaluation — three of four frontier models fail, all with the same wrong answer. Only GPT-5.4 held both subsections.

A.7. ERROR TAXONOMY SUMMARY

Error type	Count	Examples
Partial knowledge (incomplete statute)	4	Q21 (Opus, Gemini), Q12 (Gemini), Q4 (Opus)
Statutory disambiguation (near-identical categories)	2	Q17 (GPT), Q1 (Opus)
General-over-specific (default rule displacing exception)	1	Q14 (Opus)

Zero reasoning failures. Every frontier-model error traces to a specific statutory fact stored incorrectly, incompletely, or not at all. The models reason correctly from wrong premises.

The dominant error pattern is partial knowledge: the model knows the article, cites it correctly, applies the main rule accurately, but misses a subsection or exception that changes the answer. This is more subtle than pure hallucination and harder to detect — the reasoning chain looks sound until you check the statute.