← Back to Blog

What Psychology Benchmarks Show About AI's Jagged Clinical Profile

• By John Britton

Psychology benchmarks show a lopsided picture of AI capability. The same model can look excellent on clinical knowledge, theory-of-mind items, or structured psychiatry questions, then look much weaker on crisis judgment, multi-turn emotional support, high-ambiguity moral reasoning, or therapy-process fidelity.

The benchmark literature gives a snapshot of that uneven profile, mostly from models that are now behind the current frontier. Dell'Acqua, Mollick, and colleagues called this kind of uneven capability the "jagged technological frontier." Averaging a model's performance into a single "psychology score" hides those clinical weak spots.

I pulled together benchmark results across clinical psychology, psychiatry, counseling, theory of mind, emotion, moral reasoning, crisis response, and adjacent psychology exams. The result is a working review table that can be searched, grouped, plotted, and revised as newer papers appear.

There are two important caveats to keep in mind when interpreting these charts.

First, the dataset is a scoping review dataset, not a clean randomized benchmark tournament. Different papers used different prompts, judges, tasks, scoring scales, and model versions.

Second, psychology benchmark papers arrive later than model announcements. Labs publish general benchmark numbers quickly, including psychology slices of MMLU-Pro and similar exam benchmarks. Clinical psychology papers need datasets, judges, protocols, analysis, and review. Most of the models in this dataset were the frontier six to eighteen months ago. The current frontier (GPT-5.5-class systems, Claude Opus 4.7, Gemini 3.1 Pro) is not directly tested across these psychology families yet.

As a brief methodological note, I built the dataset with Codex in a local spreadsheet-and-repo workflow. I fed in tables from papers, screenshots, PDFs, arXiv pages, and the Artificial Analysis API, then used the local files to normalize model names, group benchmarks, and generate the charts. The dataset is a working review table rather than an official leaderboard.

The Benchmark Map

Each benchmark is assigned to a capability family for analysis. Some benchmarks span more than one skill, so the family label marks the dominant thing I used for grouping.

Benchmark Capability Family What It Mostly Tests
CBT-Bench Therapy skill and response quality CBT knowledge, CBT classification, and CBT-style response quality
CounselingBench Therapy skill and response quality Counseling competence and licensure-style counseling judgment
MindEval Therapy skill and response quality Multi-turn mental-health support quality (scored on a 1-6 rubric)
MentalChat16K Therapy skill and response quality Mental-health chat response quality and information selection
MENTAT Structured clinical reasoning Real-world psychiatry questions and clinical handling
PsychiatryBench Structured clinical reasoning Multi-task psychiatry performance
PsychBench Structured clinical reasoning Adult psychiatry composite performance
MentalBench Differential diagnosis DSM-style differential diagnosis and diagnostic discrimination
SIRI-2 Crisis safety Suicide intervention response judgment
SIRI-2 consistency Multi-turn support Whether the model's reasoning stays stable across repeated crisis judgments
HEART Multi-turn support Emotional-support dialogue over multi-turn interactions
OMind-Chat Therapy skill and response quality Mental-health dialogue quality at turn and conversation level
EmoBench Emotion and affect Recognizing and applying emotional information
EmoBench-Reddit Emotion and affect Hierarchical emotion understanding from social media
EmoBench-M Emotion and affect Foundational emotion recognition across modalities
MoralChoice Moral reasoning Low-ambiguity and high-ambiguity moral choice
MoralBench Moral reasoning Moral foundations and moral judgment patterns
Theory of Mind / MoToMQA Theory of mind False belief, recursive mental states, and transfer
MMLU-Pro Psychology / PsychoBench Knowledge and exam recall Psychology exam-style knowledge
PATIENT-Ψ / PATIENT-Ψ-TRAINER Therapy skill and response quality CBT simulated-patient fidelity and formulation-practice feedback
MusPsy Therapy skill and response quality Multi-session counseling data quality

These family labels are rough groupings, not clean separations. A therapy benchmark inevitably contains some reasoning, a crisis benchmark involves empathy, and a theory-of-mind task is not the same thing as doing therapy. The categories are useful for spotting patterns, but the boundaries are fuzzy.

Heatmap of psychology benchmark categories by model, ordered roughly by model size.
Category-by-model heatmap from the benchmark pool. Cells use direction-corrected headline scores on a 0-100 scale, with scores marked ° based on a single benchmark. A score of 100 does not mean perfect performance; it means the model hit the top of that specific benchmark's scoring rubric (for example, achieving the reported human baseline on CBT-Bench). Models are ordered by disclosed size when available. Blank cells mean no benchmark in that family for that model.

When we look across the heatmap, the coverage is visibly uneven. Many models were tested on only a few families, and the high-coverage models move unevenly across clinical capabilities.

The Models and Their Scores

The table below shows models with at least four capability families, sorted by their Artificial Analysis (AA) intelligence index — a general-purpose ability score maintained independently of this review. Models near the top of the AA ranking were stronger general systems at the time of their release. The question is whether that general ability carries over to psychology.

The overall psych score is family-balanced. I average scores within each capability family, then average the families. For benchmarks with clear anchors, the score uses those anchors. CBT-Bench is scaled against the reported human baseline, HEART is scaled between Original Dataset Completions and Best Human, and MindEval scores on the 1-6 rubric are converted to percent of scale. The jaggedness score is the gap between the model's highest and lowest family score.

Model Release AA Score Benchmarks Families Psych Score Jaggedness
Claude Sonnet 4.5 Sep 2025 37.1 4 4 74.7 46.4
Gemini 2.5 Pro Jun 2025 34.6 8 7 72.4 30.7
Gemini 2.5 Flash Apr 2025 17.8 5 4 60.3 44.3
Gemini 2.0 Flash Dec 2024 16.8 6 4 72.6 10.7
GPT-4o May 2024 14.5 11 7 74.1 36.8
GPT-4 Mar 2023 12.8 12 4 73.5 28.5
Llama 3.1 8B Jul 2024 11.8 7 4 56.3 31.4
GPT-3.5 Turbo Mar 2023 9.0 10 6 57.7 65.0
Mistral 7B Sep 2023 7.4 11 5 53.1 61.0

When we look at the highest-scoring models, the first striking pattern is how tightly compressed they are at the top. Claude Sonnet 4.5, Gemini 2.5 Pro, GPT-4o, GPT-4, and Gemini 2.0 Flash all land between 72 and 75 on the psych score. General intelligence helps because the weaker models at the bottom score 15-20 points lower, but among the stronger systems, more general ability does not cleanly predict better clinical psychology performance. The correlation between AA score and psych score is r = .48 across 13 matched models, positive but moderate.

The second major takeaway is that higher general intelligence does not smooth out a model's clinical profile. The correlation between AA score and jaggedness is essentially zero (r = -.06). Claude Sonnet 4.5 has the highest AA score in this set and the second-highest jaggedness. Gemini 2.0 Flash, a smaller and less general model, has the smoothest profile. Smarter models are not more balanced. They just have different weak spots.

Two Case Studies

We can see what this jaggedness looks like in practice by zooming in on two specific models. Raw scores use each benchmark's native scale: values near 0-1 are proportions, values above 1 are percent or absolute scores, and negative values mean below-reference performance.

GPT-4o (The best-covered model)

GPT-4o has more benchmark data in this review than any other model, with 11 benchmarks across 7 capability families. That breadth exposes the uneven profile clearly.

Benchmark Family Raw Score
PsychoBench Knowledge recall 94.36
CBT-Bench Therapy protocol fidelity 94.1
MENTAT Structured clinical reasoning 0.79
EmoBench-Reddit Emotion and affect 0.74
CounselingBench Therapy protocol fidelity 62.5
MentalBench Differential diagnosis 52.43
HEART Multi-turn consistency 48.5
SIRI-2 Crisis safety 45.55
SIRI-2 consistency Crisis safety 0.831
CBT-Bench Level III Therapy protocol fidelity -0.32

GPT-4o scores 94 on psychology exam recall and 94 on CBT knowledge questions. It handles structured psychiatry questions at 79% and hierarchical emotion understanding at 74%. These are genuinely strong results that highlight GPT-4o's capabilities as a general model.

But the profile drops when the task moves beyond knowledge into judgment. Differential diagnosis falls to 52, crisis safety to 46, and multi-turn emotional support to 49. The biggest drop is CBT-Bench Level III, which asks the model to actually generate therapy responses that match reference quality, and it scores below the reference baseline. The MENTAT paper noted that even strong models sometimes produced responses that were "clinically reasonable" in structure but missed the patient-specific nuance that a careful clinician would catch. On SIRI-2, the model often selected safe-sounding but generic responses to crisis scenarios rather than engaging with the specific risk factors presented.

The overall picture for this model remains quite positive. GPT-4o is one of the strongest models in this dataset, with a psych score of 74.1 and a jaggedness of 36.8 that reflects real variance across clinical families rather than catastrophic failure. While it knows clinical psychology well, it struggles more when it has to do clinical psychology by sustaining appropriate responses across turns, making safe crisis calls, and generating therapy-quality output rather than recalling the right answer. That gap between knowing and doing is the jagged frontier in one model.

Gemini 2.5 Pro (Widest family coverage)

Gemini 2.5 Pro has the most family coverage in this review at seven families. It was tested on newer benchmarks that other models in the dataset missed, including MindEval and EmoBench-Reddit.

Benchmark Family Raw Score
MMLU-Pro Psychology Knowledge recall 87.3
PsychiatryBench Structured clinical reasoning 80.2
EmoBench-Reddit Emotion and affect 0.81
SIRI-2 consistency Crisis safety 0.891
MentalBench Differential diagnosis 59.65
MindEval Therapy quality (1-6 scale) 3.83
SIRI-2 Crisis safety 40.37
HEART Multi-turn consistency 39.5

Gemini 2.5 Pro scores well on knowledge and reasoning benchmarks, and it handles emotion understanding and crisis-reasoning consistency above 80%. But its MindEval therapy quality score is 3.83 out of 6, which lands in the "acceptable with notable limitations" range on MindEval's rubric, where scores of 5-6 are considered genuinely exceptional and even skilled human therapists rarely hit every anchor. Its HEART multi-turn support score (39.5) and SIRI-2 crisis judgment score (40.37) are both well below its knowledge-family scores.

The fact that a model can handle 87% of psychology exam questions but cannot yet sustain clinical-quality therapy dialogue is exactly what this jagged profile looks like in practice.

What Predicts the Psych Score

General intelligence helps, but it is not the strongest signal in this data.

Among the five disclosed-size open models in the review, log parameter size correlated with psych score at r = .77. That is directional but not statistically significant with only five data points. Among all 13 models matched to AA scores, general intelligence correlated with psych score at r = .48 (p = .09). Release date showed only a weak relationship (r = .30).

The practical read is that bigger models tend to score higher, and more generally capable models tend to score higher, but neither factor cleanly separates the top group. The five models between 72 and 75 on the psych score span three years of releases (GPT-4 in March 2023 through Claude Sonnet 4.5 in September 2025) and a wide range of AA scores (12.8 to 37.1). Within that top cluster, which families a model was tested on, and which benchmarks existed when it was evaluated, matters as much as raw capability.

Observed psych benchmark score, Artificial Analysis score, jaggedness, and benchmark coverage by model.
Observed psych benchmark score (average across capability families) vs matched Artificial Analysis intelligence index. Jaggedness refers to the gap between a model's highest and lowest family score.

The Benchmark Window

This dataset captures AI capability during a specific, slightly older window in time. Most of the models tested here were near the frontier six to eighteen months ago. The current frontier (systems like GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) has not been tested across these same psychology families yet.

That timing matters because psychology benchmark papers move slower than general-capability announcements. A lab can release MMLU-Pro numbers on launch day. A clinical psychology benchmark paper needs a custom dataset, domain-expert judges, scoring rubrics, analysis, and peer review. By the time the paper appears, the model it tested may already be two generations behind.

The heatmap shows the same pattern among disclosed-size models. Larger open models were generally better, especially on reasoning-heavy tasks. Size alone failed to flatten the profile, since large open models still had weak families and some small models did fine on narrow or scaffolded tasks.

For developers building local and edge mental-health tools, this pattern cuts both ways. Smaller models may work well enough for constrained screening support, psychoeducation drafts, summarization, or simple routing. The same evidence argues against using them for unsupervised crisis judgment, diagnosis, or ambiguous clinical decision-making.

Where Agentic Ability Could Change the Profile

Current frontier progress is partly about agentic ability. A more agentic model can keep a goal active, use tools, recover from errors, track state, and complete multi-step tasks. Those are not therapy skills by themselves, but they overlap with several weak families in the psychology benchmark pool.

General agentic benchmarks support a narrow transfer hypothesis. As models get better at multi-step planning, tool use, and long-horizon context tracking, those gains should help multi-turn support, information synthesis, crisis workflow adherence, structured care planning, and case-formulation maintenance. Those tasks require state tracking and policy-following over time.

Agentic gains may do less for moral ambiguity, emotional resonance, alliance, rupture repair, and clinical responsibility. A model can execute a plan more persistently and still execute the wrong plan. A tool-using clinical system can retrieve the wrong thing, overfit to a stale formulation, optimize engagement instead of health, or sound organized while making a poor clinical move. While execution improves one part of the model's profile, clinical validity still has to be measured directly.

What This Means

The benchmark literature caught AI psychology capability during a transition. The models tested here (mostly from the 2023-2025 frontier window) were already strong on knowledge and reasoning tasks. They were weaker and less stable on ambiguity, crisis judgment, multi-turn emotional support, and therapy-process tasks. Since then, frontier model capability has moved upward quickly, especially for reasoning and agentic execution.

When we step back from the individual benchmarks, three broad patterns emerge from the data.

First, the top of the psych-score ranking is surprisingly compressed, with five models from different labs clustering between 72 and 75. While general ability helped them reach that tier, it did not separate them from each other or make their profiles balanced. Instead, the remaining jaggedness comes from which clinical skills each model was exposed to, how benchmarks measured those skills, and which failure modes each architecture is prone to.

Second, some families may lag even as models get stronger. High-ambiguity moral reasoning, sustained emotional support, alliance, rupture repair, and real clinical responsibility may not improve just because models get better at coding, math, or terminal tasks.

Third, the existing benchmark suite may stop discriminating. If current frontier models saturate knowledge recall and many structured tasks, the field needs harder multi-turn, adversarial, longitudinal, and clinically grounded benchmarks.

It is clear that many current LLMs can easily answer standard psychology questions. The next question is whether frontier agentic systems can maintain a clinically appropriate stance across time, risk, ambiguity, and changing context.

What Needs Testing Next

Frontier models are probably approaching saturation on many of the psychology benchmark families in this dataset. Knowledge recall, structured clinical reasoning, theory-of-mind items, and some therapy-protocol tasks may soon stop separating models well. Several weaker families may also catch up as reasoning, tool use, and multi-turn state tracking improve.

Because models are improving so quickly, we will need to re-test for jaggedness rather than assume it remains. The next benchmark wave should make the tasks harder. It should test longitudinal conversations, adversarial turns, changing risk states, rupture repair, high-ambiguity moral judgment, protocol fidelity under stress, and clinical decisions where the model has to maintain a stance over time.

The older benchmark pool shows where the profile was uneven. The next question is whether current agentic frontier models have actually closed those gaps, or whether the gaps have moved to harder versions of the same clinical skills.

Benchmark References

Subscribe for future posts

If you want new writing at the intersection of AI and psychology, ethics, and implementation of AI in clinical practice, subscribe on Substack.

Subscribe on Substack

The views expressed here are my own and do not necessarily reflect the views of any current or future employer, training site, academic institution, or affiliated organization.