Can LLMs Understand Humanitarian Data? A 2026 Stress Test

What this stress test is and is not

This is a structured walkthrough of how leading large language models handle questions a humanitarian researcher would actually ask. The dataset universe is the same one our editorial team uses every week: UNHCR Refugee Data Finder, OCHA Humanitarian Data Exchange (HDX), IDMC GRID and GIDD, and ACLED. The model universe is the current generation of widely-available frontier models accessed through their public APIs in mid-2026, used in two modes: bare-prompt (no documents attached) and retrieval-augmented (the relevant primary source attached as context).

We are not publishing a leaderboard. Model versions ship monthly and any ranking would be stale within a quarter. What we are publishing is the pattern of behaviour, which has been stable across the last three model generations and is the part useful for deciding when to trust an LLM with a humanitarian-data question.

The task categories

Five categories cover most analytic asks.

1. Fact recall. "How many Syrian refugees were registered in Türkiye at the end of 2024?" 2. Definitional precision. "What is the difference between an IDP, a refugee, and a stateless person under UNHCR's mandate?" 3. Aggregation and arithmetic. "Sum the IDMC GRID 2024 new-displacement figures for the Horn of Africa." 4. Causal and contextual reasoning. "Why did UNHCR's Venezuelan refugee figures fall in 2023 even though outflows continued?" 5. Source provenance. "Which primary source publishes monthly IDP flow estimates for Sudan, and how often is it updated?"

What the models actually do well

Three categories are reliably useful even without retrieval.

Definitional precision. Frontier LLMs explain refugee law definitions, the distinction between protracted refugee situations and new-displacement events, and the structure of UNHCR's population statistics with high accuracy. The training corpus includes the 1951 Convention, the Cartagena Declaration, IDMC methodology notes, and decades of academic literature. Outputs match primary sources almost verbatim.
Source navigation. Asked which dataset to consult for a given question, models point researchers to the right portal (HDX for sector datasets, ACLED for events, IDMC for stock-and-flow estimates, UNHCR Refugee Data Finder for refugee statistics) with rare errors.
Structured rewriting. Given a paragraph of raw situation-report text, models reliably extract entities, dates, and indicator values into a structured schema. Quality is high enough to use as a first pass before human review.

What the models do badly

Two categories are where the gap between fluency and reliability is widest, and where the most damage gets done.

Fact recall on numbers. Asked for a specific country-year population figure, frontier models produce numbers that sound right and are wrong roughly half the time, with the error often within a plausible 10-30 percent band. The smaller and more recent the figure, the higher the error rate. This is the textbook hallucination failure mode: the words are confident, the digits are invented.
Aggregation and arithmetic over current data. Asked to sum or compare values not present in the prompt, models confabulate intermediate steps. Chain-of-thought outputs look like reasoning and are often arithmetic fiction. This is true even of models marketed as having tool use, when the tool calls fail silently.

The pattern is consistent across providers and model sizes. Bigger models hallucinate more fluently, not less often, on this specific class of question.

Retrieval changes the picture, but not by as much as you would hope

Attaching the relevant primary source as context (the RAG pattern) shifts model behaviour substantially.

Fact-recall errors fall by an order of magnitude when the source PDF or CSV is in the context window and the model is told to cite line numbers or page numbers.
Arithmetic errors still occur, particularly when the source contains nested tables, footnoted exclusions, or methodology caveats that change which cells should be summed.
Causal reasoning improves modestly. Models grounded in a methodology note will correctly explain Venezuela 2023 (the shift from UNHCR-mandated population counting to host-government registry counting drove the apparent fall) rather than inventing a political narrative.
The remaining failure mode is plausible misattribution: the model cites the document accurately but draws a conclusion the document does not support.

What this means for humanitarian researchers

The operational implications are concrete.

Never quote an LLM-produced number without retrieving it from the primary source on the same session. The failure mode is not occasional; it is the default.
Use LLMs for navigation, definitions, and structured extraction. These are the high-value, low-risk use cases.
Use RAG for synthesis, but design the retrieval set tightly. Stuffing context windows with twenty PDFs is worse than retrieving three relevant ones; the model will average across contradictions.
Treat chain-of-thought as a presentation device, not a verification step. The arithmetic in the reasoning trace is not audited by the model; it is generated by the same fluent confabulator that produced the answer.
Maintain a human-in-the-loop for anything that will be published. Our own editorial workflow attaches a primary-source citation to every numerical claim in an AI-assisted draft before it ships.

The honest take

The 2026 generation of LLMs is genuinely useful for humanitarian researchers, on a narrower set of tasks than the marketing suggests. They are good at language and bad at numbers, good at definitions and bad at current events, good at structure and bad at causation. Any workflow that respects those boundaries gets value; any workflow that does not produces confident misinformation at scale.

Sources and further reading

UNHCR Refugee Data Finder: https://www.unhcr.org/refugee-statistics/
OCHA Humanitarian Data Exchange: https://data.humdata.org/
IDMC Global Internal Displacement Database (GIDD): https://www.internal-displacement.org/database/
ACLED: https://acleddata.com/
Stanford HAI 2024 AI Index, hallucination benchmarks: https://aiindex.stanford.edu/
Anthropic and OpenAI model documentation on retrieval-augmented use: https://docs.anthropic.com/ and https://platform.openai.com/docs/

We Value Your Privacy

Can Large Language Models Understand Humanitarian Data? We Tested It

What this stress test is and is not

The task categories

What the models actually do well

What the models do badly

Retrieval changes the picture, but not by as much as you would hope

What this means for humanitarian researchers

The honest take

Sources and further reading

How We Use AI to Synthesize UNHCR, OCHA, and ACLED Data — Without Losing the Human Layer

How to Cite Humanitarian Data: UNHCR, OCHA, IOM, and IDMC in 2026

The Best AI Tools for Humanitarian Data Analysis in 2026

How to Use AI to Analyze UNHCR and OCHA Datasets (2026)

How AI Is Transforming UN Humanitarian Response in 2026

Syria's Displacement Landscape: Internal and Cross-Border Trends

Can Large Language Models Understand Humanitarian Data? We Tested It

What this stress test is and is not

The task categories

What the models actually do well

What the models do badly

Retrieval changes the picture, but not by as much as you would hope

What this means for humanitarian researchers

The honest take

Sources and further reading

Keep reading

How We Use AI to Synthesize UNHCR, OCHA, and ACLED Data — Without Losing the Human Layer

How to Cite Humanitarian Data: UNHCR, OCHA, IOM, and IDMC in 2026

The Best AI Tools for Humanitarian Data Analysis in 2026

How to Use AI to Analyze UNHCR and OCHA Datasets (2026)

How AI Is Transforming UN Humanitarian Response in 2026

Syria's Displacement Landscape: Internal and Cross-Border Trends