How We Use AI to Synthesize UNHCR, OCHA, and ACLED Data — Without Losing the Human Layer
Why we are publishing this
The single most common question we get from researchers, journalists, and aid workers is some variant of "how do you actually use AI on this site?" This page is the answer. It describes the pipeline, the controls, the failure modes we have hit, and what stays out of the pipeline on purpose.
The data spine
Every published number on Humanity Centered Data traces back to a primary-source dataset. The spine has four pillars.
- UNHCR Refugee Data Finder and the operational portal for refugee, asylum-seeker, and statelessness statistics.
- OCHA Humanitarian Data Exchange (HDX) for sectoral indicators, response plans, and population-in-need figures.
- ACLED for geocoded conflict events.
- IDMC GRID and GIDD for internal displacement stock and flow.
Supporting sources (IOM DTM, ReliefWeb, JRC, FAO, WHO, World Bank) feed specific thematic pages. The sourceFreshness and sourceAttribution layers in the codebase enforce the rule that every chart and every claim links to its primary source.
What the AI pipeline does, and does not, do
The pipeline does five things.
1. Document ingestion. PDFs, CSVs, and API responses from the spine sources are converted to text and structured tables, with provenance metadata attached at the record level. 2. Retrieval. When a country page or thematic page is being assembled, a vector and lexical hybrid retriever pulls the relevant primary-source passages into the context window. The retrieval set is scoped tightly; we do not let the model average across the whole corpus. 3. Drafting. A frontier LLM produces narrative text grounded in the retrieved passages, with inline references to the source IDs. 4. Verification. A second pass checks every numerical claim against the retrieved source and flags any number that does not match. Unmatched numbers are stripped, not paraphrased. 5. Human review. Every page is reviewed by a human editor before publication. Drafts with unresolved verification flags are returned, not published.
The pipeline does not do five things.
- It does not generate quotes or first-person testimony. Per our content policy, the site never fabricates stories or anecdotes about named individuals.
- It does not produce predictions or forecasts that lack a published primary-source basis.
- It does not write recommendations for specific governments or agencies.
- It does not auto-publish. There is no scheduled job that pushes AI-drafted content to readers without human sign-off.
- It does not retain user prompts as training data for any third-party model.
Where the human layer lives
The phrase "human in the loop" is used loosely in the industry, so it is worth being specific about where the humans actually sit.
- Source selection is human-curated. We choose which datasets enter the spine.
- Retrieval scope is human-curated. Each page template defines which source set its retrievals are allowed to draw from.
- Verification thresholds are human-set. Numerical claims must match the primary source to defined tolerances; mismatches block publication.
- Editorial review is human-performed. Editors check for tone, accuracy, and the more subtle failure mode where a model cites a source accurately but draws a conclusion the source does not support.
- Reader-visible corrections are human-issued. When we find errors after publication, the correction goes on the page with a date.
Failure modes we have hit
Publishing this honestly means saying what has gone wrong.
- Plausible misattribution. A draft once attributed a 2023 conflict-event total to ACLED that was actually a sum of two non-comparable categories. The verification pass missed it because the digits matched a real cell in the source; the editor caught it. We tightened the verifier to check column semantics, not just values.
- Stale-source confidence. Early drafts sometimes cited UNHCR figures that had been superseded by a newer mid-year release. We now require the freshness layer to reject any source older than the most recent publication for that country-year.
- Cross-source averaging. When two sources disagreed (a common case), early drafts split the difference. We now require the draft to surface the disagreement explicitly and cite both sources, rather than producing a synthetic midpoint.
- Tone drift on protection-sensitive topics. We monitor for it; we re-edit when we catch it.
Why this matters for trust
The humanitarian data ecosystem runs on chains of citation that can be broken silently by AI-assisted publishing at scale. Our position is that AI assistance is fine, opacity is not. Every page tells readers what data it draws from. Every chart links to the source. Every number is verifiable in under one click. None of that is incompatible with using AI to draft; all of it is incompatible with publishing AI drafts unread.
If a reader can trace any claim on the site back to a primary source in under sixty seconds, the pipeline is working. If they cannot, we want to know.
Sources and further reading
- UNHCR Refugee Data Finder: https://www.unhcr.org/refugee-statistics/
- OCHA HDX: https://data.humdata.org/
- ACLED: https://acleddata.com/
- IDMC GIDD: https://www.internal-displacement.org/database/
- IASC Operational Guidance on Data Responsibility: https://interagencystandingcommittee.org/
Keep reading
Can Large Language Models Understand Humanitarian Data? We Tested It
A structured stress test of leading LLMs against UNHCR, OCHA, IDMC, and ACLED data. Where they genuinely help, and where they confidently mislead.
AI vs Traditional Methods: How Humanitarian Organizations Are Counting Displaced People in 2026
Registration desks, household surveys, and satellite based machine learning estimates are now being combined to count displaced populations. A practical comparison of what each method gets right and wrong in 2026.
How to Use AI to Analyze UNHCR and OCHA Datasets (2026)
A practical 2026 walkthrough for researchers and analysts who want to combine AI tools with the primary datasets published by UNHCR and OCHA without losing rigour or citation discipline.
How AI Is Transforming UN Humanitarian Response in 2026
From UNHCR refugee forecasting to OCHA situation reports and WFP food security models, a clear 2026 guide to how AI is being used across the UN humanitarian system, what works, and what is overhyped.
AI vs. UNHCR: Who Gets the Numbers Right on Global Displacement?
A clear-headed comparison of AI displacement estimates against UNHCR registration data. Where each method wins, where each fails, and what the divergences actually mean for policy.
How Does UNHCR Count Refugees in 2026? The Methodology Explained
The number of refugees in the world depends on who is counted and how. A clear explanation of UNHCR’s population categories, data sources, and the limits of the headline figure.
