Humanity Centered Data | UN Refugee & IDP Tracker

Synthetic Data for Humanitarian Research: When It Helps, When It Misleads

By the Humanity Centered Data Editorial TeamPublished June 19, 2026

June 19, 202610 min read

Why synthetic data is suddenly everywhere

Humanitarian data is sensitive almost by definition. Sharing it externally — with researchers, donors, civic-tech partners — is constrained by protection obligations even when the analytical value would be high. Synthetic data, generated to match the statistical properties of a real dataset without containing any real records, has emerged as a partial solution. The UN Privacy-Preserving Techniques Handbook and the OECD work on privacy-enhancing technologies document both the promise and the limits.

Where synthetic data is genuinely useful

Three uses are well established. Development and testing: building dashboards, pipelines, and ML models against a realistic but non-sensitive dataset. External research collaboration: sharing a synthetic version of a sensitive dataset for exploratory analysis, with sensitive analyses run by the data custodian on the real data. Education and training: teaching humanitarian data skills without exposing real beneficiaries. In all three cases the synthetic data is a means to access without disclosure, not a substitute for real analysis.

Where it is misleading or unsafe

Two failure patterns recur. Statistical fidelity at the margin: synthesisers tuned to match marginal distributions can miss the joint distributions that drive the most policy-relevant findings, particularly for rare and small subgroups. A synthetic dataset that accurately reproduces the overall age distribution may distort the distribution of older female-headed households who are the actual policy target. Privacy leakage: poorly configured synthesisers can memorise and reproduce real records, especially outliers; this has been documented in published audits of commercial tools.

Differential privacy is the rigorous floor

The strongest privacy guarantees in 2026 come from differentially private synthesisers, which provide mathematically bounded leakage with respect to any single individual. The US Census Bureau's adoption of differential privacy for the 2020 census is the most-discussed operational deployment. For humanitarian datasets, DP synthesisers from OpenDP and similar projects are operationally usable, though they require a deliberate choice of privacy budget that has direct utility consequences.

How to evaluate a synthetic dataset before relying on it

Five tests are minimum due diligence. Univariate fidelity: do per-variable distributions match? Bivariate and joint fidelity: do key correlations and joint distributions match? Subgroup fidelity: do the subgroups of policy interest reproduce in the synthetic version? Utility-on-task: do models trained on the synthetic data perform similarly when evaluated on real data? Membership-inference risk: can an attacker tell whether a specific individual was in the training set? The SDGym benchmarking suite and the NIST PETs Prize Challenge results document standard evaluation protocols.

Governance, not just technique

Synthetic data does not exempt a humanitarian organisation from data protection obligations. Consent obtained for an original dataset does not automatically extend to derivative synthetic releases. The ICRC data protection handbook treats synthetic data as a privacy-enhancing technology subject to the same purpose-limitation and necessity tests as any other processing. Practical implication: document the synthesis pipeline, the privacy guarantees, the utility evaluation, and the access regime before sharing.

We Value Your Privacy

Synthetic Data for Humanitarian Research: When It Helps, When It Misleads

Why synthetic data is suddenly everywhere

Where synthetic data is genuinely useful

Where it is misleading or unsafe

Differential privacy is the rigorous floor

How to evaluate a synthetic dataset before relying on it

Governance, not just technique

Further reading and primary sources

Federated Learning for Refugee Data: A Privacy-Preserving Path Forward

The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

ChatGPT vs Claude vs Gemini for Humanitarian Researchers: A 2026 Comparison

Biometric AI in Refugee Registration: UNHCR BIMS, IrisGuard, and the Privacy Debate

The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

Digital Identity for Refugees in 2026: Promise and Pitfalls

Synthetic Data for Humanitarian Research: When It Helps, When It Misleads

Why synthetic data is suddenly everywhere

Where synthetic data is genuinely useful

Where it is misleading or unsafe

Differential privacy is the rigorous floor

How to evaluate a synthetic dataset before relying on it

Governance, not just technique

Further reading and primary sources

Keep reading

Federated Learning for Refugee Data: A Privacy-Preserving Path Forward

The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

ChatGPT vs Claude vs Gemini for Humanitarian Researchers: A 2026 Comparison

Biometric AI in Refugee Registration: UNHCR BIMS, IrisGuard, and the Privacy Debate

The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

Digital Identity for Refugees in 2026: Promise and Pitfalls