Synthetic Data for Humanitarian Research: When It Helps, When It Misleads
Why synthetic data is suddenly everywhere
Humanitarian data is sensitive almost by definition. Sharing it externally — with researchers, donors, civic-tech partners — is constrained by protection obligations even when the analytical value would be high. Synthetic data, generated to match the statistical properties of a real dataset without containing any real records, has emerged as a partial solution. The UN Privacy-Preserving Techniques Handbook and the OECD work on privacy-enhancing technologies document both the promise and the limits.
Where synthetic data is genuinely useful
Three uses are well established. Development and testing: building dashboards, pipelines, and ML models against a realistic but non-sensitive dataset. External research collaboration: sharing a synthetic version of a sensitive dataset for exploratory analysis, with sensitive analyses run by the data custodian on the real data. Education and training: teaching humanitarian data skills without exposing real beneficiaries. In all three cases the synthetic data is a means to access without disclosure, not a substitute for real analysis.
Where it is misleading or unsafe
Two failure patterns recur. Statistical fidelity at the margin: synthesisers tuned to match marginal distributions can miss the joint distributions that drive the most policy-relevant findings, particularly for rare and small subgroups. A synthetic dataset that accurately reproduces the overall age distribution may distort the distribution of older female-headed households who are the actual policy target. Privacy leakage: poorly configured synthesisers can memorise and reproduce real records, especially outliers; this has been documented in published audits of commercial tools.
Differential privacy is the rigorous floor
The strongest privacy guarantees in 2026 come from differentially private synthesisers, which provide mathematically bounded leakage with respect to any single individual. The US Census Bureau's adoption of differential privacy for the 2020 census is the most-discussed operational deployment. For humanitarian datasets, DP synthesisers from OpenDP and similar projects are operationally usable, though they require a deliberate choice of privacy budget that has direct utility consequences.
How to evaluate a synthetic dataset before relying on it
Five tests are minimum due diligence. Univariate fidelity: do per-variable distributions match? Bivariate and joint fidelity: do key correlations and joint distributions match? Subgroup fidelity: do the subgroups of policy interest reproduce in the synthetic version? Utility-on-task: do models trained on the synthetic data perform similarly when evaluated on real data? Membership-inference risk: can an attacker tell whether a specific individual was in the training set? The SDGym benchmarking suite and the NIST PETs Prize Challenge results document standard evaluation protocols.
Governance, not just technique
Synthetic data does not exempt a humanitarian organisation from data protection obligations. Consent obtained for an original dataset does not automatically extend to derivative synthetic releases. The ICRC data protection handbook treats synthetic data as a privacy-enhancing technology subject to the same purpose-limitation and necessity tests as any other processing. Practical implication: document the synthesis pipeline, the privacy guarantees, the utility evaluation, and the access regime before sharing.
Further reading and primary sources
- UN PETs Handbook: https://unstats.un.org/
- OECD PETs work: https://www.oecd.org/digital/
- OpenDP: https://opendp.org/
- SDGym: https://github.com/sdv-dev/SDGym
- US Census disclosure avoidance: https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance.html
- ICRC data protection handbook: https://www.icrc.org/en/data-protection-humanitarian-action-handbook
Keep reading
Federated Learning for Refugee Data: A Privacy-Preserving Path Forward
How federated learning offers a privacy-preserving path for humanitarian AI, and what is actually deployed in 2026.
The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)
AI tools are now woven through humanitarian operations. The benefits are real and so are the risks. A frank look at the bias, privacy, and accountability gaps shaping the sector in 2026.
ChatGPT vs Claude vs Gemini for Humanitarian Researchers: A 2026 Comparison
A practical 2026 comparison of ChatGPT, Claude, and Gemini for humanitarian researchers, NGO analysts, and policy teams. Accuracy, citation behaviour, hallucination risk, and data privacy compared side by side.
Biometric AI in Refugee Registration: UNHCR BIMS, IrisGuard, and the Privacy Debate
How biometric AI is used in refugee registration in 2026, the systems involved, and the unresolved governance debates.
The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)
AI tools are now woven through humanitarian operations. The benefits are real and so are the risks. A frank look at the bias, privacy, and accountability gaps shaping the sector in 2026.
Digital Identity for Refugees in 2026: Promise and Pitfalls
A 2026 explainer on digital identity systems for refugees, drawing on UNHCR, ID4D, World Bank, and Privacy International reporting on benefits and risks.
