Navigated to Humanity Centered Data | UN Refugee & IDP Tracker
    All AI articles
    Foundations

    Synthetic Data for Humanitarian Research: When It Helps, When It Misleads

    By the Humanity Centered Data Editorial Team
    June 19, 202610 min read

    Why synthetic data is suddenly everywhere

    Humanitarian data is sensitive almost by definition. Sharing it externally — with researchers, donors, civic-tech partners — is constrained by protection obligations even when the analytical value would be high. Synthetic data, generated to match the statistical properties of a real dataset without containing any real records, has emerged as a partial solution. The UN Privacy-Preserving Techniques Handbook and the OECD work on privacy-enhancing technologies document both the promise and the limits.

    Where synthetic data is genuinely useful

    Three uses are well established. Development and testing: building dashboards, pipelines, and ML models against a realistic but non-sensitive dataset. External research collaboration: sharing a synthetic version of a sensitive dataset for exploratory analysis, with sensitive analyses run by the data custodian on the real data. Education and training: teaching humanitarian data skills without exposing real beneficiaries. In all three cases the synthetic data is a means to access without disclosure, not a substitute for real analysis.

    Where it is misleading or unsafe

    Two failure patterns recur. Statistical fidelity at the margin: synthesisers tuned to match marginal distributions can miss the joint distributions that drive the most policy-relevant findings, particularly for rare and small subgroups. A synthetic dataset that accurately reproduces the overall age distribution may distort the distribution of older female-headed households who are the actual policy target. Privacy leakage: poorly configured synthesisers can memorise and reproduce real records, especially outliers; this has been documented in published audits of commercial tools.

    Differential privacy is the rigorous floor

    The strongest privacy guarantees in 2026 come from differentially private synthesisers, which provide mathematically bounded leakage with respect to any single individual. The US Census Bureau's adoption of differential privacy for the 2020 census is the most-discussed operational deployment. For humanitarian datasets, DP synthesisers from OpenDP and similar projects are operationally usable, though they require a deliberate choice of privacy budget that has direct utility consequences.

    How to evaluate a synthetic dataset before relying on it

    Five tests are minimum due diligence. Univariate fidelity: do per-variable distributions match? Bivariate and joint fidelity: do key correlations and joint distributions match? Subgroup fidelity: do the subgroups of policy interest reproduce in the synthetic version? Utility-on-task: do models trained on the synthetic data perform similarly when evaluated on real data? Membership-inference risk: can an attacker tell whether a specific individual was in the training set? The SDGym benchmarking suite and the NIST PETs Prize Challenge results document standard evaluation protocols.

    Governance, not just technique

    Synthetic data does not exempt a humanitarian organisation from data protection obligations. Consent obtained for an original dataset does not automatically extend to derivative synthetic releases. The ICRC data protection handbook treats synthetic data as a privacy-enhancing technology subject to the same purpose-limitation and necessity tests as any other processing. Practical implication: document the synthesis pipeline, the privacy guarantees, the utility evaluation, and the access regime before sharing.

    Further reading and primary sources

    • UN PETs Handbook: https://unstats.un.org/
    • OECD PETs work: https://www.oecd.org/digital/
    • OpenDP: https://opendp.org/
    • SDGym: https://github.com/sdv-dev/SDGym
    • US Census disclosure avoidance: https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance.html
    • ICRC data protection handbook: https://www.icrc.org/en/data-protection-humanitarian-action-handbook
    Infrastructure

    Federated Learning for Refugee Data: A Privacy-Preserving Path Forward

    How federated learning offers a privacy-preserving path for humanitarian AI, and what is actually deployed in 2026.

    11 min read
    Ethics

    The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

    AI tools are now woven through humanitarian operations. The benefits are real and so are the risks. A frank look at the bias, privacy, and accountability gaps shaping the sector in 2026.

    10 min read
    Models

    ChatGPT vs Claude vs Gemini for Humanitarian Researchers: A 2026 Comparison

    A practical 2026 comparison of ChatGPT, Claude, and Gemini for humanitarian researchers, NGO analysts, and policy teams. Accuracy, citation behaviour, hallucination risk, and data privacy compared side by side.

    11 min read
    Infrastructure

    Biometric AI in Refugee Registration: UNHCR BIMS, IrisGuard, and the Privacy Debate

    How biometric AI is used in refugee registration in 2026, the systems involved, and the unresolved governance debates.

    11 min read
    🌍
    AI and Humanitarian Response

    The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

    AI tools are now woven through humanitarian operations. The benefits are real and so are the risks. A frank look at the bias, privacy, and accountability gaps shaping the sector in 2026.

    10 min read
    🌍
    Data Methodology

    Digital Identity for Refugees in 2026: Promise and Pitfalls

    A 2026 explainer on digital identity systems for refugees, drawing on UNHCR, ID4D, World Bank, and Privacy International reporting on benefits and risks.

    5 min read
    Advertisement
    Advertisement