Navigated to Humanity Centered Data | UN Refugee & IDP Tracker
    All AI articles
    Ethics

    The Bias Problem: Why AI Models Trained on Western Data Fail Displaced Populations

    June 18, 202610 min read

    The bias problem in humanitarian AI is structural, not incidental

    The framing matters. AI bias against displaced populations is not a story about a few bad models or a few careless engineers. It is a structural consequence of how training data is collected, how labels are produced, how benchmarks are chosen, and where the people who build the models live and work. Every one of those steps is concentrated in a small number of high-income countries. The downstream effect, predictable and well documented, is that systems consistently perform worse for the people humanitarian operations are trying to serve.

    Where the bias actually enters the pipeline

    Bias is not a single failure mode. It enters at five points, and the failures compound.

    1. Data collection geography. The corpora that train large language models, image classifiers, and speech systems are disproportionately English-language and North American/European in origin. Common Crawl, the backbone of most LLM pre-training, is roughly 45 percent English. Languages spoken by tens of millions of displaced people (Tigrinya, Rohingya, Sorani Kurdish, Somali, Pashto) are present in fractions of a percent. 2. Label provenance. Annotation labour for benchmark datasets is concentrated in a small number of contractor hubs. Annotators apply their own cultural priors to ambiguous cases, and those priors propagate. 3. Benchmark design. Performance is reported on benchmarks (ImageNet, GLUE, MMLU, HumanEval) that reflect what their creators considered important. Almost none include tasks defined by humanitarian operations. 4. Demographic gaps in faces, voices, names. Face-recognition error rates remain measurably higher for darker-skinned subjects; speech recognition word error rates remain higher for African and South Asian accents; named-entity recognition is brittle on non-Latin transliterations of Arabic, Pashto, and Amharic names. 5. Deployment context drift. A model validated on a US dataset and deployed in a Cox's Bazar registration system is operating outside its training distribution. Performance will degrade; the question is by how much, and whether anyone is measuring.

    What this looks like in operations

    The patterns recur across the sector.

    • Biometric registration mismatches. Face- and fingerprint-matching systems used at registration desks produce higher false-non-match rates for women in headscarves, manual labourers with worn fingerprints, and children whose biometrics change rapidly. The cost is duplicate registrations, missed entitlements, and protection-sensitive errors.
    • LLM hallucinations on non-Western place names. Models will confidently invent governorate boundaries in Yemen, misattribute camps in South Sudan, or generate fictional UNHCR statistics for Eritrea. The fluency masks the unreliability.
    • Speech-to-text in protection interviews. Word error rates two to four times higher for non-Western accents make automated transcription of refugee status determination interviews unreliable for exactly the populations being interviewed.
    • Translation collapse on low-resource languages. Machine translation quality for languages spoken by less than one percent of the internet (most languages of displacement) ranges from useful-with-review to actively misleading.
    • Risk-scoring drift. Predictive models for vulnerability scoring, ported from one country to another without retraining, mis-rank households in ways that disadvantage the most marginalised groups.

    Why it persists

    The technical fixes (more diverse training data, localised fine-tuning, fairness benchmarks per deployment context) are known. Why have they not closed the gap?

    The honest answer is incentives. Model providers optimise for benchmarks that reward broad capability, not for narrow operational performance in humanitarian contexts. The market for a model that is two percent better at Eritrean Tigrinya transcription is small; the market for a model that is two percent better at English coding tasks is enormous. Until humanitarian buyers pool procurement and pay for context-specific performance, the supply will not appear.

    The governance gap matters too. The EU AI Act, the NIST AI Risk Management Framework, and the UK AISI evaluations focus on safety and high-stakes domestic deployments. There is no comparable regime that audits AI used in humanitarian operations for performance equity across the populations being served.

    What good practice looks like

    The agencies making progress share a short list of habits.

    • Disaggregated evaluation. Performance metrics are reported by language, gender, age, and (where ethical) population-of-concern group. Aggregate accuracy is treated as a starting point, not a result.
    • Local fine-tuning with consented data. Models are adapted with operationally-collected data under data-protection guidance from UNHCR and ICRC, with retention and access controls written into the contract.
    • Human-in-the-loop on protection-sensitive outputs. Any decision affecting a refugee's status, location, or entitlements is reviewed by a qualified human before it takes effect.
    • Procurement clauses on performance equity. RFPs increasingly require vendors to report performance against operationally-relevant demographic and linguistic slices, not just headline benchmarks.
    • Open documentation. Model cards and datasheets are published for systems used in operations, naming the populations the model was not validated against.

    The 2026 read

    The bias problem is closing in the high-resource part of the long tail (Arabic, Russian, Spanish, Swahili) and barely moving in the low-resource part where most displaced people actually live. Until that asymmetry shifts, every humanitarian AI deployment needs to assume bias as the default and design around it, not retrofit fairness after launch.

    Sources and further reading

    • NIST Face Recognition Vendor Test demographic effects: https://pages.nist.gov/frvt/
    • Common Crawl statistics: https://commoncrawl.github.io/cc-crawl-statistics/
    • UNHCR Data Protection Guidance: https://www.unhcr.org/data-protection
    • IASC Operational Guidance on Data Responsibility: https://interagencystandingcommittee.org/
    • Gender Shades and follow-on work: https://gendershades.org/
    • ICRC Handbook on Data Protection in Humanitarian Action: https://www.icrc.org/en/data-protection-humanitarian-action-handbook
    Ethics

    Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

    AI systems learn from data that reflects who was easy to count and easy to reach. In humanitarian work, those gaps map directly onto vulnerability. A practical look at why AI cannot be neutral and what to do about it in 2026.

    7 min read
    ๐ŸŒ
    AI and Humanitarian Response

    Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

    AI systems learn from data that reflects who was easy to count and easy to reach. In humanitarian work, those gaps map directly onto vulnerability. A practical look at why AI cannot be neutral and what to do about it in 2026.

    7 min read
    Ethics

    Predictive AI in Conflict Zones: Promise, Peril, and the Data We Are Still Missing

    A long-form explainer on what predictive AI can and cannot do in conflict zones, where the data gaps still are, and what good governance looks like in 2026.

    14 min read
    Impact

    How AI Is Being Used to Predict Refugee Crises Before They Happen (2026)

    Machine learning models are now feeding into UNHCR, IOM, and World Bank early warning systems. A clear look at what AI can and cannot predict about forced displacement in 2026.

    9 min read
    Foundations

    How AI Reads Satellite Images to Count Displaced People (2026)

    Satellite imagery combined with machine learning is now one of the fastest ways to estimate displaced populations in inaccessible areas. Here is how the technology works in 2026.

    6 min read
    ๐ŸŽฏ
    thought-leadership

    AI Alignment and Human Values: A Data Perspective

    How the challenge of aligning AI with human values connects to human-centered dataโ€”and why getting data right is essential for getting AI right.

    10 min read
    Advertisement
    Advertisement