AI Bias Against Refugees: Why Western-Trained Models Fail Displaced Populations

The bias problem in humanitarian AI is structural, not incidental

The framing matters. AI bias against displaced populations is not a story about a few bad models or a few careless engineers. It is a structural consequence of how training data is collected, how labels are produced, how benchmarks are chosen, and where the people who build the models live and work. Every one of those steps is concentrated in a small number of high-income countries. The downstream effect, predictable and well documented, is that systems consistently perform worse for the people humanitarian operations are trying to serve.

Where the bias actually enters the pipeline

Bias is not a single failure mode. It enters at five points, and the failures compound.

1. Data collection geography. The corpora that train large language models, image classifiers, and speech systems are disproportionately English-language and North American/European in origin. Common Crawl, the backbone of most LLM pre-training, is roughly 45 percent English. Languages spoken by tens of millions of displaced people (Tigrinya, Rohingya, Sorani Kurdish, Somali, Pashto) are present in fractions of a percent. 2. Label provenance. Annotation labour for benchmark datasets is concentrated in a small number of contractor hubs. Annotators apply their own cultural priors to ambiguous cases, and those priors propagate. 3. Benchmark design. Performance is reported on benchmarks (ImageNet, GLUE, MMLU, HumanEval) that reflect what their creators considered important. Almost none include tasks defined by humanitarian operations. 4. Demographic gaps in faces, voices, names. Face-recognition error rates remain measurably higher for darker-skinned subjects; speech recognition word error rates remain higher for African and South Asian accents; named-entity recognition is brittle on non-Latin transliterations of Arabic, Pashto, and Amharic names. 5. Deployment context drift. A model validated on a US dataset and deployed in a Cox's Bazar registration system is operating outside its training distribution. Performance will degrade; the question is by how much, and whether anyone is measuring.

What this looks like in operations

The patterns recur across the sector.

Biometric registration mismatches. Face- and fingerprint-matching systems used at registration desks produce higher false-non-match rates for women in headscarves, manual labourers with worn fingerprints, and children whose biometrics change rapidly. The cost is duplicate registrations, missed entitlements, and protection-sensitive errors.
LLM hallucinations on non-Western place names. Models will confidently invent governorate boundaries in Yemen, misattribute camps in South Sudan, or generate fictional UNHCR statistics for Eritrea. The fluency masks the unreliability.
Speech-to-text in protection interviews. Word error rates two to four times higher for non-Western accents make automated transcription of refugee status determination interviews unreliable for exactly the populations being interviewed.
Translation collapse on low-resource languages. Machine translation quality for languages spoken by less than one percent of the internet (most languages of displacement) ranges from useful-with-review to actively misleading.
Risk-scoring drift. Predictive models for vulnerability scoring, ported from one country to another without retraining, mis-rank households in ways that disadvantage the most marginalised groups.

Why it persists

The technical fixes (more diverse training data, localised fine-tuning, fairness benchmarks per deployment context) are known. Why have they not closed the gap?

The honest answer is incentives. Model providers optimise for benchmarks that reward broad capability, not for narrow operational performance in humanitarian contexts. The market for a model that is two percent better at Eritrean Tigrinya transcription is small; the market for a model that is two percent better at English coding tasks is enormous. Until humanitarian buyers pool procurement and pay for context-specific performance, the supply will not appear.

The governance gap matters too. The EU AI Act, the NIST AI Risk Management Framework, and the UK AISI evaluations focus on safety and high-stakes domestic deployments. There is no comparable regime that audits AI used in humanitarian operations for performance equity across the populations being served.

What good practice looks like

The agencies making progress share a short list of habits.

Disaggregated evaluation. Performance metrics are reported by language, gender, age, and (where ethical) population-of-concern group. Aggregate accuracy is treated as a starting point, not a result.
Local fine-tuning with consented data. Models are adapted with operationally-collected data under data-protection guidance from UNHCR and ICRC, with retention and access controls written into the contract.
Human-in-the-loop on protection-sensitive outputs. Any decision affecting a refugee's status, location, or entitlements is reviewed by a qualified human before it takes effect.
Procurement clauses on performance equity. RFPs increasingly require vendors to report performance against operationally-relevant demographic and linguistic slices, not just headline benchmarks.
Open documentation. Model cards and datasheets are published for systems used in operations, naming the populations the model was not validated against.

The 2026 read

The bias problem is closing in the high-resource part of the long tail (Arabic, Russian, Spanish, Swahili) and barely moving in the low-resource part where most displaced people actually live. Until that asymmetry shifts, every humanitarian AI deployment needs to assume bias as the default and design around it, not retrofit fairness after launch.

Sources and further reading

NIST Face Recognition Vendor Test demographic effects: https://pages.nist.gov/frvt/
Common Crawl statistics: https://commoncrawl.github.io/cc-crawl-statistics/
UNHCR Data Protection Guidance: https://www.unhcr.org/data-protection
IASC Operational Guidance on Data Responsibility: https://interagencystandingcommittee.org/
Gender Shades and follow-on work: https://gendershades.org/
ICRC Handbook on Data Protection in Humanitarian Action: https://www.icrc.org/en/data-protection-humanitarian-action-handbook

We Value Your Privacy

The Bias Problem: Why AI Models Trained on Western Data Fail Displaced Populations

The bias problem in humanitarian AI is structural, not incidental

Where the bias actually enters the pipeline

What this looks like in operations

Why it persists

What good practice looks like

The 2026 read

Sources and further reading

Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

Predictive AI in Conflict Zones: Promise, Peril, and the Data We Are Still Missing

Training Data Sovereignty: Why the Global South Needs Its Own AI Datasets

How AI Is Being Used to Predict Refugee Crises Before They Happen (2026)

How AI Reads Satellite Images to Count Displaced People (2026)

The Bias Problem: Why AI Models Trained on Western Data Fail Displaced Populations

The bias problem in humanitarian AI is structural, not incidental

Where the bias actually enters the pipeline

What this looks like in operations

Why it persists

What good practice looks like

The 2026 read

Sources and further reading

Keep reading

Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

Predictive AI in Conflict Zones: Promise, Peril, and the Data We Are Still Missing

Training Data Sovereignty: Why the Global South Needs Its Own AI Datasets

How AI Is Being Used to Predict Refugee Crises Before They Happen (2026)

How AI Reads Satellite Images to Count Displaced People (2026)