Humanity Centered Data | UN Refugee & IDP Tracker

Training Data Sovereignty: Why the Global South Needs Its Own AI Datasets

By the Humanity Centered Data Editorial TeamPublished June 19, 2026

June 19, 202611 min read

The training-data geography problem

Frontier AI models are trained on data that is overwhelmingly North American and European in origin, written in a small number of languages, and reflective of the cultural assumptions of those regions. The consequences for users in the Global South are direct: weaker performance, more frequent factual errors about local contexts, and a slow but consistent re-centring of language and knowledge norms. The State of AI in Africa report and the Lacuna Fund annual reviews document the asymmetry quantitatively.

Sovereignty as a frame, not a slogan

Data sovereignty in this context means three concrete things. Authorship: the data is created, curated, and labelled by the communities whose languages and contexts it represents. Custody: storage, access controls, and licensing remain under those communities' governance. Benefit: economic and research value derived from the data accrues primarily to the communities that produced it. The CARE Principles for Indigenous Data Governance and the Te Mana Raraunga Maori Data Sovereignty Network provide the most developed governance frameworks.

Who is building the datasets

Several initiatives stand out in 2026. [Masakhane](https://www.masakhane.io/) has built foundational NLP resources for over 40 African languages through distributed community research. [AI4D Africa](https://africa.ai4d.ai/) funds language and applied AI projects across the continent. [Lacuna Fund](https://lacunafund.org/) finances labelled datasets in agriculture, health, language, and climate for the Global South. [Distributed AI Research Institute (DAIR)](https://www.dair-institute.org/) runs research grounded in affected-community priorities. [Common Voice](https://commonvoice.mozilla.org/) crowdsources speech data in over 100 languages.

What the datasets are good for

The most immediate humanitarian impact is in language coverage. Models fine-tuned on Masakhane resources outperform generic multilingual models on African-language tasks by margins that are operationally significant for translation, transcription, and chatbot deployment. In health and agriculture, Lacuna-funded datasets have enabled locally trained models for crop disease detection and clinical decision support that are not feasible with generic vision or language models alone.

What humanitarian organisations can do

Three practical contributions. Contribute data with proper governance: humanitarian organisations sit on large multilingual document corpora that, with appropriate consent and redaction, could expand low-resource language coverage. Fund the labelling and curation: the binding constraint on Global South AI datasets is not raw data, it is funded labelling and stewardship. Procure with sovereignty in mind: prefer models and vendors that document training data provenance and that compensate the communities whose data they used.

What to watch in 2026 and beyond

Two trends are worth tracking. The African Union Continental AI Strategy and equivalent regional strategies in Latin America and Southeast Asia are increasingly treating training data as critical infrastructure. The emergence of regional model-training initiatives — including state-backed and university-led — is beginning to produce models that are not derivative of Western frontier labs. Whether these reach frontier capability is uncertain; whether they serve their primary populations well is the more important question.

We Value Your Privacy

Training Data Sovereignty: Why the Global South Needs Its Own AI Datasets

The training-data geography problem

Sovereignty as a frame, not a slogan

Who is building the datasets

What the datasets are good for

What humanitarian organisations can do

What to watch in 2026 and beyond

Further reading and primary sources

The Bias Problem: Why AI Models Trained on Western Data Fail Displaced Populations

Data Colonialism: How Extractive Data Practices Replicate Colonial Patterns

The Best Free Datasets for Tracking Global Displacement in 2026

The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

Who Is Responsible When AI Gets It Wrong in a Refugee Crisis? (2026)

Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

Training Data Sovereignty: Why the Global South Needs Its Own AI Datasets

The training-data geography problem

Sovereignty as a frame, not a slogan

Who is building the datasets

What the datasets are good for

What humanitarian organisations can do

What to watch in 2026 and beyond

Further reading and primary sources

Keep reading

The Bias Problem: Why AI Models Trained on Western Data Fail Displaced Populations

Data Colonialism: How Extractive Data Practices Replicate Colonial Patterns

The Best Free Datasets for Tracking Global Displacement in 2026

The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

Who Is Responsible When AI Gets It Wrong in a Refugee Crisis? (2026)

Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)