Navigated to Humanity Centered Data | UN Refugee & IDP Tracker
    All AI articles
    Ethics

    Training Data Sovereignty: Why the Global South Needs Its Own AI Datasets

    By the Humanity Centered Data Editorial Team
    June 19, 202611 min read

    The training-data geography problem

    Frontier AI models are trained on data that is overwhelmingly North American and European in origin, written in a small number of languages, and reflective of the cultural assumptions of those regions. The consequences for users in the Global South are direct: weaker performance, more frequent factual errors about local contexts, and a slow but consistent re-centring of language and knowledge norms. The State of AI in Africa report and the Lacuna Fund annual reviews document the asymmetry quantitatively.

    Sovereignty as a frame, not a slogan

    Data sovereignty in this context means three concrete things. Authorship: the data is created, curated, and labelled by the communities whose languages and contexts it represents. Custody: storage, access controls, and licensing remain under those communities' governance. Benefit: economic and research value derived from the data accrues primarily to the communities that produced it. The CARE Principles for Indigenous Data Governance and the Te Mana Raraunga Maori Data Sovereignty Network provide the most developed governance frameworks.

    Who is building the datasets

    Several initiatives stand out in 2026. [Masakhane](https://www.masakhane.io/) has built foundational NLP resources for over 40 African languages through distributed community research. [AI4D Africa](https://africa.ai4d.ai/) funds language and applied AI projects across the continent. [Lacuna Fund](https://lacunafund.org/) finances labelled datasets in agriculture, health, language, and climate for the Global South. [Distributed AI Research Institute (DAIR)](https://www.dair-institute.org/) runs research grounded in affected-community priorities. [Common Voice](https://commonvoice.mozilla.org/) crowdsources speech data in over 100 languages.

    What the datasets are good for

    The most immediate humanitarian impact is in language coverage. Models fine-tuned on Masakhane resources outperform generic multilingual models on African-language tasks by margins that are operationally significant for translation, transcription, and chatbot deployment. In health and agriculture, Lacuna-funded datasets have enabled locally trained models for crop disease detection and clinical decision support that are not feasible with generic vision or language models alone.

    What humanitarian organisations can do

    Three practical contributions. Contribute data with proper governance: humanitarian organisations sit on large multilingual document corpora that, with appropriate consent and redaction, could expand low-resource language coverage. Fund the labelling and curation: the binding constraint on Global South AI datasets is not raw data, it is funded labelling and stewardship. Procure with sovereignty in mind: prefer models and vendors that document training data provenance and that compensate the communities whose data they used.

    What to watch in 2026 and beyond

    Two trends are worth tracking. The African Union Continental AI Strategy and equivalent regional strategies in Latin America and Southeast Asia are increasingly treating training data as critical infrastructure. The emergence of regional model-training initiatives โ€” including state-backed and university-led โ€” is beginning to produce models that are not derivative of Western frontier labs. Whether these reach frontier capability is uncertain; whether they serve their primary populations well is the more important question.

    Further reading and primary sources

    • Masakhane: https://www.masakhane.io/
    • AI4D Africa: https://africa.ai4d.ai/
    • Lacuna Fund: https://lacunafund.org/
    • DAIR Institute: https://www.dair-institute.org/
    • Common Voice: https://commonvoice.mozilla.org/
    • CARE Principles: https://www.gida-global.org/care
    • Te Mana Raraunga: https://www.temanararaunga.maori.nz/
    Ethics

    The Bias Problem: Why AI Models Trained on Western Data Fail Displaced Populations

    AI systems trained predominantly on Western data systematically underperform for refugees and IDPs. Where the bias enters, why it persists, and what is being done about it in 2026.

    10 min read
    ๐ŸŽฏ
    thought-leadership

    Data Colonialism: How Extractive Data Practices Replicate Colonial Patterns

    An examination of how current data practices replicate historical colonial patterns of extractionโ€”and what decolonizing data might mean.

    11 min read
    ๐ŸŒ
    Research Methods

    The Best Free Datasets for Tracking Global Displacement in 2026

    From UNHCR to ACLED, here are the most reliable open datasets for tracking refugees, IDPs, and migration flows across the world in 2026.

    4 min read
    Ethics

    The Risks of AI in Humanitarian Work: Bias, Privacy, and Accountability (2026)

    AI tools are now woven through humanitarian operations. The benefits are real and so are the risks. A frank look at the bias, privacy, and accountability gaps shaping the sector in 2026.

    10 min read
    Ethics

    Who Is Responsible When AI Gets It Wrong in a Refugee Crisis? (2026)

    When a model misclassifies a protection case or a biometric system locks a refugee out of food assistance, accountability is rarely clear. A close look at how responsibility is distributed in 2026 and where the gaps sit.

    8 min read
    Ethics

    Can AI Be Neutral? The Problem of Bias in Humanitarian Data (2026)

    AI systems learn from data that reflects who was easy to count and easy to reach. In humanitarian work, those gaps map directly onto vulnerability. A practical look at why AI cannot be neutral and what to do about it in 2026.

    7 min read
    Advertisement
    Advertisement