Bridging the Digital Divide:
Where Do We Stand?
Summary
As an intern with XRI Global, I collaborated with staff and fellow University of Arizona HLT students on a project aimed at supporting low-resource languages by improving their digital presence.
Our team inventoried existing datasets and language models from platforms like Hugging Face, GitHub, and Mozilla Common Voice. For some languages where models were not available, we trained new ones using open datasets.
My contributions included designing the underlying database schema, identifying data standards, streamlining data collection workflows, and cleaning and standardizing data.
The resulting data powers a user-friendly web interface that allows researchers and developers to easily locate and access resources—helping bridge the digital divide for underrepresented languages.
Content
I cleaned and processed large volumes of multilingual text, trained models, and built custom evaluation scripts using tools like Python, spaCy, scikit-learn, and PostgreSQL.
I applied knowledge from courses including NLP, Corpus Linguistics, and Machine Learning, and developed a deeper understanding of real-world data challenges, annotation inconsistencies, and pipeline deployment.
This internship also sharpened my programming and problem-solving skills, and gave me practical experience with collaborative version control, agile workflows, and defining scope with stakeholders.
Process
Requirements Gathering & Data Dictionary
To document the existing database supporting Bridging the Digital Divide, I created a data dictionary that describes each table, its fields, types, and purpose. This helps ensure clarity and consistency for future users and developers.
Schema Overview
Table | Primary Key | Related Tables | Purpose |
---|---|---|---|
Languages | language_id | Datasets | Stores metadata for each language |
Datasets | dataset_id | Models | Contains dataset information per language |
Models | model_id | Datasets | Tracks trained ML models |
Contributors | contributor_id | Datasets / Models | Tracks contributors and their roles |
Languages Table
Field Name | Data Type | Description | Example / Notes |
---|---|---|---|
language_id | INT | Unique identifier for each language | 1, 2 |
language_name | VARCHAR | Official name of the language | "Hausa", "Navajo" |
iso_code | VARCHAR | ISO 639-3 language code | "hau", "nav" |
geo_center | VARCHAR | Geographic region or center | "West Africa" |
num_speakers | INT | Estimated number of speakers | 52,000,000 |
status | VARCHAR | Vitality status | "Vigorous", "Endangered" |
Related Tables: datasets.language_id
→ languages.language_id
Datasets Table
Field Name | Data Type | Description | Example / Notes |
---|---|---|---|
dataset_id | INT | Unique identifier for each dataset | 101, 102 |
language_id | INT | Reference to associated language | 1 |
dataset_name | VARCHAR | Name of the dataset | "Hausa Speech Corpus" |
data_type | VARCHAR | Type of data | "ASR", "TTS", "NMT" |
num_samples | INT | Number of data items | 10,000 |
license | VARCHAR | Licensing info | "CC-BY" |
source_url | VARCHAR | URL or reference to dataset | "https://..." |
Related Tables: datasets.language_id
→ languages.language_id
models.dataset_id
→ datasets.dataset_id
Models Table
Field Name | Data Type | Description | Example / Notes |
---|---|---|---|
model_id | INT | Unique identifier for each model | 201, 202 |
dataset_id | INT | Reference to dataset | 101 |
model_type | VARCHAR | Type of model | "ASR", "TTS", "MT" |
framework | VARCHAR | ML framework used | "Kaldi", "ESPnet" |
accuracy_metric | FLOAT | Performance metric | 0.87 |
last_trained | DATE | Date model was last trained | 2025-07-15 |
Related Tables: models.dataset_id
→ datasets.dataset_id
Contributors Table
Field Name | Data Type | Description | Example / Notes |
---|---|---|---|
contributor_id | INT | Unique identifier for each contributor | 1, 2 |
name | VARCHAR | Full name of contributor | "Jennifer Haliewicz" |
affiliation | VARCHAR | Organization or institution | "University of Arizona" |
role | VARCHAR | Role in project | "Data Curation", "Model Training" |
email | VARCHAR | Contact email | "jennifer@example.com" |
Related Tables: Link to datasets/models via project mapping table if needed.
Week-by-Week Summary
Weeks 1–4: Planning & Requirements
- Defined project goals, data scope, and low-resource criteria
- Designed database schema and metadata structure
Weeks 5–8: Database Implementation
- Built and populated foundational schema with ISO codes, Glottocodes, and region data
Weeks 9–12: Tool Development & Tag Standardization
- Developed Replit-based prototype of mapping tool
- Standardized language-pair tags and added NMT metrics
Weeks 13–16: Conference Prep
- Created visualizations and presentation for external audiences
- Refined the tool and expanded dataset coverage post-event
Weeks 17–20: Feedback & Metadata Expansion
- Integrated speaker population estimates
- Refined UI based on feedback from users and partners
Weeks 21–Present: Ongoing Maintenance
- Continued model and dataset entry
- Resolved duplicate tags and standardized metadata
Tools & Technologies
- Languages: Python, SQL
- Libraries: spaCy, scikit-learn, pandas
- Tools: PostgreSQL/PostGIS, Replit, GitHub
Code: View Project on GitHub