Bridging the Digital Divide:
Where Do We Stand?
Summary
As an intern with XRI Global, I collaborated with staff and fellow University of Arizona HLT students on a project aimed at supporting low-resource languages by improving their digital presence.
Our team inventoried existing datasets and language models from platforms like Hugging Face, GitHub, and Mozilla Common Voice. For some languages where models were not available, we trained new ones using open datasets.
My contributions included designing the underlying database schema, identifying data standards, streamlining data collection workflows, and cleaning and standardizing data.
The resulting data powers a user-friendly web interface that allows researchers and developers to easily locate and access resources—helping bridge the digital divide for underrepresented languages.
Content
I cleaned and processed large volumes of multilingual text, trained models, and built custom evaluation scripts using tools like Python, spaCy, scikit-learn, and PostgreSQL.
I applied knowledge from courses including NLP, Corpus Linguistics, and Machine Learning, and developed a deeper understanding of real-world data challenges, annotation inconsistencies, and pipeline deployment.
This internship also sharpened my programming and problem-solving skills, and gave me practical experience with collaborative version control, agile workflows, and defining scope with stakeholders.
Process
Requirements Gathering
Name & Classification
Pulled each language’s name, family, and sub-family from Glottolog to structure the hierarchy and enable filtering by genetic lineage.
Schema: language_name, language_family_id, language_subfamily_id, glottocode
Location
Retrieved each language’s geographic centroid using GeoNames and stored it as a PostGIS POINT.
Schema: geo_center (PostGIS geometry)
Global Region
Linked each language to a continent or subregion for aggregation and dashboarding.
Schema: continent_or_region_id
Number of Speakers
Estimated speaker populations were collected to help prioritize support efforts.
Schema: num_speakers, pop_source
Language Codes
Standardized ISO 639-1, ISO 639-3, and Glottocode identifiers across the dataset.
Schema: iso_639_1, iso_639_3, glottocode
Models & Datasets
Cataloged available ASR, TTS, and NMT models from Hugging Face, Common Voice, and other sources.
Schema: nmt_datasets, nmt_pairs_source, asr_source, asr, tts, nmt
Week-by-Week Summary
Weeks 1–4: Planning & Requirements
- Defined project goals, data scope, and low-resource criteria
- Designed database schema and metadata structure
Weeks 5–8: Database Implementation
- Built and populated foundational schema with ISO codes, Glottocodes, and region data
Weeks 9–12: Tool Development & Tag Standardization
- Developed Replit-based prototype of mapping tool
- Standardized language-pair tags and added NMT metrics
Weeks 13–16: Conference Prep
- Created visualizations and presentation for external audiences
- Refined the tool and expanded dataset coverage post-event
Weeks 17–20: Feedback & Metadata Expansion
- Integrated speaker population estimates
- Refined UI based on feedback from users and partners
Weeks 21–Present: Ongoing Maintenance
- Continued model and dataset entry
- Resolved duplicate tags and standardized metadata
Tools & Technologies
- Languages: Python, SQL
- Libraries: spaCy, scikit-learn, pandas
- Tools: PostgreSQL/PostGIS, Replit, GitHub
Code: View Project on GitHub