Summary

As an intern with XRI Global, I collaborated with staff and fellow University of Arizona HLT students on a project aimed at supporting low-resource languages by improving their digital presence.

Our team inventoried existing datasets and language models from platforms like Hugging Face, GitHub, and Mozilla Common Voice. For some languages where models were not available, we trained new ones using open datasets.

My contributions included designing the underlying database schema, identifying data standards, streamlining data collection workflows, and cleaning and standardizing data.

The resulting data powers a user-friendly web interface that allows researchers and developers to easily locate and access resources—helping bridge the digital divide for underrepresented languages.

Content

I cleaned and processed large volumes of multilingual text, trained models, and built custom evaluation scripts using tools like Python, spaCy, scikit-learn, and PostgreSQL.

I applied knowledge from courses including NLP, Corpus Linguistics, and Machine Learning, and developed a deeper understanding of real-world data challenges, annotation inconsistencies, and pipeline deployment.

This internship also sharpened my programming and problem-solving skills, and gave me practical experience with collaborative version control, agile workflows, and defining scope with stakeholders.

Process

Requirements Gathering

Name & Classification

Pulled each language’s name, family, and sub-family from Glottolog to structure the hierarchy and enable filtering by genetic lineage.

Schema: language_name, language_family_id, language_subfamily_id, glottocode

Location

Retrieved each language’s geographic centroid using GeoNames and stored it as a PostGIS POINT.

Schema: geo_center (PostGIS geometry)

Global Region

Linked each language to a continent or subregion for aggregation and dashboarding.

Schema: continent_or_region_id

Number of Speakers

Estimated speaker populations were collected to help prioritize support efforts.

Schema: num_speakers, pop_source

Language Codes

Standardized ISO 639-1, ISO 639-3, and Glottocode identifiers across the dataset.

Schema: iso_639_1, iso_639_3, glottocode

Models & Datasets

Cataloged available ASR, TTS, and NMT models from Hugging Face, Common Voice, and other sources.

Schema: nmt_datasets, nmt_pairs_source, asr_source, asr, tts, nmt

Week-by-Week Summary

Weeks 1–4: Planning & Requirements

  • Defined project goals, data scope, and low-resource criteria
  • Designed database schema and metadata structure

Weeks 5–8: Database Implementation

  • Built and populated foundational schema with ISO codes, Glottocodes, and region data

Weeks 9–12: Tool Development & Tag Standardization

  • Developed Replit-based prototype of mapping tool
  • Standardized language-pair tags and added NMT metrics

Weeks 13–16: Conference Prep

  • Created visualizations and presentation for external audiences
  • Refined the tool and expanded dataset coverage post-event

Weeks 17–20: Feedback & Metadata Expansion

  • Integrated speaker population estimates
  • Refined UI based on feedback from users and partners

Weeks 21–Present: Ongoing Maintenance

  • Continued model and dataset entry
  • Resolved duplicate tags and standardized metadata

Tools & Technologies

  • Languages: Python, SQL
  • Libraries: spaCy, scikit-learn, pandas
  • Tools: PostgreSQL/PostGIS, Replit, GitHub