Summary

As an intern with XRI Global, I collaborated with staff and fellow University of Arizona HLT students on a project aimed at supporting low-resource languages by improving their digital presence.

Our team inventoried existing datasets and language models from platforms like Hugging Face, GitHub, and Mozilla Common Voice. For some languages where models were not available, we trained new ones using open datasets.

My contributions included designing the underlying database schema, identifying data standards, streamlining data collection workflows, and cleaning and standardizing data.

The resulting data powers a user-friendly web interface that allows researchers and developers to easily locate and access resources—helping bridge the digital divide for underrepresented languages.

Content

I cleaned and processed large volumes of multilingual text, trained models, and built custom evaluation scripts using tools like Python, spaCy, scikit-learn, and PostgreSQL.

I applied knowledge from courses including NLP, Corpus Linguistics, and Machine Learning, and developed a deeper understanding of real-world data challenges, annotation inconsistencies, and pipeline deployment.

This internship also sharpened my programming and problem-solving skills, and gave me practical experience with collaborative version control, agile workflows, and defining scope with stakeholders.

Process

Requirements Gathering & Data Dictionary

To document the existing database supporting Bridging the Digital Divide, I created a data dictionary that describes each table, its fields, types, and purpose. This helps ensure clarity and consistency for future users and developers.

Schema Overview

TablePrimary KeyRelated TablesPurpose
Languageslanguage_idDatasetsStores metadata for each language
Datasetsdataset_idModelsContains dataset information per language
Modelsmodel_idDatasetsTracks trained ML models
Contributorscontributor_idDatasets / ModelsTracks contributors and their roles
Languages Table
Field NameData TypeDescriptionExample / Notes
language_idINTUnique identifier for each language1, 2
language_nameVARCHAROfficial name of the language"Hausa", "Navajo"
iso_codeVARCHARISO 639-3 language code"hau", "nav"
geo_centerVARCHARGeographic region or center"West Africa"
num_speakersINTEstimated number of speakers52,000,000
statusVARCHARVitality status"Vigorous", "Endangered"

Related Tables: datasets.language_idlanguages.language_id

Datasets Table
Field NameData TypeDescriptionExample / Notes
dataset_idINTUnique identifier for each dataset101, 102
language_idINTReference to associated language1
dataset_nameVARCHARName of the dataset"Hausa Speech Corpus"
data_typeVARCHARType of data"ASR", "TTS", "NMT"
num_samplesINTNumber of data items10,000
licenseVARCHARLicensing info"CC-BY"
source_urlVARCHARURL or reference to dataset"https://..."

Related Tables: datasets.language_idlanguages.language_id models.dataset_iddatasets.dataset_id

Models Table
Field NameData TypeDescriptionExample / Notes
model_idINTUnique identifier for each model201, 202
dataset_idINTReference to dataset101
model_typeVARCHARType of model"ASR", "TTS", "MT"
frameworkVARCHARML framework used"Kaldi", "ESPnet"
accuracy_metricFLOATPerformance metric0.87
last_trainedDATEDate model was last trained2025-07-15

Related Tables: models.dataset_iddatasets.dataset_id

Contributors Table
Field NameData TypeDescriptionExample / Notes
contributor_idINTUnique identifier for each contributor1, 2
nameVARCHARFull name of contributor"Jennifer Haliewicz"
affiliationVARCHAROrganization or institution"University of Arizona"
roleVARCHARRole in project"Data Curation", "Model Training"
emailVARCHARContact email"jennifer@example.com"

Related Tables: Link to datasets/models via project mapping table if needed.

Week-by-Week Summary

Weeks 1–4: Planning & Requirements

  • Defined project goals, data scope, and low-resource criteria
  • Designed database schema and metadata structure

Weeks 5–8: Database Implementation

  • Built and populated foundational schema with ISO codes, Glottocodes, and region data

Weeks 9–12: Tool Development & Tag Standardization

  • Developed Replit-based prototype of mapping tool
  • Standardized language-pair tags and added NMT metrics

Weeks 13–16: Conference Prep

  • Created visualizations and presentation for external audiences
  • Refined the tool and expanded dataset coverage post-event

Weeks 17–20: Feedback & Metadata Expansion

  • Integrated speaker population estimates
  • Refined UI based on feedback from users and partners

Weeks 21–Present: Ongoing Maintenance

  • Continued model and dataset entry
  • Resolved duplicate tags and standardized metadata

Tools & Technologies

  • Languages: Python, SQL
  • Libraries: spaCy, scikit-learn, pandas
  • Tools: PostgreSQL/PostGIS, Replit, GitHub