Bridging the Digital Divide:Where Do We Stand? Jennifer Haliewicz / HLT Portfolio

Summary

As an intern with XRI Global, I collaborated with staff and fellow University of Arizona HLT students on a project aimed at supporting low-resource languages by improving their digital presence.

Our team inventoried existing datasets and language models from platforms like Hugging Face, GitHub, and Mozilla Common Voice. For some languages where models were not available, we trained new ones using open datasets.

My contributions included designing the underlying database schema, identifying data standards, streamlining data collection workflows, and cleaning and standardizing data.

The resulting data powers a user-friendly web interface that allows researchers and developers to easily locate and access resources—helping bridge the digital divide for underrepresented languages.

Content

I cleaned and processed large volumes of multilingual text, trained models, and built custom evaluation scripts using tools like Python, spaCy, scikit-learn, and PostgreSQL.

I applied knowledge from courses including NLP, Corpus Linguistics, and Machine Learning, and developed a deeper understanding of real-world data challenges, annotation inconsistencies, and pipeline deployment.

This internship also sharpened my programming and problem-solving skills, and gave me practical experience with collaborative version control, agile workflows, and defining scope with stakeholders.

Process

Requirements Gathering & Data Dictionary

To document the existing database supporting Bridging the Digital Divide, I created a data dictionary that describes each table, its fields, types, and purpose. This helps ensure clarity and consistency for future users and developers.

Schema Overview

Table	Primary Key	Related Tables	Purpose
Languages	`language_id`	Datasets	Stores metadata for each language
Datasets	`dataset_id`	Models	Contains dataset information per language
Models	`model_id`	Datasets	Tracks trained ML models
Contributors	`contributor_id`	Datasets / Models	Tracks contributors and their roles

Language Tables

Languages

Field Name	Data Type	Description	Example / Notes
`language_id`	INT	Unique identifier for each language	1, 2
`language_name`	VARCHAR	Official name of the language	"Hausa", "Navajo"
`iso_code`	VARCHAR	ISO 639-3 language code	"hau", "nav"
`geo_center`	VARCHAR	Geographic region or center	"West Africa"
`num_speakers`	INT	Estimated number of speakers	52,000,000
`status`	VARCHAR	Vitality status	"Vigorous", "Endangered"

Related Tables: datasets.language_id → languages.language_id

Datasets Table

Field Name	Data Type	Description	Example / Notes
`dataset_id`	INT	Unique identifier for each dataset	101, 102
`language_id`	INT	Reference to associated language	1
`dataset_name`	VARCHAR	Name of the dataset	"Hausa Speech Corpus"
`data_type`	VARCHAR	Type of data	"ASR", "TTS", "NMT"
`num_samples`	INT	Number of data items	10,000
`license`	VARCHAR	Licensing info	"CC-BY"
`source_url`	VARCHAR	URL or reference to dataset	"https://..."

Related Tables: datasets.language_id → languages.language_id models.dataset_id → datasets.dataset_id

Models Table

Field Name	Data Type	Description	Example / Notes
`model_id`	INT	Unique identifier for each model	201, 202
`dataset_id`	INT	Reference to dataset	101
`model_type`	VARCHAR	Type of model	"ASR", "TTS", "MT"
`framework`	VARCHAR	ML framework used	"Kaldi", "ESPnet"
`accuracy_metric`	FLOAT	Performance metric	0.87
`last_trained`	DATE	Date model was last trained	2025-07-15

Related Tables: models.dataset_id → datasets.dataset_id

Contributors Table

Field Name	Data Type	Description	Example / Notes
`contributor_id`	INT	Unique identifier for each contributor	1, 2
`name`	VARCHAR	Full name of contributor	"Jennifer Haliewicz"
`affiliation`	VARCHAR	Organization or institution	"University of Arizona"
`role`	VARCHAR	Role in project	"Data Curation", "Model Training"
`email`	VARCHAR	Contact email	"jennifer@example.com"

Related Tables: Link to datasets/models via project mapping table if needed.

Week-by-Week Summary

Weeks 1–4: Planning & Requirements

Defined project goals, data scope, and low-resource criteria
Designed database schema and metadata structure

Weeks 5–8: Database Implementation

Built and populated foundational schema with ISO codes, Glottocodes, and region data

Weeks 9–12: Tool Development & Tag Standardization

Developed Replit-based prototype of mapping tool
Standardized language-pair tags and added NMT metrics

Weeks 13–16: Conference Prep

Created visualizations and presentation for external audiences
Refined the tool and expanded dataset coverage post-event

Weeks 17–20: Feedback & Metadata Expansion

Integrated speaker population estimates
Refined UI based on feedback from users and partners

Weeks 21–Present: Ongoing Maintenance

Continued model and dataset entry
Resolved duplicate tags and standardized metadata

Tools & Technologies

Languages: Python, SQL
Libraries: spaCy, scikit-learn, pandas
Tools: PostgreSQL/PostGIS, Replit, GitHub

Code: View Project on GitHub

Share on

Twitter Facebook LinkedIn