Civic Tech Featured

New AI tool to revitalise endangered indigenous language

AI tools like ChatGPT perform well with languages like English because many people speak them.

Jared Coleman, who recently earned his Ph.D. in computer science, and his supervisor, Bhaskar Krishnamachari, both love languages—human and computer.

Krishnamachari, who grew up in India, speaks Tamil, Hindi, and English, and learned French and Mandarin Chinese in college. Coleman, an English speaker, enjoyed learning Spanish in high school and picked up Portuguese from his now-wife and friends in college.

During the pandemic, Coleman started learning Owens Valley Paiute online. Coleman is a member of the Big Pine Paiute Tribe of Owens Valley—his father, David, grew up on the tribe’s reservation in Big Pine, CA, and Paiute is their ancestral language.

AI tools like ChatGPT perform well with languages like English because many people speak them. However, Paiute is a “no-resource language,” meaning there are no Paiute sentences translated into English for training AI models.

In a new paper, “LLM-Assisted Rule-Based Machine Translation for Low/No-Resource Languages,” Coleman and Krishnamachari suggest a new way to help people learn no-resource languages. Their co-authors are Khalil Iskarous, a USC Dornsife linguistics professor, and Ruben Rosales, an independent researcher.

Their method combines old-fashioned rule-based translation tools with advanced AI language models (LLMs). The LLM doesn’t translate Paiute directly but helps guide the rule-based system, which uses grammar and vocabulary rules for translation.

“The LLM acts as a smart helper, ensuring the rule-based system makes accurate translations,” said Coleman.

The translation tool simplifies complex sentences and uses English words as placeholders for unknown words. While this may lose some meaning, it still produces clear and grammatically correct translations.

Coleman explained that this method mimics how language learners naturally mix known and unknown words, making it useful in real-life situations.

“The tool can handle a lot of the translation on its own with some guidance,” added Krishnamachari.

Coleman has also created and maintains digital tools for language revitalization called Kubishi, which means ‘brain’ in Paiute. This includes an online dictionary and a sentence-builder enabled by their research.

Their paper, to be presented at NAACL’s AmericasNLP workshop, highlights how LLMs can help revitalize endangered languages.

Coleman credits his tribe for their long-standing efforts in language revitalization through classes, dictionaries, and recordings. “This research is just one part of a much larger effort,” he said.

The paper also suggests future work, including adding more complex sentences to test their method. This project is both a personal and academic achievement for Coleman, who will join Loyola Marymount University as an assistant professor of computer science this fall.

“My dad didn’t grow up speaking Paiute because boarding schools forbade it,” said Coleman. “I’m lucky my great-grandparents documented the language with linguists. Hearing their voices and understanding them is very personally satisfying.”

-EUREKALERT