紐時賞析/AI沒能通過越南語測驗 哪些「低資源語言」在科技世代被忽略?

南非约翰尼斯堡的Lelapa AI公司正发展根据社会实际需求的研究,支持非洲语言人工智慧科技发展。(纽约时报)

When AI Fails the Language Test, Who Is Left Out of the Conversation?

AI没能通过测验 哪些语言遭忽略?

Stanford researchers gave a popular artificial intelligence chatbot a language test.

史丹福大学研究员针对热门人工智慧聊天机器人进行语言测试。

They asked the bot in Vietnamese to write a traditional poem in the form known as “song thất lục bát” that follows a pattern of lines made up of seven, seven, six, then eight words. When the bot spit out an answer, it wrote a poem but didn’t follow the format.

他们要求越南语机器人写一首传统诗歌,以诗句依序为七字、七字、六字接着八字的「双七六八体」格式撰写。机器人吐出答案,写了一首诗,但没有遵循格式。

The team tried a different prompt, asking what the proper Vietnamese word was for a mother’s younger brother, and it responded with the words for a father’s younger and older siblings.

这个团队试了不同指令,询问称呼母亲的弟弟的适当越南语单字是什么,它却回答关于父亲手足的越南语单字。

While the use of AI has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. AI experts worry that the language gap could exacerbate technological inequities and that it could leave many regions and cultures behind.

尽管西方人工智慧使用量激增,世界其他许多地方却被排除在对话外,因为这项科技大部分以英语训练。人工智慧专家忧心,语言鸿沟可能加剧科技不平等,也可能将许多地区和文化抛在后头。

A delay of access to good technology of even a few years “can potentially lead to a few decades of economic delay,” said Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

史丹福大学「史丹福人工智慧实验室」博士候选人张创,是负责打造并测试越南语模型团队的成员。他说,只是晚了短短几年才取得优良科技,「也可能导致经济延迟发展数十年」。

The tests his team ran found that AI tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the AI model to learn from.

他的团队进行测试发现,整体而言人工智慧工具在处理越南语时,可能发生事实和措辞上的错误,这可能是因为以行业标准而言,越南语是个「低资源语言」,意味着越南语在线上没有足够的资料集和内容让人工智慧模型学习。

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because AI tech development and online engagement is centered in the United States and China.

低资源语言被世界各地上千万甚至上亿人使用,但它们产生的数位资料较少,因为人工智慧科技开发和线上参与集中在美国和中国。

An analysis of top websites by W3Techs, a tech survey company, found that English makes up more than 60% of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5% of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

科技调查公司W3Techs针对主要网站的一项分析发现,英语占网际网路语言资料的60%以上。收集语言资料的研究组织「民族语」指出,尽管英语在全球被广为使用,但英语母语者仅占世界人口的5%。中文和西班牙文是具有重大线上存在感和可信数位资料集语言的其他范例。

“Large companies like Google, Apple, OpenAI, for example, have not necessarily trained their models for tools that serve these markets,” Chinasa T. Okolo, a fellow at the Center for Technology Innovation at the Brookings Institution, said about communities with low-resource languages. “They don’t provide enough market value for them to do so.”

布鲁金斯研究院科技创新中心研究员奇娜萨.奥科洛提到使用低资源语言的社群时表示:「谷歌、苹果和OpenAI这类大型公司,未必会为了服务这些市场的工具来训练他们的模型。它们没有提供足够的市场价值让这些公司这么做」。

文/Sara Ruberg 译/罗方妤