研究揭示了人们阅读唇语时看到了什么
Study reveals what people see when they read lips

原始链接: https://news.ku.edu/news/article/study-reveals-what-people-really-see-when-they-read-lips

堪萨斯大学由迈克尔·维特维奇(Michael Vitevitch)教授领导的研究团队利用网络科学分析了读唇错误产生的原因。该研究跳出了传统的音素研究范畴,转而关注“视素”(visemes)——即嘴巴、下巴和嘴唇的视觉特征。通过根据视觉相似性对2万个英语单词进行映射,研究团队发现读唇错误并非随机发生。相反,错误通常发生在单词具有相似视觉特征或使用频率较高时,从而形成了难以区分的“词簇”。 研究结果表明,人类读唇的准确率普遍低于预期,往往只需遗漏一两个视觉线索就会出错。这项研究对人类应用和机器应用都具有重要意义。通过了解这些视觉“景观”,专家们希望开发出改进的训练方法,帮助人们提高读唇技能。此外,该研究还能通过将说话人的面部视觉数据与音频输入相结合,提升人工智能和自动转录服务(如视频会议中使用的技术),从而实现更准确、更拟人的语音识别。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 研究揭示了人们在读唇语时看到了什么 (ku.edu) 3 点,由 giuliomagnifico 发布于 1 小时前 | 隐藏 | 过往 | 收藏 | 讨论 | 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

LAWRENCE — New research from the University of Kansas uses network science to determine why people make mistakes when lip reading.

Michael Vitevitch, professor of speech-language-hearing at KU, and his co-authors created a visual map of around 20,000 words in English, hoping to better grasp why some words are more difficult to lip-read than others.

The results appear in the Journal of the Acoustical Society of America. Findings could improve training for lip readers and boost the capacity for artificial intelligence to read lips and provide transcription and other digital services.

“What we looked at in this study is how people basically read lips, how accurate they are and, more specifically, what kinds of mistakes they make,” Vitevitch said. “A lot of previous work looked at how accurate people were and didn’t necessarily look at the characteristics of the errors themselves. There’s a lot to be learned from the mistakes you make, and that was the approach we took.”

While previous work on lip reading examined errors, much of that research was done by spoken-language researchers who focused on phonemes — the sounds in a language — and on how close participants were to the word as it sounds.

Vitevitch took a different approach. 

“We focused on the visual characteristics,” he said. “Instead of looking at how many sounds of the word people got, we looked at how many of the visual characteristics, which we call ‘visemes’ (the visual equivalent of a phoneme), they got. We focused on what you’re getting from the lips, jaw and mouth without using auditory sound. You’re just trying to get the information from what you’re seeing.”

“How does that sound look when it’s spoken? We don’t care what it sounds like; we care about how it looks when it’s spoken,” he said. “Sometimes words sound similar and look similar, such as ‘kit,’ ‘cat’ and ‘cut.’ Other times words don’t sound alike but still look similar like ‘vet,’ ‘fit’ and ‘fuzz.’ In both cases if you’re just looking at my face, you couldn’t tell one word from the other.”

Through analysis of the word map, researchers determined:

  • People are more likely to mistake a word for another word used more commonly.
  • When spoken, about a third of words in English look like at least one other word.
  • If a word has many visual look-alikes, it’s consistently harder to lip-read.
  • Lip-reading mistakes don’t happen randomly — they’re more likely when visually similar words occupy the same region in the visual network.

“One surprise was that people aren’t that good at this,” Vitevitch said. “We think we are, but we’re really not. Most of the errors show that you’re one or two visual characteristics — one or two visemes — off. You’re getting a good amount of it, but perhaps not enough to get by.”

The researchers’ visual map allowed them to understand how words are distributed throughout the landscape, according to Vitevitch. In the map, words were close when they looked similar and farther apart when the words appeared visually unalike.

“Certain areas become more compressed than you might expect,” he said. “The landscape stretches and compresses in ways we hadn’t anticipated. That stretching and compression has implications for how accurate you’re going to be when trying to lip-read. Does it give you more competitors than you would otherwise have? Or does it move things farther apart and make them more perceptually distinct?”

The KU researcher said his group hopes to move into lip-reading training.

“The idea is that if you track people’s errors over time, those errors should start shrinking toward the target word,” Vitevitch said. “Instead of being far away, people begin picking up the information they need and making more accurate guesses.”

An additional application of the research is in training automatic transcription.

“Systems such as Zoom already do a reasonable job transcribing speech,” Vitevitch said. “Could they do better if they used not only audio but also visual information from a speaker’s face? Computers are very good at finding patterns, and sometimes they’re the same patterns humans use. We may be able to train computers to do things in a more humanlike way.”

Vitevitch said his group will continue to follow up on this work in different ways.

“We’re continuing to explore how people do this, potentially moving toward machine-learning applications and finding ways to help people who need assistance understanding speech,” he said.

Vitevitch’s co-authors were KU graduate students Maia Flynn and Reid Kelly, along with Lorin Lachs of California State University, Fresno.

联系我们 contact @ memedata.com