Northeastern's AI Enhances Video Accessibility for the Blind

SAN JOSE, Calif. — For individuals who are blind or have low vision, audio descriptions of movies and TV shows are crucial for understanding the visual content. While networks and streaming services often hire professionals to create these descriptions, the same level of accessibility is not available for the billions of user-generated videos on platforms like YouTube and TikTok.

Using AI vision language models (VLM), researchers at Northeastern University are making audio descriptions available for user-generated videos through a platform called YouDescribe. This crowdsourced platform functions like a library, where blind and low-vision users can request descriptions for videos and later rate and contribute to them.

“It’s understandable that a 20-second video on TikTok of somebody dancing may not get a professional description,” says Lana Do, who recently completed her master’s in computer science at Northeastern’s Silicon Valley campus. “But blind and low-vision people might like to see that dancing video too.”

A 2020 video of the South Korean boy band BTS’s song “Dynamite” is at the top of YouDescribe’s wishlist, waiting to be described. Despite having 3,000 volunteer describers, the platform can only meet 7% of the requests, Do explains.

Lana Do works in the lab of Ilmi Yoon, a teaching professor of computer science at Northeastern’s Silicon Valley campus. Yoon joined YouDescribe’s team in 2018 to develop the platform’s machine learning elements. This year, Do added new features to speed up YouDescribe’s human-in-the-loop workflow. New VLM technology provides better quality descriptions, and a new infobot tool allows users to ask for more information about a specific video frame. Low-vision users can even correct mistakes in the descriptions using a collaborative editing interface.

The result is that video content descriptions are becoming more available and of higher quality. AI-generated drafts ease the burden on human describers, and users can easily engage in the process through ratings and comments.

“Users can provide feedback, such as mentioning a flapping sound they heard in a documentary set in a forest that wasn’t described,” Do says. “This helps improve the accuracy and relevance of the descriptions.”

Do and her colleagues recently presented a paper at the Symposium on Human-Computer Interaction in Amsterdam about the potential for AI to accelerate the development of audio descriptions. AI does a surprisingly good job at describing human expressions and movements, Yoon notes. For example, in one video, an AI agent describes the steps a chef takes while making cheese rolls.

However, there are some consistent weaknesses. AI isn’t as good at reading facial expressions in cartoons, and humans are generally better at picking out the most important details in a scene. “It’s very labor-intensive,” Yoon says. “Blind users don’t want to get distracted with too much verbal description. It’s an editorial art to verbalize the most important information in a concise way.”

Graduate students in Yoon’s lab compare AI-generated descriptions to those created by human describers. They measure the gaps to train the AI to perform better. “Blind users don’t want to get distracted with too much verbal description. It’s an editorial art to verbalize the most important information in a concise way,” Yoon emphasizes.

YouDescribe was launched in 2013 by the San Francisco-based Smith-Kettlewell Eye Research Institute to train sighted volunteers in creating audio descriptions. The platform focuses on YouTube and TikTok videos, offering tutorials for recording and timing narration to make user-generated content accessible.

As the use of AI in audio descriptions continues to evolve, platforms like YouDescribe are making significant strides in improving video accessibility for the blind and low-vision community.