AI Language Models Boost Video Descriptions for the Blind
Researchers at Northeastern University are using AI vision language models to provide audio descriptions for user-generated videos, making content more accessible to blind and low-vision users.
For people who are blind or have low vision, understanding the action in movies and TV shows relies heavily on audio descriptions. While networks and streaming services hire professionals to create these descriptions, the same cannot be said for the vast array of user-generated content on platforms like YouTube and TikTok.
Blind and low-vision users often request descriptions for these videos on a crowdsourced platform called YouDescribe, but only 7% of the requests are completed. To address this, researchers at Northeastern University are leveraging AI vision language models (VLM) to speed up the process.
Lana Do, a recent master's graduate in computer science from Northeastern's Silicon Valley campus, explains, 'It's understandable that a 20-second TikTok video of somebody dancing may not get a professional description, but blind and low-vision people might like to see that dancing video too.'
YouDescribe, launched in 2013 by the San Francisco-based Smith-Kettlewell Eye Research Institute, trains sighted volunteers to create audio descriptions. The platform offers tutorials for recording and timing narration, making user-generated video content more accessible.
In 2018, Ilmi Yoon, a teaching professor of computer science at Northeastern's Silicon Valley campus, joined YouDescribe's team to develop the platform's machine learning elements. This year, Do added new features to enhance the human-in-the-loop workflow. VLM technology now provides better quality descriptions, and an infobot tool allows users to request more information about specific video frames.
Low-vision users can even correct mistakes in the descriptions using a collaborative editing interface. 'They could say that they were watching a documentary set in a forest and they heard a flapping sound that wasn't described, and they wondered what it was,' Do explains.
The integration of AI-generated drafts eases the burden on human describers, and users can easily engage in the process through ratings and comments. 'Blind users don't want to get distracted with too much verbal description. It's an editorial art to verbalize the most important information in a concise way,' Yoon notes.
Graduate students in Yoon's lab compare AI first drafts to those created by human describers to measure the gaps and train the AI to improve. While AI is surprisingly good at describing human expressions and movements, it struggles with reading facial expressions in cartoons and picking up on the most important details in a scene.
Despite these challenges, the combination of AI and human collaboration is making significant strides in making video content more accessible. 'Only 7% of requested videos on the wishlist have audio descriptions, but with the help of AI, we're seeing a marked improvement,' Do concludes.
Frequently Asked Questions
What is YouDescribe?
YouDescribe is a crowdsourced platform launched in 2013 by the Smith-Kettlewell Eye Research Institute. It trains sighted volunteers to create audio descriptions for user-generated videos, making content more accessible to blind and low-vision users.
How does AI assist in creating audio descriptions?
AI vision language models (VLM) generate initial drafts of audio descriptions for user-generated videos, which are then refined by human describers. This speeds up the process and makes more content accessible to blind and low-vision users.
What are the limitations of AI in creating audio descriptions?
AI is good at describing human expressions and movements but struggles with reading facial expressions in cartoons and picking up on the most important details in a scene. Human describers are still better at these tasks.
How can users contribute to YouDescribe?
Sighted users can volunteer to create audio descriptions for requested videos. They can also rate and provide feedback on existing descriptions to improve their quality.
What is the impact of AI on the accessibility of user-generated content?
AI significantly speeds up the process of creating audio descriptions for user-generated content, making more videos accessible to blind and low-vision users. This collaboration between AI and human describers is improving the availability and quality of audio descriptions.