IBM Granite Vision Leads the Way in Document Understanding

Last week, IBM made headlines with its advancements in speech recognition, and now it’s making waves in the world of vision. Each of Granite’s senses is proving its worth, and the latest achievement is no exception.

So much of the data we interact with at work is inherently visual. Employees around the world perform countless time-consuming tasks each day that could be automated by large AI models capable of interpreting the world to provide answers. From understanding information in charts and tables to parsing the contents of images in presentations and websites, or deciphering handwritten notes, the potential for automation is vast.

However, this requires a multimodal AI model that can understand both text and the layout of documents, enabling it to interpret complex forms, charts, tables, and invoices in a way that a text-only model cannot. IBM recently secured a top spot on the OCRBench leaderboard with its open-source Granite Vision 3.3 2B model. The multimodal model currently ranks second in the table, significantly outperforming any other small model under 7B parameters.

OCRBench is a comprehensive benchmark used by the AI industry to assess how effective vision and multimodal models are at tasks requiring the ability to read text in challenging scenarios. While the concept of machines reading printed or handwritten text is not new, building AI systems that can take this vision capability, discern what is being displayed, and generate useful information is a cutting-edge field.

OCRBench includes five components to evaluate the model’s performance: text recognition, key information extraction, understanding handwritten mathematical expressions, and the ability to answer questions on specific scenes and documents. The test consists of 1,000 question-and-answer pairs, with each answer including at least four symbols to minimize false positives.

When tested across these five components, Granite Vision 3.3 2B achieved the second-highest score overall, particularly excelling in recognizing mathematical expressions and answering questions on specific scenes. The model outperformed several industry leaders, including Google’s Gemini, OpenAI’s GPT-4V, and several models based on Meta’s Llama.

According to Eli Schwartz, a researcher from the IBM Granite Vision team, the high score is attributed to the training data. “We deliberately trained the model on a dataset of low-quality documents, which made it exceptionally resilient and accurate on the kind of real-world images found in the benchmark,” Schwartz said.

For this latest version of Granite Vision, the team made several tweaks to improve its performance. The goal was to create a compact model that would work effectively and dependably, making cutting-edge AI more accessible and cost-effective. Rogerio Feris, an IBM Researcher working on these vision models, noted that the team dropped in a new encoder and added more layers of document training.

The team focused on creating high-quality training data for tasks that IBM’s own use cases would benefit most from. “The biggest surprise was just how much powerful performance we could extract from such a small model,” Schwartz said. “We were also impressed that it continued to improve as we added more data, showing that we haven’t yet hit the ceiling for what these efficient models can do.”

The next big leap in AI will come when these models can act and reason without explicit instructions. With more high-quality data, more reinforcement learning, and some time, the team expects that models in the future will be able to power agentic workflows that can execute complex business tasks on their own.

IBM sees this model as a stepping stone to even greater advances in the future. The team’s focus on creating a compact, efficient model that delivers exceptional performance sets the stage for more transformative applications in the years to come.