Baidu Unlimited-OCR: Rompiendo la barrera de la longitud en la lectura de documentos

Baidu has officially open-sourced Unlimited-OCR, a breakthrough document-parsing model that achieves SOTA results on OmniDocBench. By utilizing a novel attention mechanism called Reference Sliding Window Attention (R-SWA), it parses complex, multi-page documents in a single pass with a completely constant memory footprint.

The Evolution of Document Intelligence

For years, digitizing documents has been a two-step process: run an optical character recognition (OCR) engine on individual page images, and then use heuristics or LLMs to reconstruct the layout, tables, and reading order. While Vision-Language Models (VLMs) have recently shown the ability to parse documents end-to-end, they have been severely limited by document length. Trying to process a 50-page PDF in a single pass quickly overwhelms GPU memory due to the quadratic scaling of the attention mechanism.

The KV Cache Bottleneck in Long-Document Processing

In a standard Transformer decoder, as the model generates text, it stores the Key-Value (KV) states of all previous tokens in memory (the KV Cache) so it doesn''t have to recompute them. When transcribing a long document, the output text can easily span thousands of tokens. This causes the KV cache to grow linearly, which in turn leads to quadratic memory usage in self-attention. As a result, inference becomes extremely slow, and eventually, the system crashes with an Out-of-Memory (OOM) error.

Memory Footprint Comparison (KV Cache)

1. Traditional AttentionO(N²) Quadratic

Memory demands inflate rapidly as the output grows. Results in Out-of-Memory (OOM) crashes on long documents.

2. Reference Sliding Window (R-SWA)O(1) Constant

Memory footprint remains perfectly flat. The model only remembers the visual source and the immediate text window.

Enter Reference Sliding Window Attention (R-SWA)

Inspired by the human cognitive process of transcription—where we look at the source document (Reference) but only keep a small window of the recently written words in our short-term memory—Baidu''s team designed R-SWA. It splits the attention mechanism into two distinct parts:

👁️

Full Vision Reference

The decoder maintains full, unrestricted attention over the entire visual features of the document pages, ensuring no loss of visual context.

⏳

Sliding Window Text Attention

Instead of attending to the entire generated history, the decoder only attends to a fixed window of the most recent text tokens (e.g., the last 128 tokens).

⚡

Constant KV Cache

By capping the text history, the KV cache size remains constant. This achieves O(1) memory complexity and flat latency per token during generation.

Comparison: Traditional OCR vs. Unlimited-OCR

Feature	Traditional Page-by-Page OCR	Baidu Unlimited-OCR
Max Document Length	1 - 2 Pages (per pass)	Unlimited (tested up to 50+ pages)
Memory Complexity	O(N²) Quadratic growth	O(1) Constant memory
Page Stitching Errors	High (broken tables & sentences)	Zero (seamless end-to-end parsing)
Latency per Token	Increases exponentially	Completely flat line

Under the Hood: Architecture & Efficiency

Unlimited-OCR is built upon the DeepSeek-OCR baseline, inheriting its highly efficient vision-language foundation. However, Baidu''s team replaced the standard self-attention layers in the decoder with the R-SWA module. The model features a Mixture-of-Experts (MoE) design:

3 Billion Total Parameters: A highly capable base model trained on extensive multi-lingual document datasets.
500 Million Activated Parameters: During inference, only a fraction of the experts are activated, resulting in extremely high throughput and low power consumption.
State-of-the-Art Accuracy: Achieved a composite score of 93.23% on the OmniDocBench v1.5 benchmark, outperforming previous SOTA models.

Get Started with Unlimited-OCR

Baidu has released Unlimited-OCR under the permissive MIT license. You can download the model weights, run local inference, or integrate it into your document pipelines. It can run comfortably on a single consumer GPU.

View on GitHub 🤗 View Models

Install & Run

git clone https://github.com/baidu/Unlimited-OCR.git

Frequently Asked Questions

What is the maximum page limit for Unlimited-OCR?▼

Theoretically, there is no page limit due to the O(1) constant memory complexity of the R-SWA mechanism. In practice, it has been successfully tested on documents up to 50+ pages in a single pass.

Can it run on consumer-grade hardware?▼

Yes! Because of the Mixture-of-Experts (MoE) architecture (only 500M active parameters) and the constant KV cache, it can run efficiently on commodity GPUs (such as an RTX 4090 or even smaller cards) without running out of memory.

Is it suitable for commercial applications?▼

Absolutely. Unlimited-OCR is released under the MIT license, which allows for commercial use, modification, and distribution without restrictions.

Community Buzz

Sophia Vance@sophia_v

2026-06-25 10:30:00

Constant KV cache for long document generation is a game-changer. Finally, we can parse 50-page reports without OOM errors!

Hiroshi Tanaka@hiro_t

2026-06-25 12:15:00

Tested the Hugging Face space demo. The speed remains completely flat from page 1 to page 30. Outstanding work by Baidu.

Devin Carter@devinc

2026-06-25 14:00:00

Using MoE (3B total / 500M active) means we can run this on a single consumer GPU with great throughput. Can''t wait to integrate it.