Baidu Unlimited-OCR: Rompiendo la barrera de la longitud en la lectura de documentos
Cómo la Reference Sliding Window Attention (R-SWA) de Baidu logra un análisis de documentos con memoria constante O(1), dejando obsoleta la tecnología OCR de página por página.
Unlimited-OCR
Análisis de decenas de páginas de documentos en una sola pasada con una complejidad de memoria O(1) constante.
Parses dozens of pages in a single forward pass without chunking.
KV cache size remains flat, preventing GPU Out-Of-Memory errors.
Generation speed per token does not degrade as document length increases.
Baidu has officially open-sourced Unlimited-OCR, a breakthrough document-parsing model that achieves SOTA results on OmniDocBench. By utilizing a novel attention mechanism called Reference Sliding Window Attention (R-SWA), it parses complex, multi-page documents in a single pass with a completely constant memory footprint.
The Evolution of Document Intelligence
For years, digitizing documents has been a two-step process: run an optical character recognition (OCR) engine on individual page images, and then use heuristics or LLMs to reconstruct the layout, tables, and reading order. While Vision-Language Models (VLMs) have recently shown the ability to parse documents end-to-end, they have been severely limited by document length. Trying to process a 50-page PDF in a single pass quickly overwhelms GPU memory due to the quadratic scaling of the attention mechanism.
The KV Cache Bottleneck in Long-Document Processing
In a standard Transformer decoder, as the model generates text, it stores the Key-Value (KV) states of all previous tokens in memory (the KV Cache) so it doesn''t have to recompute them. When transcribing a long document, the output text can easily span thousands of tokens. This causes the KV cache to grow linearly, which in turn leads to quadratic memory usage in self-attention. As a result, inference becomes extremely slow, and eventually, the system crashes with an Out-of-Memory (OOM) error.
Memory Footprint Comparison (KV Cache)
Memory demands inflate rapidly as the output grows. Results in Out-of-Memory (OOM) crashes on long documents.
Memory footprint remains perfectly flat. The model only remembers the visual source and the immediate text window.
Enter Reference Sliding Window Attention (R-SWA)
Inspired by the human cognitive process of transcription—where we look at the source document (Reference) but only keep a small window of the recently written words in our short-term memory—Baidu''s team designed R-SWA. It splits the attention mechanism into two distinct parts:
Full Vision Reference
The decoder maintains full, unrestricted attention over the entire visual features of the document pages, ensuring no loss of visual context.
Sliding Window Text Attention
Instead of attending to the entire generated history, the decoder only attends to a fixed window of the most recent text tokens (e.g., the last 128 tokens).
Constant KV Cache
By capping the text history, the KV cache size remains constant. This achieves O(1) memory complexity and flat latency per token during generation.
Comparison: Traditional OCR vs. Unlimited-OCR
| Feature | Traditional Page-by-Page OCR | Baidu Unlimited-OCR |
|---|---|---|
| Max Document Length | 1 - 2 Pages (per pass) | Unlimited (tested up to 50+ pages) |
| Memory Complexity | O(N²) Quadratic growth | O(1) Constant memory |
| Page Stitching Errors | High (broken tables & sentences) | Zero (seamless end-to-end parsing) |
| Latency per Token | Increases exponentially | Completely flat line |
Under the Hood: Architecture & Efficiency
Unlimited-OCR is built upon the DeepSeek-OCR baseline, inheriting its highly efficient vision-language foundation. However, Baidu''s team replaced the standard self-attention layers in the decoder with the R-SWA module. The model features a Mixture-of-Experts (MoE) design:
- 3 Billion Total Parameters: A highly capable base model trained on extensive multi-lingual document datasets.
- 500 Million Activated Parameters: During inference, only a fraction of the experts are activated, resulting in extremely high throughput and low power consumption.
- State-of-the-Art Accuracy: Achieved a composite score of 93.23% on the OmniDocBench v1.5 benchmark, outperforming previous SOTA models.
Get Started with Unlimited-OCR
Baidu has released Unlimited-OCR under the permissive MIT license. You can download the model weights, run local inference, or integrate it into your document pipelines. It can run comfortably on a single consumer GPU.
Install & Run
git clone https://github.com/baidu/Unlimited-OCR.gitFrequently Asked Questions
What is the maximum page limit for Unlimited-OCR?▼
Can it run on consumer-grade hardware?▼
Is it suitable for commercial applications?▼
Community Buzz
Constant KV cache for long document generation is a game-changer. Finally, we can parse 50-page reports without OOM errors!
Tested the Hugging Face space demo. The speed remains completely flat from page 1 to page 30. Outstanding work by Baidu.
Using MoE (3B total / 500M active) means we can run this on a single consumer GPU with great throughput. Can''t wait to integrate it.