AI OCR LogoAI OCR
Voltar ao blog
Novo lançamento

Baidu Unlimited-OCR: Rompendo a barreira do comprimento na análise de documentos

Como a Reference Sliding Window Attention (R-SWA) da Baidu alcança análise de documentos com memória constante O(1), tornando obsoleto o OCR página por página.

Baidu AI Research • VLM de código aberto

Unlimited-OCR

Análise de dezenas de páginas de documentos em uma única passagem com complexidade de memória O(1) constante.

📄
50+ páginas
R-SWA
🧠
KV Cache constante
Context Horizon
Unlimited

Parses dozens of pages in a single forward pass without chunking.

Memory Complexity
O(1) Constant

KV cache size remains flat, preventing GPU Out-Of-Memory errors.

Inference Latency
Flat Line

Generation speed per token does not degrade as document length increases.

Baidu has officially open-sourced Unlimited-OCR, a breakthrough document-parsing model that achieves SOTA results on OmniDocBench. By utilizing a novel attention mechanism called Reference Sliding Window Attention (R-SWA), it parses complex, multi-page documents in a single pass with a completely constant memory footprint.

The Evolution of Document Intelligence

For years, digitizing documents has been a two-step process: run an optical character recognition (OCR) engine on individual page images, and then use heuristics or LLMs to reconstruct the layout, tables, and reading order. While Vision-Language Models (VLMs) have recently shown the ability to parse documents end-to-end, they have been severely limited by document length. Trying to process a 50-page PDF in a single pass quickly overwhelms GPU memory due to the quadratic scaling of the attention mechanism.

The KV Cache Bottleneck in Long-Document Processing

In a standard Transformer decoder, as the model generates text, it stores the Key-Value (KV) states of all previous tokens in memory (the KV Cache) so it doesn''t have to recompute them. When transcribing a long document, the output text can easily span thousands of tokens. This causes the KV cache to grow linearly, which in turn leads to quadratic memory usage in self-attention. As a result, inference becomes extremely slow, and eventually, the system crashes with an Out-of-Memory (OOM) error.

Memory Footprint Comparison (KV Cache)

1. Traditional AttentionO(N²) Quadratic

Memory demands inflate rapidly as the output grows. Results in Out-of-Memory (OOM) crashes on long documents.

2. Reference Sliding Window (R-SWA)O(1) Constant

Memory footprint remains perfectly flat. The model only remembers the visual source and the immediate text window.

Enter Reference Sliding Window Attention (R-SWA)

Inspired by the human cognitive process of transcription—where we look at the source document (Reference) but only keep a small window of the recently written words in our short-term memory—Baidu''s team designed R-SWA. It splits the attention mechanism into two distinct parts:

👁️

Full Vision Reference

The decoder maintains full, unrestricted attention over the entire visual features of the document pages, ensuring no loss of visual context.

Sliding Window Text Attention

Instead of attending to the entire generated history, the decoder only attends to a fixed window of the most recent text tokens (e.g., the last 128 tokens).

Constant KV Cache

By capping the text history, the KV cache size remains constant. This achieves O(1) memory complexity and flat latency per token during generation.

Comparison: Traditional OCR vs. Unlimited-OCR

FeatureTraditional Page-by-Page OCRBaidu Unlimited-OCR
Max Document Length1 - 2 Pages (per pass)Unlimited (tested up to 50+ pages)
Memory ComplexityO(N²) Quadratic growthO(1) Constant memory
Page Stitching ErrorsHigh (broken tables & sentences)Zero (seamless end-to-end parsing)
Latency per TokenIncreases exponentiallyCompletely flat line

Under the Hood: Architecture & Efficiency

Unlimited-OCR is built upon the DeepSeek-OCR baseline, inheriting its highly efficient vision-language foundation. However, Baidu''s team replaced the standard self-attention layers in the decoder with the R-SWA module. The model features a Mixture-of-Experts (MoE) design:

  • 3 Billion Total Parameters: A highly capable base model trained on extensive multi-lingual document datasets.
  • 500 Million Activated Parameters: During inference, only a fraction of the experts are activated, resulting in extremely high throughput and low power consumption.
  • State-of-the-Art Accuracy: Achieved a composite score of 93.23% on the OmniDocBench v1.5 benchmark, outperforming previous SOTA models.

Get Started with Unlimited-OCR

Baidu has released Unlimited-OCR under the permissive MIT license. You can download the model weights, run local inference, or integrate it into your document pipelines. It can run comfortably on a single consumer GPU.

Install & Run

git clone https://github.com/baidu/Unlimited-OCR.git

Frequently Asked Questions

What is the maximum page limit for Unlimited-OCR?
Theoretically, there is no page limit due to the O(1) constant memory complexity of the R-SWA mechanism. In practice, it has been successfully tested on documents up to 50+ pages in a single pass.
Can it run on consumer-grade hardware?
Yes! Because of the Mixture-of-Experts (MoE) architecture (only 500M active parameters) and the constant KV cache, it can run efficiently on commodity GPUs (such as an RTX 4090 or even smaller cards) without running out of memory.
Is it suitable for commercial applications?
Absolutely. Unlimited-OCR is released under the MIT license, which allows for commercial use, modification, and distribution without restrictions.

Community Buzz

Sophia Vance@sophia_v
2026-06-25 10:30:00

Constant KV cache for long document generation is a game-changer. Finally, we can parse 50-page reports without OOM errors!

Hiroshi Tanaka@hiro_t
2026-06-25 12:15:00

Tested the Hugging Face space demo. The speed remains completely flat from page 1 to page 30. Outstanding work by Baidu.

Devin Carter@devinc
2026-06-25 14:00:00

Using MoE (3B total / 500M active) means we can run this on a single consumer GPU with great throughput. Can''t wait to integrate it.