Engineering Explainer

PixelRAG: reading the web as pictures

Skip HTML parsing entirely — retrieve and read the web as screenshots, in pixel space, and beat text-based retrieval even on text-only questions.

When an AI system looks something up on the web to answer your question, there's a hidden, unglamorous step nobody talks about: turning web pages into text. PixelRAG's bet is that we can delete that step entirely — and come out ahead.

The hidden, lossy step

Retrieval-augmented generation, or RAG, grounds a language model in external knowledge: when a question comes in, the system retrieves relevant documents and feeds them to the model. The biggest, richest library is the web, so most systems retrieve from web pages. But web pages aren't text — they're HTML, CSS, layout, tables, infoboxes, and images, rendered together into what you actually see.

A traditional RAG system needs plain text, so before any clever AI happens there's an HTML-to-text parsing stage — and it is brittle and lossy. The paper notes a single extractor can discard over forty percent of a page's recoverable text. Tables get flattened into meaningless delimiter strings; images and charts vanish; and the choice of parser alone can swing answer accuracy by nearly ten percent. We've been treating this step as free when it absolutely isn't.

Read the web as you see it

So PixelRAG asks: the web was designed to be looked at — what if the AI looks at it too? Render each page as a screenshot, build the searchable index out of those images, retrieve the most relevant screenshots, and feed them straight to a vision-language model — a model that takes images as input and can read text inside them. No parsing, no text conversion. Everything happens, as the authors put it, in pixel space.

This is feasible now precisely because vision-language models got good at reading text rendered as pixels — OCR, tables, charts. Once a model can reliably read a screenshot, you no longer need to extract the text.

Making it scale

The engineering is most of the work. They build over the full English Wikipedia — seven million articles — the first screenshot-based retrieval at that scale.

  • Render and tile. Pages are rendered offline (fetching decoupled from rendering for speed), stripped of navigation, and sliced into fixed-size tiles — about thirty million tiles for Wikipedia. The whole pipeline runs in roughly two days on one machine.
  • Index. Each tile is turned into a single embedding vector. The leading visual-document methods use hundreds of vectors per image for fine detail — but at thirty million tiles that would balloon the index to ~6.5 terabytes. A single 2048-dimension vector per tile keeps it near 120 GB, manageable on one host.
  • Retrieve and read. A query is embedded, the most similar tiles are retrieved, and a vision-language reader answers directly from the pixels.

They also fine-tune the visual embedder with no human labels: a language model invents questions from a tile, mines hard negatives, and filters out false negatives where another tile happens to answer the question too.

The surprising result

PixelRAG beats both no-retrieval and text-based RAG across six benchmarks. The shock is that it wins even on the text-centric tasks — classic Wikipedia question-answering where the answer is plain prose. On SimpleQA it scores about seventy-nine percent against roughly seventy-two for the best text parser.

Why? Two failure modes of text. First, parser loss: flattening a table destroys the structure that made it meaningful, so the answer simply vanishes from the text being searched. Second, and subtler: collapsing a two-dimensional page into a one-dimensional token stream loses the visual cues that tell a reader where to look. The authors found a lovely example — Wikipedia's keyword-dense infobox, once flattened, overlaps with almost any question about the topic, so the text retriever keeps grabbing it instead of the answer-bearing paragraph. The visual model is immune: to it, an infobox looks like an infobox.

There's a bonus, too. Because the input is now an image, you have a new dial — resolution. Render the screenshots smaller and you cut token cost by up to three times while keeping accuracy.

The honest caveats

This is a preprint. It's compute-heavy — building the index took eight high-end GPUs and two terabytes of memory. The fine-tuned embedder, trained on Wikipedia, didn't transfer cleanly to news pages, so some benefit needs per-domain work. And the single-vector design they chose for scale trades away some of the fine detail that the more expensive multi-vector methods capture.

Why it matters

PixelRAG questions an assumption so deep we rarely notice it: that text is the right representation for web knowledge. We've spent years building elaborate machinery to convert the visual, structured, designed-for-humans web into lossy text, then feeding that to our models. As models learn to see, that detour may be unnecessary — and even counterproductive. Meet the web in its native form, as pixels, and you can be both more accurate and more efficient.