PageIndex is a vectorless, reasoning-based Retrieval-Augmented Generation (RAG) approach that retrieves answers from long documents without using embeddings, chunking, or a vector database.
Instead of relying on semantic similarity search, PageIndex builds a hierarchical Table of Contents (TOC) tree from a document and uses a Large Language Model (LLM) to reason over that structure. The model first identifies the most relevant section using the document’s hierarchy, then navigates to that section to generate a precise, cited answer.
Traditional RAG retrieves by similarity.
PageIndex retrieves by reasoning over structure.
This makes it particularly effective for structured, long-form content such as financial reports, legal contracts, regulatory filings, policy documents, and academic papers.
Most Retrieval-Augmented Generation (RAG) systems rely on embeddings and vector databases. They split documents into chunks, convert them into vectors, and retrieve answers using cosine similarity.
But similarity is not reasoning.
PageIndex is a vectorless RAG architecture that retrieves information by reasoning over document structure instead of performing semantic search. Rather than treating a document as a flat pile of text, it treats it as a structured hierarchy — like a textbook with a table of contents.
To understand how this works intuitively, let’s walk through an example using the classic film Sholay .
The Core Idea: Structure Before Search
If you feed the script or a detailed synopsis of Sholay into PageIndex, it does not split the document into arbitrary 500-word chunks.
Instead, it builds a structural tree of the story.
Instead, it builds a structural tree of the story.
Think:
Document → Hierarchical Index → Reasoning-Based Retrieval → Answer
Instead of:
Document → Chunks → Embeddings → Vector DB → Similarity Search → Answer
Phase 1: Creating the Tree (Indexing Phase)
The first phase is structural indexing.
1️⃣ Structural Detection
The LLM reads the script and detects natural boundaries such as:
- Scene headings (“SCENE 1 — THE TRAIN ROBBERY”)
- Character introductions
- Act breaks
- Major narrative transitions
It doesn’t rely on fixed chunk sizes. It relies on narrative structure.
- 🎬 Dark root → represents the full document
- 🔵 Blue → Major story segments
- 🔴 Red → Gabbar-related arc
- 🟣 Purple → Critical event node
- 🟠 Gold → Specific factual events
2️⃣ Hierarchical Mapping
PageIndex builds a tree.
If the root node is:
Sholay
The first-level branches might look like:
- Prologue
- Recruitment of Veeru & Jai
- Life in Ramgarh
- Gabbar’s Reign of Terror
- The Final Showdown
Each of those branches can contain child nodes (subsections).
For example:
Gabbar’s Den
Summary:
“This section covers Gabbar Singh’s introduction, the ‘Kitne aadmi the’ dialogue, and the punishment of his henchmen.”
Each node contains:
-
title -
nodeId -
summary -
child nodes
The key here is the summary. The LLM writes a concise, semantic description of what happens in that section.
That summary becomes the retrieval signal later.
Phase 2: The Query Phase
Now imagine the user asks:
Why did Thakur lose his arms?
Here is what does not happen:
- The full script is not sent immediately.
- No embeddings are generated.
- No vector similarity search is performed.
Instead, the LLM receives:
- The user’s question.
- The hierarchical map (the JSON tree).
- The summaries of each node.
Not the entire script.
Just the structure.
How the LLM Finds the Answer (Reasoning, Not Math)
Step A: The Structural Search
The LLM reads the tree.
It sees nodes like:
- “The Massacre of Thakur’s Family”
- “Gabbar’s Revenge”
- “Life in Ramgarh”
Based on the summaries, it reasons:
The answer likely exists in sections involving Gabbar and Thakur’s injury.
This is logical reasoning, not vector similarity.
Step B: The Deep Dive
PageIndex then retrieves only the raw text corresponding to those specific nodes.
Instead of scanning 50 pages, it retrieves perhaps 2–3 focused sections.
Step C: The Final Answer
Now the LLM reads that small, highly relevant snippet and responds:
Thakur lost his arms because Gabbar Singh cut them off as revenge for Thakur arresting him years earlier.
It can also cite:
(nodeId: massacre-thakur-family)
The retrieval is explainable.
Why This Is Better Than Traditional RAG (For Structured Documents)
In a traditional vector RAG system:
If you search for “Thakur’s arms,” the system might retrieve:
- A fight scene where Jai and Veeru are using their arms
- Dialogue containing similar vocabulary
- Irrelevant mentions of “hands” or “injury”
This happens because vector search retrieves by semantic closeness, not narrative relevance.
It performs what can be described as “vibe matching.”
PageIndex avoids this problem because the summary of the massacre scene explicitly says:
“This section describes how Gabbar attacks Thakur’s family and severs his arms.”
The LLM does not guess.
It navigates.
Why PageIndex Works
PageIndex works because it separates two cognitive tasks:
- Navigation — Determine where the answer should exist.
- Extraction — Read only that section and generate the answer.
This mirrors how humans read:
When you want to know why something happened in a novel, you don’t skim every page randomly.
You go to the chapter where the relevant event occurred.
PageIndex forces the LLM to behave the same way.
When This Approach Shines
This architecture is particularly effective for:
- Financial reports
- Legal documents
- Policy papers
- Regulatory filings
- Academic research
- Long narrative content
Anywhere structure matters more than surface similarity.
The Bigger Insight
Traditional RAG assumes:
Relevance = semantic similarity.
PageIndex assumes:
Relevance = structural reasoning.
That difference seems small, but in long, hierarchical documents, it is profound.
Instead of building a better search engine, PageIndex builds a navigational map — and lets the LLM think before it reads.


‘0’ Komentar
Tinggalkan Komentar