Code Index Three Tiers: When Heavy Indexing is Actually Worth It

Code is not a document. Stuffing code into a vector store works up to a point, but answering "who calls this function?" or "which tests break if I change this line?" is completely beyond vector retrieval. combo agent splits "getting code into the knowledge base" into three indexing tiers — how much initialization time you invest determines the depth of questions you can answer.

One Table to Understand All Three Tiers

Tier	What You Need to Do	When It's Ready	Questions It Can Answer	Who It's For
Zero index	Just drop the repo and search	Instant	"Does this repo have authentication code?" "Which files mention this API?"	Ad-hoc debugging, newly inherited code, small repos you won't maintain long-term
Light index	Wait for initial sync (minutes, depends on repo size)	Seconds after any file change	"Where is this function defined?" "Who calls it?" "What are all its imports?"	Daily code review, new-hire onboarding, the vast majority of business repos
Heavy index	Set up a compilable build environment first (10 min ~ hours, depends on project)	Available after full analysis completes	"Which downstreams are affected if I change this line?" "What does this inheritance hierarchy look like?" "All cross-compilation-unit references"	Legacy C/C++ projects, security audits, cross-module large refactors

Why Not "One Tier for Everything"

Many products claim "one tier handles everything." Either they're misleading you:

Not deep enough: only literal grep, can't answer call relationships
Too slow: every repo needs "half-compilation," initialization takes tens of minutes, and you have to wait again next time

Our judgment: index depth and initialization cost are a trade-off — picking the right tier per scenario is the engineering approach.

Rendering diagram…

Zero Index: Search Right Away

No index built — relies on syntax-level intelligent matching + LLM reading relevant files directly.

Good for:

"What does this repo mainly do? Give me an overview"
"Search for code that handles timeouts"
"I remember there was a file called payment_*.py, help me find it"

Not good for:

"Who calls this function" — zero index has no call graph, it can only tell you where the name literally appears
Cross-language calls, cross-file inheritance inference

Practical advice: If you just want AI to "take a look at this repo," zero index is enough. Its biggest advantage is no waiting.

Light Index: The Daily Workhorse

The platform uses local tools (tree-sitter-based syntax analysis) to build a lightweight index, extracting the locations of functions, classes, variables, and imports for each language, stored in the search engine.

Rendering diagram…

Key Upgrade (M4 / R8+)

Change one file, new content searchable in seconds — used to take minutes, now instant
Old "batch wait / batch rebuild" mechanism removed; write code and query immediately

Good for

"Where is processOrder defined and who calls it?"
"What does this class inherit from? What are its subclasses?"
"How many places in the whole project use redis.Pipeline?"
"List all external modules imported by this file"

Not good for

Questions requiring type inference (C++ template instantiation, Python duck-typing runtime types)
Cross-compilation-unit, cross-link-boundary references
ASIL-D module "which downstreams are affected if I change this line?"

Heavy Index: Audits / Refactors / Legacy C/C++

Actually "half-compiles" the project, building a fully-typed code graph (based on the SCIP semantic graph). Most expensive and most accurate.

Right Scenarios

Scenario	Why Heavy Index Is Required
Legacy C/C++ projects	Macros, templates, include graphs are complex; without type info, you can't really audit
Security audits	Track user input flow to sensitive functions (taint analysis)
Cross-module large refactors	Changing a core interface requires seeing all downstreams first
Code compliance review	MISRA / ASPICE / 26262 require "complete call chain traceability"
ASIL-D safety analysis	SOTIF / FMEA need line-level impact propagation

Prerequisites

Your project can build in our environment (CMake / Bazel / Maven, etc.)
Accept a one-time 10-minute to hour-level initialization
Project language is in the SCIP support matrix (Java / Python / Go / Rust / TypeScript / C/C++ partial)

Real-World Example

On automotive ECU code, heavy index can answer: "I changed can_send_frame — which CAN buses are ultimately affected, and which ASIL-level requirements do they correspond to?" — light and zero index cannot answer this.

Scenario-to-Tier Recommendations

Scenario	Recommended Tier	Notes
Daily PR / MR code review	Light	Second-level response
New hire onboarding	Light	Call relationships are sufficient
Quick review of an outsourced commit	Zero	Not worth waiting
Browsing an open-source repo	Zero	Instant
ASPICE compliance audit / BP assessment	Heavy	Needs complete call chains
Cross-module large refactor	Heavy	Must see all downstreams
Security audit / taint analysis	Heavy	Cross-compilation-unit
SOTIF / FMEA impact analysis	Heavy	Line-level precision

Ragbase Code Index Full Stack (L1–L6)

The sales-facing "zero / light / heavy" tiers are product packaging. The real technical layers are six deep:

Layer	Index Type	Source	Trigger
L1	Text chunks + embeddings	Any parser	Always
L2	BM25 full-text inverted index	Any parser	Always
L3	AST chunks	tree-sitter	`parser_id=code-aware`
L4	R8 keyword fields (functions / classes / imports / fqn)	tree-sitter AST derived	L3 running
L5	`audit_findings`	ast-grep YAML rules	ast-grep binary available
L6	SCIP semantic graph (`index.scip`)	`scip-java` / `scip-python` / tree-sitter	`index_tier=heavy`

Sales Tier	Actual Layers
Zero index	Most L1/L2/L3 not written — falls back to LLM / grep
Light index	L1+L2 always on; `code-aware` adds L3+L4+L5
Heavy index	Light index plus L6 SCIP full graph for supported languages

Relationship to L1+L2 Dual-Layer Code Review

Our AI Code Review system also uses L1 / L2 — but those are review tiers, not index tiers:

Code review L1: Objective static scanning (cppcheck / Checkstyle / ruff, etc.)
Code review L2: Semantic review (LLM + KB + LTM)
Index L1–L6: How code enters the knowledge base

Don't confuse them. L1/L2 code review can run at any index tier — but L2 review of ASIL-D modules only makes sense paired with heavy index.

FAQ

Q: Can I go straight from zero to heavy index? A: Yes. Index tier is a project-level setting that can be switched online. Switching to heavy triggers a full half-compilation pass.

Q: My repo is very large (1M+ lines). Can heavy index handle it? A: Yes, but initial setup may take several hours. We recommend starting with your core modules (100–500K lines) on heavy, with the rest on light.

Q: Does private deployment support heavy index? A: Yes. SCIP indexing is fully local — no data leaves the customer's network.

Q: How does incremental re-indexing work? A: Light index is truly incremental (file change → seconds). Heavy index uses batch incremental (per commit batch or scheduled full rebuild).

Q: How does this compare to Sourcegraph? A: Sourcegraph has excellent code indexing and we're aligned with it on capabilities (we use the same SCIP toolchain in some scenarios). But combo agent is a full Agent platform combining LLM retrieval + team memory + business rules — not just "code search."

Full capability details: Code Index Three Tiers (docs/concepts/code-index).