Code indexing (three tiers)

In one sentence

Code is not prose. You can shove it into a vector database, but questions like “who calls this function?” or “which tests break if I edit this line?” are beyond raw vector search.

The platform separates three tiers of code indexing — the setup cost you accept sets the ceiling on future answers.

Indexing ≠ retrieval tools. This page covers how code lands in the knowledge base.
After it is indexed, use unified_search and code_search to query it.
The tier unlocks which retrieval mode works — Zero allows text/ast grep, Light enables semantic, Heavy unlocks precise navigate across files.

Ragbase stack (L1–L6)

Customer-facing Zero / Light / Heavy packaging maps onto the L1–L6 stack below.
Use this table for engineering alignment, proposals, and acceptance — it applies to every parser, not only code-aware.

Tier	Index type	Source	When it activates	What you get
L1	Text chunks + embeddings	Any parser	Always	Semantic search / similarity recall
L2	BM25 lexical index (`content_ltks`, etc.)	Any parser	Always	Keyword search
L3	AST chunking	tree-sitter	`parser_id=code-aware` and language marked L3 ✓ in matrix	Function / class / method granular chunks with intact semantics
L4	R8 keyword fields (`function_decls`, `class_decls`, `import_decls`, `unresolved_refs`, `fqn`, `outline_path`, …)	tree-sitter AST	`code-aware` and L3 succeeded	“What symbols live in this snippet?” — filter & facet ready
L5	`audit_findings`	ast-grep YAML	`code-aware` and ast-grep binary available	Static rules (NPE risk, hard-coded secrets, dangerous APIs, …)
L6	SCIP graph (`index.scip`)	`scip-java`, `scip-python`, tree-sitter emit	`code-aware` `and` `parser_config.index_tier=heavy` and language L6 ✓	Precise go-to-def / find-refs / blast radius

Rough mapping to Zero / Light / Heavy

Zero: uploads allowed, but most files skip L1/L2/L3 materialisation — LLM + grep cover gaps.
Light: L1+L2 always on; code-aware adds L3+L4+L5 (heavy not required).
Heavy: builds on Light and, for supported languages, emits the L6 SCIP graph (slowest, strongest).

Three tiers at a glance

Tier	Effort	Time to usable	Good questions	Audience
Zero	Drop a repo and search	Immediate	“Anything about auth?” “Which files mention API X?”	Ad-hoc audits, short-lived repos
Light	Wait for first sync (minutes typical)	Seconds after a file change	“Where is `processOrder` defined?” “Who calls it?” “What are the imports?”	Day-to-day reviews, onboarding
Heavy	Provide a buildable project (10 min–hours)	After full analysis completes	“Which downstreams break if I change this line?” “What is the inheritance lattice?” “All references across TUs”	Legacy C/C++, security, big refactors

Zero tier — search immediately

No durable index — smart textual tools + the model reading files on demand.

Works well for

“What does this repo do overall?”
“Find timeout handling snippets”
“There was a payment_*.py, locate it”

Weak spots

True call graph questions — you only see textual mentions of a name
Cross-language or cross-file semantic reasoning

Practical tip: choose Zero when you only need a quick tour — zero wait is the win.

Light tier — daily driver

Local syntactic tooling (tree-sitter class stack) builds compact symbol indexes per language and stores them for search.

Rendering diagram…

What changed in M4/R8

Per-file increments land in seconds, not minutes-long batch queues
Removed the old bulk “wait for rebuild” gate — commit, then query

Great for

“Where is processOrder defined and invoked?”
“What is the superclass / subclasses?”
“How many redis.Pipeline usages exist?”

Still weak when

Answers need full type inference (C++ templates, dynamic Python)
References cross link units without extra graph work

Heavy tier — audits, refactors, mature C/C++

We run a semi-compilation pass to assemble a typed graph. Highest cost, highest precision.

Ideal for

Legacy C/C++ with macros/includes too tangled for syntax-only views
Security work (taint / dataflow style questions)
Cross-module refactors needing downstream inventory
Compliance programs (MISRA, ASPICE traceability) demanding call-chain evidence

Requirements

Project must build in our runners (CMake / Bazel / Maven, …)
Accept a one-time tens-of-minutes → hours investment

Example: on ECU firmware, answer how editing can_send_frame touches specific CAN buses tied to ASIL-tagged requirements — not something Light/Zero can guarantee.

Decision checklist

Need call relationships? No → Zero. Yes → at least Light.
Living codebase that changes often? Yes → Light (incremental keeps up). One-off snapshot → Zero can suffice.
Old C/C++, security, or formal compliance? Yes → Heavy; otherwise Light is usually enough.

One KB, many repos?

The platform is multi-repo by default — one knowledge base can hold several repositories separated by top-level folders:

my-code-kb/
├── repo-vehicle-bsw/
│   ├── src/...
├── repo-mes-platform/
│   └── apps/
└── repo-shared-libs/

Top-level directory names become repo_kwd. Each repo maintains its own:

Symbol index bucket (symbol_index_<repo>)
Repository map (repo_map_<repo>)
Audit hits / reference maps

Why multi-repo matters

Capability	Detail
True per-repo incremental	Editing repo A touches only that file’s shards; B/C stay cold
Cross-repo lookup	Imports from repo B to symbols in repo A resolve in one query
Targeted heavy rebuild	Re-index SCIP for a single repo without taking others offline
Ambiguity resolution	Duplicate `UserService` names get `repo_kwd` tags automatically

Single-repo mode

Drop one top-level folder:

my-code-kb/
└── my-only-repo/

You can even flatten files at the KB root — the platform assigns a default repo_kwd. Single-repo is just a degenerate case.

Layout guidance

Scenario	Layout
One service / product	`kb/<repo>/…` or flat under `kb/`
Many microservices	One top-level folder per git repository
Monorepo	Treat the monorepo as one repo; rely on `outline_path` to separate projects
Parallel branches	`repo-main`, `repo-release-2026q2`, …

Do not flatten unrelated repos into one undifferentiated pile — repository isolation and cross-ref quality both suffer.

Supported languages

The matrix below is the language × (L3–L6) cheat sheet.
L1–L6 behaves the same for every parser; this section specialises in code-aware + bundled language packs.

Applies when parser_id = code-aware. PDFs / Markdown still use other parsers (L1/L2 or char fallbacks — see last table row).

Where to change parser & index tier (UI)

Without APIs: Combospace → Knowledge bases → open any KB → Configuration in the left rail.

Rendering diagram…

Configuration tab — chunk method (code-aware open) and index tier (demo KB robot-code)

Figure note: UI copy shifts between releases. Capture via Hikari/scripts/capture-kb-configuration-screenshot.py (direct URL under /combospace/combospace/knowledge/dataset/configuration). Increase DOC_CAPTURE_NAV_TIMEOUT_MS, DOC_CAPTURE_SELECTOR_TIMEOUT_MS, or DOC_CAPTURE_SETTLE_MS on cold bundles. Override KB_CAPTURE_PATH_PREFIX if your shell path differs.

Below is what each language unlocks for L3/L4/L5/L6 — aligns with the L1–L6 table.

The four row classes in the legend:

Layer	Meaning	Used by
L3 AST chunks	tree-sitter slicing with `outline_path`	Light & Heavy
L4 R8 fields	Structured keyword facets (`function_decls`, references, …)	Light
L5 ast-grep audits	YAML rule hits (`unified_search code_search mode=audit`)	Light & Heavy
L6 SCIP	Compiler-grade graph for cross-TU navigation	Heavy only

Language matrix

Language	Extensions	L3	L4	L5 audit	L6 SCIP
Python	`.py` `.pyi`	✓	✓	✓	✓ `scip-python` (no build file)
Java	`.java`	✓	✓	✓	✓ `scip-java`, needs `pom.xml` / `build.gradle*`
Kotlin	`.kt` `.kts`	✓	✓	✓	✓ path ① build → `scip-java`; path ② loose emit (no build)
Scala	`.scala`	✗	✗	✓	✓ `scip-java`, build required
Go	`.go`	✓	✓	✓	✗ (indexer not wired)
TypeScript / TSX	`.ts` `.tsx`	✓	✓	✓	✗
JavaScript / JSX	`.js` `.jsx` `.mjs` `.cjs`	✓	✓	✓	✗
C / C++	`.c` `.cpp` `.cc` `.cxx` `.h` `.hpp`	✓	✓	✓	⚠ aarch64 linux lacks `scip-clang`; `cpp_stub` fail-soft
Rust	`.rs`	✓	✓	✓	✗
Ruby / PHP / Swift / Dart / Lua / C# / Elixir	…	✗	✗	✓ (ast-grep)	✗
Other text (`md`, `txt`, Dockerfiles, …)	—	char fallback	✗	✗	✗

How to read it:

Nine L3-ready languages auto-enter Light tier with symbol payloads out of the box.
L5 audit spans more languages than L3 — you can still run unified_search code_search mode=audit on Scala / Ruby / Swift / etc. even without AST chunks.
L6 SCIP today covers Python / Java / Kotlin / Scala / C·C++ — other L3 languages lean on L4 faceting (~90% scenarios). Need more? mail info@nox-lumen.com.

⚠ C/C++ caveat: production ARM64 hosts lack upstream scip-clang binaries, so L6 uses cpp_stub fail-soft — symbol panes still work, yet cross-TU navigation degrades. x86_64 builds are unaffected.

Unsupported language?

Need PHP / Ruby / Swift / … on Light tier? Extend via parser-sdk — usually less than a day:

Confirm tree-sitter-language-pack already ships the grammar.

Fork code-aware-parser, append a LanguageConfig, e.g.

_RUBY = LanguageConfig(
    language="ruby",
    file_extensions=(".rb",),
    atomic_nodes=frozenset({"method", "class", "module"}),
    scope_nodes=frozenset({"class", "module", "method"}),
    import_nodes=frozenset({"call"}),
)

Drop matching queries under queries/ruby/.
ragbase-cli skill push.

Full walkthrough: parser-sdk · custom code chunker (includes downloadable reference).

Once published, the language rides the same Zero/Light/Heavy modes as first-party packs (L3/L4 present, L5 audits on, L6 when matrix says so).

Cross-industry vignettes

Automotive (auto)

Light — everyday ECU reviews, onboarding drills
Heavy — ASPICE evidence, SOTIF analysis, ECU interface refactors

Smart manufacturing (mfg)

Light — MES / SCADA code upkeep
Zero — vendor sample repos you only glance at once

Tier vs retrieval `mode`

unified_search and code_search only unlock modes the KB actually indexed:

Tool + mode	Zero	Light	Heavy
`code_search(scope="kb", mode="text")` (ripgrep)	✅	✅	✅
`code_search(scope="kb", mode="ast")`	✅	✅	✅
`unified_search.code_search(mode="text")`	✅	✅	✅ (trigram assist)
`unified_search.code_search(mode="ast")`	✅	✅	✅
`unified_search.code_search(mode="semantic")`	❌	✅	✅
`unified_search.code_search(mode="navigate")`	❌	⚠ ES field heuristics	✅ SCIP precision
`unified_search.code_search(mode="hybrid")`	❌	✅	✅

How to read it:

Pick a mode, verify the KB tier, upgrade in Configuration when needed.
Upgrades are non-destructive — flip the switch and backfill.
Precise navigate needs Heavy and languages with L6 ✓ (matrix above).

Relationship to the code-review SKILL

Indexing is the substrate; the code-review SKILL is the product — it auto-selects retrieval modes based on the active tier.

PR/MR/CR reviews: Light is enough.
Full-branch / repo sweeps: Heavy yields depth.
Disposable repos: Zero for instant runs.

Parser SKILL hub — chunk quality before indexing
unified_search tool
code_search tool
Code-review SKILL

Code indexing (three tiers)

On this page