Code indexing (three tiers)
In one sentence
Code is not prose. You can shove it into a vector database, but questions like “who calls this function?” or “which tests break if I edit this line?” are beyond raw vector search.
The platform separates three tiers of code indexing — the setup cost you accept sets the ceiling on future answers.
Indexing ≠ retrieval tools. This page covers how code lands in the knowledge base.
After it is indexed, useunified_searchandcode_searchto query it.
The tier unlocks which retrievalmodeworks — Zero allowstext/astgrep, Light enablessemantic, Heavy unlocks precisenavigateacross files.
Ragbase stack (L1–L6)
Customer-facing Zero / Light / Heavy packaging maps onto the L1–L6 stack below.
Use this table for engineering alignment, proposals, and acceptance — it applies to every parser, not onlycode-aware.
| Tier | Index type | Source | When it activates | What you get |
|---|---|---|---|---|
| L1 | Text chunks + embeddings | Any parser | Always | Semantic search / similarity recall |
| L2 | BM25 lexical index (content_ltks, etc.) | Any parser | Always | Keyword search |
| L3 | AST chunking | tree-sitter | parser_id=code-aware and language marked L3 ✓ in matrix | Function / class / method granular chunks with intact semantics |
| L4 | R8 keyword fields (function_decls, class_decls, import_decls, unresolved_refs, fqn, outline_path, …) | tree-sitter AST | code-aware and L3 succeeded | “What symbols live in this snippet?” — filter & facet ready |
| L5 | audit_findings | ast-grep YAML | code-aware and ast-grep binary available | Static rules (NPE risk, hard-coded secrets, dangerous APIs, …) |
| L6 | SCIP graph (index.scip) | scip-java, scip-python, tree-sitter emit | code-aware and parser_config.index_tier=heavy and language L6 ✓ | Precise go-to-def / find-refs / blast radius |
Rough mapping to Zero / Light / Heavy
- Zero: uploads allowed, but most files skip L1/L2/L3 materialisation — LLM + grep cover gaps.
- Light: L1+L2 always on;
code-awareadds L3+L4+L5 (heavynot required). - Heavy: builds on Light and, for supported languages, emits the L6 SCIP graph (slowest, strongest).
Three tiers at a glance
| Tier | Effort | Time to usable | Good questions | Audience |
|---|---|---|---|---|
| Zero | Drop a repo and search | Immediate | “Anything about auth?” “Which files mention API X?” | Ad-hoc audits, short-lived repos |
| Light | Wait for first sync (minutes typical) | Seconds after a file change | “Where is processOrder defined?” “Who calls it?” “What are the imports?” | Day-to-day reviews, onboarding |
| Heavy | Provide a buildable project (10 min–hours) | After full analysis completes | “Which downstreams break if I change this line?” “What is the inheritance lattice?” “All references across TUs” | Legacy C/C++, security, big refactors |
Zero tier — search immediately
No durable index — smart textual tools + the model reading files on demand.
Works well for
- “What does this repo do overall?”
- “Find timeout handling snippets”
- “There was a
payment_*.py, locate it”
Weak spots
- True call graph questions — you only see textual mentions of a name
- Cross-language or cross-file semantic reasoning
Practical tip: choose Zero when you only need a quick tour — zero wait is the win.
Light tier — daily driver
Local syntactic tooling (tree-sitter class stack) builds compact symbol indexes per language and stores them for search.
What changed in M4/R8
- Per-file increments land in seconds, not minutes-long batch queues
- Removed the old bulk “wait for rebuild” gate — commit, then query
Great for
- “Where is
processOrderdefined and invoked?” - “What is the superclass / subclasses?”
- “How many
redis.Pipelineusages exist?”
Still weak when
- Answers need full type inference (C++ templates, dynamic Python)
- References cross link units without extra graph work
Heavy tier — audits, refactors, mature C/C++
We run a semi-compilation pass to assemble a typed graph. Highest cost, highest precision.
Ideal for
- Legacy C/C++ with macros/includes too tangled for syntax-only views
- Security work (taint / dataflow style questions)
- Cross-module refactors needing downstream inventory
- Compliance programs (MISRA, ASPICE traceability) demanding call-chain evidence
Requirements
- Project must build in our runners (CMake / Bazel / Maven, …)
- Accept a one-time tens-of-minutes → hours investment
Example: on ECU firmware, answer how editing can_send_frame touches specific CAN buses tied to ASIL-tagged requirements — not something Light/Zero can guarantee.
Decision checklist
- Need call relationships? No → Zero. Yes → at least Light.
- Living codebase that changes often? Yes → Light (incremental keeps up). One-off snapshot → Zero can suffice.
- Old C/C++, security, or formal compliance? Yes → Heavy; otherwise Light is usually enough.
One KB, many repos?
The platform is multi-repo by default — one knowledge base can hold several repositories separated by top-level folders:
Top-level directory names become repo_kwd. Each repo maintains its own:
- Symbol index bucket (
symbol_index_<repo>) - Repository map (
repo_map_<repo>) - Audit hits / reference maps
Why multi-repo matters
| Capability | Detail |
|---|---|
| True per-repo incremental | Editing repo A touches only that file’s shards; B/C stay cold |
| Cross-repo lookup | Imports from repo B to symbols in repo A resolve in one query |
| Targeted heavy rebuild | Re-index SCIP for a single repo without taking others offline |
| Ambiguity resolution | Duplicate UserService names get repo_kwd tags automatically |
Single-repo mode
Drop one top-level folder:
You can even flatten files at the KB root — the platform assigns a default repo_kwd. Single-repo is just a degenerate case.
Layout guidance
| Scenario | Layout |
|---|---|
| One service / product | kb/<repo>/… or flat under kb/ |
| Many microservices | One top-level folder per git repository |
| Monorepo | Treat the monorepo as one repo; rely on outline_path to separate projects |
| Parallel branches | repo-main, repo-release-2026q2, … |
Do not flatten unrelated repos into one undifferentiated pile — repository isolation and cross-ref quality both suffer.
Supported languages
The matrix below is the language × (L3–L6) cheat sheet.
L1–L6 behaves the same for every parser; this section specialises incode-aware+ bundled language packs.
Applies when parser_id = code-aware. PDFs / Markdown still use other parsers (L1/L2 or char fallbacks — see last table row).
Where to change parser & index tier (UI)
Without APIs: Combospace → Knowledge bases → open any KB → Configuration in the left rail.

Figure note: UI copy shifts between releases. Capture via
Hikari/scripts/capture-kb-configuration-screenshot.py(direct URL under/combospace/combospace/knowledge/dataset/configuration). IncreaseDOC_CAPTURE_NAV_TIMEOUT_MS,DOC_CAPTURE_SELECTOR_TIMEOUT_MS, orDOC_CAPTURE_SETTLE_MSon cold bundles. OverrideKB_CAPTURE_PATH_PREFIXif your shell path differs.
Below is what each language unlocks for L3/L4/L5/L6 — aligns with the L1–L6 table.
The four row classes in the legend:
| Layer | Meaning | Used by |
|---|---|---|
| L3 AST chunks | tree-sitter slicing with outline_path | Light & Heavy |
| L4 R8 fields | Structured keyword facets (function_decls, references, …) | Light |
| L5 ast-grep audits | YAML rule hits (unified_search code_search mode=audit) | Light & Heavy |
| L6 SCIP | Compiler-grade graph for cross-TU navigation | Heavy only |
Language matrix
| Language | Extensions | L3 | L4 | L5 audit | L6 SCIP |
|---|---|---|---|---|---|
| Python | .py .pyi | ✓ | ✓ | ✓ | ✓ scip-python (no build file) |
| Java | .java | ✓ | ✓ | ✓ | ✓ scip-java, needs pom.xml / build.gradle* |
| Kotlin | .kt .kts | ✓ | ✓ | ✓ | ✓ path ① build → scip-java; path ② loose emit (no build) |
| Scala | .scala | ✗ | ✗ | ✓ | ✓ scip-java, build required |
| Go | .go | ✓ | ✓ | ✓ | ✗ (indexer not wired) |
| TypeScript / TSX | .ts .tsx | ✓ | ✓ | ✓ | ✗ |
| JavaScript / JSX | .js .jsx .mjs .cjs | ✓ | ✓ | ✓ | ✗ |
| C / C++ | .c .cpp .cc .cxx .h .hpp | ✓ | ✓ | ✓ | ⚠ aarch64 linux lacks scip-clang; cpp_stub fail-soft |
| Rust | .rs | ✓ | ✓ | ✓ | ✗ |
| Ruby / PHP / Swift / Dart / Lua / C# / Elixir | … | ✗ | ✗ | ✓ (ast-grep) | ✗ |
Other text (md, txt, Dockerfiles, …) | — | char fallback | ✗ | ✗ | ✗ |
How to read it:
- Nine L3-ready languages auto-enter Light tier with symbol payloads out of the box.
- L5 audit spans more languages than L3 — you can still run
unified_search code_search mode=auditon Scala / Ruby / Swift / etc. even without AST chunks. - L6 SCIP today covers Python / Java / Kotlin / Scala / C·C++ — other L3 languages lean on L4 faceting (~90% scenarios). Need more? mail
info@nox-lumen.com.
⚠ C/C++ caveat: production ARM64 hosts lack upstream
scip-clangbinaries, so L6 usescpp_stubfail-soft — symbol panes still work, yet cross-TU navigation degrades. x86_64 builds are unaffected.
Unsupported language?
Need PHP / Ruby / Swift / … on Light tier? Extend via parser-sdk — usually less than a day:
-
Confirm tree-sitter-language-pack already ships the grammar.
-
Fork
code-aware-parser, append aLanguageConfig, e.g. -
Drop matching queries under
queries/ruby/. -
ragbase-cli skill push.
Full walkthrough: parser-sdk · custom code chunker (includes downloadable reference).
Once published, the language rides the same Zero/Light/Heavy modes as first-party packs (L3/L4 present, L5 audits on, L6 when matrix says so).
Cross-industry vignettes
Automotive (auto)
- Light — everyday ECU reviews, onboarding drills
- Heavy — ASPICE evidence, SOTIF analysis, ECU interface refactors
Smart manufacturing (mfg)
- Light — MES / SCADA code upkeep
- Zero — vendor sample repos you only glance at once
Tier vs retrieval mode
unified_search and code_search only unlock modes the KB actually indexed:
| Tool + mode | Zero | Light | Heavy |
|---|---|---|---|
code_search(scope="kb", mode="text") (ripgrep) | ✅ | ✅ | ✅ |
code_search(scope="kb", mode="ast") | ✅ | ✅ | ✅ |
unified_search.code_search(mode="text") | ✅ | ✅ | ✅ (trigram assist) |
unified_search.code_search(mode="ast") | ✅ | ✅ | ✅ |
unified_search.code_search(mode="semantic") | ❌ | ✅ | ✅ |
unified_search.code_search(mode="navigate") | ❌ | ⚠ ES field heuristics | ✅ SCIP precision |
unified_search.code_search(mode="hybrid") | ❌ | ✅ | ✅ |
How to read it:
- Pick a
mode, verify the KB tier, upgrade in Configuration when needed. - Upgrades are non-destructive — flip the switch and backfill.
- Precise
navigateneeds Heavy and languages with L6 ✓ (matrix above).
Relationship to the code-review SKILL
Indexing is the substrate; the code-review SKILL is the product — it auto-selects retrieval modes based on the active tier.
- PR/MR/CR reviews: Light is enough.
- Full-branch / repo sweeps: Heavy yields depth.
- Disposable repos: Zero for instant runs.
Related reading
- Parser SKILL hub — chunk quality before indexing
unified_searchtoolcode_searchtool- Code-review SKILL