Nox-Lumen MfgNox-Lumen Mfg

Code indexing (three tiers)

In one sentence

Code is not prose. You can shove it into a vector database, but questions like “who calls this function?” or “which tests break if I edit this line?” are beyond raw vector search.

The platform separates three tiers of code indexing — the setup cost you accept sets the ceiling on future answers.

Indexing ≠ retrieval tools. This page covers how code lands in the knowledge base.
After it is indexed, use unified_search and code_search to query it.
The tier unlocks which retrieval mode works — Zero allows text/ast grep, Light enables semantic, Heavy unlocks precise navigate across files.


Ragbase stack (L1–L6)

Customer-facing Zero / Light / Heavy packaging maps onto the L1–L6 stack below.
Use this table for engineering alignment, proposals, and acceptance — it applies to every parser, not only code-aware.

TierIndex typeSourceWhen it activatesWhat you get
L1Text chunks + embeddingsAny parserAlwaysSemantic search / similarity recall
L2BM25 lexical index (content_ltks, etc.)Any parserAlwaysKeyword search
L3AST chunkingtree-sitterparser_id=code-aware and language marked L3 ✓ in matrixFunction / class / method granular chunks with intact semantics
L4R8 keyword fields (function_decls, class_decls, import_decls, unresolved_refs, fqn, outline_path, …)tree-sitter ASTcode-aware and L3 succeeded“What symbols live in this snippet?” — filter & facet ready
L5audit_findingsast-grep YAMLcode-aware and ast-grep binary availableStatic rules (NPE risk, hard-coded secrets, dangerous APIs, …)
L6SCIP graph (index.scip)scip-java, scip-python, tree-sitter emitcode-aware and parser_config.index_tier=heavy and language L6 ✓Precise go-to-def / find-refs / blast radius

Rough mapping to Zero / Light / Heavy

  • Zero: uploads allowed, but most files skip L1/L2/L3 materialisation — LLM + grep cover gaps.
  • Light: L1+L2 always on; code-aware adds L3+L4+L5 (heavy not required).
  • Heavy: builds on Light and, for supported languages, emits the L6 SCIP graph (slowest, strongest).

Three tiers at a glance

TierEffortTime to usableGood questionsAudience
ZeroDrop a repo and searchImmediate“Anything about auth?” “Which files mention API X?”Ad-hoc audits, short-lived repos
LightWait for first sync (minutes typical)Seconds after a file change“Where is processOrder defined?” “Who calls it?” “What are the imports?”Day-to-day reviews, onboarding
HeavyProvide a buildable project (10 min–hours)After full analysis completes“Which downstreams break if I change this line?” “What is the inheritance lattice?” “All references across TUs”Legacy C/C++, security, big refactors

Zero tier — search immediately

No durable index — smart textual tools + the model reading files on demand.

Works well for

  • “What does this repo do overall?”
  • “Find timeout handling snippets”
  • “There was a payment_*.py, locate it”

Weak spots

  • True call graph questions — you only see textual mentions of a name
  • Cross-language or cross-file semantic reasoning

Practical tip: choose Zero when you only need a quick tour — zero wait is the win.


Light tier — daily driver

Local syntactic tooling (tree-sitter class stack) builds compact symbol indexes per language and stores them for search.

Rendering diagram…

What changed in M4/R8

  • Per-file increments land in seconds, not minutes-long batch queues
  • Removed the old bulk “wait for rebuild” gate — commit, then query

Great for

  • “Where is processOrder defined and invoked?”
  • “What is the superclass / subclasses?”
  • “How many redis.Pipeline usages exist?”

Still weak when

  • Answers need full type inference (C++ templates, dynamic Python)
  • References cross link units without extra graph work

Heavy tier — audits, refactors, mature C/C++

We run a semi-compilation pass to assemble a typed graph. Highest cost, highest precision.

Ideal for

  • Legacy C/C++ with macros/includes too tangled for syntax-only views
  • Security work (taint / dataflow style questions)
  • Cross-module refactors needing downstream inventory
  • Compliance programs (MISRA, ASPICE traceability) demanding call-chain evidence

Requirements

  • Project must build in our runners (CMake / Bazel / Maven, …)
  • Accept a one-time tens-of-minutes → hours investment

Example: on ECU firmware, answer how editing can_send_frame touches specific CAN buses tied to ASIL-tagged requirements — not something Light/Zero can guarantee.


Decision checklist

  1. Need call relationships? No → Zero. Yes → at least Light.
  2. Living codebase that changes often? Yes → Light (incremental keeps up). One-off snapshot → Zero can suffice.
  3. Old C/C++, security, or formal compliance? Yes → Heavy; otherwise Light is usually enough.

One KB, many repos?

The platform is multi-repo by default — one knowledge base can hold several repositories separated by top-level folders:

my-code-kb/
├── repo-vehicle-bsw/
│   ├── src/...
├── repo-mes-platform/
│   └── apps/
└── repo-shared-libs/

Top-level directory names become repo_kwd. Each repo maintains its own:

  • Symbol index bucket (symbol_index_<repo>)
  • Repository map (repo_map_<repo>)
  • Audit hits / reference maps

Why multi-repo matters

CapabilityDetail
True per-repo incrementalEditing repo A touches only that file’s shards; B/C stay cold
Cross-repo lookupImports from repo B to symbols in repo A resolve in one query
Targeted heavy rebuildRe-index SCIP for a single repo without taking others offline
Ambiguity resolutionDuplicate UserService names get repo_kwd tags automatically

Single-repo mode

Drop one top-level folder:

my-code-kb/
└── my-only-repo/

You can even flatten files at the KB root — the platform assigns a default repo_kwd. Single-repo is just a degenerate case.

Layout guidance

ScenarioLayout
One service / productkb/<repo>/… or flat under kb/
Many microservicesOne top-level folder per git repository
MonorepoTreat the monorepo as one repo; rely on outline_path to separate projects
Parallel branchesrepo-main, repo-release-2026q2, …

Do not flatten unrelated repos into one undifferentiated pile — repository isolation and cross-ref quality both suffer.


Supported languages

The matrix below is the language × (L3–L6) cheat sheet.
L1–L6 behaves the same for every parser; this section specialises in code-aware + bundled language packs.

Applies when parser_id = code-aware. PDFs / Markdown still use other parsers (L1/L2 or char fallbacks — see last table row).

Where to change parser & index tier (UI)

Without APIs: Combospace → Knowledge bases → open any KB → Configuration in the left rail.

Rendering diagram…

Configuration tab — chunk method (code-aware open) and index tier (demo KB robot-code)

Figure note: UI copy shifts between releases. Capture via Hikari/scripts/capture-kb-configuration-screenshot.py (direct URL under /combospace/combospace/knowledge/dataset/configuration). Increase DOC_CAPTURE_NAV_TIMEOUT_MS, DOC_CAPTURE_SELECTOR_TIMEOUT_MS, or DOC_CAPTURE_SETTLE_MS on cold bundles. Override KB_CAPTURE_PATH_PREFIX if your shell path differs.

Below is what each language unlocks for L3/L4/L5/L6 — aligns with the L1–L6 table.

The four row classes in the legend:

LayerMeaningUsed by
L3 AST chunkstree-sitter slicing with outline_pathLight & Heavy
L4 R8 fieldsStructured keyword facets (function_decls, references, …)Light
L5 ast-grep auditsYAML rule hits (unified_search code_search mode=audit)Light & Heavy
L6 SCIPCompiler-grade graph for cross-TU navigationHeavy only

Language matrix

LanguageExtensionsL3L4L5 auditL6 SCIP
Python.py .pyiscip-python (no build file)
Java.javascip-java, needs pom.xml / build.gradle*
Kotlin.kt .kts✓ path ① build → scip-java; path ② loose emit (no build)
Scala.scalascip-java, build required
Go.go✗ (indexer not wired)
TypeScript / TSX.ts .tsx
JavaScript / JSX.js .jsx .mjs .cjs
C / C++.c .cpp .cc .cxx .h .hpp⚠ aarch64 linux lacks scip-clang; cpp_stub fail-soft
Rust.rs
Ruby / PHP / Swift / Dart / Lua / C# / Elixir✓ (ast-grep)
Other text (md, txt, Dockerfiles, …)char fallback

How to read it:

  1. Nine L3-ready languages auto-enter Light tier with symbol payloads out of the box.
  2. L5 audit spans more languages than L3 — you can still run unified_search code_search mode=audit on Scala / Ruby / Swift / etc. even without AST chunks.
  3. L6 SCIP today covers Python / Java / Kotlin / Scala / C·C++ — other L3 languages lean on L4 faceting (~90% scenarios). Need more? mail info@nox-lumen.com.

C/C++ caveat: production ARM64 hosts lack upstream scip-clang binaries, so L6 uses cpp_stub fail-soft — symbol panes still work, yet cross-TU navigation degrades. x86_64 builds are unaffected.

Unsupported language?

Need PHP / Ruby / Swift / … on Light tier? Extend via parser-sdk — usually less than a day:

  1. Confirm tree-sitter-language-pack already ships the grammar.

  2. Fork code-aware-parser, append a LanguageConfig, e.g.

    _RUBY = LanguageConfig(
        language="ruby",
        file_extensions=(".rb",),
        atomic_nodes=frozenset({"method", "class", "module"}),
        scope_nodes=frozenset({"class", "module", "method"}),
        import_nodes=frozenset({"call"}),
    )
  3. Drop matching queries under queries/ruby/.

  4. ragbase-cli skill push.

Full walkthrough: parser-sdk · custom code chunker (includes downloadable reference).

Once published, the language rides the same Zero/Light/Heavy modes as first-party packs (L3/L4 present, L5 audits on, L6 when matrix says so).


Cross-industry vignettes

Automotive (auto)

  • Light — everyday ECU reviews, onboarding drills
  • Heavy — ASPICE evidence, SOTIF analysis, ECU interface refactors

Smart manufacturing (mfg)

  • Light — MES / SCADA code upkeep
  • Zero — vendor sample repos you only glance at once

Tier vs retrieval mode

unified_search and code_search only unlock modes the KB actually indexed:

Tool + modeZeroLightHeavy
code_search(scope="kb", mode="text") (ripgrep)
code_search(scope="kb", mode="ast")
unified_search.code_search(mode="text")✅ (trigram assist)
unified_search.code_search(mode="ast")
unified_search.code_search(mode="semantic")
unified_search.code_search(mode="navigate")⚠ ES field heuristics✅ SCIP precision
unified_search.code_search(mode="hybrid")

How to read it:

  • Pick a mode, verify the KB tier, upgrade in Configuration when needed.
  • Upgrades are non-destructive — flip the switch and backfill.
  • Precise navigate needs Heavy and languages with L6 ✓ (matrix above).

Relationship to the code-review SKILL

Indexing is the substrate; the code-review SKILL is the product — it auto-selects retrieval modes based on the active tier.

  • PR/MR/CR reviews: Light is enough.
  • Full-branch / repo sweeps: Heavy yields depth.
  • Disposable repos: Zero for instant runs.

On this page