Nox-Lumen MfgNox-Lumen Mfg

Case study: Full-branch heavy index vs zero-index

Background

A customer piloted Kotlin review on Gitee repo uu5208/simpler-robot, branch v4-dev, running two paths in parallel:

  • A. Heavy-index deep review: create KB (code-aware, index_tier=heavy) → import → parse → audit batch → sample semantic / navigate checks → HTML report
  • B. Zero-index quick pass: no KB—gitee_get_tree + gitee_get_file_contents read sources → review per code-review dimensions → JSON / MD / HTML

These are the two canonical shapes for code review on the platform—useful for architecture / procurement decisions.


Side-by-side

DimensionA. Heavy-index deepB. Zero-index quick
CoverageWhole simbot-cores/ subtree (97 files / 529 KB)Four core .kt files (~31 KB)
SetupKB create + import + parse (minutes to low tens of minutes)Credentials + tree fetch (seconds)
Retrievalaudit / hybrid / navigate / semanticLLM reads files only
Typical questions“What high findings exist repo-wide?” “Who calls this symbol?”“What’s wrong in these four files?”
ReuseKB persists—each later review reuses the index (incremental sync)Re-fetch every run
ArtifactsHTML dashboard + audit hit list + key call chainsJSON + Markdown + HTML
Observed findings (B run)— (A still parsing)17 — fatal:0 / severe:3 / general:8 / minor:6
Top risks (B)① coroutine scope leak risk; ② missing safety checks in event dispatcher; ③ interface dependency direction mismatch

A goes deeper but costs setup time; B is near-instant but only inspects files you tell the LLM to open. Pick based on product needs—see “When to choose”.


When to pick A vs B

NeedPath
Pre-release whole-repo inventoryA
Onboarding / compliance physical for a new repoA
Long-lived “review → feedback → LTM” loopA
Spot-check a vendor drop or a handful of filesB
Huge repo or constrained network—KB cost too highB
No PR/MR but need a fast HTML reportB (lowest cost)

A. Heavy-index — prompt template

Scenario C (full-branch). Let it run end-to-end—heavy indexing may take 10–15 minutes; have the agent poll without pausing for “continue?” prompts.

Run a full-repo code review (self-hosted KB → import → parse → L2 → HTML).

Repo: https://gitee.com/<your-org>/<your-repo>.git (branch=<your-branch>)
Token: <REDACTED — inject via manage_credential, never put in plaintext prompts>

Pipeline (do not pause mid-flight—finish all steps):

1. manage_credential store gitee token (if unset)

2. kb_create:
   - name: "<repo>-review-<timestamp>"
   - parser_id: "code-aware"
   - parser_config: {"index_tier": "heavy"}
   Capture kb_id—use downstream immediately.

3. import_from_gitee dryrun → execute back-to-back:
   - url, kb_id, mode="dryrun"
   - values={"ref":"<branch>","sub_path":"<sub-path>"}
   - file_globs=["*.kt"], limit=50
   After dry-run counts + dryrun_id, rerun same args with mode="execute", parse_intent="auto".

4. Wait for parsing (~10–15 min is normal—do not abort):
   - Poll kb_doc_status every 30s (no tight loops)
   - Done when kb_doc_list shows docs[].status all done
   - Hierarchy check: locations should include "<sub-path>/..." prefixes (field `location`, not just `name`)

5. code-review L2 (scenario C, skip gitee_get_tree—use KB only):
   a. unified_search code_search mode=audit, kb_ids=[new KB], top_k=50
      aggregate high/medium by rule_id
   b. mode=hybrid top_k=10 sample queries:
      - hard-coded password token secret credential
      - SQL concatenation injection
      - random crypto weak cipher
      - leaked resources leak close cleanup
      - thread safety concurrency data race
   c. Navigate key audit symbols → mode=navigate query="refs:<symbol>"
   d. Skip KB requirement lookup (none bound) and skip LTM

6. HTML report write_file → outputs/code_review_visual_report.html:
   - Overview: file counts, langs, kb_id + name
   - Audit table: rule_id × severity × count
   - Top 10 risks: file:line / snippet / fix / confidence
   - Navigate call-graph excerpts

Guardrails:
- Do not repeatedly kb_parse queued docs—it duplicates tasks
- Any subjective judgement must reconcile with code_search; audit findings are ground truth
- No interim “progress reports”—finish steps 1–6 then summarize
- Never mutate upstream repo

B. Zero-index — prompt template

When “I only need these files — skip KB uploads.” No kb_create / kb_upload / kb_parse / import_from_gitee; only gitee_get_*.

code-review scenario C full-branch review without KB ingestion.

Repo: https://gitee.com/<your-org>/<your-repo>
Branch: <branch>
Scope: *.kt under <sub-path>/<...>
Token: <REDACTED — manage_credential scm/gitee/tenant>

Steps:

1. gitee_get_tree(owner="<owner>", repo="<repo>", ref="<branch>", recursive=True)
   filter target .kt files, max 10

2. gitee_get_file_contents(...) per file — full text

3. Walk code-review Step-3 dimensions per file:
   - code_standard naming/comments/SRP
   - security auth/secrets swallowed exceptions
   - performance blocking IO / N+1 / concurrency hazards
   - design_consistency contracts / dependency direction

4. Skip Step 2 (KB req search) & Step 4 (LTM) — no KB bound

5. Emit JSON + Markdown + HTML under outputs/; every finding needs reasoning + confidence

Rules:
- Never call kb_create / kb_upload / kb_parse / import_from_gitee — gitee reads only
- Paste raw errors verbatim
- Finish with severity histogram + titles for top 3 findings

Ground truth (complete B-path run)

ItemValue
Files reviewed4
Total findings17
Severity mixfatal:0 · severe:3 · general:8 · minor:6
Top 3① coroutine scope leak
② dispatcher safety gaps
③ interface dependency inversion
ArtifactsJSON / Markdown / HTML

Path A imported 97 files / 529 KB; deep audit/navigate outputs land once parsing completes—the upside is index once, reuse forever.


Security reminder

Never paste PATs/password/API keys inside prompts.

  1. credential tool: manage_credential store …
  2. Later prompts reference purpose scopes only (scm/gitee/tenant); runtime injects secrets
  3. Tokens stay out of LLM context and session logs

The Token: lines in the templates are placeholders—replace with “already stored via manage_credential” in real use.


On this page