Parsing (Parser SKILL)

TL;DR

Every file landing in the knowledge base is first read, chunked, and enriched by a parser. Previously this was hard-coded for ~14 formats with if-elif; adding a new format meant patching core code.

Now parsers are pluggable SKILLs. We ship several built-ins; customers can ship their own on equal footing.

Problems this solves

Pain you might have hit	What changes
Patent drawings / CAD / engineering screenshots arrive as blobs with nothing to search	Build a parser SKILL for your drawing layout; extract part numbers / labels into structured chunks
Proprietary contracts or work orders split poorly — clauses cut mid-thought	Build an index-template SKILL anchored on headings so chunks follow your doc structure
PDF mixes prose, schematic, tables — flattening everything as text loses the figures	VLM prompt packs are SKILLs — swap prompts per scenario (circuit, flowchart, industrial)
Lark docs / SCM repos / webhook payloads cannot land in KB	Build a Source SKILL that syncs those systems into ingestion
You want another chunking policy without touching core	Fork the nearest built-in and adjust hooks / YAML

Three registries: SKILLs attach at startup

Rendering diagram…

At startup the platform scans builtin/ plus installed SKILL manifests and registers parsers, sources, and VLM packs — no core edits and no service restart to add a parser SKILL package.

How customers extend (three difficulty tiers)

Who	What	Outcome
Engineers comfortable coding	Ship a CLI + manifest using our SDK	Standalone SKILL for marketplace / private tenant
Engineers scripting	Fork nearest SKILL; tweak two hooks	Modified SKILL drop-in
Ops configuring YAML only	Fork built-in; override `values.yaml`	Config overlay

Heavy users rarely start from zero — our scaffold handles CLI boilerplate so you focus on reading files and emitting chunks.

Industry examples

Patent / IP (xIPnex)

Disclosure SKILL: pulls disclosure sections / draft claims / drawing captions with semantic chunks
Office action SKILL: links examiner remarks to cited spec paragraphs
Drawing OCR prompt pack: tailored VLM prompt for annotation-heavy figures

Automotive OEM / Tier1 (auto)

DOORS SKILL: ReqIF ingest chunked by module; trace attributes retained
AUTOSAR ARXML SKILL: chunk by SWC / port / interface for searchable units
MISRA report SKILL: map violations → line ↔ rule ↔ severity for review citations

Smart manufacturing / plant (mfg)

CAD SKILL (DWG/STEP): extract part IDs / mates / critical dims into structured search
Work-order SKILL: chunk by operation; lift cycle time / equipment / material fields
Quotation SKILL: structured rows for item / unit price / qty / total for downstream pricing bots

Code is "documents": `code-aware` parser

Uploading code is not "split text blindly" — code has structure. The built-in code-aware parser (M5) chunks along semantic units:

Fields emitted	Why it matters
`outline_path` (e.g. `pkg.module.Class.method`)	Retrieval shows exactly which class/method owns the snippet
`function_decls` / `class_decls` / `import_decls`	Definitions / hierarchy / imports become first-class queries
`references` (resolved vs unresolved)	"Who calls this symbol?" — resolved inside TU; unresolved waits on heavy-tier SCIP
`siblings_signatures`	Gives adjacent methods so assistants stay stylistically aligned

Eight languages ship with L3 (AST chunks) + L4 (keyword fields) — Python / Java / Kotlin / Go / TypeScript / JavaScript / C / C++ / Rust — plus broader L5 ast-grep audit coverage (Scala / Ruby / PHP / Swift / Dart / Lua / C# / Elixir, …). Full language × capability matrix: Code indexing · supported languages.

Selecting `code-aware` in the console

Open any knowledge base → Configuration tab in the sidebar → Chunk method dropdown → choose code-aware, then pick Code index tier (light vs heavy).

Knowledge base · Configuration tab — chunk-method dropdown expanded with parser options including code-aware

Field mapping (parser_config.index_tier, etc.) lives in Code indexing · where to configure.

Need another language on L3/L4? parser-sdk can land in hours; full cross-compilation SCIP coverage → email info@nox-lumen.com.

How this ties to the three tiers

Rendering diagram…

Chunk quality directly controls whether light-tier queries resolve references instantly — treat code-aware as the quality gate before indexing.

When you fork a code parser

Scenario	Reason
House DSL (G-code, ChemDraw scripting, insurer rule DSL)	No grammar in stock tree-sitter pack
Private schema (IDL / protobuf dialect / ARXML subset)	Built-ins cannot infer your schema
Mixed literate notebooks	You need prose + code boundaries your way

Fork code-aware, swap grammars + two hooks (extract_symbols, extract_references) — persistence and field plumbing stay platform-managed.

Stable built-in parsers (high level)

Class	Built-in SKILL	Files
General docs	`manual` / `naive` / `paper`	PDF / DOC / PPT / Markdown
Regulations	`laws`	Clause / chapter aware
Tables	`table`	Excel / CSV / HTML tables
Resumes	`resume`	Education / jobs / skills blocks
Images	`picture` + VLM prompt packs	Raster / screenshots / diagrams
Code	`code-aware`	Python / Java / C++ / Kotlin / Go … with AST-aware chunks

Where to start

Big picture: read What is a SKILL?
Hands-on: Your first custom SKILL
Reference source: combo-skills on GitHub

SKILL system overview
Code indexing — retrieval consumes parser output
Custom SKILL tutorial

Parsing (Parser SKILL)

On this page