Parsing (Parser SKILL)
TL;DR
Every file landing in the knowledge base is first read, chunked, and enriched by a parser.
Previously this was hard-coded for ~14 formats with if-elif; adding a new format meant patching core code.
Now parsers are pluggable SKILLs. We ship several built-ins; customers can ship their own on equal footing.
Problems this solves
| Pain you might have hit | What changes |
|---|---|
| Patent drawings / CAD / engineering screenshots arrive as blobs with nothing to search | Build a parser SKILL for your drawing layout; extract part numbers / labels into structured chunks |
| Proprietary contracts or work orders split poorly — clauses cut mid-thought | Build an index-template SKILL anchored on headings so chunks follow your doc structure |
| PDF mixes prose, schematic, tables — flattening everything as text loses the figures | VLM prompt packs are SKILLs — swap prompts per scenario (circuit, flowchart, industrial) |
| Lark docs / SCM repos / webhook payloads cannot land in KB | Build a Source SKILL that syncs those systems into ingestion |
| You want another chunking policy without touching core | Fork the nearest built-in and adjust hooks / YAML |
Three registries: SKILLs attach at startup
At startup the platform scans builtin/ plus installed SKILL manifests and registers parsers, sources, and VLM packs — no core edits and no service restart to add a parser SKILL package.
How customers extend (three difficulty tiers)
| Who | What | Outcome |
|---|---|---|
| Engineers comfortable coding | Ship a CLI + manifest using our SDK | Standalone SKILL for marketplace / private tenant |
| Engineers scripting | Fork nearest SKILL; tweak two hooks | Modified SKILL drop-in |
| Ops configuring YAML only | Fork built-in; override values.yaml | Config overlay |
Heavy users rarely start from zero — our scaffold handles CLI boilerplate so you focus on reading files and emitting chunks.
Industry examples
Patent / IP (xIPnex)
- Disclosure SKILL: pulls disclosure sections / draft claims / drawing captions with semantic chunks
- Office action SKILL: links examiner remarks to cited spec paragraphs
- Drawing OCR prompt pack: tailored VLM prompt for annotation-heavy figures
Automotive OEM / Tier1 (auto)
- DOORS SKILL: ReqIF ingest chunked by module; trace attributes retained
- AUTOSAR ARXML SKILL: chunk by SWC / port / interface for searchable units
- MISRA report SKILL: map violations → line ↔ rule ↔ severity for review citations
Smart manufacturing / plant (mfg)
- CAD SKILL (DWG/STEP): extract part IDs / mates / critical dims into structured search
- Work-order SKILL: chunk by operation; lift cycle time / equipment / material fields
- Quotation SKILL: structured rows for item / unit price / qty / total for downstream pricing bots
Code is "documents": code-aware parser
Uploading code is not "split text blindly" — code has structure. The built-in code-aware parser (M5) chunks along semantic units:
| Fields emitted | Why it matters |
|---|---|
outline_path (e.g. pkg.module.Class.method) | Retrieval shows exactly which class/method owns the snippet |
function_decls / class_decls / import_decls | Definitions / hierarchy / imports become first-class queries |
references (resolved vs unresolved) | "Who calls this symbol?" — resolved inside TU; unresolved waits on heavy-tier SCIP |
siblings_signatures | Gives adjacent methods so assistants stay stylistically aligned |
Eight languages ship with L3 (AST chunks) + L4 (keyword fields) — Python / Java / Kotlin / Go / TypeScript / JavaScript / C / C++ / Rust — plus broader L5 ast-grep audit coverage (Scala / Ruby / PHP / Swift / Dart / Lua / C# / Elixir, …). Full language × capability matrix: Code indexing · supported languages.
Selecting code-aware in the console
Open any knowledge base → Configuration tab in the sidebar → Chunk method dropdown → choose code-aware, then pick Code index tier (light vs heavy).

Field mapping (
parser_config.index_tier, etc.) lives in Code indexing · where to configure.
Need another language on L3/L4?
parser-sdkcan land in hours; full cross-compilation SCIP coverage → emailinfo@nox-lumen.com.
How this ties to the three tiers
Chunk quality directly controls whether light-tier queries resolve references instantly — treat
code-awareas the quality gate before indexing.
When you fork a code parser
| Scenario | Reason |
|---|---|
| House DSL (G-code, ChemDraw scripting, insurer rule DSL) | No grammar in stock tree-sitter pack |
| Private schema (IDL / protobuf dialect / ARXML subset) | Built-ins cannot infer your schema |
| Mixed literate notebooks | You need prose + code boundaries your way |
Fork code-aware, swap grammars + two hooks (extract_symbols, extract_references) — persistence and field plumbing stay platform-managed.
Stable built-in parsers (high level)
| Class | Built-in SKILL | Files |
|---|---|---|
| General docs | manual / naive / paper | PDF / DOC / PPT / Markdown |
| Regulations | laws | Clause / chapter aware |
| Tables | table | Excel / CSV / HTML tables |
| Resumes | resume | Education / jobs / skills blocks |
| Images | picture + VLM prompt packs | Raster / screenshots / diagrams |
| Code | code-aware | Python / Java / C++ / Kotlin / Go … with AST-aware chunks |
Where to start
- Big picture: read What is a SKILL?
- Hands-on: Your first custom SKILL
- Reference source: combo-skills on GitHub
Related
- SKILL system overview
- Code indexing — retrieval consumes parser output
- Custom SKILL tutorial