Nox-Lumen MfgNox-Lumen Mfg

Parsing (Parser SKILL)

TL;DR

Every file landing in the knowledge base is first read, chunked, and enriched by a parser. Previously this was hard-coded for ~14 formats with if-elif; adding a new format meant patching core code.

Now parsers are pluggable SKILLs. We ship several built-ins; customers can ship their own on equal footing.


Problems this solves

Pain you might have hitWhat changes
Patent drawings / CAD / engineering screenshots arrive as blobs with nothing to searchBuild a parser SKILL for your drawing layout; extract part numbers / labels into structured chunks
Proprietary contracts or work orders split poorly — clauses cut mid-thoughtBuild an index-template SKILL anchored on headings so chunks follow your doc structure
PDF mixes prose, schematic, tables — flattening everything as text loses the figuresVLM prompt packs are SKILLs — swap prompts per scenario (circuit, flowchart, industrial)
Lark docs / SCM repos / webhook payloads cannot land in KBBuild a Source SKILL that syncs those systems into ingestion
You want another chunking policy without touching coreFork the nearest built-in and adjust hooks / YAML

Three registries: SKILLs attach at startup

Rendering diagram…

At startup the platform scans builtin/ plus installed SKILL manifests and registers parsers, sources, and VLM packs — no core edits and no service restart to add a parser SKILL package.


How customers extend (three difficulty tiers)

WhoWhatOutcome
Engineers comfortable codingShip a CLI + manifest using our SDKStandalone SKILL for marketplace / private tenant
Engineers scriptingFork nearest SKILL; tweak two hooksModified SKILL drop-in
Ops configuring YAML onlyFork built-in; override values.yamlConfig overlay

Heavy users rarely start from zero — our scaffold handles CLI boilerplate so you focus on reading files and emitting chunks.


Industry examples

Patent / IP (xIPnex)

  • Disclosure SKILL: pulls disclosure sections / draft claims / drawing captions with semantic chunks
  • Office action SKILL: links examiner remarks to cited spec paragraphs
  • Drawing OCR prompt pack: tailored VLM prompt for annotation-heavy figures

Automotive OEM / Tier1 (auto)

  • DOORS SKILL: ReqIF ingest chunked by module; trace attributes retained
  • AUTOSAR ARXML SKILL: chunk by SWC / port / interface for searchable units
  • MISRA report SKILL: map violations → line ↔ rule ↔ severity for review citations

Smart manufacturing / plant (mfg)

  • CAD SKILL (DWG/STEP): extract part IDs / mates / critical dims into structured search
  • Work-order SKILL: chunk by operation; lift cycle time / equipment / material fields
  • Quotation SKILL: structured rows for item / unit price / qty / total for downstream pricing bots

Code is "documents": code-aware parser

Uploading code is not "split text blindly" — code has structure. The built-in code-aware parser (M5) chunks along semantic units:

Fields emittedWhy it matters
outline_path (e.g. pkg.module.Class.method)Retrieval shows exactly which class/method owns the snippet
function_decls / class_decls / import_declsDefinitions / hierarchy / imports become first-class queries
references (resolved vs unresolved)"Who calls this symbol?" — resolved inside TU; unresolved waits on heavy-tier SCIP
siblings_signaturesGives adjacent methods so assistants stay stylistically aligned

Eight languages ship with L3 (AST chunks) + L4 (keyword fields) — Python / Java / Kotlin / Go / TypeScript / JavaScript / C / C++ / Rust — plus broader L5 ast-grep audit coverage (Scala / Ruby / PHP / Swift / Dart / Lua / C# / Elixir, …). Full language × capability matrix: Code indexing · supported languages.

Selecting code-aware in the console

Open any knowledge base → Configuration tab in the sidebar → Chunk method dropdown → choose code-aware, then pick Code index tier (light vs heavy).

Knowledge base · Configuration tab — chunk-method dropdown expanded with parser options including code-aware

Field mapping (parser_config.index_tier, etc.) lives in Code indexing · where to configure.

Need another language on L3/L4? parser-sdk can land in hours; full cross-compilation SCIP coverage → email info@nox-lumen.com.

How this ties to the three tiers

Rendering diagram…

Chunk quality directly controls whether light-tier queries resolve references instantly — treat code-aware as the quality gate before indexing.

When you fork a code parser

ScenarioReason
House DSL (G-code, ChemDraw scripting, insurer rule DSL)No grammar in stock tree-sitter pack
Private schema (IDL / protobuf dialect / ARXML subset)Built-ins cannot infer your schema
Mixed literate notebooksYou need prose + code boundaries your way

Fork code-aware, swap grammars + two hooks (extract_symbols, extract_references) — persistence and field plumbing stay platform-managed.


Stable built-in parsers (high level)

ClassBuilt-in SKILLFiles
General docsmanual / naive / paperPDF / DOC / PPT / Markdown
RegulationslawsClause / chapter aware
TablestableExcel / CSV / HTML tables
ResumesresumeEducation / jobs / skills blocks
Imagespicture + VLM prompt packsRaster / screenshots / diagrams
Codecode-awarePython / Java / C++ / Kotlin / Go … with AST-aware chunks

Where to start

  1. Big picture: read What is a SKILL?
  2. Hands-on: Your first custom SKILL
  3. Reference source: combo-skills on GitHub

On this page