parser-sdk
One-liner
ragbase-parser-sdk ships a Parser SKILL in hours—CLI scaffolding, contract validation, local dry-runs, packaging. You focus on chunking semantics, nothing else.
Install
Packages are not on public PyPI. Email info@nox-lumen.com explaining your parser scenario (formats / data domain). We reply with private PyPI creds + templates.
Then:
Minimal end-to-end
Your code surface: implement parse
make_cli wraps your subclass with CLI entrypoints (parse / partial-parse / info) honoring the contract the platform expects.
Incremental parse (avoid re-chunking the whole repo)
Add a decorator:
Runtime invokes partial-parse, touching only deltas.
manifest.json — advertise capabilities
Startup routes files by capabilities.extensions to your parser without platform code changes.
Contract outputs
| File | Payload |
|---|---|
chunks.jsonl | One JSON chunk per line → ES ingestion |
result.json | Metadata / stats / errors |
Key Chunk fields (subset):
| Field | Role |
|---|---|
content | Required chunk text |
metadata.tags | Business filters for retrieval |
metadata.outline_path | Structural path (e.g., claims.1.dependent) |
metadata.source_ref | Provenance offsets / lines |
Code-oriented parsers may also populate function_decls / class_decls / references / imports (M5). See Parser skills · Code-aware parsers.
Debug playbook
- Fast loop: edit →
parse-test→ inspectchunks.jsonl - Contract drift:
skill validatemirrors the A-layer protocol - Tracing:
ragbase-cli parse-test --debugkeeps intermediates
Real-world patterns
General doc examples
| File type | Parser idea |
|---|---|
| G-code (manufacturing) | Chunk per process segment toggling G91 ↔ G90; metadata captures cycle time estimates, tool#, spindle RPM |
| AUTOSAR ARXML subset (auto) | Chunk per SWC / port / interface; keep trace attrs for reqs |
| Corporate Word templates | Chunk by Heading styles H1/H2; outline_path anchors citations |
Code: bespoke slicer
Built-in code-aware parsers default to function/class atomic nodes. Customize when you need:
- Semantic segments: docblock + signature + body = chunk
- Commit hunks: use
git blameto bundle same-commit edits - Call graph locality: tightly coupled symbols share a chunk
- Business namespaces: carve monorepos by folder / package prefix
parser-sdk builds a sibling SKILL that competes peacefully with stock code-aware.
Fully worked sample that chunks by git hunk while honoring chunk contracts:
hunk_aware_code_parser.py (save-as)
Follow the header comment (~5 steps) to skill push.
Core excerpt:
Declare extensions + precedence in manifest:
Relationship to built-in
code-aware— Each KB selects one active code parser via KB settings. Multiple SKILL parsers can coexist; e.g., default repos usecode-aware, PR-review KB swaps tohunk-aware-code-parser.
Do we lose code fields? No—populate
outline_path,function_decls,class_decls,references,importsexactly as before—three-tier code index still applies—you only redefine chunk boundaries.
For broader customization (languages, forks of built-ins) see Parser skills · Code is also a doc.
Upgrade notes (0.x → 1.0)
Namespaces moved from ragbase_parser_sdk to ragbase.parser_sdk: