Implement phase 3 standard critical imports

This commit is contained in:
2026-03-14 02:03:37 +01:00
parent 5c4d540246
commit 6870aa2aef
7 changed files with 465 additions and 45 deletions

View File

@@ -31,14 +31,33 @@ The current implementation supports:
- manifest-driven source selection
- `standard` critical tables with columns `A-E`
- XML-based extraction using `pdftohtml -xml`
- geometry-based parsing for `Slash.pdf`
- geometry-based parsing across the currently enabled phase-3 tables:
- `arcane-aether`
- `arcane-nether`
- `ballistic-shrapnel`
- `brawling`
- `cold`
- `electricity`
- `grapple`
- `heat`
- `impact`
- `krush`
- `ma-strikes`
- `ma-sweeps`
- `puncture`
- `slash`
- `subdual`
- `tiny`
- `unbalance`
- row-boundary repair for trailing affix leakage
- footer/page-number filtering during body parsing
- transactional loading into SQLite
The current implementation does not yet support:
- variant-column critical tables
- grouped variant tables
- `Mana.pdf`, whose current XML layout and affix notation still need a dedicated parser pass
- OCR/image-based PDFs such as `Void.pdf`
- normalized `critical_branch` population
- normalized `critical_effect` population
@@ -183,10 +202,9 @@ The parser was hardened in two ways:
The importer now explicitly rejects cells that still look structurally wrong after repair:
- a cell may not begin with affix-like lines before prose
- a cell may not contain prose after affix lines
- prose and affix segments may not alternate more than once inside a cell
This hardening step is important because it closed a class of row-boundary bugs that simple row/cell counts could not detect.
This keeps the phase-2.1 safety goal in place while allowing broader standard-table layouts that render a single affix block either before or after the prose block.
## Planned Future Phases
@@ -194,9 +212,34 @@ The current architecture is intended to support additional phases:
### Phase 3: Broader Table Coverage
- add more `standard` critical PDFs
- expand the manifest
- verify parser stability across more source layouts
Phase 3 expands the manifest and validates the shared `standard` parser across a broader set of `A-E` tables.
The currently enabled phase-3 table set is:
- `arcane-aether`
- `arcane-nether`
- `ballistic-shrapnel`
- `brawling`
- `cold`
- `electricity`
- `grapple`
- `heat`
- `impact`
- `krush`
- `ma-strikes`
- `ma-sweeps`
- `puncture`
- `slash`
- `subdual`
- `tiny`
- `unbalance`
Current phase-3 notes:
- header detection now tolerates minor `top` misalignment across the `A-E` header glyphs
- footer page numbers are filtered out before body parsing
- validation allows a single contiguous affix block either before or after prose
- `Mana.pdf` is intentionally left out for now because its row-anchor geometry and notation still need dedicated handling
### Phase 4: Variant and Grouped Tables
@@ -289,6 +332,11 @@ Each entry declares:
The manifest is intentionally the control point for enabling importer support one table at a time.
For the currently enabled phase-3 entries:
- `family` is `standard`
- `extractionMethod` is `xml`
## Artifact Layout
Artifacts are written under: