Implement phase 3 standard critical imports
This commit is contained in:
@@ -31,14 +31,33 @@ The current implementation supports:
|
||||
- manifest-driven source selection
|
||||
- `standard` critical tables with columns `A-E`
|
||||
- XML-based extraction using `pdftohtml -xml`
|
||||
- geometry-based parsing for `Slash.pdf`
|
||||
- geometry-based parsing across the currently enabled phase-3 tables:
|
||||
- `arcane-aether`
|
||||
- `arcane-nether`
|
||||
- `ballistic-shrapnel`
|
||||
- `brawling`
|
||||
- `cold`
|
||||
- `electricity`
|
||||
- `grapple`
|
||||
- `heat`
|
||||
- `impact`
|
||||
- `krush`
|
||||
- `ma-strikes`
|
||||
- `ma-sweeps`
|
||||
- `puncture`
|
||||
- `slash`
|
||||
- `subdual`
|
||||
- `tiny`
|
||||
- `unbalance`
|
||||
- row-boundary repair for trailing affix leakage
|
||||
- footer/page-number filtering during body parsing
|
||||
- transactional loading into SQLite
|
||||
|
||||
The current implementation does not yet support:
|
||||
|
||||
- variant-column critical tables
|
||||
- grouped variant tables
|
||||
- `Mana.pdf`, whose current XML layout and affix notation still need a dedicated parser pass
|
||||
- OCR/image-based PDFs such as `Void.pdf`
|
||||
- normalized `critical_branch` population
|
||||
- normalized `critical_effect` population
|
||||
@@ -183,10 +202,9 @@ The parser was hardened in two ways:
|
||||
|
||||
The importer now explicitly rejects cells that still look structurally wrong after repair:
|
||||
|
||||
- a cell may not begin with affix-like lines before prose
|
||||
- a cell may not contain prose after affix lines
|
||||
- prose and affix segments may not alternate more than once inside a cell
|
||||
|
||||
This hardening step is important because it closed a class of row-boundary bugs that simple row/cell counts could not detect.
|
||||
This keeps the phase-2.1 safety goal in place while allowing broader standard-table layouts that render a single affix block either before or after the prose block.
|
||||
|
||||
## Planned Future Phases
|
||||
|
||||
@@ -194,9 +212,34 @@ The current architecture is intended to support additional phases:
|
||||
|
||||
### Phase 3: Broader Table Coverage
|
||||
|
||||
- add more `standard` critical PDFs
|
||||
- expand the manifest
|
||||
- verify parser stability across more source layouts
|
||||
Phase 3 expands the manifest and validates the shared `standard` parser across a broader set of `A-E` tables.
|
||||
|
||||
The currently enabled phase-3 table set is:
|
||||
|
||||
- `arcane-aether`
|
||||
- `arcane-nether`
|
||||
- `ballistic-shrapnel`
|
||||
- `brawling`
|
||||
- `cold`
|
||||
- `electricity`
|
||||
- `grapple`
|
||||
- `heat`
|
||||
- `impact`
|
||||
- `krush`
|
||||
- `ma-strikes`
|
||||
- `ma-sweeps`
|
||||
- `puncture`
|
||||
- `slash`
|
||||
- `subdual`
|
||||
- `tiny`
|
||||
- `unbalance`
|
||||
|
||||
Current phase-3 notes:
|
||||
|
||||
- header detection now tolerates minor `top` misalignment across the `A-E` header glyphs
|
||||
- footer page numbers are filtered out before body parsing
|
||||
- validation allows a single contiguous affix block either before or after prose
|
||||
- `Mana.pdf` is intentionally left out for now because its row-anchor geometry and notation still need dedicated handling
|
||||
|
||||
### Phase 4: Variant and Grouped Tables
|
||||
|
||||
@@ -289,6 +332,11 @@ Each entry declares:
|
||||
|
||||
The manifest is intentionally the control point for enabling importer support one table at a time.
|
||||
|
||||
For the currently enabled phase-3 entries:
|
||||
|
||||
- `family` is `standard`
|
||||
- `extractionMethod` is `xml`
|
||||
|
||||
## Artifact Layout
|
||||
|
||||
Artifacts are written under:
|
||||
|
||||
Reference in New Issue
Block a user