Implement phase 4 critical table imports
This commit is contained in:
@@ -30,8 +30,10 @@ The current implementation supports:
|
||||
- explicit CLI commands for reset, extraction, and import
|
||||
- manifest-driven source selection
|
||||
- `standard` critical tables with columns `A-E`
|
||||
- `variant_column` critical tables with non-severity columns
|
||||
- `grouped_variant` critical tables with a group axis plus variant columns
|
||||
- XML-based extraction using `pdftohtml -xml`
|
||||
- geometry-based parsing across the currently enabled phase-3 tables:
|
||||
- geometry-based parsing across the currently enabled table set:
|
||||
- `arcane-aether`
|
||||
- `arcane-nether`
|
||||
- `ballistic-shrapnel`
|
||||
@@ -42,22 +44,24 @@ The current implementation supports:
|
||||
- `heat`
|
||||
- `impact`
|
||||
- `krush`
|
||||
- `large_creature_magic`
|
||||
- `large_creature_weapon`
|
||||
- `ma-strikes`
|
||||
- `ma-sweeps`
|
||||
- `mana`
|
||||
- `puncture`
|
||||
- `slash`
|
||||
- `subdual`
|
||||
- `super_large_creature_weapon`
|
||||
- `tiny`
|
||||
- `unbalance`
|
||||
- row-boundary repair for trailing affix leakage
|
||||
- split row-label reconstruction for tables that render labels such as `99-` / `100` as two fragments
|
||||
- footer/page-number filtering during body parsing
|
||||
- transactional loading into SQLite
|
||||
|
||||
The current implementation does not yet support:
|
||||
|
||||
- variant-column critical tables
|
||||
- grouped variant tables
|
||||
- OCR/image-based PDFs such as `Void.pdf`
|
||||
- normalized `critical_branch` population
|
||||
- normalized `critical_effect` population
|
||||
@@ -246,9 +250,28 @@ Current phase-3 notes:
|
||||
|
||||
### Phase 4: Variant and Grouped Tables
|
||||
|
||||
- support `variant_column` tables such as `Large Creature - Weapon.pdf`
|
||||
- support `grouped_variant` tables such as `Large Creature - Magic.pdf`
|
||||
- add parser strategies for additional table families
|
||||
Phase 4 extended the importer beyond `A-E` tables.
|
||||
|
||||
The currently enabled phase-4 table set is:
|
||||
|
||||
- `large_creature_weapon`
|
||||
- `family`: `variant_column`
|
||||
- columns: `NORMAL`, `MAGIC`, `MITHRIL`, `HOLY_ARMS`, `SLAYING`
|
||||
- `super_large_creature_weapon`
|
||||
- `family`: `variant_column`
|
||||
- columns: `NORMAL`, `MAGIC`, `MITHRIL`, `HOLY_ARMS`, `SLAYING`
|
||||
- `large_creature_magic`
|
||||
- `family`: `grouped_variant`
|
||||
- groups: `large`, `super_large`
|
||||
- columns: `NORMAL`, `SLAYING`
|
||||
|
||||
Phase-4 notes:
|
||||
|
||||
- grouped results now populate `critical_group` during SQLite load
|
||||
- parser dispatch is family-based instead of standard-table only
|
||||
- left-margin row labels can be reconstructed from split fragments such as `151-` / `175`
|
||||
- the grouped magic PDF is imported once as `large_creature_magic`
|
||||
- `sources/Large Creature - Magic.pdf` and `sources/Super Large Creature - Magic.pdf` are duplicate files
|
||||
|
||||
### Phase 5: Conditional Branch Extraction
|
||||
|
||||
@@ -335,10 +358,12 @@ Each entry declares:
|
||||
|
||||
The manifest is intentionally the control point for enabling importer support one table at a time.
|
||||
|
||||
For the currently enabled phase-3 entries:
|
||||
For the currently enabled entries:
|
||||
|
||||
- `family` is `standard`
|
||||
- `extractionMethod` is `xml`
|
||||
- standard tables use `family: standard`
|
||||
- creature weapon tables use `family: variant_column`
|
||||
- grouped creature magic uses `family: grouped_variant`
|
||||
- all enabled entries currently use `extractionMethod: xml`
|
||||
|
||||
## Artifact Layout
|
||||
|
||||
|
||||
@@ -19,11 +19,12 @@ The PDFs are not one uniform table shape. I found three families:
|
||||
- Example: `Large Creature - Magic.pdf` has:
|
||||
- group: `large`, `super_large`
|
||||
- column: `normal`, `slaying`
|
||||
- In the current importer manifest, the grouped magic PDF is loaded once as `large_creature_magic` because the `Large Creature - Magic.pdf` and `Super Large Creature - Magic.pdf` source files are duplicates.
|
||||
- row: roll band
|
||||
|
||||
There are also extraction constraints:
|
||||
|
||||
- Most PDFs are text extractable with `pdftotext -layout`.
|
||||
- Most PDFs are text extractable with `pdftohtml -xml`.
|
||||
- `Void.pdf` appears image-based and will need OCR or manual transcription.
|
||||
- A single cell can contain:
|
||||
- base description text
|
||||
@@ -282,4 +283,3 @@ Recommended import flow:
|
||||
6. Route image PDFs like `Void.pdf` through OCR before the same parser.
|
||||
|
||||
The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user