Implement phase 4 critical table imports

2026-03-14 03:27:14 +01:00
parent a391a1421a
commit b2f61c3d73
17 changed files with 1280 additions and 474 deletions
--- a/docs/critical_import_tool.md
+++ b/docs/critical_import_tool.md
@@ -30,8 +30,10 @@ The current implementation supports:
 - explicit CLI commands for reset, extraction, and import
 - manifest-driven source selection
 - `standard` critical tables with columns `A-E`
+- `variant_column` critical tables with non-severity columns
+- `grouped_variant` critical tables with a group axis plus variant columns
 - XML-based extraction using `pdftohtml -xml`
- geometry-based parsing across the currently enabled phase-3 tables:
+- geometry-based parsing across the currently enabled table set:
  - `arcane-aether`
  - `arcane-nether`
  - `ballistic-shrapnel`
@@ -42,22 +44,24 @@ The current implementation supports:
  - `heat`
  - `impact`
  - `krush`
+  - `large_creature_magic`
+  - `large_creature_weapon`
  - `ma-strikes`
  - `ma-sweeps`
  - `mana`
  - `puncture`
  - `slash`
  - `subdual`
+  - `super_large_creature_weapon`
  - `tiny`
  - `unbalance`
 - row-boundary repair for trailing affix leakage
+- split row-label reconstruction for tables that render labels such as `99-` / `100` as two fragments
 - footer/page-number filtering during body parsing
 - transactional loading into SQLite

 The current implementation does not yet support:

- variant-column critical tables
- grouped variant tables
 - OCR/image-based PDFs such as `Void.pdf`
 - normalized `critical_branch` population
 - normalized `critical_effect` population
@@ -246,9 +250,28 @@ Current phase-3 notes:

 ### Phase 4: Variant and Grouped Tables

- support `variant_column` tables such as `Large Creature - Weapon.pdf`
- support `grouped_variant` tables such as `Large Creature - Magic.pdf`
- add parser strategies for additional table families
+Phase 4 extended the importer beyond `A-E` tables.
+
+The currently enabled phase-4 table set is:
+
+- `large_creature_weapon`
+  - `family`: `variant_column`
+  - columns: `NORMAL`, `MAGIC`, `MITHRIL`, `HOLY_ARMS`, `SLAYING`
+- `super_large_creature_weapon`
+  - `family`: `variant_column`
+  - columns: `NORMAL`, `MAGIC`, `MITHRIL`, `HOLY_ARMS`, `SLAYING`
+- `large_creature_magic`
+  - `family`: `grouped_variant`
+  - groups: `large`, `super_large`
+  - columns: `NORMAL`, `SLAYING`
+
+Phase-4 notes:
+
+- grouped results now populate `critical_group` during SQLite load
+- parser dispatch is family-based instead of standard-table only
+- left-margin row labels can be reconstructed from split fragments such as `151-` / `175`
+- the grouped magic PDF is imported once as `large_creature_magic`
+  - `sources/Large Creature - Magic.pdf` and `sources/Super Large Creature - Magic.pdf` are duplicate files

 ### Phase 5: Conditional Branch Extraction

@@ -335,10 +358,12 @@ Each entry declares:

 The manifest is intentionally the control point for enabling importer support one table at a time.

-For the currently enabled phase-3 entries:
+For the currently enabled entries:

- `family` is `standard`
- `extractionMethod` is `xml`
+- standard tables use `family: standard`
+- creature weapon tables use `family: variant_column`
+- grouped creature magic uses `family: grouped_variant`
+- all enabled entries currently use `extractionMethod: xml`

 ## Artifact Layout

--- a/docs/critical_tables_db_model.md
+++ b/docs/critical_tables_db_model.md
@@ -19,11 +19,12 @@ The PDFs are not one uniform table shape. I found three families:
   - Example: `Large Creature - Magic.pdf` has:
     - group: `large`, `super_large`
     - column: `normal`, `slaying`
+   - In the current importer manifest, the grouped magic PDF is loaded once as `large_creature_magic` because the `Large Creature - Magic.pdf` and `Super Large Creature - Magic.pdf` source files are duplicates.
     - row: roll band

 There are also extraction constraints:

- Most PDFs are text extractable with `pdftotext -layout`.
+- Most PDFs are text extractable with `pdftohtml -xml`.
 - `Void.pdf` appears image-based and will need OCR or manual transcription.
 - A single cell can contain:
  - base description text
@@ -282,4 +283,3 @@ Recommended import flow:
 6. Route image PDFs like `Void.pdf` through OCR before the same parser.

 The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.
-