Implement phase 4 critical table imports

2026-03-14 03:27:14 +01:00
parent a391a1421a
commit b2f61c3d73
17 changed files with 1280 additions and 474 deletions
--- a/docs/critical_tables_db_model.md
+++ b/docs/critical_tables_db_model.md
@@ -19,11 +19,12 @@ The PDFs are not one uniform table shape. I found three families:
   - Example: `Large Creature - Magic.pdf` has:
     - group: `large`, `super_large`
     - column: `normal`, `slaying`
+   - In the current importer manifest, the grouped magic PDF is loaded once as `large_creature_magic` because the `Large Creature - Magic.pdf` and `Super Large Creature - Magic.pdf` source files are duplicates.
     - row: roll band

 There are also extraction constraints:

- Most PDFs are text extractable with `pdftotext -layout`.
+- Most PDFs are text extractable with `pdftohtml -xml`.
 - `Void.pdf` appears image-based and will need OCR or manual transcription.
 - A single cell can contain:
  - base description text
@@ -282,4 +283,3 @@ Recommended import flow:
 6. Route image PDFs like `Void.pdf` through OCR before the same parser.

 The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.
-