Implement phase 4 critical table imports
This commit is contained in:
@@ -19,11 +19,12 @@ The PDFs are not one uniform table shape. I found three families:
|
||||
- Example: `Large Creature - Magic.pdf` has:
|
||||
- group: `large`, `super_large`
|
||||
- column: `normal`, `slaying`
|
||||
- In the current importer manifest, the grouped magic PDF is loaded once as `large_creature_magic` because the `Large Creature - Magic.pdf` and `Super Large Creature - Magic.pdf` source files are duplicates.
|
||||
- row: roll band
|
||||
|
||||
There are also extraction constraints:
|
||||
|
||||
- Most PDFs are text extractable with `pdftotext -layout`.
|
||||
- Most PDFs are text extractable with `pdftohtml -xml`.
|
||||
- `Void.pdf` appears image-based and will need OCR or manual transcription.
|
||||
- A single cell can contain:
|
||||
- base description text
|
||||
@@ -282,4 +283,3 @@ Recommended import flow:
|
||||
6. Route image PDFs like `Void.pdf` through OCR before the same parser.
|
||||
|
||||
The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user