Update phase 7 OCR planning docs

2026-03-18 02:24:31 +01:00
parent 7e5a6516a4
commit 768fffcf7d
2 changed files with 27 additions and 5 deletions
--- a/docs/critical_tables_db_model.md
+++ b/docs/critical_tables_db_model.md
@@ -25,7 +25,7 @@ The PDFs are not one uniform table shape. I found three families:
 There are also extraction constraints:

 - Most PDFs are text extractable with `pdftohtml -xml`.
- `Void.pdf` appears image-based and will need OCR or manual transcription.
+- `Void.pdf` appears image-based and will need OCR bootstrap, with the existing curation flow handling cleanup.
 - A single cell can contain:
  - base description text
  - symbolic affixes such as `+5H - 2S - 3B`
@@ -279,7 +279,7 @@ Current import flow:
 4. Parse symbolic affixes for both the base result and any branch affix payloads into `critical_effect`.
 5. Return the base result plus ordered branches and parsed affix effects through the web critical lookup.
 6. Gradually enrich prose-derived effects such as death, blindness, paralysis, limb loss, initiative changes, and item breakage.
-7. Route image PDFs like `Void.pdf` through OCR before the same parser.
+7. Route image PDFs like `Void.pdf` through OCR bootstrap before the same downstream parser and curation flow.

 The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.