Update phase 7 OCR planning docs

This commit is contained in:
2026-03-18 02:24:31 +01:00
parent 7e5a6516a4
commit 768fffcf7d
2 changed files with 27 additions and 5 deletions

View File

@@ -303,11 +303,33 @@ Phase-6 notes:
- the web critical lookup now returns and renders parsed affix effects alongside the raw affix text - the web critical lookup now returns and renders parsed affix effects alongside the raw affix text
- prose-derived effects remain future work - prose-derived effects remain future work
### Phase 7: OCR and Manual Fallback ### Phase 7: OCR Bootstrap for Curation
- support image-based PDFs such as `Void.pdf` - support image-based PDFs such as `Void.pdf`
- route image-based sources through OCR or curated manual input - bootstrap scanned tables through OCR while keeping the existing curation flow as the fallback
- keep the same post-extraction parsing contract where possible - keep the same downstream parsing and load contract where practical
#### Validation summary:
- `Void.pdf` is image-only; text extraction does not produce usable content
- OCR on rendered page images does recover the title, `A-E` header row, all 19 expected roll bands, the `Key:` footer, and most prose
- OCR remains weakest on symbol-heavy affix notation and occasional glyph confusion such as `C` -> `Cc`
- because of that, phase 7 should be implemented as OCR bootstrap for curation, not as a separate manual-transcription feature
#### Final implementation plan:
1. Add `void` to the manifest as `family: standard`, `extractionMethod: ocr`, and extend `src/RolemasterDb.ImportTool/CriticalImportManifestEntry.cs` with an optional `AxisTemplateSlug`. For Void, that value should point at a built-in standard template derived from `mana`.
2. Introduce a canonical extracted-source model, for example `ExtractedCriticalSource`, containing page geometries, positioned text fragments, extractor metadata, and coordinate/render profile metadata. Refactor `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs` so extraction dispatches by `ExtractionMethod` instead of always calling `pdftohtml`.
3. Move the current XML path behind a dedicated XML extractor implementation rather than letting the command runner own XML extraction directly. Existing XML-backed tables should remain behaviorally unchanged.
4. Implement an OCR extractor for scanned PDFs like `Void.pdf`. It should render page PNGs with Poppler, run OCR with Tesseract TSV output, parse the TSV into canonical fragments, and persist the raw OCR diagnostics as artifacts.
5. Add explicit external tool discovery/configuration for the Poppler and Tesseract executables instead of assuming bare command names on `PATH` are always safe. The OCR path depends on deterministic rasterization and OCR invocation.
6. Add a built-in standard-table axis template based on `Mana` rather than trying to rediscover `Void` structure from noisy OCR. The template should hard-code columns `A-E` and these 19 roll bands: `01-05`, `06-10`, `11-15`, `16-20`, `21-35`, `36-45`, `46-50`, `51-55`, `56-60`, `61-65`, `66`, `67-70`, `71-75`, `76-80`, `81-85`, `86-90`, `91-95`, `96-99`, `100`.
7. Build a `StandardOcrBootstrapper` for template-driven standard tables. OCR should be used only to find anchors such as the `A-E` header row, the left-column roll labels, and the footer/key boundary, then interpolate the full 95-cell grid from the template and assign OCR fragments into those cells.
8. Refactor parsing so the standard parser can operate on canonical fragments plus a supplied grid, not just raw XML text. The OCR path should reuse the existing cell-to-result parsing, branch splitting, affix parsing, validation, and load behavior after grid assignment.
9. Fix the coordinate-space seam explicitly. The current image pipeline assumes XML-space coordinates with a fixed render scale; OCR fragments come from rendered-page pixel coordinates. Phase 7 should carry extractor-provided coordinate metadata so source bounds, page geometry, and crop artifacts remain correct for both XML and OCR tables.
10. Keep validation strict on structure and permissive on OCR text quality. The import should fail only if `Void` cannot be turned into a complete, cropable 95-cell standard table with valid source bounds. OCR misreads inside cells should become warnings, while the raw OCR text is still loaded into SQLite with `IsCurated = false` so the existing curation UI can refine it.
11. Extend artifacts in `src/RolemasterDb.ImportTool/ImportArtifactPaths.cs` and `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs` so OCR imports persist both the raw OCR payload and the normalized fragments. Keep `fragments.json` as the canonical debug view and add an OCR-specific artifact such as `source.ocr.tsv`.
12. Add tests that do not depend on live OCR. The current manifest test in `src/RolemasterDb.ImportTool.Tests/StandardCriticalTableParserIntegrationTests.cs` assumes every enabled table is XML; that will need to become extraction-method-aware. Add checked-in OCR fixtures for `Void`, then cover anchor detection, template interpolation, 95-cell assignment, representative `Void` cells, and a full load path.
## Current CLI ## Current CLI

View File

@@ -25,7 +25,7 @@ The PDFs are not one uniform table shape. I found three families:
There are also extraction constraints: There are also extraction constraints:
- Most PDFs are text extractable with `pdftohtml -xml`. - Most PDFs are text extractable with `pdftohtml -xml`.
- `Void.pdf` appears image-based and will need OCR or manual transcription. - `Void.pdf` appears image-based and will need OCR bootstrap, with the existing curation flow handling cleanup.
- A single cell can contain: - A single cell can contain:
- base description text - base description text
- symbolic affixes such as `+5H - 2S - 3B` - symbolic affixes such as `+5H - 2S - 3B`
@@ -279,7 +279,7 @@ Current import flow:
4. Parse symbolic affixes for both the base result and any branch affix payloads into `critical_effect`. 4. Parse symbolic affixes for both the base result and any branch affix payloads into `critical_effect`.
5. Return the base result plus ordered branches and parsed affix effects through the web critical lookup. 5. Return the base result plus ordered branches and parsed affix effects through the web critical lookup.
6. Gradually enrich prose-derived effects such as death, blindness, paralysis, limb loss, initiative changes, and item breakage. 6. Gradually enrich prose-derived effects such as death, blindness, paralysis, limb loss, initiative changes, and item breakage.
7. Route image PDFs like `Void.pdf` through OCR before the same parser. 7. Route image PDFs like `Void.pdf` through OCR bootstrap before the same downstream parser and curation flow.
The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone. The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.