Update phase 7 OCR planning docs
This commit is contained in:
@@ -303,11 +303,33 @@ Phase-6 notes:
|
|||||||
- the web critical lookup now returns and renders parsed affix effects alongside the raw affix text
|
- the web critical lookup now returns and renders parsed affix effects alongside the raw affix text
|
||||||
- prose-derived effects remain future work
|
- prose-derived effects remain future work
|
||||||
|
|
||||||
### Phase 7: OCR and Manual Fallback
|
### Phase 7: OCR Bootstrap for Curation
|
||||||
|
|
||||||
- support image-based PDFs such as `Void.pdf`
|
- support image-based PDFs such as `Void.pdf`
|
||||||
- route image-based sources through OCR or curated manual input
|
- bootstrap scanned tables through OCR while keeping the existing curation flow as the fallback
|
||||||
- keep the same post-extraction parsing contract where possible
|
- keep the same downstream parsing and load contract where practical
|
||||||
|
|
||||||
|
#### Validation summary:
|
||||||
|
|
||||||
|
- `Void.pdf` is image-only; text extraction does not produce usable content
|
||||||
|
- OCR on rendered page images does recover the title, `A-E` header row, all 19 expected roll bands, the `Key:` footer, and most prose
|
||||||
|
- OCR remains weakest on symbol-heavy affix notation and occasional glyph confusion such as `C` -> `Cc`
|
||||||
|
- because of that, phase 7 should be implemented as OCR bootstrap for curation, not as a separate manual-transcription feature
|
||||||
|
|
||||||
|
#### Final implementation plan:
|
||||||
|
|
||||||
|
1. Add `void` to the manifest as `family: standard`, `extractionMethod: ocr`, and extend `src/RolemasterDb.ImportTool/CriticalImportManifestEntry.cs` with an optional `AxisTemplateSlug`. For Void, that value should point at a built-in standard template derived from `mana`.
|
||||||
|
2. Introduce a canonical extracted-source model, for example `ExtractedCriticalSource`, containing page geometries, positioned text fragments, extractor metadata, and coordinate/render profile metadata. Refactor `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs` so extraction dispatches by `ExtractionMethod` instead of always calling `pdftohtml`.
|
||||||
|
3. Move the current XML path behind a dedicated XML extractor implementation rather than letting the command runner own XML extraction directly. Existing XML-backed tables should remain behaviorally unchanged.
|
||||||
|
4. Implement an OCR extractor for scanned PDFs like `Void.pdf`. It should render page PNGs with Poppler, run OCR with Tesseract TSV output, parse the TSV into canonical fragments, and persist the raw OCR diagnostics as artifacts.
|
||||||
|
5. Add explicit external tool discovery/configuration for the Poppler and Tesseract executables instead of assuming bare command names on `PATH` are always safe. The OCR path depends on deterministic rasterization and OCR invocation.
|
||||||
|
6. Add a built-in standard-table axis template based on `Mana` rather than trying to rediscover `Void` structure from noisy OCR. The template should hard-code columns `A-E` and these 19 roll bands: `01-05`, `06-10`, `11-15`, `16-20`, `21-35`, `36-45`, `46-50`, `51-55`, `56-60`, `61-65`, `66`, `67-70`, `71-75`, `76-80`, `81-85`, `86-90`, `91-95`, `96-99`, `100`.
|
||||||
|
7. Build a `StandardOcrBootstrapper` for template-driven standard tables. OCR should be used only to find anchors such as the `A-E` header row, the left-column roll labels, and the footer/key boundary, then interpolate the full 95-cell grid from the template and assign OCR fragments into those cells.
|
||||||
|
8. Refactor parsing so the standard parser can operate on canonical fragments plus a supplied grid, not just raw XML text. The OCR path should reuse the existing cell-to-result parsing, branch splitting, affix parsing, validation, and load behavior after grid assignment.
|
||||||
|
9. Fix the coordinate-space seam explicitly. The current image pipeline assumes XML-space coordinates with a fixed render scale; OCR fragments come from rendered-page pixel coordinates. Phase 7 should carry extractor-provided coordinate metadata so source bounds, page geometry, and crop artifacts remain correct for both XML and OCR tables.
|
||||||
|
10. Keep validation strict on structure and permissive on OCR text quality. The import should fail only if `Void` cannot be turned into a complete, cropable 95-cell standard table with valid source bounds. OCR misreads inside cells should become warnings, while the raw OCR text is still loaded into SQLite with `IsCurated = false` so the existing curation UI can refine it.
|
||||||
|
11. Extend artifacts in `src/RolemasterDb.ImportTool/ImportArtifactPaths.cs` and `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs` so OCR imports persist both the raw OCR payload and the normalized fragments. Keep `fragments.json` as the canonical debug view and add an OCR-specific artifact such as `source.ocr.tsv`.
|
||||||
|
12. Add tests that do not depend on live OCR. The current manifest test in `src/RolemasterDb.ImportTool.Tests/StandardCriticalTableParserIntegrationTests.cs` assumes every enabled table is XML; that will need to become extraction-method-aware. Add checked-in OCR fixtures for `Void`, then cover anchor detection, template interpolation, 95-cell assignment, representative `Void` cells, and a full load path.
|
||||||
|
|
||||||
## Current CLI
|
## Current CLI
|
||||||
|
|
||||||
|
|||||||
@@ -25,7 +25,7 @@ The PDFs are not one uniform table shape. I found three families:
|
|||||||
There are also extraction constraints:
|
There are also extraction constraints:
|
||||||
|
|
||||||
- Most PDFs are text extractable with `pdftohtml -xml`.
|
- Most PDFs are text extractable with `pdftohtml -xml`.
|
||||||
- `Void.pdf` appears image-based and will need OCR or manual transcription.
|
- `Void.pdf` appears image-based and will need OCR bootstrap, with the existing curation flow handling cleanup.
|
||||||
- A single cell can contain:
|
- A single cell can contain:
|
||||||
- base description text
|
- base description text
|
||||||
- symbolic affixes such as `+5H - 2S - 3B`
|
- symbolic affixes such as `+5H - 2S - 3B`
|
||||||
@@ -279,7 +279,7 @@ Current import flow:
|
|||||||
4. Parse symbolic affixes for both the base result and any branch affix payloads into `critical_effect`.
|
4. Parse symbolic affixes for both the base result and any branch affix payloads into `critical_effect`.
|
||||||
5. Return the base result plus ordered branches and parsed affix effects through the web critical lookup.
|
5. Return the base result plus ordered branches and parsed affix effects through the web critical lookup.
|
||||||
6. Gradually enrich prose-derived effects such as death, blindness, paralysis, limb loss, initiative changes, and item breakage.
|
6. Gradually enrich prose-derived effects such as death, blindness, paralysis, limb loss, initiative changes, and item breakage.
|
||||||
7. Route image PDFs like `Void.pdf` through OCR before the same parser.
|
7. Route image PDFs like `Void.pdf` through OCR bootstrap before the same downstream parser and curation flow.
|
||||||
|
|
||||||
The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.
|
The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user