Add import tool docs

2026-03-14 01:30:10 +01:00
parent 719355da90
commit be5c0a9b54
2 changed files with 504 additions and 0 deletions
--- a/docs/critical_import_tool.md
+++ b/docs/critical_import_tool.md
@@ -0,0 +1,500 @@
+# Critical Import Tool
+
+## Purpose
+
+The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app.
+
+The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be:
+
+- explicit
+- repeatable
+- debuggable
+- able to rebuild importer-managed data without resetting the entire application
+
+The tool currently lives in `src/RolemasterDb.ImportTool` and operates against the same SQLite schema used by the web app.
+
+## Goals
+
+The importer is designed around the following requirements:
+
+- reset and reload critical data without touching unrelated tables
+- preserve source fidelity while still producing structured lookup data
+- make parsing failures visible before bad data reaches SQLite
+- keep intermediate artifacts on disk for inspection
+- support iterative parser development one table at a time
+
+## Current Scope
+
+The current implementation supports:
+
+- explicit CLI commands for reset, extraction, and import
+- manifest-driven source selection
+- `standard` critical tables with columns `A-E`
+- XML-based extraction using `pdftohtml -xml`
+- geometry-based parsing for `Slash.pdf`
+- transactional loading into SQLite
+
+The current implementation does not yet support:
+
+- variant-column critical tables
+- grouped variant tables
+- OCR/image-based PDFs such as `Void.pdf`
+- normalized `critical_branch` population
+- normalized `critical_effect` population
+- automatic confidence scoring beyond validation errors
+
+## High-Level Architecture
+
+The importer workflow is:
+
+1. Resolve a table entry from the manifest.
+2. Extract the source PDF into an artifact format.
+3. Parse the extracted artifact into an in-memory table model.
+4. Write debug artifacts to disk.
+5. Validate the parsed result.
+6. If validation succeeds, load the parsed data into SQLite in a transaction.
+
+The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow.
+
+## Implementation Phases
+
+## Phase 1: Initial Importer and Text Extraction
+
+Phase 1 established the first end-to-end workflow:
+
+- a dedicated console project
+- `CommandLineParser` based verbs
+- a table manifest
+- transactional reset/load commands
+- a first parser for `Slash.pdf`
+
+### Phase 1 command surface
+
+Phase 1 introduced these verbs:
+
+- `reset criticals`
+- `extract <table>`
+- `load <table>`
+- `import <table>`
+
+### Phase 1 extraction approach
+
+The initial version used `pdftotext -layout` to create a flattened text artifact. The parser then tried to reconstruct:
+
+- column boundaries from the `A-E` header line
+- roll-band rows from labels such as `71-75`
+- cell contents by slicing monospaced text blocks
+
+### Phase 1 outcome
+
+Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs.
+
+### Phase 1 failure mode
+
+The first serious regression was seen in `Slash.pdf`:
+
+- lookup target: `slash`, severity `A`, roll `72`
+- expected band: `71-75`
+- broken result from the text-based parser: content from `76-80` mixed with stray characters from severity `B`
+
+That failure showed the core problem with `pdftotext -layout`: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout.
+
+Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development.
+
+## Phase 2: XML Geometry-Based Parsing
+
+Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on `pdftohtml -xml`.
+
+### Why Phase 2 was necessary
+
+The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by `pdftohtml` preserves:
+
+- page number
+- `top`
+- `left`
+- `width`
+- `height`
+- text content
+
+That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines.
+
+### Phase 2 extraction format
+
+The importer now extracts to XML instead of plain text:
+
+- extraction tool: `pdftohtml -xml -i -noframes`
+- artifact file: `source.xml`
+
+### Phase 2 parser model
+
+The parser now works in these stages:
+
+1. Load all `<text>` fragments from the XML.
+2. Detect the standard `A-E` header row.
+3. Detect roll-band labels on the left margin.
+4. Build row bands from the vertical positions of those roll labels.
+5. Build column boundaries from the horizontal centers of the `A-E` header fragments.
+6. Assign each text fragment to a row by `top`.
+7. Assign each text fragment to a column by horizontal position.
+8. Reconstruct each cell from ordered fragments.
+9. Split cell content into description lines and affix-like lines.
+10. Validate the result before touching SQLite.
+
+### Phase 2 reliability improvement
+
+This phase fixed the original `Slash / A / 72` corruption. The same lookup now resolves to:
+
+- band `71-75`
+- description `Blow falls on lower leg. Slash tendons. Poor sucker.`
+
+The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.
+
+## Planned Future Phases
+
+The current architecture is intended to support additional phases:
+
+### Phase 3: Broader Table Coverage
+
+- add more `standard` critical PDFs
+- expand the manifest
+- verify parser stability across more source layouts
+
+### Phase 4: Variant and Grouped Tables
+
+- support `variant_column` tables such as `Large Creature - Weapon.pdf`
+- support `grouped_variant` tables such as `Large Creature - Magic.pdf`
+- add parser strategies for additional table families
+
+### Phase 5: Conditional Branch Extraction
+
+- split branch-heavy cells into `critical_branch`
+- preserve the base cell text and branch text separately
+- support branch conditions such as `with helmet` and `w/o leg greaves`
+
+### Phase 6: Effect Normalization
+
+- parse symbolic affix lines into normalized effects
+- populate `critical_effect`
+- gradually enrich prose-derived effects over time
+
+### Phase 7: OCR and Manual Fallback
+
+- support image-based PDFs such as `Void.pdf`
+- route image-based sources through OCR or curated manual input
+- keep the same post-extraction parsing contract where possible
+
+## Current CLI
+
+The tool uses `CommandLineParser` and currently exposes these verbs:
+
+### `reset criticals`
+
+Deletes importer-managed critical data from SQLite.
+
+Use this when:
+
+- you want to clear imported critical data
+- you want to rerun a fresh import
+- you need to verify the rebuild path from an empty critical-table state
+
+Example:
+
+```powershell
+dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals
+```
+
+### `extract <table>`
+
+Resolves a table from the manifest and writes the extraction artifact to disk.
+
+Example:
+
+```powershell
+dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash
+```
+
+### `load <table>`
+
+Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds.
+
+Example:
+
+```powershell
+dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash
+```
+
+### `import <table>`
+
+Runs extraction followed by load.
+
+Example:
+
+```powershell
+dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash
+```
+
+## Manifest
+
+The importer manifest is stored at:
+
+- `sources/critical-import-manifest.json`
+
+Each entry declares:
+
+- `slug`
+- `displayName`
+- `family`
+- `extractionMethod`
+- `pdfPath`
+- `enabled`
+
+The manifest is intentionally the control point for enabling importer support one table at a time.
+
+## Artifact Layout
+
+Artifacts are written under:
+
+- `artifacts/import/critical/<slug>/`
+
+The current artifact set is:
+
+### `source.xml`
+
+The raw XML extraction output from `pdftohtml`.
+
+Use this when:
+
+- checking whether text is present in the PDF
+- inspecting original `top` and `left` coordinates
+- diagnosing row/column misassignment
+
+### `fragments.json`
+
+A normalized list of parsed text fragments with page and position metadata.
+
+Use this when:
+
+- comparing raw XML to the importer’s internal fragment model
+- confirming that specific fragments were loaded correctly
+- debugging Unicode or whitespace normalization issues
+
+### `parsed-cells.json`
+
+The reconstructed cells after geometry-based row/column assignment.
+
+Use this when:
+
+- validating a specific row and column
+- checking whether a fragment was assigned to the correct cell
+- confirming description and affix splitting
+
+### `validation-report.json`
+
+The validation result for the parsed table.
+
+This includes:
+
+- overall validity
+- validation errors
+- row count
+- cell count
+
+Use this when:
+
+- a `load` command fails
+- a parser change introduces ambiguity
+- you need to confirm that the importer refused to write SQLite data
+
+## Standard Table Parsing Strategy
+
+The current `standard` parser is designed for tables shaped like `Slash.pdf`:
+
+- columns: `A-E`
+- rows: roll bands such as `01-05`, `71-75`, `100`
+- cell contents: prose, symbolic affixes, and sometimes conditional branch lines
+
+### Header Detection
+
+The parser searches the XML fragments for a row containing exactly:
+
+- `A`
+- `B`
+- `C`
+- `D`
+- `E`
+
+Those positions define the standard-table column anchors.
+
+### Row Detection
+
+The parser searches the left margin below the header for roll-band labels, for example:
+
+- `01-05`
+- `66`
+- `251+`
+
+Those vertical positions define the row anchors.
+
+### Row Bands
+
+The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors.
+
+That prevents one row from drifting into the next when text wraps over multiple visual lines.
+
+### Column Assignment
+
+Each text fragment is assigned to the nearest column band based on horizontal center position.
+
+This is the core reliability improvement over the phase-1 text slicing approach.
+
+### Line Reconstruction
+
+Fragments inside a cell are grouped into lines by close `top` values and then ordered by `left`.
+
+This produces a stable line list even when PDF text is broken into multiple fragments.
+
+### Description vs Affix Splitting
+
+The parser classifies lines as:
+
+- description-like prose
+- affix-like notation
+
+Affix-like lines include:
+
+- `+...`
+- symbolic lines using the critical glyphs
+- branch-like affix lines such as `with leg greaves: +2H - ...`
+
+The current implementation stores:
+
+- `RawCellText`
+- `DescriptionText`
+- `RawAffixText`
+
+It does not yet normalize branches or effects into separate tables.
+
+## Validation Rules
+
+The current validation pass is intentionally strict.
+
+At minimum, a valid `standard` table must satisfy:
+
+- a detectable `A-E` header row exists
+- roll-band labels are found
+- each detected row produces content for all five columns
+- total parsed cell count matches `row_count * 5`
+
+If validation fails:
+
+- artifacts are still written
+- SQLite load is aborted
+- the command returns an error
+
+This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table.
+
+## Database Load Behavior
+
+The loader is transactional.
+
+The current load path:
+
+1. ensures the SQLite database exists
+2. deletes the existing subtree for the targeted critical table
+3. inserts:
+   - `critical_table`
+   - `critical_column`
+   - `critical_roll_band`
+   - `critical_result`
+4. commits only after the full table is saved
+
+This means importer iterations can target one table without resetting unrelated database content.
+
+## Interaction With Web App Startup
+
+The web application no longer auto-seeds critical starter data on startup.
+
+Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer.
+
+This separation is important because:
+
+- importer iterations are frequent
+- parser logic is still evolving
+- startup should not silently repopulate critical data behind the tool’s back
+
+## Current Code Map
+
+Important files in the current implementation:
+
+- `src/RolemasterDb.ImportTool/Program.cs`
+  - CLI entry point
+- `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs`
+  - command orchestration
+- `src/RolemasterDb.ImportTool/CriticalImportLoader.cs`
+  - transactional SQLite load/reset behavior
+- `src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs`
+  - manifest loading
+- `src/RolemasterDb.ImportTool/PdfXmlExtractor.cs`
+  - XML extraction via `pdftohtml`
+- `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs`
+  - artifact output
+- `src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs`
+  - standard table geometry parser
+- `src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs`
+  - positioned text fragment model
+- `src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs`
+  - debug cell artifact model
+- `src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs`
+  - validation output model
+
+## Adding a New Table
+
+The recommended process for onboarding a new table is:
+
+1. Add a manifest entry.
+2. Run `extract <slug>`.
+3. Inspect `source.xml`.
+4. Run `load <slug>`.
+5. Inspect `validation-report.json` and `parsed-cells.json`.
+6. If validation succeeds, spot-check SQLite output.
+7. If validation fails, adjust the parser or add a family-specific parser strategy before retrying.
+
+## Debugging Guidance
+
+If a table imports incorrectly, inspect artifacts in this order:
+
+1. `validation-report.json`
+2. `parsed-cells.json`
+3. `fragments.json`
+4. `source.xml`
+
+That order usually answers the key questions fastest:
+
+- did validation fail
+- which row/column is wrong
+- were fragments assigned incorrectly
+- or was the extraction itself already malformed
+
+## Reliability Position
+
+The current importer should be understood as:
+
+- reliable enough for geometry-based `standard` table iteration
+- much safer than the old flattened-text approach
+- still evolving toward broader family coverage and deeper normalization
+
+The key design rule going forward is:
+
+- do not silently load ambiguous data
+
+The importer should always prefer:
+
+- preserving source fidelity
+- writing review artifacts
+- failing validation
+
+over:
+
+- guessing
+- auto-correcting without evidence
+- loading nearly-correct but structurally wrong critical results