diff --git a/.gitignore b/.gitignore index 0808c4a..f0203f0 100644 --- a/.gitignore +++ b/.gitignore @@ -274,6 +274,10 @@ ServiceFabricBackup/ *.ldf *.ndf +# SQLite transient sidecars +*.db-shm +*.db-wal + # Business Intelligence projects *.rdl.data *.bim.layout diff --git a/docs/critical_import_tool.md b/docs/critical_import_tool.md new file mode 100644 index 0000000..5db628f --- /dev/null +++ b/docs/critical_import_tool.md @@ -0,0 +1,500 @@ +# Critical Import Tool + +## Purpose + +The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app. + +The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be: + +- explicit +- repeatable +- debuggable +- able to rebuild importer-managed data without resetting the entire application + +The tool currently lives in `src/RolemasterDb.ImportTool` and operates against the same SQLite schema used by the web app. + +## Goals + +The importer is designed around the following requirements: + +- reset and reload critical data without touching unrelated tables +- preserve source fidelity while still producing structured lookup data +- make parsing failures visible before bad data reaches SQLite +- keep intermediate artifacts on disk for inspection +- support iterative parser development one table at a time + +## Current Scope + +The current implementation supports: + +- explicit CLI commands for reset, extraction, and import +- manifest-driven source selection +- `standard` critical tables with columns `A-E` +- XML-based extraction using `pdftohtml -xml` +- geometry-based parsing for `Slash.pdf` +- transactional loading into SQLite + +The current implementation does not yet support: + +- variant-column critical tables +- grouped variant tables +- OCR/image-based PDFs such as `Void.pdf` +- normalized `critical_branch` population +- normalized `critical_effect` population +- automatic confidence scoring beyond validation errors + +## High-Level Architecture + +The importer workflow is: + +1. Resolve a table entry from the manifest. +2. Extract the source PDF into an artifact format. +3. Parse the extracted artifact into an in-memory table model. +4. Write debug artifacts to disk. +5. Validate the parsed result. +6. If validation succeeds, load the parsed data into SQLite in a transaction. + +The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow. + +## Implementation Phases + +## Phase 1: Initial Importer and Text Extraction + +Phase 1 established the first end-to-end workflow: + +- a dedicated console project +- `CommandLineParser` based verbs +- a table manifest +- transactional reset/load commands +- a first parser for `Slash.pdf` + +### Phase 1 command surface + +Phase 1 introduced these verbs: + +- `reset criticals` +- `extract ` +- `load
` +- `import
` + +### Phase 1 extraction approach + +The initial version used `pdftotext -layout` to create a flattened text artifact. The parser then tried to reconstruct: + +- column boundaries from the `A-E` header line +- roll-band rows from labels such as `71-75` +- cell contents by slicing monospaced text blocks + +### Phase 1 outcome + +Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs. + +### Phase 1 failure mode + +The first serious regression was seen in `Slash.pdf`: + +- lookup target: `slash`, severity `A`, roll `72` +- expected band: `71-75` +- broken result from the text-based parser: content from `76-80` mixed with stray characters from severity `B` + +That failure showed the core problem with `pdftotext -layout`: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout. + +Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development. + +## Phase 2: XML Geometry-Based Parsing + +Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on `pdftohtml -xml`. + +### Why Phase 2 was necessary + +The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by `pdftohtml` preserves: + +- page number +- `top` +- `left` +- `width` +- `height` +- text content + +That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines. + +### Phase 2 extraction format + +The importer now extracts to XML instead of plain text: + +- extraction tool: `pdftohtml -xml -i -noframes` +- artifact file: `source.xml` + +### Phase 2 parser model + +The parser now works in these stages: + +1. Load all `` fragments from the XML. +2. Detect the standard `A-E` header row. +3. Detect roll-band labels on the left margin. +4. Build row bands from the vertical positions of those roll labels. +5. Build column boundaries from the horizontal centers of the `A-E` header fragments. +6. Assign each text fragment to a row by `top`. +7. Assign each text fragment to a column by horizontal position. +8. Reconstruct each cell from ordered fragments. +9. Split cell content into description lines and affix-like lines. +10. Validate the result before touching SQLite. + +### Phase 2 reliability improvement + +This phase fixed the original `Slash / A / 72` corruption. The same lookup now resolves to: + +- band `71-75` +- description `Blow falls on lower leg. Slash tendons. Poor sucker.` + +The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows. + +## Planned Future Phases + +The current architecture is intended to support additional phases: + +### Phase 3: Broader Table Coverage + +- add more `standard` critical PDFs +- expand the manifest +- verify parser stability across more source layouts + +### Phase 4: Variant and Grouped Tables + +- support `variant_column` tables such as `Large Creature - Weapon.pdf` +- support `grouped_variant` tables such as `Large Creature - Magic.pdf` +- add parser strategies for additional table families + +### Phase 5: Conditional Branch Extraction + +- split branch-heavy cells into `critical_branch` +- preserve the base cell text and branch text separately +- support branch conditions such as `with helmet` and `w/o leg greaves` + +### Phase 6: Effect Normalization + +- parse symbolic affix lines into normalized effects +- populate `critical_effect` +- gradually enrich prose-derived effects over time + +### Phase 7: OCR and Manual Fallback + +- support image-based PDFs such as `Void.pdf` +- route image-based sources through OCR or curated manual input +- keep the same post-extraction parsing contract where possible + +## Current CLI + +The tool uses `CommandLineParser` and currently exposes these verbs: + +### `reset criticals` + +Deletes importer-managed critical data from SQLite. + +Use this when: + +- you want to clear imported critical data +- you want to rerun a fresh import +- you need to verify the rebuild path from an empty critical-table state + +Example: + +```powershell +dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals +``` + +### `extract
` + +Resolves a table from the manifest and writes the extraction artifact to disk. + +Example: + +```powershell +dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash +``` + +### `load
` + +Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds. + +Example: + +```powershell +dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash +``` + +### `import
` + +Runs extraction followed by load. + +Example: + +```powershell +dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash +``` + +## Manifest + +The importer manifest is stored at: + +- `sources/critical-import-manifest.json` + +Each entry declares: + +- `slug` +- `displayName` +- `family` +- `extractionMethod` +- `pdfPath` +- `enabled` + +The manifest is intentionally the control point for enabling importer support one table at a time. + +## Artifact Layout + +Artifacts are written under: + +- `artifacts/import/critical//` + +The current artifact set is: + +### `source.xml` + +The raw XML extraction output from `pdftohtml`. + +Use this when: + +- checking whether text is present in the PDF +- inspecting original `top` and `left` coordinates +- diagnosing row/column misassignment + +### `fragments.json` + +A normalized list of parsed text fragments with page and position metadata. + +Use this when: + +- comparing raw XML to the importer’s internal fragment model +- confirming that specific fragments were loaded correctly +- debugging Unicode or whitespace normalization issues + +### `parsed-cells.json` + +The reconstructed cells after geometry-based row/column assignment. + +Use this when: + +- validating a specific row and column +- checking whether a fragment was assigned to the correct cell +- confirming description and affix splitting + +### `validation-report.json` + +The validation result for the parsed table. + +This includes: + +- overall validity +- validation errors +- row count +- cell count + +Use this when: + +- a `load` command fails +- a parser change introduces ambiguity +- you need to confirm that the importer refused to write SQLite data + +## Standard Table Parsing Strategy + +The current `standard` parser is designed for tables shaped like `Slash.pdf`: + +- columns: `A-E` +- rows: roll bands such as `01-05`, `71-75`, `100` +- cell contents: prose, symbolic affixes, and sometimes conditional branch lines + +### Header Detection + +The parser searches the XML fragments for a row containing exactly: + +- `A` +- `B` +- `C` +- `D` +- `E` + +Those positions define the standard-table column anchors. + +### Row Detection + +The parser searches the left margin below the header for roll-band labels, for example: + +- `01-05` +- `66` +- `251+` + +Those vertical positions define the row anchors. + +### Row Bands + +The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors. + +That prevents one row from drifting into the next when text wraps over multiple visual lines. + +### Column Assignment + +Each text fragment is assigned to the nearest column band based on horizontal center position. + +This is the core reliability improvement over the phase-1 text slicing approach. + +### Line Reconstruction + +Fragments inside a cell are grouped into lines by close `top` values and then ordered by `left`. + +This produces a stable line list even when PDF text is broken into multiple fragments. + +### Description vs Affix Splitting + +The parser classifies lines as: + +- description-like prose +- affix-like notation + +Affix-like lines include: + +- `+...` +- symbolic lines using the critical glyphs +- branch-like affix lines such as `with leg greaves: +2H - ...` + +The current implementation stores: + +- `RawCellText` +- `DescriptionText` +- `RawAffixText` + +It does not yet normalize branches or effects into separate tables. + +## Validation Rules + +The current validation pass is intentionally strict. + +At minimum, a valid `standard` table must satisfy: + +- a detectable `A-E` header row exists +- roll-band labels are found +- each detected row produces content for all five columns +- total parsed cell count matches `row_count * 5` + +If validation fails: + +- artifacts are still written +- SQLite load is aborted +- the command returns an error + +This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table. + +## Database Load Behavior + +The loader is transactional. + +The current load path: + +1. ensures the SQLite database exists +2. deletes the existing subtree for the targeted critical table +3. inserts: + - `critical_table` + - `critical_column` + - `critical_roll_band` + - `critical_result` +4. commits only after the full table is saved + +This means importer iterations can target one table without resetting unrelated database content. + +## Interaction With Web App Startup + +The web application no longer auto-seeds critical starter data on startup. + +Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer. + +This separation is important because: + +- importer iterations are frequent +- parser logic is still evolving +- startup should not silently repopulate critical data behind the tool’s back + +## Current Code Map + +Important files in the current implementation: + +- `src/RolemasterDb.ImportTool/Program.cs` + - CLI entry point +- `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs` + - command orchestration +- `src/RolemasterDb.ImportTool/CriticalImportLoader.cs` + - transactional SQLite load/reset behavior +- `src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs` + - manifest loading +- `src/RolemasterDb.ImportTool/PdfXmlExtractor.cs` + - XML extraction via `pdftohtml` +- `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs` + - artifact output +- `src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs` + - standard table geometry parser +- `src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs` + - positioned text fragment model +- `src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs` + - debug cell artifact model +- `src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs` + - validation output model + +## Adding a New Table + +The recommended process for onboarding a new table is: + +1. Add a manifest entry. +2. Run `extract `. +3. Inspect `source.xml`. +4. Run `load `. +5. Inspect `validation-report.json` and `parsed-cells.json`. +6. If validation succeeds, spot-check SQLite output. +7. If validation fails, adjust the parser or add a family-specific parser strategy before retrying. + +## Debugging Guidance + +If a table imports incorrectly, inspect artifacts in this order: + +1. `validation-report.json` +2. `parsed-cells.json` +3. `fragments.json` +4. `source.xml` + +That order usually answers the key questions fastest: + +- did validation fail +- which row/column is wrong +- were fragments assigned incorrectly +- or was the extraction itself already malformed + +## Reliability Position + +The current importer should be understood as: + +- reliable enough for geometry-based `standard` table iteration +- much safer than the old flattened-text approach +- still evolving toward broader family coverage and deeper normalization + +The key design rule going forward is: + +- do not silently load ambiguous data + +The importer should always prefer: + +- preserving source fidelity +- writing review artifacts +- failing validation + +over: + +- guessing +- auto-correcting without evidence +- loading nearly-correct but structurally wrong critical results