Add import tool docs

This commit is contained in:
2026-03-14 01:30:10 +01:00
parent 719355da90
commit be5c0a9b54
2 changed files with 504 additions and 0 deletions

View File

@@ -0,0 +1,500 @@
# Critical Import Tool
## Purpose
The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app.
The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be:
- explicit
- repeatable
- debuggable
- able to rebuild importer-managed data without resetting the entire application
The tool currently lives in `src/RolemasterDb.ImportTool` and operates against the same SQLite schema used by the web app.
## Goals
The importer is designed around the following requirements:
- reset and reload critical data without touching unrelated tables
- preserve source fidelity while still producing structured lookup data
- make parsing failures visible before bad data reaches SQLite
- keep intermediate artifacts on disk for inspection
- support iterative parser development one table at a time
## Current Scope
The current implementation supports:
- explicit CLI commands for reset, extraction, and import
- manifest-driven source selection
- `standard` critical tables with columns `A-E`
- XML-based extraction using `pdftohtml -xml`
- geometry-based parsing for `Slash.pdf`
- transactional loading into SQLite
The current implementation does not yet support:
- variant-column critical tables
- grouped variant tables
- OCR/image-based PDFs such as `Void.pdf`
- normalized `critical_branch` population
- normalized `critical_effect` population
- automatic confidence scoring beyond validation errors
## High-Level Architecture
The importer workflow is:
1. Resolve a table entry from the manifest.
2. Extract the source PDF into an artifact format.
3. Parse the extracted artifact into an in-memory table model.
4. Write debug artifacts to disk.
5. Validate the parsed result.
6. If validation succeeds, load the parsed data into SQLite in a transaction.
The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow.
## Implementation Phases
## Phase 1: Initial Importer and Text Extraction
Phase 1 established the first end-to-end workflow:
- a dedicated console project
- `CommandLineParser` based verbs
- a table manifest
- transactional reset/load commands
- a first parser for `Slash.pdf`
### Phase 1 command surface
Phase 1 introduced these verbs:
- `reset criticals`
- `extract <table>`
- `load <table>`
- `import <table>`
### Phase 1 extraction approach
The initial version used `pdftotext -layout` to create a flattened text artifact. The parser then tried to reconstruct:
- column boundaries from the `A-E` header line
- roll-band rows from labels such as `71-75`
- cell contents by slicing monospaced text blocks
### Phase 1 outcome
Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs.
### Phase 1 failure mode
The first serious regression was seen in `Slash.pdf`:
- lookup target: `slash`, severity `A`, roll `72`
- expected band: `71-75`
- broken result from the text-based parser: content from `76-80` mixed with stray characters from severity `B`
That failure showed the core problem with `pdftotext -layout`: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout.
Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development.
## Phase 2: XML Geometry-Based Parsing
Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on `pdftohtml -xml`.
### Why Phase 2 was necessary
The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by `pdftohtml` preserves:
- page number
- `top`
- `left`
- `width`
- `height`
- text content
That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines.
### Phase 2 extraction format
The importer now extracts to XML instead of plain text:
- extraction tool: `pdftohtml -xml -i -noframes`
- artifact file: `source.xml`
### Phase 2 parser model
The parser now works in these stages:
1. Load all `<text>` fragments from the XML.
2. Detect the standard `A-E` header row.
3. Detect roll-band labels on the left margin.
4. Build row bands from the vertical positions of those roll labels.
5. Build column boundaries from the horizontal centers of the `A-E` header fragments.
6. Assign each text fragment to a row by `top`.
7. Assign each text fragment to a column by horizontal position.
8. Reconstruct each cell from ordered fragments.
9. Split cell content into description lines and affix-like lines.
10. Validate the result before touching SQLite.
### Phase 2 reliability improvement
This phase fixed the original `Slash / A / 72` corruption. The same lookup now resolves to:
- band `71-75`
- description `Blow falls on lower leg. Slash tendons. Poor sucker.`
The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.
## Planned Future Phases
The current architecture is intended to support additional phases:
### Phase 3: Broader Table Coverage
- add more `standard` critical PDFs
- expand the manifest
- verify parser stability across more source layouts
### Phase 4: Variant and Grouped Tables
- support `variant_column` tables such as `Large Creature - Weapon.pdf`
- support `grouped_variant` tables such as `Large Creature - Magic.pdf`
- add parser strategies for additional table families
### Phase 5: Conditional Branch Extraction
- split branch-heavy cells into `critical_branch`
- preserve the base cell text and branch text separately
- support branch conditions such as `with helmet` and `w/o leg greaves`
### Phase 6: Effect Normalization
- parse symbolic affix lines into normalized effects
- populate `critical_effect`
- gradually enrich prose-derived effects over time
### Phase 7: OCR and Manual Fallback
- support image-based PDFs such as `Void.pdf`
- route image-based sources through OCR or curated manual input
- keep the same post-extraction parsing contract where possible
## Current CLI
The tool uses `CommandLineParser` and currently exposes these verbs:
### `reset criticals`
Deletes importer-managed critical data from SQLite.
Use this when:
- you want to clear imported critical data
- you want to rerun a fresh import
- you need to verify the rebuild path from an empty critical-table state
Example:
```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals
```
### `extract <table>`
Resolves a table from the manifest and writes the extraction artifact to disk.
Example:
```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash
```
### `load <table>`
Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds.
Example:
```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash
```
### `import <table>`
Runs extraction followed by load.
Example:
```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash
```
## Manifest
The importer manifest is stored at:
- `sources/critical-import-manifest.json`
Each entry declares:
- `slug`
- `displayName`
- `family`
- `extractionMethod`
- `pdfPath`
- `enabled`
The manifest is intentionally the control point for enabling importer support one table at a time.
## Artifact Layout
Artifacts are written under:
- `artifacts/import/critical/<slug>/`
The current artifact set is:
### `source.xml`
The raw XML extraction output from `pdftohtml`.
Use this when:
- checking whether text is present in the PDF
- inspecting original `top` and `left` coordinates
- diagnosing row/column misassignment
### `fragments.json`
A normalized list of parsed text fragments with page and position metadata.
Use this when:
- comparing raw XML to the importers internal fragment model
- confirming that specific fragments were loaded correctly
- debugging Unicode or whitespace normalization issues
### `parsed-cells.json`
The reconstructed cells after geometry-based row/column assignment.
Use this when:
- validating a specific row and column
- checking whether a fragment was assigned to the correct cell
- confirming description and affix splitting
### `validation-report.json`
The validation result for the parsed table.
This includes:
- overall validity
- validation errors
- row count
- cell count
Use this when:
- a `load` command fails
- a parser change introduces ambiguity
- you need to confirm that the importer refused to write SQLite data
## Standard Table Parsing Strategy
The current `standard` parser is designed for tables shaped like `Slash.pdf`:
- columns: `A-E`
- rows: roll bands such as `01-05`, `71-75`, `100`
- cell contents: prose, symbolic affixes, and sometimes conditional branch lines
### Header Detection
The parser searches the XML fragments for a row containing exactly:
- `A`
- `B`
- `C`
- `D`
- `E`
Those positions define the standard-table column anchors.
### Row Detection
The parser searches the left margin below the header for roll-band labels, for example:
- `01-05`
- `66`
- `251+`
Those vertical positions define the row anchors.
### Row Bands
The parser derives each rows vertical range from the midpoint between adjacent roll-band anchors.
That prevents one row from drifting into the next when text wraps over multiple visual lines.
### Column Assignment
Each text fragment is assigned to the nearest column band based on horizontal center position.
This is the core reliability improvement over the phase-1 text slicing approach.
### Line Reconstruction
Fragments inside a cell are grouped into lines by close `top` values and then ordered by `left`.
This produces a stable line list even when PDF text is broken into multiple fragments.
### Description vs Affix Splitting
The parser classifies lines as:
- description-like prose
- affix-like notation
Affix-like lines include:
- `+...`
- symbolic lines using the critical glyphs
- branch-like affix lines such as `with leg greaves: +2H - ...`
The current implementation stores:
- `RawCellText`
- `DescriptionText`
- `RawAffixText`
It does not yet normalize branches or effects into separate tables.
## Validation Rules
The current validation pass is intentionally strict.
At minimum, a valid `standard` table must satisfy:
- a detectable `A-E` header row exists
- roll-band labels are found
- each detected row produces content for all five columns
- total parsed cell count matches `row_count * 5`
If validation fails:
- artifacts are still written
- SQLite load is aborted
- the command returns an error
This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table.
## Database Load Behavior
The loader is transactional.
The current load path:
1. ensures the SQLite database exists
2. deletes the existing subtree for the targeted critical table
3. inserts:
- `critical_table`
- `critical_column`
- `critical_roll_band`
- `critical_result`
4. commits only after the full table is saved
This means importer iterations can target one table without resetting unrelated database content.
## Interaction With Web App Startup
The web application no longer auto-seeds critical starter data on startup.
Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer.
This separation is important because:
- importer iterations are frequent
- parser logic is still evolving
- startup should not silently repopulate critical data behind the tools back
## Current Code Map
Important files in the current implementation:
- `src/RolemasterDb.ImportTool/Program.cs`
- CLI entry point
- `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs`
- command orchestration
- `src/RolemasterDb.ImportTool/CriticalImportLoader.cs`
- transactional SQLite load/reset behavior
- `src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs`
- manifest loading
- `src/RolemasterDb.ImportTool/PdfXmlExtractor.cs`
- XML extraction via `pdftohtml`
- `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs`
- artifact output
- `src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs`
- standard table geometry parser
- `src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs`
- positioned text fragment model
- `src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs`
- debug cell artifact model
- `src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs`
- validation output model
## Adding a New Table
The recommended process for onboarding a new table is:
1. Add a manifest entry.
2. Run `extract <slug>`.
3. Inspect `source.xml`.
4. Run `load <slug>`.
5. Inspect `validation-report.json` and `parsed-cells.json`.
6. If validation succeeds, spot-check SQLite output.
7. If validation fails, adjust the parser or add a family-specific parser strategy before retrying.
## Debugging Guidance
If a table imports incorrectly, inspect artifacts in this order:
1. `validation-report.json`
2. `parsed-cells.json`
3. `fragments.json`
4. `source.xml`
That order usually answers the key questions fastest:
- did validation fail
- which row/column is wrong
- were fragments assigned incorrectly
- or was the extraction itself already malformed
## Reliability Position
The current importer should be understood as:
- reliable enough for geometry-based `standard` table iteration
- much safer than the old flattened-text approach
- still evolving toward broader family coverage and deeper normalization
The key design rule going forward is:
- do not silently load ambiguous data
The importer should always prefer:
- preserving source fidelity
- writing review artifacts
- failing validation
over:
- guessing
- auto-correcting without evidence
- loading nearly-correct but structurally wrong critical results