600 lines
17 KiB
Markdown
600 lines
17 KiB
Markdown
# Critical Import Tool
|
||
|
||
## Purpose
|
||
|
||
The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app.
|
||
|
||
The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be:
|
||
|
||
- explicit
|
||
- repeatable
|
||
- debuggable
|
||
- able to rebuild importer-managed data without resetting the entire application
|
||
|
||
The tool currently lives in `src/RolemasterDb.ImportTool` and operates against the same SQLite schema used by the web app.
|
||
|
||
## Goals
|
||
|
||
The importer is designed around the following requirements:
|
||
|
||
- reset and reload critical data without touching unrelated tables
|
||
- preserve source fidelity while still producing structured lookup data
|
||
- make parsing failures visible before bad data reaches SQLite
|
||
- keep intermediate artifacts on disk for inspection
|
||
- support iterative parser development one table at a time
|
||
|
||
## Current Scope
|
||
|
||
The current implementation supports:
|
||
|
||
- explicit CLI commands for reset, extraction, and import
|
||
- manifest-driven source selection
|
||
- `standard` critical tables with columns `A-E`
|
||
- XML-based extraction using `pdftohtml -xml`
|
||
- geometry-based parsing across the currently enabled phase-3 tables:
|
||
- `arcane-aether`
|
||
- `arcane-nether`
|
||
- `ballistic-shrapnel`
|
||
- `brawling`
|
||
- `cold`
|
||
- `electricity`
|
||
- `grapple`
|
||
- `heat`
|
||
- `impact`
|
||
- `krush`
|
||
- `ma-strikes`
|
||
- `ma-sweeps`
|
||
- `puncture`
|
||
- `slash`
|
||
- `subdual`
|
||
- `tiny`
|
||
- `unbalance`
|
||
- row-boundary repair for trailing affix leakage
|
||
- footer/page-number filtering during body parsing
|
||
- transactional loading into SQLite
|
||
|
||
The current implementation does not yet support:
|
||
|
||
- variant-column critical tables
|
||
- grouped variant tables
|
||
- `Mana.pdf`, whose current XML layout and affix notation still need a dedicated parser pass
|
||
- OCR/image-based PDFs such as `Void.pdf`
|
||
- normalized `critical_branch` population
|
||
- normalized `critical_effect` population
|
||
- automatic confidence scoring beyond validation errors
|
||
|
||
## High-Level Architecture
|
||
|
||
The importer workflow is:
|
||
|
||
1. Resolve a table entry from the manifest.
|
||
2. Extract the source PDF into an artifact format.
|
||
3. Parse the extracted artifact into an in-memory table model.
|
||
4. Write debug artifacts to disk.
|
||
5. Validate the parsed result.
|
||
6. If validation succeeds, load the parsed data into SQLite in a transaction.
|
||
|
||
The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow.
|
||
|
||
## Implementation Phases
|
||
|
||
## Phase 1: Initial Importer and Text Extraction
|
||
|
||
Phase 1 established the first end-to-end workflow:
|
||
|
||
- a dedicated console project
|
||
- `CommandLineParser` based verbs
|
||
- a table manifest
|
||
- transactional reset/load commands
|
||
- a first parser for `Slash.pdf`
|
||
|
||
### Phase 1 command surface
|
||
|
||
Phase 1 introduced these verbs:
|
||
|
||
- `reset criticals`
|
||
- `extract <table>`
|
||
- `load <table>`
|
||
- `import <table>`
|
||
|
||
### Phase 1 extraction approach
|
||
|
||
The initial version used `pdftotext -layout` to create a flattened text artifact. The parser then tried to reconstruct:
|
||
|
||
- column boundaries from the `A-E` header line
|
||
- roll-band rows from labels such as `71-75`
|
||
- cell contents by slicing monospaced text blocks
|
||
|
||
### Phase 1 outcome
|
||
|
||
Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs.
|
||
|
||
### Phase 1 failure mode
|
||
|
||
The first serious regression was seen in `Slash.pdf`:
|
||
|
||
- lookup target: `slash`, severity `A`, roll `72`
|
||
- expected band: `71-75`
|
||
- broken result from the text-based parser: content from `76-80` mixed with stray characters from severity `B`
|
||
|
||
That failure showed the core problem with `pdftotext -layout`: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout.
|
||
|
||
Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development.
|
||
|
||
## Phase 2: XML Geometry-Based Parsing
|
||
|
||
Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on `pdftohtml -xml`.
|
||
|
||
### Why Phase 2 was necessary
|
||
|
||
The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by `pdftohtml` preserves:
|
||
|
||
- page number
|
||
- `top`
|
||
- `left`
|
||
- `width`
|
||
- `height`
|
||
- text content
|
||
|
||
That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines.
|
||
|
||
### Phase 2 extraction format
|
||
|
||
The importer now extracts to XML instead of plain text:
|
||
|
||
- extraction tool: `pdftohtml -xml -i -noframes`
|
||
- artifact file: `source.xml`
|
||
|
||
### Phase 2 parser model
|
||
|
||
The parser now works in these stages:
|
||
|
||
1. Load all `<text>` fragments from the XML.
|
||
2. Detect the standard `A-E` header row.
|
||
3. Detect roll-band labels on the left margin.
|
||
4. Build row bands from the vertical positions of those roll labels.
|
||
5. Build column boundaries from the horizontal centers of the `A-E` header fragments.
|
||
6. Assign each text fragment to a row by `top`.
|
||
7. Assign each text fragment to a column by horizontal position.
|
||
8. Reconstruct each cell from ordered fragments.
|
||
9. Split cell content into description lines and affix-like lines.
|
||
10. Validate the result before touching SQLite.
|
||
|
||
### Phase 2 reliability improvement
|
||
|
||
This phase fixed the original `Slash / A / 72` corruption. The same lookup now resolves to:
|
||
|
||
- band `71-75`
|
||
- description `Blow falls on lower leg. Slash tendons. Poor sucker.`
|
||
|
||
The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.
|
||
|
||
## Phase 2.1: Boundary Hardening After Manual Validation
|
||
|
||
After phase 2, a manual validation pass compared:
|
||
|
||
- the rendered `Slash.pdf`
|
||
- the extracted `source.xml`
|
||
- the imported SQLite rows
|
||
|
||
That review found a remaining defect around the `51-55` / `56-60` boundary:
|
||
|
||
- `51-55` lost several affix lines
|
||
- `56-60` gained leading affix lines from the previous row
|
||
|
||
The root cause was the original row segmentation rule:
|
||
|
||
- rows were assigned strictly by the midpoint between adjacent roll-label `top` values
|
||
|
||
That rule was too naive for rows whose affix block sits visually near the next row label.
|
||
|
||
### Phase 2.1 fix
|
||
|
||
The parser was hardened in two ways:
|
||
|
||
1. Leading affix leakage repair
|
||
- after the initial row assignment, if a cell in the next row starts with affix-like lines and then continues with prose, those leading affix lines are moved back to the previous row
|
||
2. Better affix classification
|
||
- generic digit-starting lines are no longer assumed to be affixes
|
||
- this prevents prose such as `25% chance your weapon is stuck...` from being misclassified
|
||
|
||
### Phase 2.1 validation rules
|
||
|
||
The importer now explicitly rejects cells that still look structurally wrong after repair:
|
||
|
||
- prose and affix segments may not alternate more than once inside a cell
|
||
|
||
This keeps the phase-2.1 safety goal in place while allowing broader standard-table layouts that render a single affix block either before or after the prose block.
|
||
|
||
## Planned Future Phases
|
||
|
||
The current architecture is intended to support additional phases:
|
||
|
||
### Phase 3: Broader Table Coverage
|
||
|
||
Phase 3 expands the manifest and validates the shared `standard` parser across a broader set of `A-E` tables.
|
||
|
||
The currently enabled phase-3 table set is:
|
||
|
||
- `arcane-aether`
|
||
- `arcane-nether`
|
||
- `ballistic-shrapnel`
|
||
- `brawling`
|
||
- `cold`
|
||
- `electricity`
|
||
- `grapple`
|
||
- `heat`
|
||
- `impact`
|
||
- `krush`
|
||
- `ma-strikes`
|
||
- `ma-sweeps`
|
||
- `puncture`
|
||
- `slash`
|
||
- `subdual`
|
||
- `tiny`
|
||
- `unbalance`
|
||
|
||
Current phase-3 notes:
|
||
|
||
- header detection now tolerates minor `top` misalignment across the `A-E` header glyphs
|
||
- footer page numbers are filtered out before body parsing
|
||
- validation allows a single contiguous affix block either before or after prose
|
||
- `Mana.pdf` is intentionally left out for now because its row-anchor geometry and notation still need dedicated handling
|
||
|
||
### Phase 4: Variant and Grouped Tables
|
||
|
||
- support `variant_column` tables such as `Large Creature - Weapon.pdf`
|
||
- support `grouped_variant` tables such as `Large Creature - Magic.pdf`
|
||
- add parser strategies for additional table families
|
||
|
||
### Phase 5: Conditional Branch Extraction
|
||
|
||
- split branch-heavy cells into `critical_branch`
|
||
- preserve the base cell text and branch text separately
|
||
- support branch conditions such as `with helmet` and `w/o leg greaves`
|
||
|
||
### Phase 6: Effect Normalization
|
||
|
||
- parse symbolic affix lines into normalized effects
|
||
- populate `critical_effect`
|
||
- gradually enrich prose-derived effects over time
|
||
|
||
### Phase 7: OCR and Manual Fallback
|
||
|
||
- support image-based PDFs such as `Void.pdf`
|
||
- route image-based sources through OCR or curated manual input
|
||
- keep the same post-extraction parsing contract where possible
|
||
|
||
## Current CLI
|
||
|
||
The tool uses `CommandLineParser` and currently exposes these verbs:
|
||
|
||
### `reset criticals`
|
||
|
||
Deletes importer-managed critical data from SQLite.
|
||
|
||
Use this when:
|
||
|
||
- you want to clear imported critical data
|
||
- you want to rerun a fresh import
|
||
- you need to verify the rebuild path from an empty critical-table state
|
||
|
||
Example:
|
||
|
||
```powershell
|
||
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals
|
||
```
|
||
|
||
### `extract <table>`
|
||
|
||
Resolves a table from the manifest and writes the extraction artifact to disk.
|
||
|
||
Example:
|
||
|
||
```powershell
|
||
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash
|
||
```
|
||
|
||
### `load <table>`
|
||
|
||
Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds.
|
||
|
||
Example:
|
||
|
||
```powershell
|
||
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash
|
||
```
|
||
|
||
### `import <table>`
|
||
|
||
Runs extraction followed by load.
|
||
|
||
Example:
|
||
|
||
```powershell
|
||
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash
|
||
```
|
||
|
||
## Manifest
|
||
|
||
The importer manifest is stored at:
|
||
|
||
- `sources/critical-import-manifest.json`
|
||
|
||
Each entry declares:
|
||
|
||
- `slug`
|
||
- `displayName`
|
||
- `family`
|
||
- `extractionMethod`
|
||
- `pdfPath`
|
||
- `enabled`
|
||
|
||
The manifest is intentionally the control point for enabling importer support one table at a time.
|
||
|
||
For the currently enabled phase-3 entries:
|
||
|
||
- `family` is `standard`
|
||
- `extractionMethod` is `xml`
|
||
|
||
## Artifact Layout
|
||
|
||
Artifacts are written under:
|
||
|
||
- `artifacts/import/critical/<slug>/`
|
||
|
||
The current artifact set is:
|
||
|
||
### `source.xml`
|
||
|
||
The raw XML extraction output from `pdftohtml`.
|
||
|
||
Use this when:
|
||
|
||
- checking whether text is present in the PDF
|
||
- inspecting original `top` and `left` coordinates
|
||
- diagnosing row/column misassignment
|
||
|
||
### `fragments.json`
|
||
|
||
A normalized list of parsed text fragments with page and position metadata.
|
||
|
||
Use this when:
|
||
|
||
- comparing raw XML to the importer’s internal fragment model
|
||
- confirming that specific fragments were loaded correctly
|
||
- debugging Unicode or whitespace normalization issues
|
||
|
||
### `parsed-cells.json`
|
||
|
||
The reconstructed cells after geometry-based row/column assignment.
|
||
|
||
Use this when:
|
||
|
||
- validating a specific row and column
|
||
- checking whether a fragment was assigned to the correct cell
|
||
- confirming description and affix splitting
|
||
|
||
### `validation-report.json`
|
||
|
||
The validation result for the parsed table.
|
||
|
||
This includes:
|
||
|
||
- overall validity
|
||
- validation errors
|
||
- row count
|
||
- cell count
|
||
|
||
Use this when:
|
||
|
||
- a `load` command fails
|
||
- a parser change introduces ambiguity
|
||
- you need to confirm that the importer refused to write SQLite data
|
||
|
||
## Standard Table Parsing Strategy
|
||
|
||
The current `standard` parser is designed for tables shaped like `Slash.pdf`:
|
||
|
||
- columns: `A-E`
|
||
- rows: roll bands such as `01-05`, `71-75`, `100`
|
||
- cell contents: prose, symbolic affixes, and sometimes conditional branch lines
|
||
|
||
### Header Detection
|
||
|
||
The parser searches the XML fragments for a row containing exactly:
|
||
|
||
- `A`
|
||
- `B`
|
||
- `C`
|
||
- `D`
|
||
- `E`
|
||
|
||
Those positions define the standard-table column anchors.
|
||
|
||
### Row Detection
|
||
|
||
The parser searches the left margin below the header for roll-band labels, for example:
|
||
|
||
- `01-05`
|
||
- `66`
|
||
- `251+`
|
||
|
||
Those vertical positions define the row anchors.
|
||
|
||
### Row Bands
|
||
|
||
The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors.
|
||
|
||
That prevents one row from drifting into the next when text wraps over multiple visual lines.
|
||
|
||
### Column Assignment
|
||
|
||
Each text fragment is assigned to the nearest column band based on horizontal center position.
|
||
|
||
This is the core reliability improvement over the phase-1 text slicing approach.
|
||
|
||
### Line Reconstruction
|
||
|
||
Fragments inside a cell are grouped into lines by close `top` values and then ordered by `left`.
|
||
|
||
This produces a stable line list even when PDF text is broken into multiple fragments.
|
||
|
||
### Boundary Repair
|
||
|
||
After the initial midpoint-based row assignment, the parser performs a repair step across adjacent rows in the same column.
|
||
|
||
If the next row begins with affix-like lines and then continues with prose, those leading affix lines are treated as leaked trailing affixes from the previous row and moved back.
|
||
|
||
This repair exists because some tables place affix lines close enough to the next row label that midpoint-only segmentation is not reliable.
|
||
|
||
### Description vs Affix Splitting
|
||
|
||
The parser classifies lines as:
|
||
|
||
- description-like prose
|
||
- affix-like notation
|
||
|
||
Affix-like lines include:
|
||
|
||
- `+...`
|
||
- symbolic lines using the critical glyphs
|
||
- branch-like affix lines such as `with leg greaves: +2H - ...`
|
||
|
||
Affix-like classification is intentionally conservative. Numeric prose lines such as `25% chance...` are not treated as affixes unless they match a known affix-like notation pattern.
|
||
|
||
The current implementation stores:
|
||
|
||
- `RawCellText`
|
||
- `DescriptionText`
|
||
- `RawAffixText`
|
||
|
||
It does not yet normalize branches or effects into separate tables.
|
||
|
||
## Validation Rules
|
||
|
||
The current validation pass is intentionally strict.
|
||
|
||
At minimum, a valid `standard` table must satisfy:
|
||
|
||
- a detectable `A-E` header row exists
|
||
- roll-band labels are found
|
||
- each detected row produces content for all five columns
|
||
- total parsed cell count matches `row_count * 5`
|
||
- no cell begins with affix-like lines before prose
|
||
- no cell contains prose after affix lines
|
||
|
||
If validation fails:
|
||
|
||
- artifacts are still written
|
||
- SQLite load is aborted
|
||
- the command returns an error
|
||
|
||
This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table.
|
||
|
||
## Database Load Behavior
|
||
|
||
The loader is transactional.
|
||
|
||
The current load path:
|
||
|
||
1. ensures the SQLite database exists
|
||
2. deletes the existing subtree for the targeted critical table
|
||
3. inserts:
|
||
- `critical_table`
|
||
- `critical_column`
|
||
- `critical_roll_band`
|
||
- `critical_result`
|
||
4. commits only after the full table is saved
|
||
|
||
This means importer iterations can target one table without resetting unrelated database content.
|
||
|
||
## Interaction With Web App Startup
|
||
|
||
The web application no longer auto-seeds critical starter data on startup.
|
||
|
||
Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer.
|
||
|
||
This separation is important because:
|
||
|
||
- importer iterations are frequent
|
||
- parser logic is still evolving
|
||
- startup should not silently repopulate critical data behind the tool’s back
|
||
|
||
## Current Code Map
|
||
|
||
Important files in the current implementation:
|
||
|
||
- `src/RolemasterDb.ImportTool/Program.cs`
|
||
- CLI entry point
|
||
- `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs`
|
||
- command orchestration
|
||
- `src/RolemasterDb.ImportTool/CriticalImportLoader.cs`
|
||
- transactional SQLite load/reset behavior
|
||
- `src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs`
|
||
- manifest loading
|
||
- `src/RolemasterDb.ImportTool/PdfXmlExtractor.cs`
|
||
- XML extraction via `pdftohtml`
|
||
- `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs`
|
||
- artifact output
|
||
- `src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs`
|
||
- standard table geometry parser
|
||
- `src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs`
|
||
- positioned text fragment model
|
||
- `src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs`
|
||
- debug cell artifact model
|
||
- `src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs`
|
||
- validation output model
|
||
|
||
## Adding a New Table
|
||
|
||
The recommended process for onboarding a new table is:
|
||
|
||
1. Add a manifest entry.
|
||
2. Run `extract <slug>`.
|
||
3. Inspect `source.xml`.
|
||
4. Run `load <slug>`.
|
||
5. Inspect `validation-report.json` and `parsed-cells.json`.
|
||
6. If validation succeeds, spot-check SQLite output.
|
||
7. If validation fails, adjust the parser or add a family-specific parser strategy before retrying.
|
||
|
||
## Debugging Guidance
|
||
|
||
If a table imports incorrectly, inspect artifacts in this order:
|
||
|
||
1. `validation-report.json`
|
||
2. `parsed-cells.json`
|
||
3. `fragments.json`
|
||
4. `source.xml`
|
||
|
||
That order usually answers the key questions fastest:
|
||
|
||
- did validation fail
|
||
- which row/column is wrong
|
||
- were fragments assigned incorrectly
|
||
- or was the extraction itself already malformed
|
||
|
||
## Reliability Position
|
||
|
||
The current importer should be understood as:
|
||
|
||
- reliable enough for geometry-based `standard` table iteration
|
||
- much safer than the old flattened-text approach
|
||
- still evolving toward broader family coverage and deeper normalization
|
||
|
||
The key design rule going forward is:
|
||
|
||
- do not silently load ambiguous data
|
||
|
||
The importer should always prefer:
|
||
|
||
- preserving source fidelity
|
||
- writing review artifacts
|
||
- failing validation
|
||
|
||
over:
|
||
|
||
- guessing
|
||
- auto-correcting without evidence
|
||
- loading nearly-correct but structurally wrong critical results
|