28 KiB
Critical Import Tool
Purpose
The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app.
The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be:
- explicit
- repeatable
- debuggable
- able to rebuild importer-managed data without resetting the entire application
The tool currently lives in src/RolemasterDb.ImportTool and operates against the same SQLite schema used by the web app.
Goals
The importer is designed around the following requirements:
- reset and reload critical data without touching unrelated tables
- preserve source fidelity while still producing structured lookup data
- make parsing failures visible before bad data reaches SQLite
- keep intermediate artifacts on disk for inspection
- support iterative parser development one table at a time
Current Scope
The current implementation supports:
- explicit CLI commands for reset, extraction, and import
- manifest-driven source selection
standardcritical tables with columnsA-Evariant_columncritical tables with non-severity columnsgrouped_variantcritical tables with a group axis plus variant columns- XML-based extraction using
pdftohtml -xml - XML-aligned page rendering and per-cell PNG crops using
pdftoppm -png -r 432 - geometry-based parsing across the currently enabled table set:
arcane-aetherarcane-netherballistic-shrapnelbrawlingcoldelectricitygrappleheatimpactkrushlarge_creature_magiclarge_creature_weaponma-strikesma-sweepsmanapunctureslashsubdualsuper_large_creature_weapontinyunbalance
- row-boundary repair for trailing affix leakage
- split row-label reconstruction for tables that render labels such as
99-/100as two fragments - conditional branch extraction into
critical_branch - footer/page-number filtering during body parsing
- transactional loading into SQLite
- importer-managed source provenance for each parsed result:
- source page number
- source crop bounds
- deterministic crop-image path
- non-destructive merge loading that preserves curated rows
- conditional branch display through the web critical lookup
The current implementation does not yet support:
- OCR/image-based PDFs such as
Void.pdf - automatic confidence scoring beyond validation errors
High-Level Architecture
The importer workflow is:
- Resolve a table entry from the manifest.
- Extract the source PDF into an artifact format.
- Parse the extracted artifact into an in-memory table model.
- Write debug artifacts to disk.
- Render page and cell reference PNGs.
- Validate the parsed result.
- If validation succeeds, merge the parsed data into SQLite in a transaction.
The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow.
Implementation Phases
Phase 1: Initial Importer and Text Extraction
Phase 1 established the first end-to-end workflow:
- a dedicated console project
CommandLineParserbased verbs- a table manifest
- transactional reset/load commands
- a first parser for
Slash.pdf
Phase 1 command surface
Phase 1 introduced these verbs:
reset criticalsextract <table>load <table>import <table>
Phase 1 extraction approach
The initial version used pdftotext -layout to create a flattened text artifact. The parser then tried to reconstruct:
- column boundaries from the
A-Eheader line - roll-band rows from labels such as
71-75 - cell contents by slicing monospaced text blocks
Phase 1 outcome
Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs.
Phase 1 failure mode
The first serious regression was seen in Slash.pdf:
- lookup target:
slash, severityA, roll72 - expected band:
71-75 - broken result from the text-based parser: content from
76-80mixed with stray characters from severityB
That failure showed the core problem with pdftotext -layout: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout.
Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development.
Phase 2: XML Geometry-Based Parsing
Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on pdftohtml -xml.
Why Phase 2 was necessary
The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by pdftohtml preserves:
- page number
topleftwidthheight- text content
That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines.
Phase 2 extraction format
The importer now extracts to XML instead of plain text:
- extraction tool:
pdftohtml -xml -i -noframes - artifact file:
source.xml
Phase 2 parser model
The parser now works in these stages:
- Load all
<text>fragments from the XML. - Detect the standard
A-Eheader row. - Detect roll-band labels on the left margin.
- Build row bands from the vertical positions of those roll labels.
- Build column boundaries from the horizontal centers of the
A-Eheader fragments. - Assign each text fragment to a row by
top. - Assign each text fragment to a column by horizontal position.
- Reconstruct each cell from ordered fragments.
- Split cell content into description lines and affix-like lines.
- Validate the result before touching SQLite.
Phase 2 reliability improvement
This phase fixed the original Slash / A / 72 corruption. The same lookup now resolves to:
- band
71-75 - description
Blow falls on lower leg. Slash tendons. Poor sucker.
The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.
Phase 2.1: Boundary Hardening After Manual Validation
After phase 2, a manual validation pass compared:
- the rendered
Slash.pdf - the extracted
source.xml - the imported SQLite rows
That review found a remaining defect around the 51-55 / 56-60 boundary:
51-55lost several affix lines56-60gained leading affix lines from the previous row
The root cause was the original row segmentation rule:
- rows were assigned strictly by the midpoint between adjacent roll-label
topvalues
That rule was too naive for rows whose affix block sits visually near the next row label.
Phase 2.1 fix
The parser was hardened in two ways:
- Leading affix leakage repair
- after the initial row assignment, if a cell in the next row starts with affix-like lines and then continues with prose, those leading affix lines are moved back to the previous row
- Better affix classification
- generic digit-starting lines are no longer assumed to be affixes
- this prevents prose such as
25% chance your weapon is stuck...from being misclassified
Phase 2.1 validation rules
The importer now explicitly rejects cells that still look structurally wrong after repair:
- prose and affix segments may not alternate more than once inside a cell
This keeps the phase-2.1 safety goal in place while allowing broader standard-table layouts that render a single affix block either before or after the prose block.
Phase 3: Broader Table Coverage
Phase 3 expands the manifest and validates the shared standard parser across a broader set of A-E tables.
The currently enabled phase-3 table set is:
arcane-aetherarcane-netherballistic-shrapnelbrawlingcoldelectricitygrappleheatimpactkrushma-strikesma-sweepsmanapunctureslashsubdualtinyunbalance
Current phase-3 notes:
- header detection now tolerates minor
topmisalignment across theA-Eheader glyphs - first-row body parsing can now begin slightly above the first roll-band label when the PDF places prose between the header row and the label, which prevents clipped
01-05cells such asMana.pdf - row boundaries can snap to the last affix-to-prose transition between adjacent roll labels when midpoint slicing would leak into the next row
- affix symbols are learned from the footer legend before body parsing, so symbol-only affix fragments are classified correctly
- cross-column text fragments can now be split at geometry-aligned whitespace boundaries before column assignment, while affix fragments still split on hard internal spacing
- footer page numbers are filtered out before body parsing
- validation allows a single contiguous affix block either before or after prose
Phase 4: Variant and Grouped Tables
Phase 4 extended the importer beyond A-E tables.
The currently enabled phase-4 table set is:
large_creature_weaponfamily:variant_column- columns:
NORMAL,MAGIC,MITHRIL,HOLY_ARMS,SLAYING
super_large_creature_weaponfamily:variant_column- columns:
NORMAL,MAGIC,MITHRIL,HOLY_ARMS,SLAYING
large_creature_magicfamily:grouped_variant- groups:
large,super_large - columns:
NORMAL,SLAYING
Phase-4 notes:
- grouped results now populate
critical_groupduring SQLite load - parser dispatch is family-based instead of standard-table only
- left-margin row labels can be reconstructed from split fragments such as
151-/175 - the grouped magic PDF is imported once as
large_creature_magicsources/Large Creature - Magic.pdfandsources/Super Large Creature - Magic.pdfare duplicate files
Phase 5: Conditional Branch Extraction
Phase 5 is complete.
Phase-5 notes:
- branch-heavy cells are split into base result content plus ordered
critical_branchrows - branch parsing is shared across
standard,variant_column, andgrouped_varianttable families - branch conditions are preserved as display text and normalized into condition keys such as
with_leg_greaves - branch payloads can contain prose, affix notation, or both
- the importer now upgrades older SQLite files to add the
CriticalBranchestable before load - the web critical lookup now returns and renders conditional branches alongside the base result
Phase 6: Effect Normalization
Phase 6 is complete for symbol-driven affixes.
Phase-6 notes:
- footer legends are parsed into table-specific affix metadata before effect normalization
- symbolic affix lines are normalized into
critical_effectrows for both base results and conditional branches - the normalized pass currently covers direct hits, must-parry, no-parry, stun, bleed, foe penalties, attacker bonuses, and
Manapower-point modifiers - result and branch
parsed_jsonpayloads now store the normalized symbol effects - the web critical lookup now returns and renders parsed affix effects alongside the raw affix text
- prose-derived effects remain future work
Phase 7: OCR Bootstrap for Curation
- support image-based PDFs such as
Void.pdf - bootstrap scanned tables through OCR while keeping the existing curation flow as the fallback
- keep the same downstream parsing and load contract where practical
Validation summary:
Void.pdfis image-only; text extraction does not produce usable content- OCR on rendered page images does recover the title,
A-Eheader row, all 19 expected roll bands, theKey:footer, and most prose - OCR remains weakest on symbol-heavy affix notation and occasional glyph confusion such as
C->Cc - because of that, phase 7 should be implemented as OCR bootstrap for curation, not as a separate manual-transcription feature
Final implementation plan:
- Add
voidto the manifest asfamily: standard,extractionMethod: ocr, and extendsrc/RolemasterDb.ImportTool/CriticalImportManifestEntry.cswith an optionalAxisTemplateSlug. For Void, that value should point at a built-in standard template derived frommana. - Introduce a canonical extracted-source model, for example
ExtractedCriticalSource, containing page geometries, positioned text fragments, extractor metadata, and coordinate/render profile metadata. Refactorsrc/RolemasterDb.ImportTool/CriticalImportCommandRunner.csso extraction dispatches byExtractionMethodinstead of always callingpdftohtml. - Move the current XML path behind a dedicated XML extractor implementation rather than letting the command runner own XML extraction directly. Existing XML-backed tables should remain behaviorally unchanged.
- Implement an OCR extractor for scanned PDFs like
Void.pdf. It should render page PNGs with Poppler, run OCR with Tesseract TSV output, parse the TSV into canonical fragments, and persist the raw OCR diagnostics as artifacts. - Add explicit external tool discovery/configuration for the Poppler and Tesseract executables instead of assuming bare command names on
PATHare always safe. The OCR path depends on deterministic rasterization and OCR invocation. - Add a built-in standard-table axis template based on
Manarather than trying to rediscoverVoidstructure from noisy OCR. The template should hard-code columnsA-Eand these 19 roll bands:01-05,06-10,11-15,16-20,21-35,36-45,46-50,51-55,56-60,61-65,66,67-70,71-75,76-80,81-85,86-90,91-95,96-99,100. - Build a
StandardOcrBootstrapperfor template-driven standard tables. OCR should be used only to find anchors such as theA-Eheader row, the left-column roll labels, and the footer/key boundary, then interpolate the full 95-cell grid from the template and assign OCR fragments into those cells. - Refactor parsing so the standard parser can operate on canonical fragments plus a supplied grid, not just raw XML text. The OCR path should reuse the existing cell-to-result parsing, branch splitting, affix parsing, validation, and load behavior after grid assignment.
- Fix the coordinate-space seam explicitly. The current image pipeline assumes XML-space coordinates with a fixed render scale; OCR fragments come from rendered-page pixel coordinates. Phase 7 should carry extractor-provided coordinate metadata so source bounds, page geometry, and crop artifacts remain correct for both XML and OCR tables.
- Keep validation strict on structure and permissive on OCR text quality. The import should fail only if
Voidcannot be turned into a complete, cropable 95-cell standard table with valid source bounds. OCR misreads inside cells should become warnings, while the raw OCR text is still loaded into SQLite withIsCurated = falseso the existing curation UI can refine it. - Extend artifacts in
src/RolemasterDb.ImportTool/ImportArtifactPaths.csandsrc/RolemasterDb.ImportTool/ImportArtifactWriter.csso OCR imports persist both the raw OCR payload and the normalized fragments. Keepfragments.jsonas the canonical debug view and add an OCR-specific artifact such assource.ocr.tsv. - Add tests that do not depend on live OCR. The current manifest test in
src/RolemasterDb.ImportTool.Tests/StandardCriticalTableParserIntegrationTests.csassumes every enabled table is XML; that will need to become extraction-method-aware. Add checked-in OCR fixtures forVoid, then cover anchor detection, template interpolation, 95-cell assignment, representativeVoidcells, and a full load path.
Current CLI
The tool uses CommandLineParser and currently exposes these verbs:
reset criticals
Deletes importer-managed critical data from SQLite.
Use this when:
- you want to clear imported critical data
- you want to rerun a fresh import
- you need to verify the rebuild path from an empty critical-table state
Example:
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals
extract <table>
Resolves a table from the manifest and writes the extraction artifact to disk.
Example:
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash
load <table>
Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds.
Example:
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash
import <table>
Runs extraction followed by load.
Example:
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash
reimport-images <table>
Reuses source.xml, regenerates page PNGs and cell PNGs, rewrites the JSON artifacts, and refreshes only source-image metadata in SQLite.
Use this when:
- crop resolution or render settings changed
- you want better source images without reloading result text
- you want to keep curated and uncurated content untouched while refreshing artifacts
Example:
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reimport-images slash
Manifest
The importer manifest is stored at:
sources/critical-import-manifest.json
Each entry declares:
slugdisplayNamefamilyextractionMethodpdfPathenabled
The manifest is intentionally the control point for enabling importer support one table at a time.
For the currently enabled entries:
- standard tables use
family: standard - creature weapon tables use
family: variant_column - grouped creature magic uses
family: grouped_variant - all enabled entries currently use
extractionMethod: xml
Artifact Layout
Artifacts are written under:
artifacts/import/critical/<slug>/
The current artifact set is:
source.xml
The raw XML extraction output from pdftohtml.
Use this when:
- checking whether text is present in the PDF
- inspecting original
topandleftcoordinates - diagnosing row/column misassignment
fragments.json
A normalized list of parsed text fragments with page and position metadata.
Use this when:
- comparing raw XML to the importer’s internal fragment model
- confirming that specific fragments were loaded correctly
- debugging Unicode or whitespace normalization issues
parsed-cells.json
The reconstructed cells after geometry-based row/column assignment.
Use this when:
- validating a specific row and column
- checking whether a fragment was assigned to the correct cell
- confirming description and affix splitting
- confirming page and crop provenance for a specific result
Each parsed cell now includes:
sourceBounds- XML-aligned page number and bounding rectangle for the final repaired cell content
sourceImagePath- importer-managed relative PNG path when image generation succeeded
sourceImageCrop- the final crop rectangle written to disk
pages/page-001.png
Rendered PDF page images at 432 DPI, using a central render scale factor of 4 over the XML coordinate space emitted by pdftohtml -xml.
Use this when:
- visually checking page-level alignment
- comparing XML coordinates against the rendered source page
- confirming crop placement without re-running the importer
cells/<group>__<column>__<roll-band>.png
One deterministic PNG crop per parsed critical result.
Use this when:
- curating a result in the web editor
- verifying the importer matched the intended source cell
- debugging crop padding or page-boundary issues
validation-report.json
The validation result for the parsed table.
This includes:
- overall validity
- validation errors
- validation warnings
- row count
- cell count
Use this when:
- a
loadcommand fails - a parser change introduces ambiguity
- you need to confirm that the importer refused to write SQLite data
Standard Table Parsing Strategy
The current standard parser is designed for tables shaped like Slash.pdf:
- columns:
A-E - rows: roll bands such as
01-05,71-75,100 - cell contents: prose, symbolic affixes, and sometimes conditional branch lines
Header Detection
The parser searches the XML fragments for a row containing exactly:
ABCDE
Those positions define the standard-table column anchors.
Row Detection
The parser searches the left margin below the header for roll-band labels, for example:
01-0566251+
Those vertical positions define the row anchors.
Row Bands
The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors.
That prevents one row from drifting into the next when text wraps over multiple visual lines.
Column Assignment
Each text fragment is assigned to the nearest column band based on horizontal center position.
This is the core reliability improvement over the phase-1 text slicing approach.
Line Reconstruction
Fragments inside a cell are grouped into lines by close top values and then ordered by left.
This produces a stable line list even when PDF text is broken into multiple fragments.
Boundary Repair
After the initial midpoint-based row assignment, the parser performs a repair step across adjacent rows in the same column.
If the next row begins with affix-like lines and then continues with prose, those leading affix lines are treated as leaked trailing affixes from the previous row and moved back.
This repair exists because some tables place affix lines close enough to the next row label that midpoint-only segmentation is not reliable.
Description vs Affix Splitting
The parser classifies lines as:
- description-like prose
- affix-like notation
Affix-like lines include:
+...- symbolic lines using the critical glyphs
- branch-like affix lines such as
with leg greaves: +2H - ...
Affix-like classification is intentionally conservative. Numeric prose lines such as 25% chance... are not treated as affixes unless they match a known affix-like notation pattern.
The current implementation stores:
- base
RawCellText - base
DescriptionText - base
RawAffixText - normalized base affix effects in
critical_effect - parsed conditional branches with condition text, branch prose, branch affix text, and normalized branch affix effects
- parsed conditional branches in debug artifacts and persisted SQLite rows
Validation Rules
The current validation pass is intentionally strict.
At minimum, a valid standard table must satisfy:
- a detectable
A-Eheader row exists - roll-band labels are found
- each detected row produces content for all five columns
- total parsed cell count matches
row_count * 5 - no cell begins with affix-like lines before prose
- no cell contains prose after affix lines
If validation fails:
- artifacts are still written
- SQLite load is aborted
- the command returns an error
If validation succeeds with warnings:
- artifacts still record the warnings
- SQLite load continues
- the CLI prints each warning before reporting the successful load
This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table.
Database Load Behavior
The loader is transactional.
The current load path:
- ensures the SQLite database exists
- upgrades older SQLite files to the current importer-owned critical schema where needed
- reconciles the targeted table, axes, and existing results by logical identity
- inserts newly discovered rows
- updates uncurated rows in place
- preserves curated rows and their edited child rows
- refreshes importer-managed source provenance and crop-image metadata
- deletes unmatched rows only when they are still uncurated
- commits only after the full merge is saved
Result identity is keyed by:
- table slug
- optional group key
- column key
- roll-band label
This means importer iterations can target one table without resetting unrelated database content, while still protecting manually curated rows from later parser changes.
Image Toolchain
The importer now uses two Poppler tools:
pdftohtml -xml -i -noframes- extracts geometry-aware XML text
pdftoppm -png -r 432- renders page PNGs and per-cell crop PNGs
The importer keeps a central render scale factor of 4. The XML still defines bounds in its original coordinate space, but rendered PNGs and stored crop metadata now use the scaled coordinate space and a 432 DPI render setting. In practice:
- XML coordinates are multiplied by
4before crop extraction - page and crop metadata stored with each result reflect the scaled PNG coordinate space
- crop alignment remains deterministic without changing the parsing pipeline
Interaction With Web App Startup
The web application no longer auto-seeds critical starter data on startup.
Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer.
This separation is important because:
- importer iterations are frequent
- parser logic is still evolving
- startup should not silently repopulate critical data behind the tool’s back
Current Code Map
Important files in the current implementation:
src/RolemasterDb.ImportTool/Program.cs- CLI entry point
src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs- command orchestration
src/RolemasterDb.ImportTool/CriticalImportLoader.cs- transactional SQLite load/reset behavior
src/RolemasterDb.ImportTool/Parsing/CriticalCellTextParser.cs- shared base-vs-branch parsing for cell content and affix extraction
src/RolemasterDb.ImportTool/Parsing/AffixEffectParser.cs- footer-legend-aware symbol effect normalization
src/RolemasterDb.ImportTool/Parsing/AffixLegend.cs- parsed footer legend model used for affix classification and effect mapping
src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs- manifest loading
src/RolemasterDb.ImportTool/PdfXmlExtractor.cs- XML extraction via
pdftohtml
- XML extraction via
src/RolemasterDb.ImportTool/ImportArtifactWriter.cs- artifact output
src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs- standard table geometry parser
src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs- positioned text fragment model
src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs- debug cell artifact model
src/RolemasterDb.ImportTool/Parsing/ParsedCriticalBranch.cs- parsed branch artifact model with normalized effects
src/RolemasterDb.ImportTool/Parsing/ParsedCriticalEffect.cs- parsed effect artifact model
src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs- validation output model
src/RolemasterDb.App/Data/RolemasterDbSchemaUpgrader.cs- SQLite upgrade hook for branch/effect-table rollout
src/RolemasterDb.App/Components/Shared/CriticalLookupResultCard.razor- web rendering of base results, conditional branches, and parsed affix effects
Adding a New Table
The recommended process for onboarding a new table is:
- Add a manifest entry.
- Run
extract <slug>. - Inspect
source.xml. - Run
load <slug>. - Inspect
validation-report.jsonandparsed-cells.json. - If validation succeeds, spot-check SQLite output.
- If validation fails, adjust the parser or add a family-specific parser strategy before retrying.
Debugging Guidance
If a table imports incorrectly, inspect artifacts in this order:
validation-report.jsonparsed-cells.jsonfragments.jsonsource.xml
That order usually answers the key questions fastest:
- did validation fail
- which row/column is wrong
- were fragments assigned incorrectly
- or was the extraction itself already malformed
Reliability Position
The current importer should be understood as:
- reliable enough for geometry-based
standardtable iteration - much safer than the old flattened-text approach
- still evolving toward broader family coverage and deeper normalization
The key design rule going forward is:
- do not silently load ambiguous data
The importer should always prefer:
- preserving source fidelity
- writing review artifacts
- failing validation
over:
- guessing
- auto-correcting without evidence
- loading nearly-correct but structurally wrong critical results