Files

Frank Tovar 8cbcf66695 Add high-res critical image refresh import

2026-03-18 00:44:58 +01:00

24 KiB

Raw Blame History

Critical Import Tool

Purpose

The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app.

The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be:

explicit
repeatable
debuggable
able to rebuild importer-managed data without resetting the entire application

The tool currently lives in src/RolemasterDb.ImportTool and operates against the same SQLite schema used by the web app.

Goals

The importer is designed around the following requirements:

reset and reload critical data without touching unrelated tables
preserve source fidelity while still producing structured lookup data
make parsing failures visible before bad data reaches SQLite
keep intermediate artifacts on disk for inspection
support iterative parser development one table at a time

Current Scope

The current implementation supports:

explicit CLI commands for reset, extraction, and import
manifest-driven source selection
standard critical tables with columns A-E
variant_column critical tables with non-severity columns
grouped_variant critical tables with a group axis plus variant columns
XML-based extraction using pdftohtml -xml
XML-aligned page rendering and per-cell PNG crops using pdftoppm -png -r 432
geometry-based parsing across the currently enabled table set:
- arcane-aether
- arcane-nether
- ballistic-shrapnel
- brawling
- cold
- electricity
- grapple
- heat
- impact
- krush
- large_creature_magic
- large_creature_weapon
- ma-strikes
- ma-sweeps
- mana
- puncture
- slash
- subdual
- super_large_creature_weapon
- tiny
- unbalance
row-boundary repair for trailing affix leakage
split row-label reconstruction for tables that render labels such as 99- / 100 as two fragments
conditional branch extraction into critical_branch
footer/page-number filtering during body parsing
transactional loading into SQLite
importer-managed source provenance for each parsed result:
- source page number
- source crop bounds
- deterministic crop-image path
non-destructive merge loading that preserves curated rows
conditional branch display through the web critical lookup

The current implementation does not yet support:

OCR/image-based PDFs such as Void.pdf
automatic confidence scoring beyond validation errors

High-Level Architecture

The importer workflow is:

Resolve a table entry from the manifest.
Extract the source PDF into an artifact format.
Parse the extracted artifact into an in-memory table model.
Write debug artifacts to disk.
Render page and cell reference PNGs.
Validate the parsed result.
If validation succeeds, merge the parsed data into SQLite in a transaction.

The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow.

Implementation Phases

Phase 1: Initial Importer and Text Extraction

Phase 1 established the first end-to-end workflow:

a dedicated console project
CommandLineParser based verbs
a table manifest
transactional reset/load commands
a first parser for Slash.pdf

Phase 1 command surface

Phase 1 introduced these verbs:

reset criticals
extract <table>
load <table>
import <table>

Phase 1 extraction approach

The initial version used pdftotext -layout to create a flattened text artifact. The parser then tried to reconstruct:

column boundaries from the A-E header line
roll-band rows from labels such as 71-75
cell contents by slicing monospaced text blocks

Phase 1 outcome

Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs.

Phase 1 failure mode

The first serious regression was seen in Slash.pdf:

lookup target: slash, severity A, roll 72
expected band: 71-75
broken result from the text-based parser: content from 76-80 mixed with stray characters from severity B

That failure showed the core problem with pdftotext -layout: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout.

Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development.

Phase 2: XML Geometry-Based Parsing

Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on pdftohtml -xml.

Why Phase 2 was necessary

The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by pdftohtml preserves:

page number
top
left
width
height
text content

That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines.

Phase 2 extraction format

The importer now extracts to XML instead of plain text:

extraction tool: pdftohtml -xml -i -noframes
artifact file: source.xml

Phase 2 parser model

The parser now works in these stages:

Load all <text> fragments from the XML.
Detect the standard A-E header row.
Detect roll-band labels on the left margin.
Build row bands from the vertical positions of those roll labels.
Build column boundaries from the horizontal centers of the A-E header fragments.
Assign each text fragment to a row by top.
Assign each text fragment to a column by horizontal position.
Reconstruct each cell from ordered fragments.
Split cell content into description lines and affix-like lines.
Validate the result before touching SQLite.

Phase 2 reliability improvement

This phase fixed the original Slash / A / 72 corruption. The same lookup now resolves to:

band 71-75
description Blow falls on lower leg. Slash tendons. Poor sucker.

The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.

Phase 2.1: Boundary Hardening After Manual Validation

After phase 2, a manual validation pass compared:

the rendered Slash.pdf
the extracted source.xml
the imported SQLite rows

That review found a remaining defect around the 51-55 / 56-60 boundary:

51-55 lost several affix lines
56-60 gained leading affix lines from the previous row

The root cause was the original row segmentation rule:

rows were assigned strictly by the midpoint between adjacent roll-label top values

That rule was too naive for rows whose affix block sits visually near the next row label.

Phase 2.1 fix

The parser was hardened in two ways:

Leading affix leakage repair
- after the initial row assignment, if a cell in the next row starts with affix-like lines and then continues with prose, those leading affix lines are moved back to the previous row
Better affix classification
- generic digit-starting lines are no longer assumed to be affixes
- this prevents prose such as 25% chance your weapon is stuck... from being misclassified

Phase 2.1 validation rules

The importer now explicitly rejects cells that still look structurally wrong after repair:

prose and affix segments may not alternate more than once inside a cell

This keeps the phase-2.1 safety goal in place while allowing broader standard-table layouts that render a single affix block either before or after the prose block.

Phase 3: Broader Table Coverage

Phase 3 expands the manifest and validates the shared standard parser across a broader set of A-E tables.

The currently enabled phase-3 table set is:

arcane-aether
arcane-nether
ballistic-shrapnel
brawling
cold
electricity
grapple
heat
impact
krush
ma-strikes
ma-sweeps
mana
puncture
slash
subdual
tiny
unbalance

Current phase-3 notes:

header detection now tolerates minor top misalignment across the A-E header glyphs
first-row body parsing can now begin slightly above the first roll-band label when the PDF places prose between the header row and the label, which prevents clipped 01-05 cells such as Mana.pdf
row boundaries can snap to the last affix-to-prose transition between adjacent roll labels when midpoint slicing would leak into the next row
affix symbols are learned from the footer legend before body parsing, so symbol-only affix fragments are classified correctly
cross-column text fragments can now be split at geometry-aligned whitespace boundaries before column assignment, while affix fragments still split on hard internal spacing
footer page numbers are filtered out before body parsing
validation allows a single contiguous affix block either before or after prose

Phase 4: Variant and Grouped Tables

Phase 4 extended the importer beyond A-E tables.

The currently enabled phase-4 table set is:

large_creature_weapon
- family: variant_column
- columns: NORMAL, MAGIC, MITHRIL, HOLY_ARMS, SLAYING
super_large_creature_weapon
- family: variant_column
- columns: NORMAL, MAGIC, MITHRIL, HOLY_ARMS, SLAYING
large_creature_magic
- family: grouped_variant
- groups: large, super_large
- columns: NORMAL, SLAYING

Phase-4 notes:

grouped results now populate critical_group during SQLite load
parser dispatch is family-based instead of standard-table only
left-margin row labels can be reconstructed from split fragments such as 151- / 175
the grouped magic PDF is imported once as large_creature_magic
- sources/Large Creature - Magic.pdf and sources/Super Large Creature - Magic.pdf are duplicate files

Phase 5: Conditional Branch Extraction

Phase 5 is complete.

Phase-5 notes:

branch-heavy cells are split into base result content plus ordered critical_branch rows
branch parsing is shared across standard, variant_column, and grouped_variant table families
branch conditions are preserved as display text and normalized into condition keys such as with_leg_greaves
branch payloads can contain prose, affix notation, or both
the importer now upgrades older SQLite files to add the CriticalBranches table before load
the web critical lookup now returns and renders conditional branches alongside the base result

Phase 6: Effect Normalization

Phase 6 is complete for symbol-driven affixes.

Phase-6 notes:

footer legends are parsed into table-specific affix metadata before effect normalization
symbolic affix lines are normalized into critical_effect rows for both base results and conditional branches
the normalized pass currently covers direct hits, must-parry, no-parry, stun, bleed, foe penalties, attacker bonuses, and Mana power-point modifiers
result and branch parsed_json payloads now store the normalized symbol effects
the web critical lookup now returns and renders parsed affix effects alongside the raw affix text
prose-derived effects remain future work

Phase 7: OCR and Manual Fallback

support image-based PDFs such as Void.pdf
route image-based sources through OCR or curated manual input
keep the same post-extraction parsing contract where possible

Current CLI

The tool uses CommandLineParser and currently exposes these verbs:

`reset criticals`

Deletes importer-managed critical data from SQLite.

Use this when:

you want to clear imported critical data
you want to rerun a fresh import
you need to verify the rebuild path from an empty critical-table state

Example:

dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals

`extract <table>`

Resolves a table from the manifest and writes the extraction artifact to disk.

Example:

dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash

`load <table>`

Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds.

Example:

dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash

`import <table>`

Runs extraction followed by load.

Example:

dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash

`reimport-images <table>`

Reuses source.xml, regenerates page PNGs and cell PNGs, rewrites the JSON artifacts, and refreshes only source-image metadata in SQLite.

Use this when:

crop resolution or render settings changed
you want better source images without reloading result text
you want to keep curated and uncurated content untouched while refreshing artifacts

Example:

dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reimport-images slash

Manifest

The importer manifest is stored at:

sources/critical-import-manifest.json

Each entry declares:

slug
displayName
family
extractionMethod
pdfPath
enabled

The manifest is intentionally the control point for enabling importer support one table at a time.

For the currently enabled entries:

standard tables use family: standard
creature weapon tables use family: variant_column
grouped creature magic uses family: grouped_variant
all enabled entries currently use extractionMethod: xml

Artifact Layout

Artifacts are written under:

artifacts/import/critical/<slug>/

The current artifact set is:

`source.xml`

The raw XML extraction output from pdftohtml.

Use this when:

checking whether text is present in the PDF
inspecting original top and left coordinates
diagnosing row/column misassignment

`fragments.json`

A normalized list of parsed text fragments with page and position metadata.

Use this when:

comparing raw XML to the importer’s internal fragment model
confirming that specific fragments were loaded correctly
debugging Unicode or whitespace normalization issues

`parsed-cells.json`

The reconstructed cells after geometry-based row/column assignment.

Use this when:

validating a specific row and column
checking whether a fragment was assigned to the correct cell
confirming description and affix splitting
confirming page and crop provenance for a specific result

Each parsed cell now includes:

sourceBounds
- XML-aligned page number and bounding rectangle for the final repaired cell content
sourceImagePath
- importer-managed relative PNG path when image generation succeeded
sourceImageCrop
- the final crop rectangle written to disk

`pages/page-001.png`

Rendered PDF page images at 432 DPI, using a central render scale factor of 4 over the XML coordinate space emitted by pdftohtml -xml.

Use this when:

visually checking page-level alignment
comparing XML coordinates against the rendered source page
confirming crop placement without re-running the importer

`cells/<group><column><roll-band>.png`

One deterministic PNG crop per parsed critical result.

Use this when:

curating a result in the web editor
verifying the importer matched the intended source cell
debugging crop padding or page-boundary issues

`validation-report.json`

The validation result for the parsed table.

This includes:

overall validity
validation errors
validation warnings
row count
cell count

Use this when:

a load command fails
a parser change introduces ambiguity
you need to confirm that the importer refused to write SQLite data

Standard Table Parsing Strategy

The current standard parser is designed for tables shaped like Slash.pdf:

columns: A-E
rows: roll bands such as 01-05, 71-75, 100
cell contents: prose, symbolic affixes, and sometimes conditional branch lines

Header Detection

The parser searches the XML fragments for a row containing exactly:

Those positions define the standard-table column anchors.

Row Detection

The parser searches the left margin below the header for roll-band labels, for example:

01-05
66
251+

Those vertical positions define the row anchors.

Row Bands

The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors.

That prevents one row from drifting into the next when text wraps over multiple visual lines.

Column Assignment

Each text fragment is assigned to the nearest column band based on horizontal center position.

This is the core reliability improvement over the phase-1 text slicing approach.

Line Reconstruction

Fragments inside a cell are grouped into lines by close top values and then ordered by left.

This produces a stable line list even when PDF text is broken into multiple fragments.

Boundary Repair

After the initial midpoint-based row assignment, the parser performs a repair step across adjacent rows in the same column.

If the next row begins with affix-like lines and then continues with prose, those leading affix lines are treated as leaked trailing affixes from the previous row and moved back.

This repair exists because some tables place affix lines close enough to the next row label that midpoint-only segmentation is not reliable.

Description vs Affix Splitting

The parser classifies lines as:

description-like prose
affix-like notation

Affix-like lines include:

+...
symbolic lines using the critical glyphs
branch-like affix lines such as with leg greaves: +2H - ...

Affix-like classification is intentionally conservative. Numeric prose lines such as 25% chance... are not treated as affixes unless they match a known affix-like notation pattern.

The current implementation stores:

base RawCellText
base DescriptionText
base RawAffixText
normalized base affix effects in critical_effect
parsed conditional branches with condition text, branch prose, branch affix text, and normalized branch affix effects
parsed conditional branches in debug artifacts and persisted SQLite rows

Validation Rules

The current validation pass is intentionally strict.

At minimum, a valid standard table must satisfy:

a detectable A-E header row exists
roll-band labels are found
each detected row produces content for all five columns
total parsed cell count matches row_count * 5
no cell begins with affix-like lines before prose
no cell contains prose after affix lines

If validation fails:

artifacts are still written
SQLite load is aborted
the command returns an error

If validation succeeds with warnings:

artifacts still record the warnings
SQLite load continues
the CLI prints each warning before reporting the successful load

This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table.

Database Load Behavior

The loader is transactional.

The current load path:

ensures the SQLite database exists
upgrades older SQLite files to the current importer-owned critical schema where needed
reconciles the targeted table, axes, and existing results by logical identity
inserts newly discovered rows
updates uncurated rows in place
preserves curated rows and their edited child rows
refreshes importer-managed source provenance and crop-image metadata
deletes unmatched rows only when they are still uncurated
commits only after the full merge is saved

Result identity is keyed by:

table slug
optional group key
column key
roll-band label

This means importer iterations can target one table without resetting unrelated database content, while still protecting manually curated rows from later parser changes.

Image Toolchain

The importer now uses two Poppler tools:

pdftohtml -xml -i -noframes
- extracts geometry-aware XML text
pdftoppm -png -r 432
- renders page PNGs and per-cell crop PNGs

The importer keeps a central render scale factor of 4. The XML still defines bounds in its original coordinate space, but rendered PNGs and stored crop metadata now use the scaled coordinate space and a 432 DPI render setting. In practice:

XML coordinates are multiplied by 4 before crop extraction
page and crop metadata stored with each result reflect the scaled PNG coordinate space
crop alignment remains deterministic without changing the parsing pipeline

Interaction With Web App Startup

The web application no longer auto-seeds critical starter data on startup.

Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer.

This separation is important because:

importer iterations are frequent
parser logic is still evolving
startup should not silently repopulate critical data behind the tool’s back

Current Code Map

Important files in the current implementation:

src/RolemasterDb.ImportTool/Program.cs
- CLI entry point
src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs
- command orchestration
src/RolemasterDb.ImportTool/CriticalImportLoader.cs
- transactional SQLite load/reset behavior
src/RolemasterDb.ImportTool/Parsing/CriticalCellTextParser.cs
- shared base-vs-branch parsing for cell content and affix extraction
src/RolemasterDb.ImportTool/Parsing/AffixEffectParser.cs
- footer-legend-aware symbol effect normalization
src/RolemasterDb.ImportTool/Parsing/AffixLegend.cs
- parsed footer legend model used for affix classification and effect mapping
src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs
- manifest loading
src/RolemasterDb.ImportTool/PdfXmlExtractor.cs
- XML extraction via pdftohtml
src/RolemasterDb.ImportTool/ImportArtifactWriter.cs
- artifact output
src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs
- standard table geometry parser
src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs
- positioned text fragment model
src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs
- debug cell artifact model
src/RolemasterDb.ImportTool/Parsing/ParsedCriticalBranch.cs
- parsed branch artifact model with normalized effects
src/RolemasterDb.ImportTool/Parsing/ParsedCriticalEffect.cs
- parsed effect artifact model
src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs
- validation output model
src/RolemasterDb.App/Data/RolemasterDbSchemaUpgrader.cs
- SQLite upgrade hook for branch/effect-table rollout
src/RolemasterDb.App/Components/Shared/CriticalLookupResultCard.razor
- web rendering of base results, conditional branches, and parsed affix effects

Adding a New Table

The recommended process for onboarding a new table is:

Add a manifest entry.
Run extract <slug>.
Inspect source.xml.
Run load <slug>.
Inspect validation-report.json and parsed-cells.json.
If validation succeeds, spot-check SQLite output.
If validation fails, adjust the parser or add a family-specific parser strategy before retrying.

Debugging Guidance

If a table imports incorrectly, inspect artifacts in this order:

validation-report.json
parsed-cells.json
fragments.json
source.xml

That order usually answers the key questions fastest:

did validation fail
which row/column is wrong
were fragments assigned incorrectly
or was the extraction itself already malformed

Reliability Position

The current importer should be understood as:

reliable enough for geometry-based standard table iteration
much safer than the old flattened-text approach
still evolving toward broader family coverage and deeper normalization

The key design rule going forward is:

do not silently load ambiguous data

The importer should always prefer:

preserving source fidelity
writing review artifacts
failing validation

over:

guessing
auto-correcting without evidence
loading nearly-correct but structurally wrong critical results

24 KiB Raw Blame History Unescape Escape

Critical Import Tool

Purpose

Goals

Current Scope

High-Level Architecture

Implementation Phases

Phase 1: Initial Importer and Text Extraction

Phase 1 command surface

Phase 1 extraction approach

Phase 1 outcome

Phase 1 failure mode

Phase 2: XML Geometry-Based Parsing

Why Phase 2 was necessary

Phase 2 extraction format

Phase 2 parser model

Phase 2 reliability improvement

Phase 2.1: Boundary Hardening After Manual Validation

Phase 2.1 fix

Phase 2.1 validation rules

Phase 3: Broader Table Coverage

Phase 4: Variant and Grouped Tables

Phase 5: Conditional Branch Extraction

Phase 6: Effect Normalization

Phase 7: OCR and Manual Fallback

Current CLI

reset criticals

extract <table>

load <table>

import <table>

reimport-images <table>

Manifest

Artifact Layout

source.xml

fragments.json

parsed-cells.json

pages/page-001.png

cells/<group>__<column>__<roll-band>.png

validation-report.json

Standard Table Parsing Strategy

Header Detection

Row Detection

Row Bands

Column Assignment

Line Reconstruction

Boundary Repair

Description vs Affix Splitting

Validation Rules

Database Load Behavior

Image Toolchain

Interaction With Web App Startup

Current Code Map

Adding a New Table

Debugging Guidance

Reliability Position

24 KiB

Raw Blame History

`reset criticals`

`extract <table>`

`load <table>`

`import <table>`

`reimport-images <table>`

`source.xml`

`fragments.json`

`parsed-cells.json`

`pages/page-001.png`

`cells/<group><column><roll-band>.png`

`validation-report.json`