RolemasterDB/docs/critical_import_tool.md

# Critical Import Tool

## Purpose

The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app.

The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be:

- explicit
- repeatable
- debuggable
- able to rebuild importer-managed data without resetting the entire application

The tool currently lives in `src/RolemasterDb.ImportTool` and operates against the same SQLite schema used by the web app.

## Goals

The importer is designed around the following requirements:

- reset and reload critical data without touching unrelated tables
- preserve source fidelity while still producing structured lookup data
- make parsing failures visible before bad data reaches SQLite
- keep intermediate artifacts on disk for inspection
- support iterative parser development one table at a time

## Current Scope

The current implementation supports:

- explicit CLI commands for reset, extraction, and import
- manifest-driven source selection
- `standard` critical tables with columns `A-E`
- XML-based extraction using `pdftohtml -xml`
- geometry-based parsing across the currently enabled phase-3 tables:
  - `arcane-aether`
  - `arcane-nether`
  - `ballistic-shrapnel`
  - `brawling`
  - `cold`
  - `electricity`
  - `grapple`
  - `heat`
  - `impact`
  - `krush`
  - `ma-strikes`
  - `ma-sweeps`
  - `puncture`
  - `slash`
  - `subdual`
  - `tiny`
  - `unbalance`
- row-boundary repair for trailing affix leakage
- footer/page-number filtering during body parsing
- transactional loading into SQLite

The current implementation does not yet support:

- variant-column critical tables
- grouped variant tables
- `Mana.pdf`, whose current XML layout and affix notation still need a dedicated parser pass
- OCR/image-based PDFs such as `Void.pdf`
- normalized `critical_branch` population
- normalized `critical_effect` population
- automatic confidence scoring beyond validation errors

## High-Level Architecture

The importer workflow is:

1. Resolve a table entry from the manifest.
2. Extract the source PDF into an artifact format.
3. Parse the extracted artifact into an in-memory table model.
4. Write debug artifacts to disk.
5. Validate the parsed result.
6. If validation succeeds, load the parsed data into SQLite in a transaction.

The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow.

## Implementation Phases

## Phase 1: Initial Importer and Text Extraction

Phase 1 established the first end-to-end workflow:

- a dedicated console project
- `CommandLineParser` based verbs
- a table manifest
- transactional reset/load commands
- a first parser for `Slash.pdf`

### Phase 1 command surface

Phase 1 introduced these verbs:

- `reset criticals`
- `extract <table>`
- `load <table>`
- `import <table>`

### Phase 1 extraction approach

The initial version used `pdftotext -layout` to create a flattened text artifact. The parser then tried to reconstruct:

- column boundaries from the `A-E` header line
- roll-band rows from labels such as `71-75`
- cell contents by slicing monospaced text blocks

### Phase 1 outcome

Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs.

### Phase 1 failure mode

The first serious regression was seen in `Slash.pdf`:

- lookup target: `slash`, severity `A`, roll `72`
- expected band: `71-75`
- broken result from the text-based parser: content from `76-80` mixed with stray characters from severity `B`

That failure showed the core problem with `pdftotext -layout`: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout.

Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development.

## Phase 2: XML Geometry-Based Parsing

Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on `pdftohtml -xml`.

### Why Phase 2 was necessary

The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by `pdftohtml` preserves:

- page number
- `top`
- `left`
- `width`
- `height`
- text content

That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines.

### Phase 2 extraction format

The importer now extracts to XML instead of plain text:

- extraction tool: `pdftohtml -xml -i -noframes`
- artifact file: `source.xml`

### Phase 2 parser model

The parser now works in these stages:

1. Load all `<text>` fragments from the XML.
2. Detect the standard `A-E` header row.
3. Detect roll-band labels on the left margin.
4. Build row bands from the vertical positions of those roll labels.
5. Build column boundaries from the horizontal centers of the `A-E` header fragments.
6. Assign each text fragment to a row by `top`.
7. Assign each text fragment to a column by horizontal position.
8. Reconstruct each cell from ordered fragments.
9. Split cell content into description lines and affix-like lines.
10. Validate the result before touching SQLite.

### Phase 2 reliability improvement

This phase fixed the original `Slash / A / 72` corruption. The same lookup now resolves to:

- band `71-75`
- description `Blow falls on lower leg. Slash tendons. Poor sucker.`

The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.

## Phase 2.1: Boundary Hardening After Manual Validation

After phase 2, a manual validation pass compared:

- the rendered `Slash.pdf`
- the extracted `source.xml`
- the imported SQLite rows

That review found a remaining defect around the `51-55` / `56-60` boundary:

- `51-55` lost several affix lines
- `56-60` gained leading affix lines from the previous row

The root cause was the original row segmentation rule:

- rows were assigned strictly by the midpoint between adjacent roll-label `top` values

That rule was too naive for rows whose affix block sits visually near the next row label.

### Phase 2.1 fix

The parser was hardened in two ways:

1. Leading affix leakage repair
   - after the initial row assignment, if a cell in the next row starts with affix-like lines and then continues with prose, those leading affix lines are moved back to the previous row
2. Better affix classification
   - generic digit-starting lines are no longer assumed to be affixes
   - this prevents prose such as `25% chance your weapon is stuck...` from being misclassified

### Phase 2.1 validation rules

The importer now explicitly rejects cells that still look structurally wrong after repair:

- prose and affix segments may not alternate more than once inside a cell

This keeps the phase-2.1 safety goal in place while allowing broader standard-table layouts that render a single affix block either before or after the prose block.

## Planned Future Phases

The current architecture is intended to support additional phases:

### Phase 3: Broader Table Coverage

Phase 3 expands the manifest and validates the shared `standard` parser across a broader set of `A-E` tables.

The currently enabled phase-3 table set is:

- `arcane-aether`
- `arcane-nether`
- `ballistic-shrapnel`
- `brawling`
- `cold`
- `electricity`
- `grapple`
- `heat`
- `impact`
- `krush`
- `ma-strikes`
- `ma-sweeps`
- `puncture`
- `slash`
- `subdual`
- `tiny`
- `unbalance`

Current phase-3 notes:

- header detection now tolerates minor `top` misalignment across the `A-E` header glyphs
- footer page numbers are filtered out before body parsing
- validation allows a single contiguous affix block either before or after prose
- `Mana.pdf` is intentionally left out for now because its row-anchor geometry and notation still need dedicated handling

### Phase 4: Variant and Grouped Tables

- support `variant_column` tables such as `Large Creature - Weapon.pdf`
- support `grouped_variant` tables such as `Large Creature - Magic.pdf`
- add parser strategies for additional table families

### Phase 5: Conditional Branch Extraction

- split branch-heavy cells into `critical_branch`
- preserve the base cell text and branch text separately
- support branch conditions such as `with helmet` and `w/o leg greaves`

### Phase 6: Effect Normalization

- parse symbolic affix lines into normalized effects
- populate `critical_effect`
- gradually enrich prose-derived effects over time

### Phase 7: OCR and Manual Fallback

- support image-based PDFs such as `Void.pdf`
- route image-based sources through OCR or curated manual input
- keep the same post-extraction parsing contract where possible

## Current CLI

The tool uses `CommandLineParser` and currently exposes these verbs:

### `reset criticals`

Deletes importer-managed critical data from SQLite.

Use this when:

- you want to clear imported critical data
- you want to rerun a fresh import
- you need to verify the rebuild path from an empty critical-table state

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals
```

### `extract <table>`

Resolves a table from the manifest and writes the extraction artifact to disk.

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash
```

### `load <table>`

Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds.

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash
```

### `import <table>`

Runs extraction followed by load.

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash
```

## Manifest

The importer manifest is stored at:

- `sources/critical-import-manifest.json`

Each entry declares:

- `slug`
- `displayName`
- `family`
- `extractionMethod`
- `pdfPath`
- `enabled`

The manifest is intentionally the control point for enabling importer support one table at a time.

For the currently enabled phase-3 entries:

- `family` is `standard`
- `extractionMethod` is `xml`

## Artifact Layout

Artifacts are written under:

- `artifacts/import/critical/<slug>/`

The current artifact set is:

### `source.xml`

The raw XML extraction output from `pdftohtml`.

Use this when:

- checking whether text is present in the PDF
- inspecting original `top` and `left` coordinates
- diagnosing row/column misassignment

### `fragments.json`

A normalized list of parsed text fragments with page and position metadata.

Use this when:

- comparing raw XML to the importer’s internal fragment model
- confirming that specific fragments were loaded correctly
- debugging Unicode or whitespace normalization issues

### `parsed-cells.json`

The reconstructed cells after geometry-based row/column assignment.

Use this when:

- validating a specific row and column
- checking whether a fragment was assigned to the correct cell
- confirming description and affix splitting

### `validation-report.json`

The validation result for the parsed table.

This includes:

- overall validity
- validation errors
- row count
- cell count

Use this when:

- a `load` command fails
- a parser change introduces ambiguity
- you need to confirm that the importer refused to write SQLite data

## Standard Table Parsing Strategy

The current `standard` parser is designed for tables shaped like `Slash.pdf`:

- columns: `A-E`
- rows: roll bands such as `01-05`, `71-75`, `100`
- cell contents: prose, symbolic affixes, and sometimes conditional branch lines

### Header Detection

The parser searches the XML fragments for a row containing exactly:

- `A`
- `B`
- `C`
- `D`
- `E`

Those positions define the standard-table column anchors.

### Row Detection

The parser searches the left margin below the header for roll-band labels, for example:

- `01-05`
- `66`
- `251+`

Those vertical positions define the row anchors.

### Row Bands

The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors.

That prevents one row from drifting into the next when text wraps over multiple visual lines.

### Column Assignment

Each text fragment is assigned to the nearest column band based on horizontal center position.

This is the core reliability improvement over the phase-1 text slicing approach.

### Line Reconstruction

Fragments inside a cell are grouped into lines by close `top` values and then ordered by `left`.

This produces a stable line list even when PDF text is broken into multiple fragments.

### Boundary Repair

After the initial midpoint-based row assignment, the parser performs a repair step across adjacent rows in the same column.

If the next row begins with affix-like lines and then continues with prose, those leading affix lines are treated as leaked trailing affixes from the previous row and moved back.

This repair exists because some tables place affix lines close enough to the next row label that midpoint-only segmentation is not reliable.

### Description vs Affix Splitting

The parser classifies lines as:

- description-like prose
- affix-like notation

Affix-like lines include:

- `+...`
- symbolic lines using the critical glyphs
- branch-like affix lines such as `with leg greaves: +2H - ...`

Affix-like classification is intentionally conservative. Numeric prose lines such as `25% chance...` are not treated as affixes unless they match a known affix-like notation pattern.

The current implementation stores:

- `RawCellText`
- `DescriptionText`
- `RawAffixText`

It does not yet normalize branches or effects into separate tables.

## Validation Rules

The current validation pass is intentionally strict.

At minimum, a valid `standard` table must satisfy:

- a detectable `A-E` header row exists
- roll-band labels are found
- each detected row produces content for all five columns
- total parsed cell count matches `row_count * 5`
- no cell begins with affix-like lines before prose
- no cell contains prose after affix lines

If validation fails:

- artifacts are still written
- SQLite load is aborted
- the command returns an error

This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table.

## Database Load Behavior

The loader is transactional.

The current load path:

1. ensures the SQLite database exists
2. deletes the existing subtree for the targeted critical table
3. inserts:
   - `critical_table`
   - `critical_column`
   - `critical_roll_band`
   - `critical_result`
4. commits only after the full table is saved

This means importer iterations can target one table without resetting unrelated database content.

## Interaction With Web App Startup

The web application no longer auto-seeds critical starter data on startup.

Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer.

This separation is important because:

- importer iterations are frequent
- parser logic is still evolving
- startup should not silently repopulate critical data behind the tool’s back

## Current Code Map

Important files in the current implementation:

- `src/RolemasterDb.ImportTool/Program.cs`
  - CLI entry point
- `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs`
  - command orchestration
- `src/RolemasterDb.ImportTool/CriticalImportLoader.cs`
  - transactional SQLite load/reset behavior
- `src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs`
  - manifest loading
- `src/RolemasterDb.ImportTool/PdfXmlExtractor.cs`
  - XML extraction via `pdftohtml`
- `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs`
  - artifact output
- `src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs`
  - standard table geometry parser
- `src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs`
  - positioned text fragment model
- `src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs`
  - debug cell artifact model
- `src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs`
  - validation output model

## Adding a New Table

The recommended process for onboarding a new table is:

1. Add a manifest entry.
2. Run `extract <slug>`.
3. Inspect `source.xml`.
4. Run `load <slug>`.
5. Inspect `validation-report.json` and `parsed-cells.json`.
6. If validation succeeds, spot-check SQLite output.
7. If validation fails, adjust the parser or add a family-specific parser strategy before retrying.

## Debugging Guidance

If a table imports incorrectly, inspect artifacts in this order:

1. `validation-report.json`
2. `parsed-cells.json`
3. `fragments.json`
4. `source.xml`

That order usually answers the key questions fastest:

- did validation fail
- which row/column is wrong
- were fragments assigned incorrectly
- or was the extraction itself already malformed

## Reliability Position

The current importer should be understood as:

- reliable enough for geometry-based `standard` table iteration
- much safer than the old flattened-text approach
- still evolving toward broader family coverage and deeper normalization

The key design rule going forward is:

- do not silently load ambiguous data

The importer should always prefer:

- preserving source fidelity
- writing review artifacts
- failing validation

over:

- guessing
- auto-correcting without evidence
- loading nearly-correct but structurally wrong critical results