# Critical Import Tool

## Purpose

The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app.

The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be:

- explicit
- repeatable
- debuggable
- able to rebuild importer-managed data without resetting the entire application

The tool currently lives in `src/RolemasterDb.ImportTool` and operates against the same SQLite schema used by the web app.

## Goals

The importer is designed around the following requirements:

- reset and reload critical data without touching unrelated tables
- preserve source fidelity while still producing structured lookup data
- make parsing failures visible before bad data reaches SQLite
- keep intermediate artifacts on disk for inspection
- support iterative parser development one table at a time

## Current Scope

The current implementation supports:

- explicit CLI commands for reset, extraction, and import
- manifest-driven source selection
- `standard` critical tables with columns `A-E`
- XML-based extraction using `pdftohtml -xml`
- geometry-based parsing for `Slash.pdf`
- transactional loading into SQLite

The current implementation does not yet support:

- variant-column critical tables
- grouped variant tables
- OCR/image-based PDFs such as `Void.pdf`
- normalized `critical_branch` population
- normalized `critical_effect` population
- automatic confidence scoring beyond validation errors

## High-Level Architecture

The importer workflow is:

1. Resolve a table entry from the manifest.
2. Extract the source PDF into an artifact format.
3. Parse the extracted artifact into an in-memory table model.
4. Write debug artifacts to disk.
5. Validate the parsed result.
6. If validation succeeds, load the parsed data into SQLite in a transaction.

The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow.

## Implementation Phases

## Phase 1: Initial Importer and Text Extraction

Phase 1 established the first end-to-end workflow:

- a dedicated console project
- `CommandLineParser` based verbs
- a table manifest
- transactional reset/load commands
- a first parser for `Slash.pdf`

### Phase 1 command surface

Phase 1 introduced these verbs:

- `reset criticals`
- `extract <table>`
- `load <table>`
- `import <table>`

### Phase 1 extraction approach

The initial version used `pdftotext -layout` to create a flattened text artifact. The parser then tried to reconstruct:

- column boundaries from the `A-E` header line
- roll-band rows from labels such as `71-75`
- cell contents by slicing monospaced text blocks

### Phase 1 outcome

Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs.

### Phase 1 failure mode

The first serious regression was seen in `Slash.pdf`:

- lookup target: `slash`, severity `A`, roll `72`
- expected band: `71-75`
- broken result from the text-based parser: content from `76-80` mixed with stray characters from severity `B`

That failure showed the core problem with `pdftotext -layout`: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout.

Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development.

## Phase 2: XML Geometry-Based Parsing

Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on `pdftohtml -xml`.

### Why Phase 2 was necessary

The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by `pdftohtml` preserves:

- page number
- `top`
- `left`
- `width`
- `height`
- text content

That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines.

### Phase 2 extraction format

The importer now extracts to XML instead of plain text:

- extraction tool: `pdftohtml -xml -i -noframes`
- artifact file: `source.xml`

### Phase 2 parser model

The parser now works in these stages:

1. Load all `<text>` fragments from the XML.
2. Detect the standard `A-E` header row.
3. Detect roll-band labels on the left margin.
4. Build row bands from the vertical positions of those roll labels.
5. Build column boundaries from the horizontal centers of the `A-E` header fragments.
6. Assign each text fragment to a row by `top`.
7. Assign each text fragment to a column by horizontal position.
8. Reconstruct each cell from ordered fragments.
9. Split cell content into description lines and affix-like lines.
10. Validate the result before touching SQLite.

### Phase 2 reliability improvement

This phase fixed the original `Slash / A / 72` corruption. The same lookup now resolves to:

- band `71-75`
- description `Blow falls on lower leg. Slash tendons. Poor sucker.`

The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.

## Planned Future Phases

The current architecture is intended to support additional phases:

### Phase 3: Broader Table Coverage

- add more `standard` critical PDFs
- expand the manifest
- verify parser stability across more source layouts

### Phase 4: Variant and Grouped Tables

- support `variant_column` tables such as `Large Creature - Weapon.pdf`
- support `grouped_variant` tables such as `Large Creature - Magic.pdf`
- add parser strategies for additional table families

### Phase 5: Conditional Branch Extraction

- split branch-heavy cells into `critical_branch`
- preserve the base cell text and branch text separately
- support branch conditions such as `with helmet` and `w/o leg greaves`

### Phase 6: Effect Normalization

- parse symbolic affix lines into normalized effects
- populate `critical_effect`
- gradually enrich prose-derived effects over time

### Phase 7: OCR and Manual Fallback

- support image-based PDFs such as `Void.pdf`
- route image-based sources through OCR or curated manual input
- keep the same post-extraction parsing contract where possible

## Current CLI

The tool uses `CommandLineParser` and currently exposes these verbs:

### `reset criticals`

Deletes importer-managed critical data from SQLite.

Use this when:

- you want to clear imported critical data
- you want to rerun a fresh import
- you need to verify the rebuild path from an empty critical-table state

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals
```

### `extract <table>`

Resolves a table from the manifest and writes the extraction artifact to disk.

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash
```

### `load <table>`

Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds.

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash
```

### `import <table>`

Runs extraction followed by load.

Example:

```powershell
dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash
```

## Manifest

The importer manifest is stored at:

- `sources/critical-import-manifest.json`

Each entry declares:

- `slug`
- `displayName`
- `family`
- `extractionMethod`
- `pdfPath`
- `enabled`

The manifest is intentionally the control point for enabling importer support one table at a time.

## Artifact Layout

Artifacts are written under:

- `artifacts/import/critical/<slug>/`

The current artifact set is:

### `source.xml`

The raw XML extraction output from `pdftohtml`.

Use this when:

- checking whether text is present in the PDF
- inspecting original `top` and `left` coordinates
- diagnosing row/column misassignment

### `fragments.json`

A normalized list of parsed text fragments with page and position metadata.

Use this when:

- comparing raw XML to the importer’s internal fragment model
- confirming that specific fragments were loaded correctly
- debugging Unicode or whitespace normalization issues

### `parsed-cells.json`

The reconstructed cells after geometry-based row/column assignment.

Use this when:

- validating a specific row and column
- checking whether a fragment was assigned to the correct cell
- confirming description and affix splitting

### `validation-report.json`

The validation result for the parsed table.

This includes:

- overall validity
- validation errors
- row count
- cell count

Use this when:

- a `load` command fails
- a parser change introduces ambiguity
- you need to confirm that the importer refused to write SQLite data

## Standard Table Parsing Strategy

The current `standard` parser is designed for tables shaped like `Slash.pdf`:

- columns: `A-E`
- rows: roll bands such as `01-05`, `71-75`, `100`
- cell contents: prose, symbolic affixes, and sometimes conditional branch lines

### Header Detection

The parser searches the XML fragments for a row containing exactly:

- `A`
- `B`
- `C`
- `D`
- `E`

Those positions define the standard-table column anchors.

### Row Detection

The parser searches the left margin below the header for roll-band labels, for example:

- `01-05`
- `66`
- `251+`

Those vertical positions define the row anchors.

### Row Bands

The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors.

That prevents one row from drifting into the next when text wraps over multiple visual lines.

### Column Assignment

Each text fragment is assigned to the nearest column band based on horizontal center position.

This is the core reliability improvement over the phase-1 text slicing approach.

### Line Reconstruction

Fragments inside a cell are grouped into lines by close `top` values and then ordered by `left`.

This produces a stable line list even when PDF text is broken into multiple fragments.

### Description vs Affix Splitting

The parser classifies lines as:

- description-like prose
- affix-like notation

Affix-like lines include:

- `+...`
- symbolic lines using the critical glyphs
- branch-like affix lines such as `with leg greaves: +2H - ...`

The current implementation stores:

- `RawCellText`
- `DescriptionText`
- `RawAffixText`

It does not yet normalize branches or effects into separate tables.

## Validation Rules

The current validation pass is intentionally strict.

At minimum, a valid `standard` table must satisfy:

- a detectable `A-E` header row exists
- roll-band labels are found
- each detected row produces content for all five columns
- total parsed cell count matches `row_count * 5`

If validation fails:

- artifacts are still written
- SQLite load is aborted
- the command returns an error

This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table.

## Database Load Behavior

The loader is transactional.

The current load path:

1. ensures the SQLite database exists
2. deletes the existing subtree for the targeted critical table
3. inserts:
   - `critical_table`
   - `critical_column`
   - `critical_roll_band`
   - `critical_result`
4. commits only after the full table is saved

This means importer iterations can target one table without resetting unrelated database content.

## Interaction With Web App Startup

The web application no longer auto-seeds critical starter data on startup.

Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer.

This separation is important because:

- importer iterations are frequent
- parser logic is still evolving
- startup should not silently repopulate critical data behind the tool’s back

## Current Code Map

Important files in the current implementation:

- `src/RolemasterDb.ImportTool/Program.cs`
  - CLI entry point
- `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs`
  - command orchestration
- `src/RolemasterDb.ImportTool/CriticalImportLoader.cs`
  - transactional SQLite load/reset behavior
- `src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs`
  - manifest loading
- `src/RolemasterDb.ImportTool/PdfXmlExtractor.cs`
  - XML extraction via `pdftohtml`
- `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs`
  - artifact output
- `src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs`
  - standard table geometry parser
- `src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs`
  - positioned text fragment model
- `src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs`
  - debug cell artifact model
- `src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs`
  - validation output model

## Adding a New Table

The recommended process for onboarding a new table is:

1. Add a manifest entry.
2. Run `extract <slug>`.
3. Inspect `source.xml`.
4. Run `load <slug>`.
5. Inspect `validation-report.json` and `parsed-cells.json`.
6. If validation succeeds, spot-check SQLite output.
7. If validation fails, adjust the parser or add a family-specific parser strategy before retrying.

## Debugging Guidance

If a table imports incorrectly, inspect artifacts in this order:

1. `validation-report.json`
2. `parsed-cells.json`
3. `fragments.json`
4. `source.xml`

That order usually answers the key questions fastest:

- did validation fail
- which row/column is wrong
- were fragments assigned incorrectly
- or was the extraction itself already malformed

## Reliability Position

The current importer should be understood as:

- reliable enough for geometry-based `standard` table iteration
- much safer than the old flattened-text approach
- still evolving toward broader family coverage and deeper normalization

The key design rule going forward is:

- do not silently load ambiguous data

The importer should always prefer:

- preserving source fidelity
- writing review artifacts
- failing validation

over:

- guessing
- auto-correcting without evidence
- loading nearly-correct but structurally wrong critical results