# Critical Import Tool ## Purpose The critical import tool exists to migrate Rolemaster critical-table source PDFs into the SQLite database used by the web app. The tool is intentionally separate from the web application startup path. Critical data needs to be re-imported repeatedly while the extraction and parsing logic evolves, so the import workflow must be: - explicit - repeatable - debuggable - able to rebuild importer-managed data without resetting the entire application The tool currently lives in `src/RolemasterDb.ImportTool` and operates against the same SQLite schema used by the web app. ## Goals The importer is designed around the following requirements: - reset and reload critical data without touching unrelated tables - preserve source fidelity while still producing structured lookup data - make parsing failures visible before bad data reaches SQLite - keep intermediate artifacts on disk for inspection - support iterative parser development one table at a time ## Current Scope The current implementation supports: - explicit CLI commands for reset, extraction, and import - manifest-driven source selection - `standard` critical tables with columns `A-E` - XML-based extraction using `pdftohtml -xml` - geometry-based parsing for `Slash.pdf` - transactional loading into SQLite The current implementation does not yet support: - variant-column critical tables - grouped variant tables - OCR/image-based PDFs such as `Void.pdf` - normalized `critical_branch` population - normalized `critical_effect` population - automatic confidence scoring beyond validation errors ## High-Level Architecture The importer workflow is: 1. Resolve a table entry from the manifest. 2. Extract the source PDF into an artifact format. 3. Parse the extracted artifact into an in-memory table model. 4. Write debug artifacts to disk. 5. Validate the parsed result. 6. If validation succeeds, load the parsed data into SQLite in a transaction. The importer uses the same EF Core context and domain model as the web app, but it owns the critical-data population flow. ## Implementation Phases ## Phase 1: Initial Importer and Text Extraction Phase 1 established the first end-to-end workflow: - a dedicated console project - `CommandLineParser` based verbs - a table manifest - transactional reset/load commands - a first parser for `Slash.pdf` ### Phase 1 command surface Phase 1 introduced these verbs: - `reset criticals` - `extract ` - `load
` - `import
` ### Phase 1 extraction approach The initial version used `pdftotext -layout` to create a flattened text artifact. The parser then tried to reconstruct: - column boundaries from the `A-E` header line - roll-band rows from labels such as `71-75` - cell contents by slicing monospaced text blocks ### Phase 1 outcome Phase 1 proved that the import loop and database load path worked, but it also exposed a critical reliability problem: flattened text was not a safe source format for these PDFs. ### Phase 1 failure mode The first serious regression was seen in `Slash.pdf`: - lookup target: `slash`, severity `A`, roll `72` - expected band: `71-75` - broken result from the text-based parser: content from `76-80` mixed with stray characters from severity `B` That failure showed the core problem with `pdftotext -layout`: it discards the original page geometry and forces the importer to guess row and column structure from a lossy text layout. Because of that, phase 1 is important historically, but it is not the recommended foundation for further parser development. ## Phase 2: XML Geometry-Based Parsing Phase 2 replaced the flattened-text pipeline with a geometry-aware pipeline based on `pdftohtml -xml`. ### Why Phase 2 was necessary The PDFs are still text-based, but the text needs to be parsed with positional information intact. The XML output produced by `pdftohtml` preserves: - page number - `top` - `left` - `width` - `height` - text content That positional data makes it possible to assign fragments to rows and columns based on geometry instead of guessing from flattened text lines. ### Phase 2 extraction format The importer now extracts to XML instead of plain text: - extraction tool: `pdftohtml -xml -i -noframes` - artifact file: `source.xml` ### Phase 2 parser model The parser now works in these stages: 1. Load all `` fragments from the XML. 2. Detect the standard `A-E` header row. 3. Detect roll-band labels on the left margin. 4. Build row bands from the vertical positions of those roll labels. 5. Build column boundaries from the horizontal centers of the `A-E` header fragments. 6. Assign each text fragment to a row by `top`. 7. Assign each text fragment to a column by horizontal position. 8. Reconstruct each cell from ordered fragments. 9. Split cell content into description lines and affix-like lines. 10. Validate the result before touching SQLite. ### Phase 2 reliability improvement This phase fixed the original `Slash / A / 72` corruption. The same lookup now resolves to: - band `71-75` - description `Blow falls on lower leg. Slash tendons. Poor sucker.` The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows. ## Planned Future Phases The current architecture is intended to support additional phases: ### Phase 3: Broader Table Coverage - add more `standard` critical PDFs - expand the manifest - verify parser stability across more source layouts ### Phase 4: Variant and Grouped Tables - support `variant_column` tables such as `Large Creature - Weapon.pdf` - support `grouped_variant` tables such as `Large Creature - Magic.pdf` - add parser strategies for additional table families ### Phase 5: Conditional Branch Extraction - split branch-heavy cells into `critical_branch` - preserve the base cell text and branch text separately - support branch conditions such as `with helmet` and `w/o leg greaves` ### Phase 6: Effect Normalization - parse symbolic affix lines into normalized effects - populate `critical_effect` - gradually enrich prose-derived effects over time ### Phase 7: OCR and Manual Fallback - support image-based PDFs such as `Void.pdf` - route image-based sources through OCR or curated manual input - keep the same post-extraction parsing contract where possible ## Current CLI The tool uses `CommandLineParser` and currently exposes these verbs: ### `reset criticals` Deletes importer-managed critical data from SQLite. Use this when: - you want to clear imported critical data - you want to rerun a fresh import - you need to verify the rebuild path from an empty critical-table state Example: ```powershell dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- reset criticals ``` ### `extract
` Resolves a table from the manifest and writes the extraction artifact to disk. Example: ```powershell dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- extract slash ``` ### `load
` Reads the extraction artifact, parses it, writes debug artifacts, validates the result, and loads SQLite if validation succeeds. Example: ```powershell dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- load slash ``` ### `import
` Runs extraction followed by load. Example: ```powershell dotnet run --project .\src\RolemasterDb.ImportTool\RolemasterDb.ImportTool.csproj -- import slash ``` ## Manifest The importer manifest is stored at: - `sources/critical-import-manifest.json` Each entry declares: - `slug` - `displayName` - `family` - `extractionMethod` - `pdfPath` - `enabled` The manifest is intentionally the control point for enabling importer support one table at a time. ## Artifact Layout Artifacts are written under: - `artifacts/import/critical//` The current artifact set is: ### `source.xml` The raw XML extraction output from `pdftohtml`. Use this when: - checking whether text is present in the PDF - inspecting original `top` and `left` coordinates - diagnosing row/column misassignment ### `fragments.json` A normalized list of parsed text fragments with page and position metadata. Use this when: - comparing raw XML to the importer’s internal fragment model - confirming that specific fragments were loaded correctly - debugging Unicode or whitespace normalization issues ### `parsed-cells.json` The reconstructed cells after geometry-based row/column assignment. Use this when: - validating a specific row and column - checking whether a fragment was assigned to the correct cell - confirming description and affix splitting ### `validation-report.json` The validation result for the parsed table. This includes: - overall validity - validation errors - row count - cell count Use this when: - a `load` command fails - a parser change introduces ambiguity - you need to confirm that the importer refused to write SQLite data ## Standard Table Parsing Strategy The current `standard` parser is designed for tables shaped like `Slash.pdf`: - columns: `A-E` - rows: roll bands such as `01-05`, `71-75`, `100` - cell contents: prose, symbolic affixes, and sometimes conditional branch lines ### Header Detection The parser searches the XML fragments for a row containing exactly: - `A` - `B` - `C` - `D` - `E` Those positions define the standard-table column anchors. ### Row Detection The parser searches the left margin below the header for roll-band labels, for example: - `01-05` - `66` - `251+` Those vertical positions define the row anchors. ### Row Bands The parser derives each row’s vertical range from the midpoint between adjacent roll-band anchors. That prevents one row from drifting into the next when text wraps over multiple visual lines. ### Column Assignment Each text fragment is assigned to the nearest column band based on horizontal center position. This is the core reliability improvement over the phase-1 text slicing approach. ### Line Reconstruction Fragments inside a cell are grouped into lines by close `top` values and then ordered by `left`. This produces a stable line list even when PDF text is broken into multiple fragments. ### Description vs Affix Splitting The parser classifies lines as: - description-like prose - affix-like notation Affix-like lines include: - `+...` - symbolic lines using the critical glyphs - branch-like affix lines such as `with leg greaves: +2H - ...` The current implementation stores: - `RawCellText` - `DescriptionText` - `RawAffixText` It does not yet normalize branches or effects into separate tables. ## Validation Rules The current validation pass is intentionally strict. At minimum, a valid `standard` table must satisfy: - a detectable `A-E` header row exists - roll-band labels are found - each detected row produces content for all five columns - total parsed cell count matches `row_count * 5` If validation fails: - artifacts are still written - SQLite load is aborted - the command returns an error This design is deliberate. It is safer to reject ambiguous extraction than to load a nearly-correct but wrong lookup table. ## Database Load Behavior The loader is transactional. The current load path: 1. ensures the SQLite database exists 2. deletes the existing subtree for the targeted critical table 3. inserts: - `critical_table` - `critical_column` - `critical_roll_band` - `critical_result` 4. commits only after the full table is saved This means importer iterations can target one table without resetting unrelated database content. ## Interaction With Web App Startup The web application no longer auto-seeds critical starter data on startup. Startup still ensures the database exists and seeds attack starter data, but critical-table population is now owned by the importer. This separation is important because: - importer iterations are frequent - parser logic is still evolving - startup should not silently repopulate critical data behind the tool’s back ## Current Code Map Important files in the current implementation: - `src/RolemasterDb.ImportTool/Program.cs` - CLI entry point - `src/RolemasterDb.ImportTool/CriticalImportCommandRunner.cs` - command orchestration - `src/RolemasterDb.ImportTool/CriticalImportLoader.cs` - transactional SQLite load/reset behavior - `src/RolemasterDb.ImportTool/CriticalImportManifestLoader.cs` - manifest loading - `src/RolemasterDb.ImportTool/PdfXmlExtractor.cs` - XML extraction via `pdftohtml` - `src/RolemasterDb.ImportTool/ImportArtifactWriter.cs` - artifact output - `src/RolemasterDb.ImportTool/Parsing/StandardCriticalTableParser.cs` - standard table geometry parser - `src/RolemasterDb.ImportTool/Parsing/XmlTextFragment.cs` - positioned text fragment model - `src/RolemasterDb.ImportTool/Parsing/ParsedCriticalCellArtifact.cs` - debug cell artifact model - `src/RolemasterDb.ImportTool/Parsing/ImportValidationReport.cs` - validation output model ## Adding a New Table The recommended process for onboarding a new table is: 1. Add a manifest entry. 2. Run `extract `. 3. Inspect `source.xml`. 4. Run `load `. 5. Inspect `validation-report.json` and `parsed-cells.json`. 6. If validation succeeds, spot-check SQLite output. 7. If validation fails, adjust the parser or add a family-specific parser strategy before retrying. ## Debugging Guidance If a table imports incorrectly, inspect artifacts in this order: 1. `validation-report.json` 2. `parsed-cells.json` 3. `fragments.json` 4. `source.xml` That order usually answers the key questions fastest: - did validation fail - which row/column is wrong - were fragments assigned incorrectly - or was the extraction itself already malformed ## Reliability Position The current importer should be understood as: - reliable enough for geometry-based `standard` table iteration - much safer than the old flattened-text approach - still evolving toward broader family coverage and deeper normalization The key design rule going forward is: - do not silently load ambiguous data The importer should always prefer: - preserving source fidelity - writing review artifacts - failing validation over: - guessing - auto-correcting without evidence - loading nearly-correct but structurally wrong critical results