xTr1m/RolemasterDB

Fork 0

Files

Frank Tovar 768fffcf7d Update phase 7 OCR planning docs

2026-03-18 02:24:31 +01:00

8.5 KiB

Raw Blame History

Critical Tables DB Model

What the PDFs look like

The PDFs are not one uniform table shape. I found three families:

Standard tables
- Columns are severity-like keys such as A through E.
- Rows are roll bands such as 01-05, 66, 96-99, or 100.
- Examples: Slash.pdf, Puncture.pdf, Arcane Aether.pdf.
Variant-column tables
- Columns are not severity letters; they are variant keys such as normal, magic, mithril, holy arms, slaying.
- Rows are still roll bands.
- Example: Large Creature - Weapon.pdf.
Grouped variant tables
- There is an extra grouping axis above the column axis.
- Example: Large Creature - Magic.pdf has:
  - group: large, super_large
  - column: normal, slaying
- In the current importer manifest, the grouped magic PDF is loaded once as large_creature_magic because the Large Creature - Magic.pdf and Super Large Creature - Magic.pdf source files are duplicates.
  - row: roll band

There are also extraction constraints:

Most PDFs are text extractable with pdftohtml -xml.
Void.pdf appears image-based and will need OCR bootstrap, with the existing curation flow handling cleanup.
A single cell can contain:
- base description text
- symbolic affixes such as +5H - 2S - 3B
- conditional branches such as with helmet, w/o leg greaves, if foe has shield

Because of that, the safest model is hybrid:

relational tables for lookup axes and indexed effects
raw text storage for fidelity
structured JSON for irregular branches that are hard to normalize perfectly on first pass

Recommended logical model

1. `critical_table`

One record per PDF/table, which is the primary "critical type" for lookup.

Examples:

slash
puncture
arcane_aether
large_creature_weapon
large_creature_magic

2. `critical_group`

Optional extra axis for tables that need more than type + column + roll.

Examples:

large
super_large

Most tables will have no group rows.

3. `critical_column`

Generalized "severity/column" axis.

Examples:

A, B, C, D, E
normal, magic, mithril, holy_arms, slaying

Do not hardcode this as a single severity enum. Treat it as a table-defined dimension.

4. `critical_roll_band`

Stores row bands and supports exact row lookup by roll.

Examples:

01-05
66
96-99
251+

Recommended fields:

min_roll
max_roll nullable for open-ended rows like 251+
display label
sort order

5. `critical_result`

One record per lookup cell:

table
optional group
column
roll band

This stores:

is_curated
raw_cell_text
description_text
raw_affix_text
parsed_json
parse_status
source_page_number
source_image_path
source_image_crop

is_curated is an explicit workflow flag. Once a result is curated in the web editor, later importer runs must preserve curator-owned content instead of replacing the row wholesale.

The source-image fields keep importer provenance separate from the editor snapshot stored in parsed_json:

source_page_number points to the rendered PDF page used for review
source_image_path stores the importer-managed relative PNG path for the cell crop
source_image_crop stores the crop geometry that produced the PNG and can be used for debugging alignment problems

6. `critical_branch`

Optional conditional branches inside a result cell.

Examples:

with helmet
without helmet
with leg greaves
if foe has shield

Each branch can carry:

condition_text
optional structured condition_json
branch description text
branch raw affix text
parsed JSON

Current implementation note:

critical_branch is now populated by the importer and returned by the web critical lookup
condition keys are normalized for lookup/API use, while the original condition text remains available for display

7. `critical_effect`

Normalized machine-readable effects parsed from the symbol line and, over time, from prose.

Recommended canonical effect_code values:

direct_hits
must_parry_rounds
no_parry_rounds
stunned_rounds
bleed_per_round
foe_penalty
attacker_bonus_next_round
power_point_modifier
initiative_gain
initiative_loss
drop_item
item_breakage_check
limb_useless
knockdown
prone
coma
paralyzed
blind
deaf
mute
dies_in_rounds
instant_death
armor_destroyed
weapon_stuck

Each effect should point to either:

the base critical_result, or
a critical_branch

This lets you keep the raw text but still filter/query on effects.

Current implementation note:

symbol-driven affixes are now normalized for both base results and conditional branch affixes
value_expression is used when the affix contains a formula instead of a flat integer, which is currently needed for Mana power-point adjustments such as +(2d10-18)P

Why this works for your lookup

Your lookup target is mostly:

critical type
severity(column)
roll

That maps cleanly to:

critical_table.slug
critical_column.column_key
numeric roll matched against critical_roll_band

For the outlier tables, add an optional group_key.

That means the API can still stay simple:

{
  "critical_type": "slash",
  "column": "C",
  "roll": 38,
  "group": null
}

or:

{
  "critical_type": "large_creature_magic",
  "group": "super_large",
  "column": "slaying",
  "roll": 88
}

Example return object

This is close to the current lookup shape, while still leaving room for future critical_effect normalization:

{
  "critical_type": "slash",
  "table_name": "Slash Critical Strike Table",
  "group": null,
  "column": "B",
  "column_label": "B",
  "column_role": "severity",
  "roll": 38,
  "roll_band": "36-45",
  "roll_band_min": 36,
  "roll_band_max": 45,
  "description": "Strike foe in shin.",
  "raw_affix_text": null,
  "branches": [
    {
      "branch_kind": "conditional",
      "condition_key": "with_leg_greaves",
      "condition_text": "with leg greaves",
      "description": "",
      "raw_affix_text": "+2H - must_parry",
      "sort_order": 1
    },
    {
      "branch_kind": "conditional",
      "condition_key": "without_leg_greaves",
      "condition_text": "w/o leg greaves",
      "description": "You slash open foe's shin.",
      "raw_affix_text": "+2H - bleed",
      "sort_order": 2
    }
  ],
  "raw_cell_text": "Original full cell text as extracted from the PDF",
  "source": {
    "pdf": "Slash.pdf",
    "extraction_method": "xml"
  }
}

Ingestion notes

Current import flow:

Create critical_table, critical_group, critical_column, and critical_roll_band from each PDF's visible axes.
Store each base cell in critical_result with base raw/description/affix text.
Split explicit conditional branches into critical_branch.
Parse symbolic affixes for both the base result and any branch affix payloads into critical_effect.
Return the base result plus ordered branches and parsed affix effects through the web critical lookup.
Gradually enrich prose-derived effects such as death, blindness, paralysis, limb loss, initiative changes, and item breakage.
Route image PDFs like Void.pdf through OCR bootstrap before the same downstream parser and curation flow.

The important design decision is: never throw away the original text. The prose is too irregular to rely on normalized fields alone.

Manual curation workflow

Because the import path depends on OCR, PDF XML extraction, and heuristics, the web app now treats manual repair as a first-class capability instead of an out-of-band database operation.

Current curation flow:

Browse a table on the /tables page.
Hover a populated cell to identify editable entries.
Open the popup editor for that cell.
Edit the entire critical_result graph:
- base raw cell text
- curated prose / description
- raw affix text
- curated state
- parse status
- parsed JSON
- nested critical_branch rows
- nested critical_effect rows for both the base result and branches
Save the result back through the API.

The corresponding API endpoints are:

GET /api/tables/critical/{slug}/cells/{resultId}
GET /api/tables/critical/{slug}/cells/{resultId}/source-image
PUT /api/tables/critical/{slug}/cells/{resultId}

The save operation replaces the stored branches and effects for that cell with the submitted payload and updates the explicit curated flag. Importer-managed source provenance can still be refreshed on later imports without overwriting curated content.

8.5 KiB Raw Blame History

Critical Tables DB Model

What the PDFs look like

Recommended logical model

1. critical_table

2. critical_group

3. critical_column

4. critical_roll_band

5. critical_result

6. critical_branch

7. critical_effect

Why this works for your lookup

Example return object

Ingestion notes

Manual curation workflow

8.5 KiB

Raw Blame History

1. `critical_table`

2. `critical_group`

3. `critical_column`

4. `critical_roll_band`

5. `critical_result`

6. `critical_branch`

7. `critical_effect`