Phase 2.1 import

2026-03-14 01:44:30 +01:00
parent be5c0a9b54
commit 5c4d540246
4 changed files with 151 additions and 22 deletions
--- a/docs/critical_import_tool.md
+++ b/docs/critical_import_tool.md
@@ -32,6 +32,7 @@ The current implementation supports:
 - `standard` critical tables with columns `A-E`
 - XML-based extraction using `pdftohtml -xml`
 - geometry-based parsing for `Slash.pdf`
+- row-boundary repair for trailing affix leakage
 - transactional loading into SQLite

 The current implementation does not yet support:
@@ -149,6 +150,44 @@ This phase fixed the original `Slash / A / 72` corruption. The same lookup now r

 The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.

+## Phase 2.1: Boundary Hardening After Manual Validation
+
+After phase 2, a manual validation pass compared:
+
+- the rendered `Slash.pdf`
+- the extracted `source.xml`
+- the imported SQLite rows
+
+That review found a remaining defect around the `51-55` / `56-60` boundary:
+
+- `51-55` lost several affix lines
+- `56-60` gained leading affix lines from the previous row
+
+The root cause was the original row segmentation rule:
+
+- rows were assigned strictly by the midpoint between adjacent roll-label `top` values
+
+That rule was too naive for rows whose affix block sits visually near the next row label.
+
+### Phase 2.1 fix
+
+The parser was hardened in two ways:
+
+1. Leading affix leakage repair
+   - after the initial row assignment, if a cell in the next row starts with affix-like lines and then continues with prose, those leading affix lines are moved back to the previous row
+2. Better affix classification
+   - generic digit-starting lines are no longer assumed to be affixes
+   - this prevents prose such as `25% chance your weapon is stuck...` from being misclassified
+
+### Phase 2.1 validation rules
+
+The importer now explicitly rejects cells that still look structurally wrong after repair:
+
+- a cell may not begin with affix-like lines before prose
+- a cell may not contain prose after affix lines
+
+This hardening step is important because it closed a class of row-boundary bugs that simple row/cell counts could not detect.
+
 ## Planned Future Phases

 The current architecture is intended to support additional phases:
@@ -353,6 +392,14 @@ Fragments inside a cell are grouped into lines by close `top` values and then or

 This produces a stable line list even when PDF text is broken into multiple fragments.

+### Boundary Repair
+
+After the initial midpoint-based row assignment, the parser performs a repair step across adjacent rows in the same column.
+
+If the next row begins with affix-like lines and then continues with prose, those leading affix lines are treated as leaked trailing affixes from the previous row and moved back.
+
+This repair exists because some tables place affix lines close enough to the next row label that midpoint-only segmentation is not reliable.
+
 ### Description vs Affix Splitting

 The parser classifies lines as:
@@ -366,6 +413,8 @@ Affix-like lines include:
 - symbolic lines using the critical glyphs
 - branch-like affix lines such as `with leg greaves: +2H - ...`

+Affix-like classification is intentionally conservative. Numeric prose lines such as `25% chance...` are not treated as affixes unless they match a known affix-like notation pattern.
+
 The current implementation stores:

 - `RawCellText`
@@ -384,6 +433,8 @@ At minimum, a valid `standard` table must satisfy:
 - roll-band labels are found
 - each detected row produces content for all five columns
 - total parsed cell count matches `row_count * 5`
+- no cell begins with affix-like lines before prose
+- no cell contains prose after affix lines

 If validation fails: