Phase 2.1 import

This commit is contained in:
2026-03-14 01:44:30 +01:00
parent be5c0a9b54
commit 5c4d540246
4 changed files with 151 additions and 22 deletions

View File

@@ -32,6 +32,7 @@ The current implementation supports:
- `standard` critical tables with columns `A-E`
- XML-based extraction using `pdftohtml -xml`
- geometry-based parsing for `Slash.pdf`
- row-boundary repair for trailing affix leakage
- transactional loading into SQLite
The current implementation does not yet support:
@@ -149,6 +150,44 @@ This phase fixed the original `Slash / A / 72` corruption. The same lookup now r
The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.
## Phase 2.1: Boundary Hardening After Manual Validation
After phase 2, a manual validation pass compared:
- the rendered `Slash.pdf`
- the extracted `source.xml`
- the imported SQLite rows
That review found a remaining defect around the `51-55` / `56-60` boundary:
- `51-55` lost several affix lines
- `56-60` gained leading affix lines from the previous row
The root cause was the original row segmentation rule:
- rows were assigned strictly by the midpoint between adjacent roll-label `top` values
That rule was too naive for rows whose affix block sits visually near the next row label.
### Phase 2.1 fix
The parser was hardened in two ways:
1. Leading affix leakage repair
- after the initial row assignment, if a cell in the next row starts with affix-like lines and then continues with prose, those leading affix lines are moved back to the previous row
2. Better affix classification
- generic digit-starting lines are no longer assumed to be affixes
- this prevents prose such as `25% chance your weapon is stuck...` from being misclassified
### Phase 2.1 validation rules
The importer now explicitly rejects cells that still look structurally wrong after repair:
- a cell may not begin with affix-like lines before prose
- a cell may not contain prose after affix lines
This hardening step is important because it closed a class of row-boundary bugs that simple row/cell counts could not detect.
## Planned Future Phases
The current architecture is intended to support additional phases:
@@ -353,6 +392,14 @@ Fragments inside a cell are grouped into lines by close `top` values and then or
This produces a stable line list even when PDF text is broken into multiple fragments.
### Boundary Repair
After the initial midpoint-based row assignment, the parser performs a repair step across adjacent rows in the same column.
If the next row begins with affix-like lines and then continues with prose, those leading affix lines are treated as leaked trailing affixes from the previous row and moved back.
This repair exists because some tables place affix lines close enough to the next row label that midpoint-only segmentation is not reliable.
### Description vs Affix Splitting
The parser classifies lines as:
@@ -366,6 +413,8 @@ Affix-like lines include:
- symbolic lines using the critical glyphs
- branch-like affix lines such as `with leg greaves: +2H - ...`
Affix-like classification is intentionally conservative. Numeric prose lines such as `25% chance...` are not treated as affixes unless they match a known affix-like notation pattern.
The current implementation stores:
- `RawCellText`
@@ -384,6 +433,8 @@ At minimum, a valid `standard` table must satisfy:
- roll-band labels are found
- each detected row produces content for all five columns
- total parsed cell count matches `row_count * 5`
- no cell begins with affix-like lines before prose
- no cell contains prose after affix lines
If validation fails: