Fix critical importer row and column boundary parsing

This commit is contained in:
2026-03-14 14:34:27 +01:00
parent eb7de020b1
commit 28587fc6df
7 changed files with 302 additions and 32 deletions

View File

@@ -238,9 +238,10 @@ The currently enabled phase-3 table set is:
Current phase-3 notes:
- header detection now tolerates minor `top` misalignment across the `A-E` header glyphs
- first-row body parsing can now begin slightly above the first roll-band label when the PDF places prose between the header row and the label, which prevents clipped `01-05` cells such as `Mana.pdf`
- row boundaries can snap to the last affix-to-prose transition between adjacent roll labels when midpoint slicing would leak into the next row
- affix symbols are learned from the footer legend before body parsing, so symbol-only affix fragments are classified correctly
- affix fragments that cross a column boundary in the XML can be split on hard internal spacing before column assignment, which is required for `Mana.pdf`
- cross-column text fragments can now be split at geometry-aligned whitespace boundaries before column assignment, while affix fragments still split on hard internal spacing
- footer page numbers are filtered out before body parsing
- validation allows a single contiguous affix block either before or after prose