Fix critical importer row and column boundary parsing
This commit is contained in:
@@ -238,9 +238,10 @@ The currently enabled phase-3 table set is:
|
||||
Current phase-3 notes:
|
||||
|
||||
- header detection now tolerates minor `top` misalignment across the `A-E` header glyphs
|
||||
- first-row body parsing can now begin slightly above the first roll-band label when the PDF places prose between the header row and the label, which prevents clipped `01-05` cells such as `Mana.pdf`
|
||||
- row boundaries can snap to the last affix-to-prose transition between adjacent roll labels when midpoint slicing would leak into the next row
|
||||
- affix symbols are learned from the footer legend before body parsing, so symbol-only affix fragments are classified correctly
|
||||
- affix fragments that cross a column boundary in the XML can be split on hard internal spacing before column assignment, which is required for `Mana.pdf`
|
||||
- cross-column text fragments can now be split at geometry-aligned whitespace boundaries before column assignment, while affix fragments still split on hard internal spacing
|
||||
- footer page numbers are filtered out before body parsing
|
||||
- validation allows a single contiguous affix block either before or after prose
|
||||
|
||||
|
||||
Reference in New Issue
Block a user