Phase 2.1 import
This commit is contained in:
@@ -14,4 +14,5 @@ Also see the other related technical documentation in the docs folder.
|
|||||||
- When asked to begin working on a task, create a detailed implementation plan first, present the plan to the user, and ask for approval before beginning with the actual implementation.
|
- When asked to begin working on a task, create a detailed implementation plan first, present the plan to the user, and ask for approval before beginning with the actual implementation.
|
||||||
- When an task is finished, perform a code review to evaluate if the change is clean and maintainable with high software engineering standards. Iterate on the code and repeat the review process until satisfied.
|
- When an task is finished, perform a code review to evaluate if the change is clean and maintainable with high software engineering standards. Iterate on the code and repeat the review process until satisfied.
|
||||||
- After the implementation is finished, verify all changed files, and run `python D:\Code\crlf.py $file1 $file2 ...` only for files you recognize, in order to normalize all line endings of all touched files to CRLF.
|
- After the implementation is finished, verify all changed files, and run `python D:\Code\crlf.py $file1 $file2 ...` only for files you recognize, in order to normalize all line endings of all touched files to CRLF.
|
||||||
|
- If there's documnentation present, always keep it updated.
|
||||||
- At the end perform a git commit with a one-liner summary.
|
- At the end perform a git commit with a one-liner summary.
|
||||||
@@ -32,6 +32,7 @@ The current implementation supports:
|
|||||||
- `standard` critical tables with columns `A-E`
|
- `standard` critical tables with columns `A-E`
|
||||||
- XML-based extraction using `pdftohtml -xml`
|
- XML-based extraction using `pdftohtml -xml`
|
||||||
- geometry-based parsing for `Slash.pdf`
|
- geometry-based parsing for `Slash.pdf`
|
||||||
|
- row-boundary repair for trailing affix leakage
|
||||||
- transactional loading into SQLite
|
- transactional loading into SQLite
|
||||||
|
|
||||||
The current implementation does not yet support:
|
The current implementation does not yet support:
|
||||||
@@ -149,6 +150,44 @@ This phase fixed the original `Slash / A / 72` corruption. The same lookup now r
|
|||||||
|
|
||||||
The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.
|
The important change is not only that the current output is correct, but that the importer now fails fast on structural ambiguity instead of silently loading corrupted rows.
|
||||||
|
|
||||||
|
## Phase 2.1: Boundary Hardening After Manual Validation
|
||||||
|
|
||||||
|
After phase 2, a manual validation pass compared:
|
||||||
|
|
||||||
|
- the rendered `Slash.pdf`
|
||||||
|
- the extracted `source.xml`
|
||||||
|
- the imported SQLite rows
|
||||||
|
|
||||||
|
That review found a remaining defect around the `51-55` / `56-60` boundary:
|
||||||
|
|
||||||
|
- `51-55` lost several affix lines
|
||||||
|
- `56-60` gained leading affix lines from the previous row
|
||||||
|
|
||||||
|
The root cause was the original row segmentation rule:
|
||||||
|
|
||||||
|
- rows were assigned strictly by the midpoint between adjacent roll-label `top` values
|
||||||
|
|
||||||
|
That rule was too naive for rows whose affix block sits visually near the next row label.
|
||||||
|
|
||||||
|
### Phase 2.1 fix
|
||||||
|
|
||||||
|
The parser was hardened in two ways:
|
||||||
|
|
||||||
|
1. Leading affix leakage repair
|
||||||
|
- after the initial row assignment, if a cell in the next row starts with affix-like lines and then continues with prose, those leading affix lines are moved back to the previous row
|
||||||
|
2. Better affix classification
|
||||||
|
- generic digit-starting lines are no longer assumed to be affixes
|
||||||
|
- this prevents prose such as `25% chance your weapon is stuck...` from being misclassified
|
||||||
|
|
||||||
|
### Phase 2.1 validation rules
|
||||||
|
|
||||||
|
The importer now explicitly rejects cells that still look structurally wrong after repair:
|
||||||
|
|
||||||
|
- a cell may not begin with affix-like lines before prose
|
||||||
|
- a cell may not contain prose after affix lines
|
||||||
|
|
||||||
|
This hardening step is important because it closed a class of row-boundary bugs that simple row/cell counts could not detect.
|
||||||
|
|
||||||
## Planned Future Phases
|
## Planned Future Phases
|
||||||
|
|
||||||
The current architecture is intended to support additional phases:
|
The current architecture is intended to support additional phases:
|
||||||
@@ -353,6 +392,14 @@ Fragments inside a cell are grouped into lines by close `top` values and then or
|
|||||||
|
|
||||||
This produces a stable line list even when PDF text is broken into multiple fragments.
|
This produces a stable line list even when PDF text is broken into multiple fragments.
|
||||||
|
|
||||||
|
### Boundary Repair
|
||||||
|
|
||||||
|
After the initial midpoint-based row assignment, the parser performs a repair step across adjacent rows in the same column.
|
||||||
|
|
||||||
|
If the next row begins with affix-like lines and then continues with prose, those leading affix lines are treated as leaked trailing affixes from the previous row and moved back.
|
||||||
|
|
||||||
|
This repair exists because some tables place affix lines close enough to the next row label that midpoint-only segmentation is not reliable.
|
||||||
|
|
||||||
### Description vs Affix Splitting
|
### Description vs Affix Splitting
|
||||||
|
|
||||||
The parser classifies lines as:
|
The parser classifies lines as:
|
||||||
@@ -366,6 +413,8 @@ Affix-like lines include:
|
|||||||
- symbolic lines using the critical glyphs
|
- symbolic lines using the critical glyphs
|
||||||
- branch-like affix lines such as `with leg greaves: +2H - ...`
|
- branch-like affix lines such as `with leg greaves: +2H - ...`
|
||||||
|
|
||||||
|
Affix-like classification is intentionally conservative. Numeric prose lines such as `25% chance...` are not treated as affixes unless they match a known affix-like notation pattern.
|
||||||
|
|
||||||
The current implementation stores:
|
The current implementation stores:
|
||||||
|
|
||||||
- `RawCellText`
|
- `RawCellText`
|
||||||
@@ -384,6 +433,8 @@ At minimum, a valid `standard` table must satisfy:
|
|||||||
- roll-band labels are found
|
- roll-band labels are found
|
||||||
- each detected row produces content for all five columns
|
- each detected row produces content for all five columns
|
||||||
- total parsed cell count matches `row_count * 5`
|
- total parsed cell count matches `row_count * 5`
|
||||||
|
- no cell begins with affix-like lines before prose
|
||||||
|
- no cell contains prose after affix lines
|
||||||
|
|
||||||
If validation fails:
|
If validation fails:
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
@@ -8,6 +8,7 @@ public sealed class StandardCriticalTableParser
|
|||||||
{
|
{
|
||||||
private const int HeaderToBodyMinimumGap = 20;
|
private const int HeaderToBodyMinimumGap = 20;
|
||||||
private const int TopGroupingTolerance = 2;
|
private const int TopGroupingTolerance = 2;
|
||||||
|
private static readonly Regex NumericAffixLineRegex = new(@"^\d+(?:H|∑|∏|π|∫|\s*[–-])", RegexOptions.Compiled);
|
||||||
|
|
||||||
public StandardCriticalTableParseResult Parse(CriticalImportManifestEntry entry, string xmlContent)
|
public StandardCriticalTableParseResult Parse(CriticalImportManifestEntry entry, string xmlContent)
|
||||||
{
|
{
|
||||||
@@ -49,8 +50,7 @@ public sealed class StandardCriticalTableParser
|
|||||||
.Select(anchor => CreateRollBand(anchor.Label, anchor.SortOrder))
|
.Select(anchor => CreateRollBand(anchor.Label, anchor.SortOrder))
|
||||||
.ToList();
|
.ToList();
|
||||||
|
|
||||||
var parsedCells = new List<ParsedCriticalCellArtifact>();
|
var cellEntries = new List<CellEntry>();
|
||||||
var parsedResults = new List<ParsedCriticalResult>();
|
|
||||||
|
|
||||||
for (var rowIndex = 0; rowIndex < rowAnchors.Count; rowIndex++)
|
for (var rowIndex = 0; rowIndex < rowAnchors.Count; rowIndex++)
|
||||||
{
|
{
|
||||||
@@ -80,30 +80,65 @@ public sealed class StandardCriticalTableParser
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
var lines = BuildLines(cellFragments);
|
cellEntries.Add(new CellEntry(
|
||||||
var rawAffixLines = lines.Where(IsAffixLikeLine).ToList();
|
|
||||||
var descriptionLines = lines.Where(line => !IsAffixLikeLine(line)).ToList();
|
|
||||||
var rawCellText = string.Join(Environment.NewLine, lines);
|
|
||||||
var descriptionText = CollapseWhitespace(string.Join(' ', descriptionLines));
|
|
||||||
var rawAffixText = rawAffixLines.Count == 0 ? null : string.Join(Environment.NewLine, rawAffixLines);
|
|
||||||
|
|
||||||
parsedCells.Add(new ParsedCriticalCellArtifact(
|
|
||||||
rowAnchors[rowIndex].Label,
|
rowAnchors[rowIndex].Label,
|
||||||
|
rowIndex,
|
||||||
columnAnchor.Key,
|
columnAnchor.Key,
|
||||||
lines,
|
BuildLines(cellFragments).ToList()));
|
||||||
rawCellText,
|
|
||||||
descriptionText,
|
|
||||||
rawAffixText));
|
|
||||||
|
|
||||||
parsedResults.Add(new ParsedCriticalResult(
|
|
||||||
columnAnchor.Key,
|
|
||||||
rowAnchors[rowIndex].Label,
|
|
||||||
rawCellText,
|
|
||||||
descriptionText,
|
|
||||||
rawAffixText));
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
RepairLeadingAffixLeakage(cellEntries);
|
||||||
|
|
||||||
|
var parsedCells = new List<ParsedCriticalCellArtifact>();
|
||||||
|
var parsedResults = new List<ParsedCriticalResult>();
|
||||||
|
|
||||||
|
foreach (var cellEntry in cellEntries.OrderBy(item => item.RowIndex).ThenBy(item => item.ColumnKey))
|
||||||
|
{
|
||||||
|
var firstProseIndex = cellEntry.Lines.FindIndex(line => !IsAffixLikeLine(line));
|
||||||
|
var firstAffixIndex = cellEntry.Lines.FindIndex(IsAffixLikeLine);
|
||||||
|
|
||||||
|
if (firstProseIndex > 0)
|
||||||
|
{
|
||||||
|
validationErrors.Add(
|
||||||
|
$"Cell '{cellEntry.RollBandLabel}/{cellEntry.ColumnKey}' begins with affix-like lines before prose.");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (firstAffixIndex >= 0)
|
||||||
|
{
|
||||||
|
var proseAfterAffix = cellEntry.Lines
|
||||||
|
.Skip(firstAffixIndex + 1)
|
||||||
|
.Any(line => !IsAffixLikeLine(line));
|
||||||
|
|
||||||
|
if (proseAfterAffix)
|
||||||
|
{
|
||||||
|
validationErrors.Add(
|
||||||
|
$"Cell '{cellEntry.RollBandLabel}/{cellEntry.ColumnKey}' contains prose after affix lines.");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
var rawAffixLines = cellEntry.Lines.Where(IsAffixLikeLine).ToList();
|
||||||
|
var descriptionLines = cellEntry.Lines.Where(line => !IsAffixLikeLine(line)).ToList();
|
||||||
|
var rawCellText = string.Join(Environment.NewLine, cellEntry.Lines);
|
||||||
|
var descriptionText = CollapseWhitespace(string.Join(' ', descriptionLines));
|
||||||
|
var rawAffixText = rawAffixLines.Count == 0 ? null : string.Join(Environment.NewLine, rawAffixLines);
|
||||||
|
|
||||||
|
parsedCells.Add(new ParsedCriticalCellArtifact(
|
||||||
|
cellEntry.RollBandLabel,
|
||||||
|
cellEntry.ColumnKey,
|
||||||
|
cellEntry.Lines,
|
||||||
|
rawCellText,
|
||||||
|
descriptionText,
|
||||||
|
rawAffixText));
|
||||||
|
|
||||||
|
parsedResults.Add(new ParsedCriticalResult(
|
||||||
|
cellEntry.ColumnKey,
|
||||||
|
cellEntry.RollBandLabel,
|
||||||
|
rawCellText,
|
||||||
|
descriptionText,
|
||||||
|
rawAffixText));
|
||||||
|
}
|
||||||
|
|
||||||
if (columnCenters.Count != 5)
|
if (columnCenters.Count != 5)
|
||||||
{
|
{
|
||||||
validationErrors.Add($"Expected 5 standard-table columns but found {columnCenters.Count}.");
|
validationErrors.Add($"Expected 5 standard-table columns but found {columnCenters.Count}.");
|
||||||
@@ -276,12 +311,46 @@ public sealed class StandardCriticalTableParser
|
|||||||
value.StartsWith("\u220F", StringComparison.Ordinal) ||
|
value.StartsWith("\u220F", StringComparison.Ordinal) ||
|
||||||
value.StartsWith("\u03C0", StringComparison.Ordinal) ||
|
value.StartsWith("\u03C0", StringComparison.Ordinal) ||
|
||||||
value.StartsWith("\u222B", StringComparison.Ordinal) ||
|
value.StartsWith("\u222B", StringComparison.Ordinal) ||
|
||||||
char.IsDigit(value[0]) ||
|
NumericAffixLineRegex.IsMatch(value) ||
|
||||||
value.Contains(" - ", StringComparison.Ordinal) ||
|
value.Contains(" - ", StringComparison.Ordinal) ||
|
||||||
value.Contains("(-", StringComparison.Ordinal) ||
|
value.Contains("(-", StringComparison.Ordinal) ||
|
||||||
value.Contains("(+", StringComparison.Ordinal);
|
value.Contains("(+", StringComparison.Ordinal);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
private static void RepairLeadingAffixLeakage(List<CellEntry> cellEntries)
|
||||||
|
{
|
||||||
|
var maxRowIndex = cellEntries.Count == 0 ? -1 : cellEntries.Max(item => item.RowIndex);
|
||||||
|
var columnKeys = cellEntries.Select(item => item.ColumnKey).Distinct(StringComparer.OrdinalIgnoreCase).ToList();
|
||||||
|
|
||||||
|
for (var rowIndex = 0; rowIndex < maxRowIndex; rowIndex++)
|
||||||
|
{
|
||||||
|
foreach (var columnKey in columnKeys)
|
||||||
|
{
|
||||||
|
var current = cellEntries.SingleOrDefault(item => item.RowIndex == rowIndex && item.ColumnKey == columnKey);
|
||||||
|
var next = cellEntries.SingleOrDefault(item => item.RowIndex == rowIndex + 1 && item.ColumnKey == columnKey);
|
||||||
|
|
||||||
|
if (current is null || next is null)
|
||||||
|
{
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
var leadingAffixCount = 0;
|
||||||
|
while (leadingAffixCount < next.Lines.Count && IsAffixLikeLine(next.Lines[leadingAffixCount]))
|
||||||
|
{
|
||||||
|
leadingAffixCount++;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (leadingAffixCount == 0 || leadingAffixCount == next.Lines.Count)
|
||||||
|
{
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
current.Lines.AddRange(next.Lines.Take(leadingAffixCount));
|
||||||
|
next.Lines.RemoveRange(0, leadingAffixCount);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
private static string CollapseWhitespace(string value) =>
|
private static string CollapseWhitespace(string value) =>
|
||||||
Regex.Replace(value.Trim(), @"\s+", " ");
|
Regex.Replace(value.Trim(), @"\s+", " ");
|
||||||
|
|
||||||
@@ -295,4 +364,12 @@ public sealed class StandardCriticalTableParser
|
|||||||
private sealed record ColumnAnchor(string Key, double CenterX);
|
private sealed record ColumnAnchor(string Key, double CenterX);
|
||||||
|
|
||||||
private sealed record RowAnchor(string Label, int Top, int SortOrder);
|
private sealed record RowAnchor(string Label, int Top, int SortOrder);
|
||||||
|
|
||||||
|
private sealed class CellEntry(string rollBandLabel, int rowIndex, string columnKey, List<string> lines)
|
||||||
|
{
|
||||||
|
public string RollBandLabel { get; } = rollBandLabel;
|
||||||
|
public int RowIndex { get; } = rowIndex;
|
||||||
|
public string ColumnKey { get; } = columnKey;
|
||||||
|
public List<string> Lines { get; } = lines;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user