On Sat, Jun 12, 2021 at 04:23:23PM -0400, Liam R. E. Quin scripsit:
On Sat, 2021-06-12 at 15:38 -0400, Graydon wrote:
This test is meant to test only that no words have been lost or re-ordered; that the transformation is semantically correct is out of scope for it.
Somerandomwitterings...
So, i'd probably consider (1) make a sequence of words from document A
Now, if you really hate your CPU :) you could transform A.seq into a regular expression, w0.*w1.*w2... and match it against the extracted string value of A.
I would have to hate my CPU intensely; some of the real documents run to a thousand or more pages in PDF.
[snip]
Doug Lenat i think has written a book around parsing algorithms, as has Anne Brüggemann-Klein; Michael Sperberg-McQueen gave a paper at Balisage about applications to Schema Validation (or at Extreme Markup). Anne's abstraction, whose namei can't remember (sorry), is most promising since your problem can be recast as equivalent to matching XML Schema grammars to input documents, with the unique particle attribution restriction lifted; RelaxNG does this with a hedge automaton and that's another approach.
I think this will be helpful in the longer term, since more general solutions and solutions for whether the transformation is semantically conformant will be wanted.
(Also likely another few buckets of water will be required. :)
Thanks!
Graydon