differencing the string value of documents

List overview All Threads
Download

newer

older

http:sent-request with username,...

Permissions

Graydon Saunders

10 Jun 2021 10 Jun '21

9:06 p.m.

Hello --

So I have XML documents A and B.

For this purpose, the markup (which is known to be different) is not relevant; the thing I need to do is prove that all of the text in the string value of A is present in B in the same order in which it exists in A. Document B can have additional text and still be correct, so it is not sufficient to check if string(A) = string(B). (Ideally, B's extra text can be extracted and then compared against the sources of extra text in the transformation between A and B so as to confirm that it's at least all plausible new text.)

I've been poking at this for a while and am getting that "reinventing the wheel by gnawing on anvils" feeling.

Does a facility to do this exist in BaseX? I've had a look at the documentation and I observe some mention of diff.xqm but that appears to be a node-based diff.

thanks! Graydon

Attachments:

attachment.html (text/html — 1.6 KB)

Show replies by date

Martin Honnen

11 Jun 11 Jun

2:34 a.m.

Am 11.06.2021 um 03:06 schrieb Graydon Saunders:

...

So I have XML documents A and B.

For this purpose, the markup (which is known to be different) is not relevant; the thing I need to do is prove that all of the text in the string value of A is present in B in the same order in which it exists in A. Document B can have additional text and still be correct, so it is not sufficient to check if string(A) = string(B). (Ideally, B's extra text can be extracted and then compared against the sources of extra text in the transformation between A and B so as to confirm that it's at least all plausible new text.)

At least on the string level even XPath 1 comparisons with substring-before/after should suffice e.g.

if (string(A) = string(B))

then "complete match"

else

let $prefix := substring-before(string(B), string(A)),

$suffix := substring-after(string(B), string(A))

return if ($suffix != '' or $prefix != '')

then "partial match"

else "no match"

I don't know what the "plausible new text" requirement is about or how to express it.

Martin Honnen

3:15 a.m.

Am 11.06.2021 um 08:34 schrieb Martin Honnen:

...

Am 11.06.2021 um 03:06 schrieb Graydon Saunders:

...
So I have XML documents A and B.

For this purpose, the markup (which is known to be different) is not relevant; the thing I need to do is prove that all of the text in the string value of A is present in B in the same order in which it exists in A. Document B can have additional text and still be correct, so it is not sufficient to check if string(A) = string(B). (Ideally, B's extra text can be extracted and then compared against the sources of extra text in the transformation between A and B so as to confirm that it's at least all plausible new text.)

At least on the string level even XPath 1 comparisons with substring-before/after should suffice e.g.

if (string(A) = string(B))

then "complete match"

else

Of course contains(string(B), string(A)) would alone suffice to check for a partial match.

...

let $prefix := substring-before(string(B), string(A)),

$suffix := substring-after(string(B), string(A))

return if ($suffix != '' or $prefix != '')

then "partial match"

else "no match"

I don't know what the "plausible new text" requirement is about or how to express it.

Graydon

12 Jun 12 Jun

12:09 p.m.

On Fri, Jun 11, 2021 at 09:15:02AM +0200, Martin Honnen scripsit:

...

Of course contains(string(B), string(A)) would alone suffice to check for a partial match.

I think I managed to express the problem badly.

Document A is the original; it has some quantity of text. Document B is produced from Document A via a transformation process that is known to add text by doing things like getting display text for an internal link from the reference target, inserting boilerplate legal disclaimers, inserting standard text ("Chapter 1", etc.), and so on.

The goal is to perform a test to ensure that B still has all of A's text, in the same order, after this transformation has taken place.

So no single string compare will work; it will certainly fail, and because there are multiple points of insertion in the document (which can be of arbitrary length), there's some question of appropriate scale of testing. (Hashing per notional sentences and comparing the hash values finds differences but doesn't provide a test for order, for example.)

I was hoping that there was a known algorithm for this kind of testing; I have trouble believing I'm the first person who has wanted to do it.

It looks like ratcheting through B's text with A's text, one word at a time, gives useful results. It doesn't allow for locating the error beyond "something, somewhere, is a problem" but that can be handled through diff tools and the full text of each document.

Thanks! Graydon

-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")

Christian Grün

1:11 p.m.

Hi Graydon,

Could you share exemplary and minimized input documents with us?

Can the structure of the documents (hierarchy of nodes, element names, etc.) be completely ignored?

Best, Christian

On Sat, Jun 12, 2021 at 6:09 PM Graydon graydonish@gmail.com wrote:

...

On Fri, Jun 11, 2021 at 09:15:02AM +0200, Martin Honnen scripsit:

...
Of course contains(string(B), string(A)) would alone suffice to check for a partial match.

I think I managed to express the problem badly.

Document A is the original; it has some quantity of text. Document B is produced from Document A via a transformation process that is known to add text by doing things like getting display text for an internal link from the reference target, inserting boilerplate legal disclaimers, inserting standard text ("Chapter 1", etc.), and so on.

The goal is to perform a test to ensure that B still has all of A's text, in the same order, after this transformation has taken place.

So no single string compare will work; it will certainly fail, and because there are multiple points of insertion in the document (which can be of arbitrary length), there's some question of appropriate scale of testing. (Hashing per notional sentences and comparing the hash values finds differences but doesn't provide a test for order, for example.)

I was hoping that there was a known algorithm for this kind of testing; I have trouble believing I'm the first person who has wanted to do it.

It looks like ratcheting through B's text with A's text, one word at a time, gives useful results. It doesn't allow for locating the error beyond "something, somewhere, is a problem" but that can be handled through diff tools and the full text of each document.

Thanks! Graydon

-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")

Graydon

3:38 p.m.

Hello!

On Sat, Jun 12, 2021 at 07:11:00PM +0200, Christian Grün scripsit:

...

Could you share exemplary and minimized input documents with us?

I have created some text documents and attached them. (The "remember to throw away some metadata markup, etc." step on the way to getting text to compare from the before and after vocabularies is believed to work reliably.)

...

Can the structure of the documents (hierarchy of nodes, element names, etc.) be completely ignored?

It can! This test is meant to test only that no words have been lost or re-ordered; that the transformation is semantically correct is out of scope for it.

Thank you! Graydon

Liam R. E. Quin

4:23 p.m.

On Sat, 2021-06-12 at 15:38 -0400, Graydon wrote:

...

This test is meant to test only that no words have been lost or re-ordered; that the transformation is semantically correct is out of scope for it.

Somerandomwitterings...

So, i'd probably consider (1) make a sequence of words from document A

Now, if you really hate your CPU :) you could transform A.seq into a regular expression, w0.*w1.*w2... and match it against the extracted string value of A.

Starting with ^.*?w0 might reduce the run-time in practice, but the others all need arbitrary backtracking in case the transformation introducedone or more words that occur at that point in the document, so you have w0 w1 w2 w3 w1 w2 w3 w4 to match against w0 w1 w2 w3 w3

This could also be written with a recursive function and a helper; the helper would find the longest match at the current position, and if that's empty the function returns "nope" and you have to back-track.

Doug Lenat i think has written a book around parsing algorithms, as has Anne Brüggemann-Klein; Michael Sperberg-McQueen gave a paper at Balisage about applications to Schema Validation (or at Extreme Markup). Anne's abstraction, whose namei can't remember (sorry), is most promising since your problem can be recast as equivalent to matching XML Schema grammars to input documents, with the unique particle attribution restriction lifted; RelaxNG does this with a hedge automaton and that's another approach.

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

Graydon

5:55 p.m.

On Sat, Jun 12, 2021 at 04:23:23PM -0400, Liam R. E. Quin scripsit:

...

On Sat, 2021-06-12 at 15:38 -0400, Graydon wrote:

...
This test is meant to test only that no words have been lost or re-ordered; that the transformation is semantically correct is out of scope for it.

Somerandomwitterings...

So, i'd probably consider (1) make a sequence of words from document A

Now, if you really hate your CPU :) you could transform A.seq into a regular expression, w0.*w1.*w2... and match it against the extracted string value of A.

I would have to hate my CPU intensely; some of the real documents run to a thousand or more pages in PDF.

[snip]

...

Doug Lenat i think has written a book around parsing algorithms, as has Anne Brüggemann-Klein; Michael Sperberg-McQueen gave a paper at Balisage about applications to Schema Validation (or at Extreme Markup). Anne's abstraction, whose namei can't remember (sorry), is most promising since your problem can be recast as equivalent to matching XML Schema grammars to input documents, with the unique particle attribution restriction lifted; RelaxNG does this with a hedge automaton and that's another approach.

I think this will be helpful in the longer term, since more general solutions and solutions for whether the transformation is semantically conformant will be wanted.

(Also likely another few buckets of water will be required. :)

Thanks!

Graydon

-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")

Christian Grün

4:36 p.m.

Thanks. Here’s one way how you could handle that:

declare function local:correct($old, $new) { empty(local:compare($old, $new)) };

declare function local:compare($old, $new) { if(head($old) = head($new)) then ( (: skip identical lines :) local:compare(tail($old), tail($new)) ) else if(empty($old)) then ( (: old lines are consumed, all fine :) () ) else if(empty($new)) then ( (: old lines remain, bad :) 'sigh' ) else ( (: lines differ: skip new line :) local:compare($old, tail($new)) ) };

let $old := unparsed-text-lines('before.txt') for $file in ( 'after-correct.txt', 'after-fails1.txt', 'after-fails2.txt' ) let $new := unparsed-text-lines($file) return $file || ': ' || local:correct($old, $new)

Here’s another solution, which avoids recursive calls and the need to optimize it for tail calls. It’s based on the handsome built-in higher-order function hof:until. Once again, it was Leo (Wörteler) who introduced it to BaseX [1]:

declare function local:correct($old, $new) { let $result := hof:until( (: test if there’s something to compare :) function($m) { empty($m?old) or empty($m?new) }, (: compare, create new input :) function($m) { map { 'old': if(head($m?old) = head($m?new)) then tail($m?old) else $m?old, 'new': tail($m?new) } }, (: initial input :) map { 'old': $old, 'new': $new } ) (: check if there’s old input left :) return empty($result?old) };

Cheers, Christian

[1] https://docs.basex.org/wiki/Higher-Order_Functions_Module#hof:until _________________________________

On Sat, Jun 12, 2021 at 9:38 PM Graydon graydonish@gmail.com wrote:

...

Hello!

On Sat, Jun 12, 2021 at 07:11:00PM +0200, Christian Grün scripsit:

...
Could you share exemplary and minimized input documents with us?

I have created some text documents and attached them. (The "remember to throw away some metadata markup, etc." step on the way to getting text to compare from the before and after vocabularies is believed to work reliably.)

...
Can the structure of the documents (hierarchy of nodes, element names, etc.) be completely ignored?

It can! This test is meant to test only that no words have been lost or re-ordered; that the transformation is semantically correct is out of scope for it.

Thank you! Graydon

Graydon

5:51 p.m.

On Sat, Jun 12, 2021 at 10:36:13PM +0200, Christian Grün scripsit:

...

Thanks. Here’s one way how you could handle that:

[snip]

This is similar to the word-level recursive approach I am currently taking. (though there is something to learn in the double function approach for sure; thank you!)

...

Here’s another solution, which avoids recursive calls and the need to optimize it for tail calls. It’s based on the handsome built-in higher-order function hof:until. Once again, it was Leo (Wörteler) who introduced it to BaseX [1]:

declare function local:correct($old, $new) { let $result := hof:until( (: test if there’s something to compare :) function($m) { empty($m?old) or empty($m?new) }, (: compare, create new input :) function($m) { map { 'old': if(head($m?old) = head($m?new)) then tail($m?old) else $m?old, 'new': tail($m?new) } }, (: initial input :) map { 'old': $old, 'new': $new } ) (: check if there’s old input left :) return empty($result?old) };

This is tremendously cool and I think it will be useful; thank you!

(I also think I am going to have to pour a few buckets of water over my head in the process of trying to fully comprehend it. :)

Much appreciated!

Graydon

-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")

1497

Age (days ago)

1498

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

9 comments

5 participants

tags (0)

participants (5)

Christian Grün
Graydon
Graydon Saunders
Liam R. E. Quin
Martin Honnen