Hi,
recently I ran into serious (as in SERIOUS) performance trouble regarding expensive regexes, and no wonder why.
Here is my scenario:
* XML text documents with a total of 1m text nodes * A regex search string, including a huge string dictionary list of 50.000 strings (some of them containing 50 words each) * a match must be wrapped in an element (= create a link)
I could drink many cups of tea while waiting for the regex to complete... hours... I ran out of tea!
Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk String Matching" [1] in which she explores the Aho-Corasick algorithm, implementing it with XQuery - marvellous! Following this, while I can build a data structure which gives me fast results, building the same structure is still very slow due to the amount of text to build from. So this was not fast enough for my use case - or I may simply not be smart enough to apply it correctly :-|
So, I tried tinkering with maps which turned out to give me extraordinary performance gains:
* build a map from the string dictionary * walk through all text nodes one by one * for each text node, put any combination of words in the text node in word order (I need to find strings, not words) into another map * strip punctuation (for word boundaries) and do some normalization of whitespaces in both maps * compare the keys of both maps * give the new reduced string dictionary to the regular regex search
While comparing the maps, I do not know where in the text my strings are, but at least I know if they are in there - to find out where exactly (and how do they fit my other regex context) I can then use a massively reduced regular regex search. Fast!
I am quite happy BUT I still cannot understand why this is so much faster for my sample data:
* plain method : 51323.72ms * maps method: 597.94ms(!)
Is this due to any optimizations done by BaseX or have I simply discovered the power of maps? How would you do it? Is there still room for improvement? Why does Aho-Corasick not help much with this scenario? Is it because the list of strings is simply too massive? Why is this so much faster with text-splitting-to-map?
See below for the query examples to better understand what I am trying to do (bulk data not included) [2],[3] There is no normalization of punctuation in the examples but that is only necessary for completeness.
Best, Daniel
[1] http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Hol...
[2] plain method
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') let $regex := $strings for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
[3] maps method
(:~ : Create map from string : Tokenization : : @param $strings : @return Map :) declare function local:createMapFromString ( $string as xs:string ) as map(*) { let $map_words := map:merge( for $string in tokenize($string,'|') let $key := $string let $val := $string return map:entry($key,$val), map { 'duplicates': 'use-first' }) return $map_words };
(:~ : Create map from text nodes : Write any combination of words in document order to the map : : @param $textnodes : @return Map :) declare function local:createMapFromTextnodes ( $textnodes as xs:string+ ) as map(*) { map:merge( for $node in $textnodes let $text := normalize-space($node) let $tokens := tokenize($text,' ') let $map_nodes := map:merge( for $start in 1 to fn:count($tokens) for $length in 1 to fn:count($tokens) - $start + 1 return map:entry(fn:string-join(fn:subsequence($tokens, $start, $length), ' '),'x') )
return $map_nodes) };
(:~ : Compare two maps : : @param $map1 : @param $map2 : @return xs:string* :) declare function local:reduceMaps ( $map1 as map(*), $map2 as map(*) ) as xs:string* { for $key in map:keys($map1) where map:contains($map2,$key) let $value := map:get($map1,$key) return $value };
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $map_words := local:createMapFromString($strings) let $map_textnodes := local:createMapFromTextnodes($textnodes) let $matches := local:reduceMaps($map_words,$map_textnodes)
let $regex := string-join(for $match in $matches group by $match return $match,'|')
for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
Hi Daniel,
Thanks for presenting your interesting use case.
Your solution looks clever. Would you mind sharing some sample data with us (an XML document and a string dictionary)?
Cheers, Christian
Zimmel, Daniel D.Zimmel@esvmedien.de schrieb am Mi., 29. Dez. 2021, 12:13:
Hi,
recently I ran into serious (as in SERIOUS) performance trouble regarding expensive regexes, and no wonder why.
Here is my scenario:
- XML text documents with a total of 1m text nodes
- A regex search string, including a huge string dictionary list of 50.000
strings (some of them containing 50 words each)
- a match must be wrapped in an element (= create a link)
I could drink many cups of tea while waiting for the regex to complete... hours... I ran out of tea!
Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk String Matching" [1] in which she explores the Aho-Corasick algorithm, implementing it with XQuery - marvellous! Following this, while I can build a data structure which gives me fast results, building the same structure is still very slow due to the amount of text to build from. So this was not fast enough for my use case - or I may simply not be smart enough to apply it correctly :-|
So, I tried tinkering with maps which turned out to give me extraordinary performance gains:
- build a map from the string dictionary
- walk through all text nodes one by one
- for each text node, put any combination of words in the text node in
word order (I need to find strings, not words) into another map
- strip punctuation (for word boundaries) and do some normalization of
whitespaces in both maps
- compare the keys of both maps
- give the new reduced string dictionary to the regular regex search
While comparing the maps, I do not know where in the text my strings are, but at least I know if they are in there - to find out where exactly (and how do they fit my other regex context) I can then use a massively reduced regular regex search. Fast!
I am quite happy BUT I still cannot understand why this is so much faster for my sample data:
- plain method : 51323.72ms
- maps method: 597.94ms(!)
Is this due to any optimizations done by BaseX or have I simply discovered the power of maps? How would you do it? Is there still room for improvement? Why does Aho-Corasick not help much with this scenario? Is it because the list of strings is simply too massive? Why is this so much faster with text-splitting-to-map?
See below for the query examples to better understand what I am trying to do (bulk data not included) [2],[3] There is no normalization of punctuation in the examples but that is only necessary for completeness.
Best, Daniel
[1] http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Hol...
[2] plain method
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') let $regex := $strings for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
[3] maps method
(:~ : Create map from string : Tokenization : : @param $strings : @return Map :) declare function local:createMapFromString ( $string as xs:string ) as map(*) { let $map_words := map:merge( for $string in tokenize($string,'|') let $key := $string let $val := $string return map:entry($key,$val), map { 'duplicates': 'use-first' }) return $map_words };
(:~ : Create map from text nodes : Write any combination of words in document order to the map : : @param $textnodes : @return Map :) declare function local:createMapFromTextnodes ( $textnodes as xs:string+ ) as map(*) { map:merge( for $node in $textnodes let $text := normalize-space($node) let $tokens := tokenize($text,' ') let $map_nodes := map:merge( for $start in 1 to fn:count($tokens) for $length in 1 to fn:count($tokens) - $start + 1 return map:entry(fn:string-join(fn:subsequence($tokens, $start, $length), ' '),'x') )
return $map_nodes)
};
(:~ : Compare two maps : : @param $map1 : @param $map2 : @return xs:string* :) declare function local:reduceMaps ( $map1 as map(*), $map2 as map(*) ) as xs:string* { for $key in map:keys($map1) where map:contains($map2,$key) let $value := map:get($map1,$key) return $value };
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $map_words := local:createMapFromString($strings) let $map_textnodes := local:createMapFromTextnodes($textnodes) let $matches := local:reduceMaps($map_words,$map_textnodes)
let $regex := string-join(for $match in $matches group by $match return $match,'|')
for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
Hi Christian,
thanks for your interest, please find attached a complete example. It is all about the length of strings in my dictionary, really.
test-reduce.xq: 170.82ms test.xq: 6572.49ms
I noticed that if the data is larger, performance-wise it is better to build a comparison map for each textnode (not as in the example). On the other hand, If the strings are short, performance might even be worse.
But with my use case, I have to search for complete strings like “Erste Durchführungsverordnung zur Verordnung über Luftfahrtpersonal (Anwendungsbestimmungen zur Lizenzierung von Piloten Flugzeuge, von Piloten Hubschrauber, von Flugingenieuren, Freiballonführern, Flugdienstberatern und Flugtechnikern auf Hubschraubern bei den Polizeien des Bundes und der Länder)“ – this does not work too well with bulk matches/replace.
So while I am quite happy with the results, I do not understand why this works so fast – or if there is any feasible alternative to this approach (I got the feeling that there is still much to learn for me in the mysteries of XQuery ;-) Does this work because BaseX simply handles this well? Could match/replace in some other way be optimized to handle bulk search including long strings?
Daniel
Von: Christian Grün christian.gruen@gmail.com Gesendet: Donnerstag, 30. Dezember 2021 20:15 An: Zimmel, Daniel D.Zimmel@ESVmedien.de Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Questions about the performance of bulk string matching
Hi Daniel,
Thanks for presenting your interesting use case.
Your solution looks clever. Would you mind sharing some sample data with us (an XML document and a string dictionary)?
Cheers, Christian
Zimmel, Daniel <D.Zimmel@esvmedien.demailto:D.Zimmel@esvmedien.de> schrieb am Mi., 29. Dez. 2021, 12:13: Hi,
recently I ran into serious (as in SERIOUS) performance trouble regarding expensive regexes, and no wonder why.
Here is my scenario:
* XML text documents with a total of 1m text nodes * A regex search string, including a huge string dictionary list of 50.000 strings (some of them containing 50 words each) * a match must be wrapped in an element (= create a link)
I could drink many cups of tea while waiting for the regex to complete... hours... I ran out of tea!
Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk String Matching" [1] in which she explores the Aho-Corasick algorithm, implementing it with XQuery - marvellous! Following this, while I can build a data structure which gives me fast results, building the same structure is still very slow due to the amount of text to build from. So this was not fast enough for my use case - or I may simply not be smart enough to apply it correctly :-|
So, I tried tinkering with maps which turned out to give me extraordinary performance gains:
* build a map from the string dictionary * walk through all text nodes one by one * for each text node, put any combination of words in the text node in word order (I need to find strings, not words) into another map * strip punctuation (for word boundaries) and do some normalization of whitespaces in both maps * compare the keys of both maps * give the new reduced string dictionary to the regular regex search
While comparing the maps, I do not know where in the text my strings are, but at least I know if they are in there - to find out where exactly (and how do they fit my other regex context) I can then use a massively reduced regular regex search. Fast!
I am quite happy BUT I still cannot understand why this is so much faster for my sample data:
* plain method : 51323.72ms * maps method: 597.94ms(!)
Is this due to any optimizations done by BaseX or have I simply discovered the power of maps? How would you do it? Is there still room for improvement? Why does Aho-Corasick not help much with this scenario? Is it because the list of strings is simply too massive? Why is this so much faster with text-splitting-to-map?
See below for the query examples to better understand what I am trying to do (bulk data not included) [2],[3] There is no normalization of punctuation in the examples but that is only necessary for completeness.
Best, Daniel
[1] http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Hol...
[2] plain method
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') let $regex := $strings for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
[3] maps method
(:~ : Create map from string : Tokenization : : @param $strings : @return Map :) declare function local:createMapFromString ( $string as xs:string ) as map(*) { let $map_words := map:merge( for $string in tokenize($string,'|') let $key := $string let $val := $string return map:entry($key,$val), map { 'duplicates': 'use-first' }) return $map_words };
(:~ : Create map from text nodes : Write any combination of words in document order to the map : : @param $textnodes : @return Map :) declare function local:createMapFromTextnodes ( $textnodes as xs:string+ ) as map(*) { map:merge( for $node in $textnodes let $text := normalize-space($node) let $tokens := tokenize($text,' ') let $map_nodes := map:merge( for $start in 1 to fn:count($tokens) for $length in 1 to fn:count($tokens) - $start + 1 return map:entry(fn:string-join(fn:subsequence($tokens, $start, $length), ' '),'x') )
return $map_nodes) };
(:~ : Compare two maps : : @param $map1 : @param $map2 : @return xs:string* :) declare function local:reduceMaps ( $map1 as map(*), $map2 as map(*) ) as xs:string* { for $key in map:keys($map1) where map:contains($map2,$key) let $value := map:get($map1,$key) return $value };
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $map_words := local:createMapFromString($strings) let $map_textnodes := local:createMapFromTextnodes($textnodes) let $matches := local:reduceMaps($map_words,$map_textnodes)
let $regex := string-join(for $match in $matches group by $match return $match,'|')
for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
Hi Daniel,
Thanks for your code.
My reply is a fairly general one and not specific to XQuery: Regular expressions are very powerful, but they can get fairly expensive, so it’s always a good idea to simplify or optimize them before running them on large datasets. Here is an example for a regex that looks fairly simple, but takes many seconds to execute:
matches( string-join((1 to 100000) ! 'x'), '.*y' )
This expression can e.g. be sped a lot by using '.*y' as replace string.
BaseX passes on most regex strings to Java without further changes. If an expression takes a long time to execute, chances are high that it’s due to the JDK.
Hope this helps, Christian
On Tue, Jan 4, 2022 at 3:50 PM Zimmel, Daniel D.Zimmel@esvmedien.de wrote:
Hi Christian,
thanks for your interest, please find attached a complete example. It is all about the length of strings in my dictionary, really.
test-reduce.xq: 170.82ms
test.xq: 6572.49ms
I noticed that if the data is larger, performance-wise it is better to build a comparison map for each textnode (not as in the example).
On the other hand, If the strings are short, performance might even be worse.
But with my use case, I have to search for complete strings like “Erste Durchführungsverordnung zur Verordnung über Luftfahrtpersonal (Anwendungsbestimmungen zur Lizenzierung von Piloten Flugzeuge, von Piloten Hubschrauber, von Flugingenieuren, Freiballonführern, Flugdienstberatern und Flugtechnikern auf Hubschraubern bei den Polizeien des Bundes und der Länder)“ – this does not work too well with bulk matches/replace.
So while I am quite happy with the results, I do not understand why this works so fast – or if there is any feasible alternative to this approach (I got the feeling that there is still much to learn for me in the mysteries of XQuery ;-)
Does this work because BaseX simply handles this well? Could match/replace in some other way be optimized to handle bulk search including long strings?
Daniel
Von: Christian Grün christian.gruen@gmail.com Gesendet: Donnerstag, 30. Dezember 2021 20:15 An: Zimmel, Daniel D.Zimmel@ESVmedien.de Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Questions about the performance of bulk string matching
Hi Daniel,
Thanks for presenting your interesting use case.
Your solution looks clever. Would you mind sharing some sample data with us (an XML document and a string dictionary)?
Cheers,
Christian
Zimmel, Daniel D.Zimmel@esvmedien.de schrieb am Mi., 29. Dez. 2021, 12:13:
Hi,
recently I ran into serious (as in SERIOUS) performance trouble regarding expensive regexes, and no wonder why.
Here is my scenario:
- XML text documents with a total of 1m text nodes
- A regex search string, including a huge string dictionary list of 50.000 strings (some of them containing 50 words each)
- a match must be wrapped in an element (= create a link)
I could drink many cups of tea while waiting for the regex to complete... hours... I ran out of tea!
Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk String Matching" [1] in which she explores the Aho-Corasick algorithm, implementing it with XQuery - marvellous! Following this, while I can build a data structure which gives me fast results, building the same structure is still very slow due to the amount of text to build from. So this was not fast enough for my use case - or I may simply not be smart enough to apply it correctly :-|
So, I tried tinkering with maps which turned out to give me extraordinary performance gains:
- build a map from the string dictionary
- walk through all text nodes one by one
- for each text node, put any combination of words in the text node in word order (I need to find strings, not words) into another map
- strip punctuation (for word boundaries) and do some normalization of whitespaces in both maps
- compare the keys of both maps
- give the new reduced string dictionary to the regular regex search
While comparing the maps, I do not know where in the text my strings are, but at least I know if they are in there - to find out where exactly (and how do they fit my other regex context) I can then use a massively reduced regular regex search. Fast!
I am quite happy BUT I still cannot understand why this is so much faster for my sample data:
- plain method : 51323.72ms
- maps method: 597.94ms(!)
Is this due to any optimizations done by BaseX or have I simply discovered the power of maps? How would you do it? Is there still room for improvement? Why does Aho-Corasick not help much with this scenario? Is it because the list of strings is simply too massive? Why is this so much faster with text-splitting-to-map?
See below for the query examples to better understand what I am trying to do (bulk data not included) [2],[3] There is no normalization of punctuation in the examples but that is only necessary for completeness.
Best, Daniel
[1] http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Hol...
[2] plain method
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') let $regex := $strings for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
[3] maps method
(:~ : Create map from string : Tokenization : : @param $strings : @return Map :) declare function local:createMapFromString ( $string as xs:string ) as map(*) { let $map_words := map:merge( for $string in tokenize($string,'|') let $key := $string let $val := $string return map:entry($key,$val), map { 'duplicates': 'use-first' }) return $map_words };
(:~ : Create map from text nodes : Write any combination of words in document order to the map : : @param $textnodes : @return Map :) declare function local:createMapFromTextnodes ( $textnodes as xs:string+ ) as map(*) { map:merge( for $node in $textnodes let $text := normalize-space($node) let $tokens := tokenize($text,' ') let $map_nodes := map:merge( for $start in 1 to fn:count($tokens) for $length in 1 to fn:count($tokens) - $start + 1 return map:entry(fn:string-join(fn:subsequence($tokens, $start, $length), ' '),'x') )
return $map_nodes)
};
(:~ : Compare two maps : : @param $map1 : @param $map2 : @return xs:string* :) declare function local:reduceMaps ( $map1 as map(*), $map2 as map(*) ) as xs:string* { for $key in map:keys($map1) where map:contains($map2,$key) let $value := map:get($map1,$key) return $value };
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text() let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $map_words := local:createMapFromString($strings) let $map_textnodes := local:createMapFromTextnodes($textnodes) let $matches := local:reduceMaps($map_words,$map_textnodes)
let $regex := string-join(for $match in $matches group by $match return $match,'|')
for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
Catching up late, but thank you for your thoughts. I have grown so fond of using maps in XQuery, I use them even with tiny data sets now all the time. So I have learned that In combination with a large amount of large search strings they can speed up things a lot where a regular expression can only go so far.
Thanks!
-----Ursprüngliche Nachricht----- Von: Christian Grün christian.gruen@gmail.com Gesendet: Freitag, 7. Januar 2022 18:49 An: Zimmel, Daniel D.Zimmel@ESVmedien.de Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Questions about the performance of bulk string matching
Hi Daniel,
Thanks for your code.
My reply is a fairly general one and not specific to XQuery: Regular expressions are very powerful, but they can get fairly expensive, so it’s always a good idea to simplify or optimize them before running them on large datasets. Here is an example for a regex that looks fairly simple, but takes many seconds to execute:
matches( string-join((1 to 100000) ! 'x'), '.*y' )
This expression can e.g. be sped a lot by using '.*y' as replace string.
BaseX passes on most regex strings to Java without further changes. If an expression takes a long time to execute, chances are high that it’s due to the JDK.
Hope this helps, Christian
On Tue, Jan 4, 2022 at 3:50 PM Zimmel, Daniel D.Zimmel@esvmedien.de wrote:
Hi Christian,
thanks for your interest, please find attached a complete example. It is all about the length of strings in my dictionary, really.
test-reduce.xq: 170.82ms
test.xq: 6572.49ms
I noticed that if the data is larger, performance-wise it is better to build a comparison map for each textnode (not as in the example).
On the other hand, If the strings are short, performance might even be worse.
But with my use case, I have to search for complete strings like “Erste Durchführungsverordnung zur Verordnung über Luftfahrtpersonal (Anwendungsbestimmungen zur Lizenzierung von Piloten Flugzeuge, von Piloten Hubschrauber, von Flugingenieuren, Freiballonführern, Flugdienstberatern und Flugtechnikern auf Hubschraubern bei den Polizeien des Bundes und der Länder)“ – this does not work too well with bulk matches/replace.
So while I am quite happy with the results, I do not understand why this works so fast – or if there is any feasible alternative to this approach (I got the feeling that there is still much to learn for me in the mysteries of XQuery ;-)
Does this work because BaseX simply handles this well? Could match/replace in some other way be optimized to handle bulk search including long strings?
Daniel
Von: Christian Grün christian.gruen@gmail.com Gesendet: Donnerstag, 30. Dezember 2021 20:15 An: Zimmel, Daniel D.Zimmel@ESVmedien.de Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Questions about the performance of bulk string matching
Hi Daniel,
Thanks for presenting your interesting use case.
Your solution looks clever. Would you mind sharing some sample data with us (an XML document and a string dictionary)?
Cheers,
Christian
Zimmel, Daniel D.Zimmel@esvmedien.de schrieb am Mi., 29. Dez. 2021, 12:13:
Hi,
recently I ran into serious (as in SERIOUS) performance trouble regarding expensive regexes, and no wonder why.
Here is my scenario:
- XML text documents with a total of 1m text nodes
- A regex search string, including a huge string dictionary list of
50.000 strings (some of them containing 50 words each)
- a match must be wrapped in an element (= create a link)
I could drink many cups of tea while waiting for the regex to complete... hours... I ran out of tea!
Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk String Matching" [1] in which she explores the Aho-Corasick algorithm, implementing it with XQuery - marvellous! Following this, while I can build a data structure which gives me fast results, building the same structure is still very slow due to the amount of text to build from. So this was not fast enough for my use case - or I may simply not be smart enough to apply it correctly :-|
So, I tried tinkering with maps which turned out to give me extraordinary performance gains:
- build a map from the string dictionary
- walk through all text nodes one by one
- for each text node, put any combination of words in the text node in
word order (I need to find strings, not words) into another map
- strip punctuation (for word boundaries) and do some normalization of
whitespaces in both maps
- compare the keys of both maps
- give the new reduced string dictionary to the regular regex search
While comparing the maps, I do not know where in the text my strings are, but at least I know if they are in there - to find out where exactly (and how do they fit my other regex context) I can then use a massively reduced regular regex search. Fast!
I am quite happy BUT I still cannot understand why this is so much faster for my sample data:
- plain method : 51323.72ms
- maps method: 597.94ms(!)
Is this due to any optimizations done by BaseX or have I simply discovered the power of maps? How would you do it? Is there still room for improvement? Why does Aho-Corasick not help much with this scenario? Is it because the list of strings is simply too massive? Why is this so much faster with text-splitting-to-map?
See below for the query examples to better understand what I am trying to do (bulk data not included) [2],[3] There is no normalization of punctuation in the examples but that is only necessary for completeness.
Best, Daniel
[1] http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol 26-Holstege01.html
[2] plain method
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text( ) let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt') let $regex := $strings for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
[3] maps method
(:~ : Create map from string : Tokenization : : @param $strings : @return Map :) declare function local:createMapFromString ( $string as xs:string ) as map(*) { let $map_words := map:merge( for $string in tokenize($string,'|') let $key := $string let $val := $string return map:entry($key,$val), map { 'duplicates': 'use-first' }) return $map_words };
(:~ : Create map from text nodes : Write any combination of words in document order to the map : : @param $textnodes : @return Map :) declare function local:createMapFromTextnodes ( $textnodes as xs:string+ ) as map(*) { map:merge( for $node in $textnodes let $text := normalize-space($node) let $tokens := tokenize($text,' ') let $map_nodes := map:merge( for $start in 1 to fn:count($tokens) for $length in 1 to fn:count($tokens) - $start + 1 return map:entry(fn:string-join(fn:subsequence($tokens, $start, $length), ' '),'x') )
return $map_nodes)
};
(:~ : Compare two maps : : @param $map1 : @param $map2 : @return xs:string* :) declare function local:reduceMaps ( $map1 as map(*), $map2 as map(*) ) as xs:string* { for $key in map:keys($map1) where map:contains($map2,$key) let $value := map:get($map1,$key) return $value };
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text( ) let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $map_words := local:createMapFromString($strings) let $map_textnodes := local:createMapFromTextnodes($textnodes) let $matches := local:reduceMaps($map_words,$map_textnodes)
let $regex := string-join(for $match in $matches group by $match return $match,'|')
for $textnode in $textnodes where matches($textnode,$regex) = true() return $textnode
basex-talk@mailman.uni-konstanz.de