Questions about the performance of bulk string matching - BaseX-Talk - mailman.uni-konstanz.de

29 Dec 2021


      Hi,
recently I ran into serious (as in SERIOUS) performance trouble regarding expensive regexes, and no wonder why.
Here is my scenario:
* XML text documents with a total of 1m text nodes
* A regex search string, including a huge string dictionary list of 50.000 strings (some of them containing 50 words each)
* a match must be wrapped in an element (= create a link)
I could drink many cups of tea while waiting for the regex to complete... hours... I ran out of tea!
Then I found the 2021 Balisage paper from Mary Holstege titled "Fast Bulk String Matching" [1] in which she explores the Aho-Corasick algorithm, implementing it with XQuery - marvellous! Following this, while I can build a data structure which gives me fast results, building the same structure is still very slow due to the amount of text to build from. So this was not fast enough for my use case - or I may simply not be smart enough to apply it correctly :-|
So, I tried tinkering with maps which turned out to give me extraordinary performance gains:
* build a map from the string dictionary
* walk through all text nodes one by one
* for each text node, put any combination of words in the text node in word order (I need to find strings, not words) into another map
* strip punctuation (for word boundaries) and do some normalization of whitespaces in both maps
* compare the keys of both maps
* give the new reduced string dictionary to the regular regex search
While comparing the maps, I do not know where in the text my strings are, but at least I know if they are in there - to find out where exactly (and how do they fit my other regex context) I can then use a massively reduced regular regex search. Fast!
I am quite happy BUT I still cannot understand why this is so much faster for my sample data:
* plain method : 51323.72ms
* maps method: 597.94ms(!)
Is this due to any optimizations done by BaseX or have I simply discovered the power of maps?
How would you do it? Is there still room for improvement? 
Why does Aho-Corasick not help much with this scenario? Is it because the list of strings is simply too massive?
Why is this so much faster with text-splitting-to-map?
See below for the query examples to better understand what I am trying to do (bulk data not included) [2],[3]
There is no normalization of punctuation in the examples but that is only necessary for completeness.
Best, Daniel
[1] http://www.balisage.net/Proceedings/vol26/print/Holstege01/BalisageVol26-Hol...
[2] plain method
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text()
let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $regex := $strings
for $textnode in $textnodes
where  matches($textnode,$regex) = true()
return $textnode
[3] maps method
(:~
 : Create map from string
 : Tokenization
 :
 : @param $strings
 : @return Map
 :)
declare function local:createMapFromString (
  $string as xs:string
) as map(*) {
  let $map_words :=  
              map:merge(
              for $string in tokenize($string,'|')
              let $key := $string
              let $val := $string
              return map:entry($key,$val),
                map { 'duplicates': 'use-first' })
return
  $map_words
};
(:~
 : Create map from text nodes
 : Write any combination of words in document order to the map
 :
 : @param $textnodes
 : @return Map
 :)
declare function local:createMapFromTextnodes (
  $textnodes as xs:string+
) as map(*) {
      map:merge(
      for $node in $textnodes
      let $text := normalize-space($node)
      let $tokens := tokenize($text,' ')
      let $map_nodes :=
                    map:merge(
                     for $start in 1 to fn:count($tokens)   
                      for $length in 1 to fn:count($tokens) - $start + 1
                     return
                       map:entry(fn:string-join(fn:subsequence($tokens, $start, $length), ' '),'x')
                     )
return
        $map_nodes)
};
(:~
 : Compare two maps
 :
 : @param $map1
 : @param $map2
 : @return xs:string*
 :)
declare function local:reduceMaps (
  $map1 as map(*),
  $map2 as map(*)
) as xs:string* {
      for $key in map:keys($map1)
      where map:contains($map2,$key)
      let $value := map:get($map1,$key)
      return $value
};
let $textnodes := fetch:xml(file:base-dir()||file:dir-separator()||'example.xml')//text()
let $strings := file:read-text(file:base-dir()||file:dir-separator()||'strings.txt')
let $map_words := local:createMapFromString($strings)
let $map_textnodes := local:createMapFromTextnodes($textnodes)
let $matches :=  local:reduceMaps($map_words,$map_textnodes)
let $regex := string-join(for $match in $matches 
              group by $match 
              return $match,'|')
for $textnode in $textnodes
where  matches($textnode,$regex) = true()
return $textnode