Hi Gerrit,

Thanks for both the observation and the test case. The bug has been fixed, a new snapshot is available [1,2].

All the best,
Christian



On Thu, Dec 2, 2021 at 9:35 AM Imsieke, Gerrit, le-tex <gerrit.imsieke@le-tex.de> wrote:
Hi Christian,

I wrote a query for a customer who wants to analyze their legacy ISO
12083 math formulas, in this case for detecting multiple subsequent
<roman> elements of length >= 3 with only whitespace in between.

This is a synthetic test document:

<doc>
   <p>
     <formula><roman>tan</roman> <roman>tan</roman></formula>
   </p>
   <p>
     <formula><roman>sin</roman> <sup>2</sup> <roman>sin</roman></formula>
   </p>
   <p>
     <formula><roman>cos</roman><sup>3</sup> <roman>cos</roman></formula>
   </p>
</doc>

And this is the query I wrote:

let $rms := //(formula | dformula)//roman[string-length() gt 2]
                                          [

following-sibling::node()[1]/self::text()[not(normalize-space())]
                                          ]
                                          [

following-sibling::*[1]/self::roman[string-length() gt 2]
                                          ]/..,
     $docs := for $rm-context in $rms
              let $path := db:path($rm-context)
              group by $path
              return <doc path="{$path}">{
                $rm-context
              }</doc>
return
<result count="{count($rms)}" docs="{count($docs)}">{
   $docs
}</result>

BaseX (up to version 9.6.3) erroneously reports all three <formula>
elements as a result, while only the first should be reported.

This can be remedied by using parentheses, as in
(following-sibling::node())[1]/self::text() and
(following-sibling::*)[1]/self::roman. But this is inefficient, and the
original query should just work™.

In the optimized original query there is
following-sibling::text()[fn:position() = 1] and
following-sibling::roman[fn:position() = 1]. These are  incorrect
optimizations of following-sibling::node()[1]/self::text() and
following-sibling::*[1]/self::roman.

Gerrit



--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit.imsieke@le-tex.de, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschäftsführer / Managing Directors:
Gerrit Imsieke, Svea Jelonek, Thomas Schmidt