Thanks for the example, this makes sense.

In my case, I think I can safely add a function to ensure that any “&” string will be passed to parse-xml-fragment() as “&amp;” because the source of the unparsed xml will always be well-formed.

For background:

I need to wrap a given list of string matches in a textnode to elements and because I need to to this recursively (eg the string is found three times in the textnode) and also need to pass a list of different matches with different replacements, I find it easier to do it with string replacements and then parsing the thenfragmented text node back to XML rather than ending up splitting the text node to multiple nodes directly.

declare function my:replaceText(

$word as xs:string,

$search as xs:string*,

$replace as xs:string*) as xs:string {

(: $search = ("String1", "String2", "String3")

$replace = ("<Replace1/>", "<Replace2/>", "<Replace3/>")

if (empty($search)) then $word

else

replace(my:replaceText($word, tail($search), tail($replace)), '(?<![">])'||my:escapeText(head($search)), head($replace),'j')

};

perhaps there is a better way to do this... how would one do this sort of replacement on mixed content without having to parse this back to XML?

But this works for me when being careful to add some handling for &amp, &gt and &lt.

+1 for internal parser velocity!

Von: Christian Grün <christian.gruen@gmail.com>
Gesendet: Dienstag, 21. November 2023 18:54
An: Zimmel, Daniel <D.Zimmel@ESVmedien.de>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Betreff: Re: [basex-talk] Bug in parse-xml-fragment() and ampersand entity?

Yes, I can see the problem: &DUMMY; ist interpreted as unknown entity and thus replaced with a question mark (a better choice would be the Unicode Replacement Character xFFFD anyway, from today's perspective). We'll keep that in mind and think about alternatives.

If your input is supposed to be interpreted as a single text fragment, one fallback solution (for now) would be

data(parse-xml('<x>' || $string || '</x>'))

Zimmel, Daniel <D.Zimmel@esvmedien.de> schrieb am Di., 21. Nov. 2023, 18:34:

Thanks for the insight!

I can see the benefit with your example – if you look at my example, it is clearly eating the text (“DUMMY”) which might be an edge case, but is obviously a problem when you think the function will give you an error in case of non-wellformedness – some text has silently been deleted.

Daniel

Von: Christian Grün <christian.gruen@gmail.com>
Gesendet: Dienstag, 21. November 2023 16:59
An: Zimmel, Daniel <D.Zimmel@ESVmedien.de>
Cc: basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Bug in parse-xml-fragment() and ampersand entity?

Hi Daniel,

Yes, I assume we’ll need to call it a bug… Although what BaseX is currently doing is known to us to be out of spec behavior. The function fn:parse-xml-fragments is based on our internal XML parser, which is much faster than the standard XML parser (in particular for small input), and it tolerates input that’s not perfectly well-formed. In addition, it accepts HTML entities without a linked DTD:

parse-xml-fragment(`ä`)

We should at least document the behavior or (better) introduce a custom BaseX function for it.

Hope this helps (for now),

Christian

On Tue, Nov 21, 2023 at 3:17 PM Zimmel, Daniel <D.Zimmel@esvmedien.de> wrote:

Hi,

is this a bug?

Query:
parse-xml-fragment('Tom & Jerry')

Result:
Tom ? Jerry

Same result with:
parse-xml-fragment('Tom &DUMMY; Jerry')

BaseX 10.7

Saxon complains correctly that the resulting document node is not well-formed.
BaseX should also return an error, shouldn't it?

Best, Daniel