Yes, I can see the problem: &DUMMY; ist interpreted as unknown entity and thus replaced with a question mark (a better choice would be the Unicode Replacement Character xFFFD anyway, from today's perspective). We'll keep that in mind and think about alternatives.
If your input is supposed to be interpreted as a single text fragment, one fallback solution (for now) would be
data(parse-xml('<x>' || $string || '</x>'))
Zimmel, Daniel D.Zimmel@esvmedien.de schrieb am Di., 21. Nov. 2023, 18:34:
Thanks for the insight!
I can see the benefit with your example – if you look at my example, it is clearly eating the text (“DUMMY”) which might be an edge case, but is obviously a problem when you think the function will give you an error in case of non-wellformedness – some text has silently been deleted.
Daniel
*Von:* Christian Grün christian.gruen@gmail.com *Gesendet:* Dienstag, 21. November 2023 16:59 *An:* Zimmel, Daniel D.Zimmel@ESVmedien.de *Cc:* basex-talk@mailman.uni-konstanz.de *Betreff:* Re: [basex-talk] Bug in parse-xml-fragment() and ampersand entity?
Hi Daniel,
Yes, I assume we’ll need to call it a bug… Although what BaseX is currently doing is known to us to be out of spec behavior. The function fn:parse-xml-fragments is based on our internal XML parser, which is much faster than the standard XML parser (in particular for small input), and it tolerates input that’s not perfectly well-formed. In addition, it accepts HTML entities without a linked DTD:
parse-xml-fragment(`ä`)
We should at least document the behavior or (better) introduce a custom BaseX function for it.
Hope this helps (for now),
Christian
On Tue, Nov 21, 2023 at 3:17 PM Zimmel, Daniel D.Zimmel@esvmedien.de wrote:
Hi,
is this a bug?
Query: parse-xml-fragment('Tom & Jerry')
Result: Tom ? Jerry
Same result with: parse-xml-fragment('Tom &DUMMY; Jerry')
BaseX 10.7
Saxon complains correctly that the resulting document node is not well-formed. BaseX should also return an error, shouldn't it?
Best, Daniel