Yes, you are certainly right. I think it was around 2007 when we chopped whitespaces by default, although we knew it didn't comply with the specification. One reason was that we rarely worked with mixed-content data at that time, and the whitespace indentations increased the size of databases and led to worse rendering results in the built-in visualizations (our first users were confused about that).
Maybe we’ll switch the default in a future version of BaseX.
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021, 23:36:
Thanks for the context.
Still, it does not explain the difference in behavior bestween doc() and parse-xml().
As far as I understand the XDM specification, whitespace may be ignored by the parser if there is a DTD or XML Schema that says that an element is not PCDATA (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all whitespace should be left in. Wendell Piez writes it with many details.
Whitespace in XML tricky. E.g. indenting XML cannot be done well without knowing which elements are PCDATA/mixed.
Now that I know about the CHOP option, I can use BaseX predictably. And the legacy reasons for keeping it set are understandable.
Best regards, Jos
On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
There is an old (and still open) issue on GitHub [1] that might give you some more insight into the history of whitespace chopping in BaseX.
Hope this helps Christian
[1] https://github.com/BaseXdb/basex/issues/913
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:41:
Hi Christian,
Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
But where in the XQuery or XDM spec does it say that whitespace
handling
when parsing is implementation dependent?
Cheers, Jos
On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can
make
this a default by adding CHOP=false in your .basex configuration file
[1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb.
2021,
22:00:
Dear all,
First off: BaseX is great to work with. I use it for a few
statically
generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml
file:
<a> a b <a> c </a> d e </a>
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
parse-xml('<a> a b <a> c </a> d e </a>')
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace
during
is
parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos