Dear all,
First off: BaseX is great to work with. I use it for a few statically generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml file:
```test.xml <a> a b <a> c </a> d e </a> ```
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
``` parse-xml('<a> a b <a> c </a> d e </a>') ```
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace during is parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can make this a default by adding CHOP=false in your .basex configuration file [1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021, 22:00:
Dear all,
First off: BaseX is great to work with. I use it for a few statically generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml file:
<a> a b <a> c </a> d e </a>
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
parse-xml('<a> a b <a> c </a> d e </a>')
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace during is parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos
Hi Christian,
Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
But where in the XQuery or XDM spec does it say that whitespace handling when parsing is implementation dependent?
Cheers, Jos
On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can make this a default by adding CHOP=false in your .basex configuration file [1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:00:
Dear all,
First off: BaseX is great to work with. I use it for a few statically generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml file:
<a> a b <a> c </a> d e </a>
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
parse-xml('<a> a b <a> c </a> d e </a>')
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace during is parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos
There is an old (and still open) issue on GitHub [1] that might give you some more insight into the history of whitespace chopping in BaseX.
Hope this helps Christian
[1] https://github.com/BaseXdb/basex/issues/913
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021, 22:41:
Hi Christian,
Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
But where in the XQuery or XDM spec does it say that whitespace handling when parsing is implementation dependent?
Cheers, Jos
On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can
make
this a default by adding CHOP=false in your .basex configuration file
[1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:00:
Dear all,
First off: BaseX is great to work with. I use it for a few statically generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml file:
<a> a b <a> c </a> d e </a>
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
parse-xml('<a> a b <a> c </a> d e </a>')
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace during
is
parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos
Thanks for the context.
Still, it does not explain the difference in behavior bestween doc() and parse-xml().
As far as I understand the XDM specification, whitespace may be ignored by the parser if there is a DTD or XML Schema that says that an element is not PCDATA (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all whitespace should be left in. Wendell Piez writes it with many details.
Whitespace in XML tricky. E.g. indenting XML cannot be done well without knowing which elements are PCDATA/mixed.
Now that I know about the CHOP option, I can use BaseX predictably. And the legacy reasons for keeping it set are understandable.
Best regards, Jos
On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
There is an old (and still open) issue on GitHub [1] that might give you some more insight into the history of whitespace chopping in BaseX.
Hope this helps Christian
[1] https://github.com/BaseXdb/basex/issues/913
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:41:
Hi Christian,
Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
But where in the XQuery or XDM spec does it say that whitespace handling when parsing is implementation dependent?
Cheers, Jos
On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can
make
this a default by adding CHOP=false in your .basex configuration file
[1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:00:
Dear all,
First off: BaseX is great to work with. I use it for a few statically generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml file:
<a> a b <a> c </a> d e </a>
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
parse-xml('<a> a b <a> c </a> d e </a>')
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace during
is
parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos
Yes, you are certainly right. I think it was around 2007 when we chopped whitespaces by default, although we knew it didn't comply with the specification. One reason was that we rarely worked with mixed-content data at that time, and the whitespace indentations increased the size of databases and led to worse rendering results in the built-in visualizations (our first users were confused about that).
Maybe we’ll switch the default in a future version of BaseX.
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021, 23:36:
Thanks for the context.
Still, it does not explain the difference in behavior bestween doc() and parse-xml().
As far as I understand the XDM specification, whitespace may be ignored by the parser if there is a DTD or XML Schema that says that an element is not PCDATA (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all whitespace should be left in. Wendell Piez writes it with many details.
Whitespace in XML tricky. E.g. indenting XML cannot be done well without knowing which elements are PCDATA/mixed.
Now that I know about the CHOP option, I can use BaseX predictably. And the legacy reasons for keeping it set are understandable.
Best regards, Jos
On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
There is an old (and still open) issue on GitHub [1] that might give you some more insight into the history of whitespace chopping in BaseX.
Hope this helps Christian
[1] https://github.com/BaseXdb/basex/issues/913
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:41:
Hi Christian,
Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
But where in the XQuery or XDM spec does it say that whitespace
handling
when parsing is implementation dependent?
Cheers, Jos
On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can
make
this a default by adding CHOP=false in your .basex configuration file
[1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb.
2021,
22:00:
Dear all,
First off: BaseX is great to work with. I use it for a few
statically
generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml
file:
<a> a b <a> c </a> d e </a>
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
parse-xml('<a> a b <a> c </a> d e </a>')
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace
during
is
parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos
Then to pass the XQuery test suite you probably use CHOP=OFF. Are there other settings needed to be compliant?
On woensdag 17 februari 2021 00:04:38 CET Christian Grün wrote:
Yes, you are certainly right. I think it was around 2007 when we chopped whitespaces by default, although we knew it didn't comply with the specification. One reason was that we rarely worked with mixed-content data at that time, and the whitespace indentations increased the size of databases and led to worse rendering results in the built-in visualizations (our first users were confused about that).
Maybe we’ll switch the default in a future version of BaseX.
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
23:36:
Thanks for the context.
Still, it does not explain the difference in behavior bestween doc() and parse-xml().
As far as I understand the XDM specification, whitespace may be ignored by the parser if there is a DTD or XML Schema that says that an element is not PCDATA (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all whitespace should be left in. Wendell Piez writes it with many details.
Whitespace in XML tricky. E.g. indenting XML cannot be done well without knowing which elements are PCDATA/mixed.
Now that I know about the CHOP option, I can use BaseX predictably. And the legacy reasons for keeping it set are understandable.
Best regards, Jos
On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
There is an old (and still open) issue on GitHub [1] that might give you some more insight into the history of whitespace chopping in BaseX.
Hope this helps Christian
[1] https://github.com/BaseXdb/basex/issues/913
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:41:
Hi Christian,
Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
But where in the XQuery or XDM spec does it say that whitespace
handling
when parsing is implementation dependent?
Cheers, Jos
On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can
make
this a default by adding CHOP=false in your .basex configuration file
[1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb.
2021,
22:00:
Dear all,
First off: BaseX is great to work with. I use it for a few
statically
generated websites.
But I recently found what might be a bug.
Some whitespace vanishes when loading xml files. E.g. this xml
file:
<a> a b <a> c </a> d e </a>
run like this:
doc('test.xml')
gives:
<a>a b<a>c</a>d e</a>
But running this:
parse-xml('<a> a b <a> c </a> d e </a>')
retains the whitespace.
I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
Running this in saxon-he-10.3.jar retains the whitespace.
I can work around this issue by placing xml:space="preserve" in the document element.
I cannot come up with a scenario in which discarding whitespace
during
is
parsing is ok when no DTD or XML Schema is provided.
Best regards, Jos
There are some serialization parameters that test suite seems to rely on:
omit-xml-declaration:no (our default: 'yes') method:xml (our default: 'basex') indent:no (our default: 'yes')
You can set those via the SERIALIZER option.
All in all, there are just a few cases in the suite that are affected by whitespace chopping.
On Wed, Feb 17, 2021 at 8:36 AM Jos van den Oever jos@vandenoever.info wrote:
Then to pass the XQuery test suite you probably use CHOP=OFF. Are there other settings needed to be compliant?
On woensdag 17 februari 2021 00:04:38 CET Christian Grün wrote:
Yes, you are certainly right. I think it was around 2007 when we chopped whitespaces by default, although we knew it didn't comply with the specification. One reason was that we rarely worked with mixed-content data at that time, and the whitespace indentations increased the size of databases and led to worse rendering results in the built-in visualizations (our first users were confused about that).
Maybe we’ll switch the default in a future version of BaseX.
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
23:36:
Thanks for the context.
Still, it does not explain the difference in behavior bestween doc() and parse-xml().
As far as I understand the XDM specification, whitespace may be ignored by the parser if there is a DTD or XML Schema that says that an element is not PCDATA (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all whitespace should be left in. Wendell Piez writes it with many details.
Whitespace in XML tricky. E.g. indenting XML cannot be done well without knowing which elements are PCDATA/mixed.
Now that I know about the CHOP option, I can use BaseX predictably. And the legacy reasons for keeping it set are understandable.
Best regards, Jos
On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
There is an old (and still open) issue on GitHub [1] that might give you some more insight into the history of whitespace chopping in BaseX.
Hope this helps Christian
[1] https://github.com/BaseXdb/basex/issues/913
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb. 2021,
22:41:
Hi Christian,
Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
But where in the XQuery or XDM spec does it say that whitespace
handling
when parsing is implementation dependent?
Cheers, Jos
On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
Hi Jos,
Whitespaces will be preserved if the CHOP option is disabled. You can
make
this a default by adding CHOP=false in your .basex configuration file
[1,2].
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Full-Text#Mixed_Content [2] https://docs.basex.org/wiki/Configuration
Jos van den Oever jos@vandenoever.info schrieb am Di., 16. Feb.
2021,
22:00: > Dear all, > > First off: BaseX is great to work with. I use it for a few
statically
> generated websites. > > But I recently found what might be a bug. > > Some whitespace vanishes when loading xml files. E.g. this xml
file:
> ```test.xml > <a> a b <a> c </a> d e </a> > ``` > > run like this: > > doc('test.xml') > > gives: > > <a>a b<a>c</a>d e</a> > > But running this: > > ``` > parse-xml('<a> a b <a> c </a> d e </a>') > ``` > > retains the whitespace. > > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6. > > Running this in saxon-he-10.3.jar retains the whitespace. > > I can work around this issue by placing xml:space="preserve" in > the > document > element. > > I cannot come up with a scenario in which discarding whitespace
during
is
> parsing is ok when no DTD or XML Schema is provided. > > Best regards, > Jos
basex-talk@mailman.uni-konstanz.de