Oops -- just sent this to Christian only, when it was meant for the list --
---------- Forwarded message ---------- From: Wendell Piez wapiez@wendellpiez.com Date: Tue, Apr 16, 2013 at 11:15 AM Subject: Re: [basex-talk] whitespace around comments To: Christian Grün christian.gruen@gmail.com
Hi,
Thanks to Cerstin for once again urging that CHOP not be set by default, and to Liam for reminding readers that this is *not* a new issue in markup processing -- indeed it was in view of problems with whitespace in earlier technologies (can anyone say "pernicious mixed content"?) that the XML Rec says what it does -- including section http://www.w3.org/TR/REC-xml/#sec-white-space.
And thanks also to Christian for reminding us why much of the time, whitespace chopping is innocuous and helpful.
Personally, I'm in an uneasy place with respect to these issues. I wholeheartedly agree with Cerstin and Liam that chopping should not be on by default. Leaving it on by default is wrong and non-conformant.
However, I'm also working with an application in which it's very convenient to have it -- not only is the data highly structured and controlled -- no mixed content here -- but also part of its job is to offer clean representations of external data that is much dirtier and cruftier. It is supposed to clear bad data away and clean it up, so managing the whitespace on ingest is just part of what we are doing. For both scaling and usability issues (as Christian has suggested), whitespace chopping is nice to have.
Yet at the same time, we are already calling XSLT for uses in which the chopping is destructive, and the day is not very far off when we will have mixed content in our primary data set too.
Liam points out something very important: it is possible in principle to distinguish between whitespace that can be safely discarded (by design) and whitespace that can't -- if you have a schema or other specification that represents this design.
As he notes, the XML Rec distinguishes between "significant" and "insignificant" whitespace by reference to content models that do and don't include #PCDATA (that is, whitespace that appears in "element content" or "mixed content"; cf http://www.w3.org/TR/REC-xml/#dt-elemcontent). If your content model for div says (p+), then whitespace between the 'p' element children of a 'div' (but not inside them) may often be judged safe to discard. (At least in a system in which a schema is used as a warrant of fitness for processing.)
When technologies such as XQuery or XSLT are designed to work with and without schemas, however -- or where schemas cannot be considered as reliable indicators of markup semantics -- even relying on this mechanism can't solve the problem (to say nothing of deciding which schema languages you support). However, it can help to mitigate it.
Then too, even XSLT 1.0 has strip-space and preserve-space configuration to indicate to a processor where it can "chop" whitespace. While it's a bit crude (it treats all elements with the same name the same), it can be useful.
Over the longer term, therefore, I think that (1) CHOP needs to be "false" by default, (2) it should be possible to turn it on (just as I am learning how to turn it off), and also (3) that we also need more flexible and configurable means for discriminating how it should work, with and without schemas to reference.
Cheers, Wendell
On Sat, Apr 13, 2013 at 7:05 AM, Christian Grün christian.gruen@gmail.com wrote:
I’d like to add some more info on why we initially decided to chop whitespaces, and why a sudden change of the default value may break existing applications (if you know the details, simply skip this section..):
Many XML documents contain whitespace-only text nodes for properly indenting elements. In highly structured data (i.e., when not working with mixed content), these nodes are in fact completely irrelevant. For example, if the following document…
<xml> <a>X</a> </xml>
…is parsed with CHOP set to true, we will get a document with a single text node. The following query…
for $t in //text() return replace node $t with 'x'
…will generate the following result:
<xml> <a>x</a> </xml>
If we set CHOP to false, the document will have three text nodes, two of them whitespace-only, and the same query will create the following result document:
<xml>x<a>x</a>x</xml>
This is just one example to demonstrate that a sudden change of the default for chop would most probably lead to unwanted side effects in existing applications. Another side effect: databases are expected to increase in size, as all whitespace nodes will get their own node ids, will be fully stored and indexed, etc.
However, I completely agree that the removal of whitespaces may lead to serious changes in mixed contents, and I easily admit that we haven’t been aware of all the implications some years ago when we started off designing the database. While I still believe that our storage copes pretty well with nowaday’s requirements, I would love to have some weeks off to completely rebuild it, and include optimizations for all kinds of features that are relevant today (including larger ranges for node ids and namespaces, or support for other tree formats such as json).
Thanks for reading, Christian ___________________________
On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin liam@w3.org wrote:
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
So if you could point out some details as why this is not conforming behaviour, this would be interesting.
It's a requirement in the XML Spec that the XML parser pass all whitespace back to the application. Some whitespace may be marked as not significant - that is only possible if there's a DTD and the space is in a context where only elements would be valid, not #PCDATA. There's no formal specification, although constructing an XDM instance from an infoset, and constructing an infoset from XML, does not entail discarding these spaces: Chopping internal whitespace nodes in mixed content contexts is not sanctioned by any version of any XML specification, with any setting of xml:space. I think the onus would be on you to justify the non-standard behaviour.
On the other hand I can see its uses too. But I don't want it, and always turn it off with BaseX :-)
Best,
Liam
-- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^
Hi Wendell,
thanks for your point of view. If you decide not to introduce a schema for your data, and if you have the chance to prepare your input before adding it to the database, you may now mark all your mixed content with xml:space="preserve".
One question to Liam: do you remember why "strip" is not a valid option for the xml:space attribute?
Christian ______________________________________
Liam points out something very important: it is possible in principle to distinguish between whitespace that can be safely discarded (by design) and whitespace that can't -- if you have a schema or other specification that represents this design.
As he notes, the XML Rec distinguishes between "significant" and "insignificant" whitespace by reference to content models that do and don't include #PCDATA (that is, whitespace that appears in "element content" or "mixed content"; cf http://www.w3.org/TR/REC-xml/#dt-elemcontent). If your content model for div says (p+), then whitespace between the 'p' element children of a 'div' (but not inside them) may often be judged safe to discard. (At least in a system in which a schema is used as a warrant of fitness for processing.)
When technologies such as XQuery or XSLT are designed to work with and without schemas, however -- or where schemas cannot be considered as reliable indicators of markup semantics -- even relying on this mechanism can't solve the problem (to say nothing of deciding which schema languages you support). However, it can help to mitigate it.
Then too, even XSLT 1.0 has strip-space and preserve-space configuration to indicate to a processor where it can "chop" whitespace. While it's a bit crude (it treats all elements with the same name the same), it can be useful.
Over the longer term, therefore, I think that (1) CHOP needs to be "false" by default, (2) it should be possible to turn it on (just as I am learning how to turn it off), and also (3) that we also need more flexible and configurable means for discriminating how it should work, with and without schemas to reference.
Cheers, Wendell
On Sat, Apr 13, 2013 at 7:05 AM, Christian Grün christian.gruen@gmail.com wrote:
I’d like to add some more info on why we initially decided to chop whitespaces, and why a sudden change of the default value may break existing applications (if you know the details, simply skip this section..):
Many XML documents contain whitespace-only text nodes for properly indenting elements. In highly structured data (i.e., when not working with mixed content), these nodes are in fact completely irrelevant. For example, if the following document…
<xml> <a>X</a> </xml>
…is parsed with CHOP set to true, we will get a document with a single text node. The following query…
for $t in //text() return replace node $t with 'x'
…will generate the following result:
<xml> <a>x</a> </xml>
If we set CHOP to false, the document will have three text nodes, two of them whitespace-only, and the same query will create the following result document:
<xml>x<a>x</a>x</xml>
This is just one example to demonstrate that a sudden change of the default for chop would most probably lead to unwanted side effects in existing applications. Another side effect: databases are expected to increase in size, as all whitespace nodes will get their own node ids, will be fully stored and indexed, etc.
However, I completely agree that the removal of whitespaces may lead to serious changes in mixed contents, and I easily admit that we haven’t been aware of all the implications some years ago when we started off designing the database. While I still believe that our storage copes pretty well with nowaday’s requirements, I would love to have some weeks off to completely rebuild it, and include optimizations for all kinds of features that are relevant today (including larger ranges for node ids and namespaces, or support for other tree formats such as json).
Thanks for reading, Christian ___________________________
On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin liam@w3.org wrote:
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
So if you could point out some details as why this is not conforming behaviour, this would be interesting.
It's a requirement in the XML Spec that the XML parser pass all whitespace back to the application. Some whitespace may be marked as not significant - that is only possible if there's a DTD and the space is in a context where only elements would be valid, not #PCDATA. There's no formal specification, although constructing an XDM instance from an infoset, and constructing an infoset from XML, does not entail discarding these spaces: Chopping internal whitespace nodes in mixed content contexts is not sanctioned by any version of any XML specification, with any setting of xml:space. I think the onus would be on you to justify the non-standard behaviour.
On the other hand I can see its uses too. But I don't want it, and always turn it off with BaseX :-)
Best,
Liam
-- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^ _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Christian,
Yes, being able to add xml:space="preserve" should help a lot. (At any rate if done carefully: as Cerstin implies these operations can be sensitive. For example TEI has places where element content appears within mixed content, where you will want to put xml:space back to "default".)
The XML Rec (http://www.w3.org/TR/REC-xml/#sec-white-space) says 'the value "default" signals that applications' default white-space processing modes are acceptable for this element' - notice the plural "modes", which I think licenses the chop behavior should someone decide he or she wants it. (Not that they could stop us of course. :-)
I wonder if there's a way I could ask BaseX to invoke Saxon and ingest its transformation results? So I wouldn't have to cache the stuff on a disk. (Or maybe I should get myself that solid state drive. :-)
Cheers, Wendell
On Thu, Apr 18, 2013 at 4:03 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Wendell,
thanks for your point of view. If you decide not to introduce a schema for your data, and if you have the chance to prepare your input before adding it to the database, you may now mark all your mixed content with xml:space="preserve".
One question to Liam: do you remember why "strip" is not a valid option for the xml:space attribute?
Christian ______________________________________
Liam points out something very important: it is possible in principle to distinguish between whitespace that can be safely discarded (by design) and whitespace that can't -- if you have a schema or other specification that represents this design.
As he notes, the XML Rec distinguishes between "significant" and "insignificant" whitespace by reference to content models that do and don't include #PCDATA (that is, whitespace that appears in "element content" or "mixed content"; cf http://www.w3.org/TR/REC-xml/#dt-elemcontent). If your content model for div says (p+), then whitespace between the 'p' element children of a 'div' (but not inside them) may often be judged safe to discard. (At least in a system in which a schema is used as a warrant of fitness for processing.)
When technologies such as XQuery or XSLT are designed to work with and without schemas, however -- or where schemas cannot be considered as reliable indicators of markup semantics -- even relying on this mechanism can't solve the problem (to say nothing of deciding which schema languages you support). However, it can help to mitigate it.
Then too, even XSLT 1.0 has strip-space and preserve-space configuration to indicate to a processor where it can "chop" whitespace. While it's a bit crude (it treats all elements with the same name the same), it can be useful.
Over the longer term, therefore, I think that (1) CHOP needs to be "false" by default, (2) it should be possible to turn it on (just as I am learning how to turn it off), and also (3) that we also need more flexible and configurable means for discriminating how it should work, with and without schemas to reference.
Cheers, Wendell
On Sat, Apr 13, 2013 at 7:05 AM, Christian Grün christian.gruen@gmail.com wrote:
I’d like to add some more info on why we initially decided to chop whitespaces, and why a sudden change of the default value may break existing applications (if you know the details, simply skip this section..):
Many XML documents contain whitespace-only text nodes for properly indenting elements. In highly structured data (i.e., when not working with mixed content), these nodes are in fact completely irrelevant. For example, if the following document…
<xml> <a>X</a> </xml>
…is parsed with CHOP set to true, we will get a document with a single text node. The following query…
for $t in //text() return replace node $t with 'x'
…will generate the following result:
<xml> <a>x</a> </xml>
If we set CHOP to false, the document will have three text nodes, two of them whitespace-only, and the same query will create the following result document:
<xml>x<a>x</a>x</xml>
This is just one example to demonstrate that a sudden change of the default for chop would most probably lead to unwanted side effects in existing applications. Another side effect: databases are expected to increase in size, as all whitespace nodes will get their own node ids, will be fully stored and indexed, etc.
However, I completely agree that the removal of whitespaces may lead to serious changes in mixed contents, and I easily admit that we haven’t been aware of all the implications some years ago when we started off designing the database. While I still believe that our storage copes pretty well with nowaday’s requirements, I would love to have some weeks off to completely rebuild it, and include optimizations for all kinds of features that are relevant today (including larger ranges for node ids and namespaces, or support for other tree formats such as json).
Thanks for reading, Christian ___________________________
On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin liam@w3.org wrote:
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
So if you could point out some details as why this is not conforming behaviour, this would be interesting.
It's a requirement in the XML Spec that the XML parser pass all whitespace back to the application. Some whitespace may be marked as not significant - that is only possible if there's a DTD and the space is in a context where only elements would be valid, not #PCDATA. There's no formal specification, although constructing an XDM instance from an infoset, and constructing an infoset from XML, does not entail discarding these spaces: Chopping internal whitespace nodes in mixed content contexts is not sanctioned by any version of any XML specification, with any setting of xml:space. I think the onus would be on you to justify the non-standard behaviour.
On the other hand I can see its uses too. But I don't want it, and always turn it off with BaseX :-)
Best,
Liam
-- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^
-- Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic publishing Eat Your Vegetables _____oo_________o_o___ooooo____ooooooo_^ _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Wendell,
I wonder if there's a way I could ask BaseX to invoke Saxon and ingest its transformation results? So I wouldn't have to cache the stuff on a disk. (Or maybe I should get myself that solid state drive. :-)
yes, there are lots of ways to do things as pipelining with BaseX, but you’ll have to dig a little bit deeper and do this in Java. I know too less about the Saxon API, but you could e.g. use the following Java code snippet to pass on an input stream to BaseX:
InputStream is = new ByteArrayInputStream("<a/>".getBytes()); Context ctx = new Context(); CreateDB cmd = new CreateDB("test"); cmd.setInput(is); cmd.execute(ctx); ctx.close();
Hope this helps, Christian
basex-talk@mailman.uni-konstanz.de