New subject: Fwd: whitespace around comments

16 Apr 2013


      Oops -- just sent this to Christian only, when it was meant for the list --
---------- Forwarded message ----------
From: Wendell Piez wapiez@wendellpiez.com
Date: Tue, Apr 16, 2013 at 11:15 AM
Subject: Re: [basex-talk] whitespace around comments
To: Christian Grün christian.gruen@gmail.com
Hi,
Thanks to Cerstin for once again urging that CHOP not be set by
default, and to Liam for reminding readers that this is *not* a new
issue in markup processing -- indeed it was in view of problems with
whitespace in earlier technologies (can anyone say "pernicious mixed
content"?) that the XML Rec says what it does -- including section
http://www.w3.org/TR/REC-xml/#sec-white-space.
And thanks also to Christian for reminding us why much of the time,
whitespace chopping is innocuous and helpful.
Personally, I'm in an uneasy place with respect to these issues. I
wholeheartedly agree with Cerstin and Liam that chopping should not be
on by default. Leaving it on by default is wrong and non-conformant.
However, I'm also working with an application in which it's very
convenient to have it -- not only is the data highly structured and
controlled -- no mixed content here -- but also part of its job is to
offer clean representations of external data that is much dirtier and
cruftier. It is supposed to clear bad data away and clean it up, so
managing the whitespace on ingest is just part of what we are doing.
For both scaling and usability issues (as Christian has suggested),
whitespace chopping is nice to have.
Yet at the same time, we are already calling XSLT for uses in which
the chopping is destructive, and the day is not very far off when we
will have mixed content in our primary data set too.
Liam points out something very important: it is possible in principle
to distinguish between whitespace that can be safely discarded (by
design) and whitespace that can't -- if you have a schema or other
specification that represents this design.
As he notes, the XML Rec distinguishes between "significant" and
"insignificant" whitespace by reference to content models that do and
don't include #PCDATA (that is, whitespace that appears in "element
content" or "mixed content"; cf
http://www.w3.org/TR/REC-xml/#dt-elemcontent). If your content model
for div says (p+), then whitespace between the 'p' element children of
a 'div' (but not inside them) may often be judged safe to discard. (At
least in a system in which a schema is used as a warrant of fitness
for processing.)
When technologies such as XQuery or XSLT are designed to work with and
without schemas, however -- or where schemas cannot be considered as
reliable indicators of markup semantics -- even relying on this
mechanism can't solve the problem (to say nothing of deciding which
schema languages you support). However, it can help to mitigate it.
Then too, even XSLT 1.0 has strip-space and preserve-space
configuration to indicate to a processor where it can "chop"
whitespace. While it's a bit crude (it treats all elements with the
same name the same), it can be useful.
Over the longer term, therefore, I think that (1) CHOP needs to be
"false" by default, (2) it should be possible to turn it on (just as I
am learning how to turn it off), and also (3) that we also need more
flexible and configurable means for discriminating how it should work,
with and without schemas to reference.
Cheers, Wendell
On Sat, Apr 13, 2013 at 7:05 AM, Christian Grün
christian.gruen@gmail.com wrote:
...
I’d like to add some more info on why we initially decided to chop
whitespaces, and why a sudden change of the default value may break
existing applications (if you know the details, simply skip this
section..):
Many XML documents contain whitespace-only text nodes for properly
indenting elements. In highly structured data (i.e., when not working
with mixed content), these nodes are in fact completely irrelevant.
For example, if the following document…
<xml>
  <a>X</a>
</xml>
…is parsed with CHOP set to true, we will get a document with a single
text node. The following query…
for $t in //text()
  return replace node $t with 'x'
…will generate the following result:
<xml>
  <a>x</a>
</xml>
If we set CHOP to false, the document will have three text nodes, two
of them whitespace-only, and the same query will create the following
result document:
<xml>x<a>x</a>x</xml>
This is just one example to demonstrate that a sudden change of the
default for chop would most probably lead to unwanted side effects in
existing applications. Another side effect: databases are expected to
increase in size, as all whitespace nodes will get their own node ids,
will be fully stored and indexed, etc.
However, I completely agree that the removal of whitespaces may lead
to serious changes in mixed contents, and I easily admit that we
haven’t been aware of all the implications some years ago when we
started off designing the database. While I still believe that our
storage copes pretty well with nowaday’s requirements, I would love to
have some weeks off to completely rebuild it, and include
optimizations for all kinds of features that are relevant today
(including larger ranges for node ids and namespaces, or support for
other tree formats such as json).
Thanks for reading,
Christian
___________________________
On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin liam@w3.org wrote:
...
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
...
So if you could point out some details as why this is not conforming
behaviour, this would be interesting.
It's a requirement in the XML Spec that the XML parser pass all
whitespace back to the application. Some whitespace may be marked as not
significant - that is only possible if there's a DTD and the space is in
a context where only elements would be valid, not #PCDATA. There's no
formal specification, although constructing an XDM instance from an
infoset, and constructing an infoset from XML, does not entail
discarding these spaces:
Chopping internal whitespace nodes in mixed content contexts is not
sanctioned by any version of any XML specification, with any setting of
xml:space. I think the onus would be on you to justify the non-standard
behaviour.
On the other hand I can see its uses too. But I don't want it, and
always turn it off with BaseX :-)
Best,
Liam
--
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^
--
Wendell Piez | http://www.wendellpiez.com
XML | XSLT | electronic publishing
Eat Your Vegetables
_____oo_________o_o___ooooo____ooooooo_^