Michael (other than me :-)) you are obviously right.
On Fri, Apr 5, 2013 at 12:29 PM, Michael Piotrowski <mxp@cl.uzh.ch> wrote:
Dirk,
On 2013-04-05, Dirk Kirsten <dk@basex.org> wrote:
> You are certainly right that with mixed content and the example you have
> given here chopping does make a semantic difference.
> However, you can disable this behaviour so BaseX does what you want. So the
> only reason I see why one should change the default behaviour would be
> because the default is not confirmant to some XML standard. However, I can
> not find any specifics in the spec about which is the expected behaviour,
> so in my opinion BaseX is doing nothing wrong here.
Well, if you agree that chopping may alter the semantics of a document,
wouldn't you agree that applying such a transformation *by default* is a
bad idea?
With respect to the XML specification, section 2.10 "White Space
Handling" says:
An XML processor MUST always pass all characters in a document that
are not markup through to the application.
Yes, the spec is vague wrt. to whitespace handling, and the existence of
the xml:space attribute shows that different behaviors--including
potentially corrupting ones--are possible. I would therefore interpret
the spec to mean that by default all characters should be preserved, but
that other behaviors are possible.
> I see that this behaviour might be surprising for some users, but this
> might as well be the case if it were the other way round.
No, because their documents wouldn't be corrupted. You can easily
remove all whitespace afterwards if you decide you don't need it, but
once it's gone, it's gone and cannot be restored. That's the problem.
> Additionally, if we would change this now it would break application
> code and unless there is a good reason (i.e. BaseX is actually doing
> something wrong or non-compliant) I don't see why one should change
> the default.
Well, I'm not on a crusade or anything, so if you believe that it's a
good idea to corrupt, by default, all documents containing mixed content
on import, or if this behavior must be kept for compatiblity, so be it.
I just wanted to point out that whitespace chopping may, in fact, alter
the semantics of documents--it's not as harmless as it may seem.
Best regards
--
Dr.-Ing. Michael Piotrowski, M.A. <mxp@cl.uzh.ch>
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Natural Language Processing for Historical Texts
* <http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017>
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk