Hi Liam,
XML's way handling of space characters is understandably an improvement over SGML, but it still causes problems sometimes and seems more complex than it perhaps could be. Although the ship has long since sailed, out of curiosity do you recall if there were any suggestions for a rule to ensure that spaces (and absence of spaces) would be consistently preserved without relying on a DTD or Schema?
A relatively safe way to "pretty print" indent XML is to only insert or remove spaces between an element's name and closing > and where spaces already exist in text nodes. Changing the spaces within an element opening tag can adjust formatting without inserting or removing text nodes. For example:
<sec sec-type="example"><p>pretty print <b>n</b><sup>2</sup>.</p></sec>
Can be indented without changing the node tree:
<sec sec-type="example"
<p
>pretty print <b >n</b><sup >2</sup>.</p></sec>
However I haven't seen any XML editor or processor implement this approach.
Best regards, Vincent
_____________________________________________ Vincent M. Lizzi Head of Information Standards | Taylor & Francis Group vincent.lizzi@taylorandfrancis.commailto:vincent.lizzi@taylorandfrancis.com
Information Classification: General From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de On Behalf Of Liam R. E. Quin Sent: Thursday, November 17, 2022 4:44 PM To: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Pretty print
On Thu, 2022-11-17 at 19:05 +0100, Christian Grün wrote:
But is there no way to declare that when I import a file to the database?
There's currently no way to supply this for specific elements
Both XML Schema and DTDs do have a way to say whether text is allowed in a particular context, and the XML loader could use this information to discard whitespace text nodes that aren't text.
On how it came to be -
SGML had some really bad whitespace rules, including what was called "pernicious whitespace" - whitespace where the parser needed backtracking to know if was text or not, but the parsers didn't actually do backtracking so they flagged it as an error. This was a very common source of problems for users.
We eliminated this for XML by requiring #PCDATA (i.e. text) always to be in a repeatable or-group, so <!ELEMENT boy (noise|dirt|#PCDATA)*> and not <!ELEMENT boy (noise*, dirt*, #PCDATA)> (to paraphrase Ambrose Beirce's Devil's Dictionary, which defined a boy as a noise with dirt on it).
liam
-- Liam Quin, https://www.delightfulcomputing.com/https://www.delightfulcomputing.com Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.orghttp://www.fromoldbooks.org