Hi Liam,

XML’s way handling of space characters is understandably an improvement over SGML, but it still causes problems sometimes and seems more complex than it perhaps could be. Although the ship has long since sailed, out of curiosity do you recall if there were any suggestions for a rule to ensure that spaces (and absence of spaces) would be consistently preserved without relying on a DTD or Schema?

A relatively safe way to “pretty print” indent XML is to only insert or remove spaces between an element’s name and closing > and where spaces already exist in text nodes. Changing the spaces within an element opening tag can adjust formatting without inserting or removing text nodes. For example:

<sec sec-type="example">pretty print n2.</sec>

Can be indented without changing the node tree:

<sec sec-type="example"

><p

>pretty

print <b

>n<sup

>2.</sec>

However I haven’t seen any XML editor or processor implement this approach.

Best regards,

Vincent

_____________________________________________

Vincent M. Lizzi

Head of Information Standards | Taylor & Francis Group

vincent.lizzi@taylorandfrancis.com

Information Classification: General

From: BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.de> On Behalf Of Liam R. E. Quin
Sent: Thursday, November 17, 2022 4:44 PM
To: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Pretty print

On Thu, 2022-11-17 at 19:05 +0100, Christian Grün wrote:
> >
> > But is there no way to declare that when I import a file to the
> > database?
> >
>
> There's currently no way to supply this for specific elements

Both XML Schema and DTDs do have a way to say whether text is allowed
in a particular context, and the XML loader could use this information
to discard whitespace text nodes that aren't text.

On how it came to be -

SGML had some really bad whitespace rules, including what was called
"pernicious whitespace" - whitespace where the parser needed
backtracking to know if was text or not, but the parsers didn't
actually do backtracking so they flagged it as an error. This was a
very common source of problems for users.

We eliminated this for XML by requiring #PCDATA (i.e. text) always to
be in a repeatable or-group, so
<!ELEMENT boy (noise|dirt|#PCDATA)*>
and not
<!ELEMENT boy (noise*, dirt*, #PCDATA)>
(to paraphrase Ambrose Beirce's Devil's Dictionary, which defined a boy
as a noise with dirt on it).

liam

--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org