Pretty print

List overview All Threads
Download

newer

older

Working with custom stores

BaseX unreachable behind load...

Giuseppe G. A. Celano

17 Nov 2022 17 Nov '22

8:10 a.m.

Hi,

I am trying to prettyprint an XML file. I tried the serialization option “indent”=“yes”, but it does not work as expected. On BaseX 9, the prettyprint was the default setting: how to get the same result in BaseX 10 (and later)? Thanks.

Best, Giuseppe

Show replies by date

Martin Honnen

17 Nov 17 Nov

8:15 a.m.

Am 11/17/2022 um 2:10 PM schrieb Giuseppe G. A. Celano:

...

Hi,

I am trying to prettyprint an XML file. I tried the serialization option “indent”=“yes”, but it does not work as expected. On BaseX 9, the prettyprint was the default setting: how to get the same result in BaseX 10 (and later)? Thanks.

Can you show us your code?

For me

declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";

declare option output:method 'xml'; declare option output:indent 'yes';

declare context item := <root><foo>bar</foo></root>;

gives the result

in 10.3.

Giuseppe G. A. Celano

8:29 a.m.

Hi,

it is:

declare option output:method 'xml'; declare option output:indent 'yes’;

doc(“myfile.xml”)

Best, Giuseppe

Dr. Giuseppe G. A. Celano DFG-project leader Universität Leipzig Institute of Computer Science, NLP Augustusplatz 10 Tel: +4934132223 04109 Leipzig Deutschland

E-mail: celano@informatik.uni-leipzig.de Web site 1: http://asv.informatik.uni-leipzig.de/en/staff/Giuseppe_Celano Web site 2: https://sites.google.com/site/giuseppegacelano/

...

On 17. Nov 2022, at 14:15, Martin Honnen martin.honnen@gmx.de wrote:

Am 11/17/2022 um 2:10 PM schrieb Giuseppe G. A. Celano:

...
Hi,

I am trying to prettyprint an XML file. I tried the serialization option “indent”=“yes”, but it does not work as expected. On BaseX 9, the prettyprint was the default setting: how to get the same result in BaseX 10 (and later)? Thanks.

Can you show us your code?

For me

declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";

declare option output:method 'xml'; declare option output:indent 'yes';

declare context item := <root><foo>bar</foo></root>;

.

gives the result

<root> <foo>bar</foo> </root>

in 10.3.

Martin Honnen

8:36 a.m.

Am 11/17/2022 um 2:29 PM schrieb Giuseppe G. A. Celano:

...

it is:

declare option output:method 'xml'; declare option output:indent 'yes’;

doc(“myfile.xml”)

How/where are you running that code? In the BaseX GUI? There I get an indented output with similar code.

Jonathan Robie

10:52 a.m.

Using Martin's simple example, I get the same results he does without explicitly declaring the namespace:

declare option output:method 'xml'; declare option output:indent 'yes';

Without the declaration, it's not indented:

But the indentation is quite different from what I see in Saxon or oXygen output when I indent. You see this with more complex examples. For instance, here it is an excerpt for a book from the New Testament:

declare option output:method 'xml'; declare option output:indent 'yes';

With Saxon or oXygen, I get this:

<book id="JHN"> <sentence> <milestone unit="verse" id="JHN 1:1">JHN 1:1</milestone> Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος. <wg> <wg role="g" class="group" xml:id="n430010010010170" rule= "conj3cl"> <wg class="cl" xml:id="n430010010010050" head="true" rule= "P-VC-S"> <wg role="p" class="pp" xml:id="n430010010010020" head="true" rule="PrepNp"> <w ref="JHN 1:1!1" after=" " class="prep" xml:id="n43001001001" lemma="ἐν" normalized="Ἐν" strong="1722" gloss="In [the]" domain="067002" ln="67.33" morph="PREP" unicode="Ἐν">Ἐν</w> <wg class="np" xml:id="n430010010020011" head="true" rule="N2NP"> <w ref="JHN 1:1!2" after=" " class="noun" type="common" xml:id="n43001001002" lemma="ἀρχή" normalized="ἀρχῇ" strong="746" number="singular" gender="feminine" case="dative" head="true" gloss="beginning" domain="067003" ln="67.65" morph="N-DSF" unicode="ἀρχῇ">ἀρχῇ</w> </wg> </wg> <wg role="vc" class="vp" xml:id="n430010010030011" rule= "V2VP"> <w ref="JHN 1:1!3" after=" " class="verb" xml:id="n43001001003" lemma="εἰμί" normalized="ἦν" strong="1510" number="singular" person="third" tense="imperfect" voice="active" mood="indicative" gloss="was" domain="013003" ln="13.69" morph="V-IAI-3S" unicode="ἦν">ἦν</w> </wg> <wg role="s" class="np" xml:id="n430010010040020" articular="true" rule="DetNP"> <w ref="JHN 1:1!4" after=" " class="det" xml:id="n43001001004" lemma="ὁ" normalized="ὁ" strong="3588" number="singular" gender="masculine" case="nominative" gloss="the" domain="092004" ln="92.24" morph="T-NSM" unicode="ὁ">ὁ</w> <wg class="np" xml:id="n430010010050011" head="true" rule="N2NP"> <w ref="JHN 1:1!5" after="," class="noun" type="common" xml:id="n43001001005" lemma="λόγος" normalized="Λόγος" strong="3056" number="singular" gender="masculine" case="nominative" head="true" gloss="Word" domain="033006" ln="33.100" morph="N-NSM" unicode="Λόγος,">Λόγος</w> </wg> </wg> </wg> <w ref="JHN 1:1!6" after=" " class="conj" xml:id="n43001001006" lemma="καί" normalized="καί" strong="2532" gloss="and" domain="089017" ln="89.93" morph="CONJ" unicode="καὶ">καὶ</w> <wg class="cl" xml:id="n430010010070060" articular="true" rule="S-VC-P"> <wg role="s" class="np" xml:id="n430010010070020" articular="true" rule="DetNP"> <w ref="JHN 1:1!7" after=" " class="det" xml:id="n43001001007" lemma="ὁ" normalized="ὁ" strong="3588" number="singular" gender="masculine" case="nominative" gloss="the" domain="092004" ln="92.24" morph="T-NSM" unicode="ὁ">ὁ</w> <wg class="np" xml:id="n430010010080011" head="true" rule="N2NP"> <w ref="JHN 1:1!8" after=" " class="noun" type="common" xml:id="n43001001008" lemma="λόγος" normalized="Λόγος" strong="3056" number="singular" gender="masculine" case="nominative" head="true" gloss="Word" domain="033006" ln="33.100" morph="N-NSM" unicode="Λόγος">Λόγος</w> </wg> </wg> <wg role="vc" class="vp" xml:id="n430010010090011" rule= "V2VP"> <w ref="JHN 1:1!9" after=" " class="verb" xml:id="n43001001009" lemma="εἰμί" normalized="ἦν" strong="1510" number="singular" person="third" tense="imperfect" voice="active" mood="indicative" gloss="was" domain="085001" ln="85.1" morph="V-IAI-3S" unicode="ἦν">ἦν</w> </wg> <wg role="p" class="pp" xml:id="n430010010100030" articular="true" head="true" rule="PrepNp"> <w ref="JHN 1:1!10" after=" " class="prep" xml:id="n43001001010" lemma="πρός" normalized="πρός" strong="4314" gloss="with" domain="089020" ln="89.112" morph="PREP" unicode="πρὸς">πρὸς</w> <wg class="np" xml:id="n430010010110020" articular="true" head="true" rule="DetNP"> <w ref="JHN 1:1!11" after=" " class="det" xml:id="n43001001011" lemma="ὁ" normalized="τόν" strong="3588" number="singular" gender="masculine" case="accusative" gloss="-" domain="092004" ln="92.24" morph="T-ASM" unicode="τὸν">τὸν</w> <wg class="np" xml:id="n430010010120011" head= "true" rule="N2NP"> <w ref="JHN 1:1!12" after="," class="noun" type="common" xml:id="n43001001012" lemma="θεός" normalized="Θεόν" strong="2316" number="singular" gender="masculine" case="accusative" head="true" gloss="God" domain="012001" ln="12.1" morph="N-ASM" unicode="Θεόν,">Θεόν</w> </wg> </wg> </wg> </wg> <w ref="JHN 1:1!13" after=" " class="conj" xml:id="n43001001013" lemma="καί" normalized="καί" strong="2532" gloss="and" domain="089017" ln="89.92" morph="CONJ" unicode="καὶ">καὶ</w> <wg class="cl" xml:id="n430010010140040" rule="P-VC-S"> <wg role="p" class="np" xml:id="n430010010140011" head="true" rule="N2NP"> <w ref="JHN 1:1!14" after=" " class="noun" type="common" xml:id="n43001001014" lemma="θεός" normalized="Θεός" strong="2316" number="singular" gender="masculine" case="nominative" head="true" gloss="God" domain="012001" ln="12.1" morph="N-NSM" unicode="Θεὸς">Θεὸς</w> </wg> <wg role="vc" class="vp" xml:id="n430010010150011" rule= "V2VP"> <w ref="JHN 1:1!15" after=" " class="verb" xml:id="n43001001015" lemma="εἰμί" normalized="ἦν" strong="1510" number="singular" person="third" tense="imperfect" voice="active" mood="indicative" gloss="was" domain="058010" ln="58.67" morph="V-IAI-3S" unicode="ἦν">ἦν</w> </wg> <wg role="s" class="np" xml:id="n430010010160020" articular="true" rule="DetNP"> <w ref="JHN 1:1!16" after=" " class="det" xml:id="n43001001016" lemma="ὁ" normalized="ὁ" strong="3588" number="singular" gender="masculine" case="nominative" gloss="the" domain="092004" ln="92.24" morph="T-NSM" unicode="ὁ">ὁ</w> <wg class="np" xml:id="n430010010170011" head="true" rule="N2NP"> <w ref="JHN 1:1!17" after="." class="noun" type="common" xml:id="n43001001017" lemma="λόγος" normalized="Λόγος" strong="3056" number="singular" gender="masculine" case="nominative" head="true" gloss="Word" domain="033006" ln="33.100" morph="N-NSM" unicode="Λόγος.">Λόγος</w> </wg> </wg> </wg> </wg> </wg> </sentence>

On Thu, Nov 17, 2022 at 8:36 AM Martin Honnen martin.honnen@gmx.de wrote:

...

Am 11/17/2022 um 2:29 PM schrieb Giuseppe G. A. Celano:

...
it is:

declare option output:method 'xml'; declare option output:indent 'yes’;

doc(“myfile.xml”)

How/where are you running that code? In the BaseX GUI? There I get an indented output with similar code.

Hans-Juergen Rennau

11:40 a.m.

Sometimes indentation seems to fail because the document already contains whitespace between the elements which the indentation - correctly - does not touch, resulting in irregular indentation. In order to exclude this source of confusion, I would remove the whitespace between elements and apply indentation to the result. For example, this query removes the critical whitespace: declare function local:pretty($n as node()) { typeswitch($n) case document-node() return document {$n/node() ! local:pretty(.)} case element() return element {node-name($n)} {$n/(@*|node()) ! local:pretty(.)} case text() return $n[not(../*) or matches(.,'\S')] default return $n};./local:pretty(.)

With kind regards,Hans-Jürgen Am Donnerstag, 17. November 2022 um 16:52:32 MEZ hat Jonathan Robie jonathan.robie@gmail.com Folgendes geschrieben:

Using Martin's simple example, I get the same results he does without explicitly declaring the namespace: declare option output:method 'xml'; declare option output:indent 'yes';

<root> <foo>bar</foo> </root> Without the declaration, it's not indented: <root><foo>bar</foo></root>

<root><foo>bar</foo></root> But the indentation is quite different from what I see in Saxon or oXygen output when I indent. You see this with more complex examples. For instance, here it is an excerpt for a book from the New Testament:

declare option output:method 'xml'; declare option output:indent 'yes';

<book id="JHN"> <sentence> <milestone unit="verse" id="JHN 1:1">JHN 1:1</milestone> Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος. <wg> <wg role="g" class="group" xml:id="n430010010010170" rule="Conj3CL"> <wg class="cl" xml:id="n430010010010050" head="true" rule="P-VC-S">  <wg role="p" class="pp" xml:id="n430010010010020" head="true" rule="PrepNp">  <w ref="JHN 1:1!1" after=" " class="prep" xml:id="n43001001001" lemma="ἐν" normalized="Ἐν" strong="1722" gloss="In [the]" domain="067002" ln="67.33" morph="PREP" unicode="Ἐν">Ἐν</w> <wg class="np" xml:id="n430010010020011" head="true" rule="N2NP"> <w ref="JHN 1:1!2" after=" " class="noun" type="common" xml:id="n43001001002" lemma="ἀρχή" normalized="ἀρχῇ" strong="746" number="singular" gender="feminine" case="dative" head="true" gloss="beginning" domain="067003" ln="67.65" morph="N-DSF" unicode="ἀρχῇ">ἀρχῇ</w> </wg> </wg> <wg role="vc" class="vp" xml:id="n430010010030011" rule="V2VP">  <w ref="JHN 1:1!3" after=" " class="verb" xml:id="n43001001003" lemma="εἰμί" normalized="ἦν" strong="1510" number="singular" person="third" tense="imperfect" voice="active" mood="indicative" gloss="was" domain="013003" ln="13.69" morph="V-IAI-3S" unicode="ἦν">ἦν</w> </wg> <wg role="s" class="np" xml:id="n430010010040020" articular="true" rule="DetNP">  <w ref="JHN 1:1!4" after=" " class="det" xml:id="n43001001004" lemma="ὁ" normalized="ὁ" strong="3588" number="singular" gender="masculine" case="nominative" gloss="the" domain="092004" ln="92.24" morph="T-NSM" unicode="ὁ">ὁ</w> <wg class="np" xml:id="n430010010050011" head="true" rule="N2NP"> <w ref="JHN 1:1!5" after="," class="noun" type="common" xml:id="n43001001005" lemma="λόγος" normalized="Λόγος" strong="3056" number="singular" gender="masculine" case="nominative" head="true" gloss="Word" domain="033006" ln="33.100" morph="N-NSM" unicode="Λόγος,">Λόγος</w> </wg> </wg> </wg> With Saxon or oXygen, I get this: <book id="JHN"> <sentence> <milestone unit="verse" id="JHN 1:1">JHN 1:1</milestone> Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος. <wg> <wg role="g" class="group" xml:id="n430010010010170" rule="conj3cl"> <wg class="cl" xml:id="n430010010010050" head="true" rule="P-VC-S"> <wg role="p" class="pp" xml:id="n430010010010020" head="true" rule="PrepNp"> <w ref="JHN 1:1!1" after=" " class="prep" xml:id="n43001001001" lemma="ἐν" normalized="Ἐν" strong="1722" gloss="In [the]" domain="067002" ln="67.33" morph="PREP" unicode="Ἐν">Ἐν</w> <wg class="np" xml:id="n430010010020011" head="true" rule="N2NP"> <w ref="JHN 1:1!2" after=" " class="noun" type="common" xml:id="n43001001002" lemma="ἀρχή" normalized="ἀρχῇ" strong="746" number="singular" gender="feminine" case="dative" head="true" gloss="beginning" domain="067003" ln="67.65" morph="N-DSF" unicode="ἀρχῇ">ἀρχῇ</w> </wg> </wg> <wg role="vc" class="vp" xml:id="n430010010030011" rule="V2VP"> <w ref="JHN 1:1!3" after=" " class="verb" xml:id="n43001001003" lemma="εἰμί" normalized="ἦν" strong="1510" number="singular" person="third" tense="imperfect" voice="active" mood="indicative" gloss="was" domain="013003" ln="13.69" morph="V-IAI-3S" unicode="ἦν">ἦν</w> </wg> <wg role="s" class="np" xml:id="n430010010040020" articular="true" rule="DetNP"> <w ref="JHN 1:1!4" after=" " class="det" xml:id="n43001001004" lemma="ὁ" normalized="ὁ" strong="3588" number="singular" gender="masculine" case="nominative" gloss="the" domain="092004" ln="92.24" morph="T-NSM" unicode="ὁ">ὁ</w> <wg class="np" xml:id="n430010010050011" head="true" rule="N2NP"> <w ref="JHN 1:1!5" after="," class="noun" type="common" xml:id="n43001001005" lemma="λόγος" normalized="Λόγος" strong="3056" number="singular" gender="masculine" case="nominative" head="true" gloss="Word" domain="033006" ln="33.100" morph="N-NSM" unicode="Λόγος,">Λόγος</w> </wg> </wg> </wg> <w ref="JHN 1:1!6" after=" " class="conj" xml:id="n43001001006" lemma="καί" normalized="καί" strong="2532" gloss="and" domain="089017" ln="89.93" morph="CONJ" unicode="καὶ">καὶ</w> <wg class="cl" xml:id="n430010010070060" articular="true" rule="S-VC-P"> <wg role="s" class="np" xml:id="n430010010070020" articular="true" rule="DetNP"> <w ref="JHN 1:1!7" after=" " class="det" xml:id="n43001001007" lemma="ὁ" normalized="ὁ" strong="3588" number="singular" gender="masculine" case="nominative" gloss="the" domain="092004" ln="92.24" morph="T-NSM" unicode="ὁ">ὁ</w> <wg class="np" xml:id="n430010010080011" head="true" rule="N2NP"> <w ref="JHN 1:1!8" after=" " class="noun" type="common" xml:id="n43001001008" lemma="λόγος" normalized="Λόγος" strong="3056" number="singular" gender="masculine" case="nominative" head="true" gloss="Word" domain="033006" ln="33.100" morph="N-NSM" unicode="Λόγος">Λόγος</w> </wg> </wg> <wg role="vc" class="vp" xml:id="n430010010090011" rule="V2VP"> <w ref="JHN 1:1!9" after=" " class="verb" xml:id="n43001001009" lemma="εἰμί" normalized="ἦν" strong="1510" number="singular" person="third" tense="imperfect" voice="active" mood="indicative" gloss="was" domain="085001" ln="85.1" morph="V-IAI-3S" unicode="ἦν">ἦν</w> </wg> <wg role="p" class="pp" xml:id="n430010010100030" articular="true" head="true" rule="PrepNp"> <w ref="JHN 1:1!10" after=" " class="prep" xml:id="n43001001010" lemma="πρός" normalized="πρός" strong="4314" gloss="with" domain="089020" ln="89.112" morph="PREP" unicode="πρὸς">πρὸς</w> <wg class="np" xml:id="n430010010110020" articular="true" head="true" rule="DetNP"> <w ref="JHN 1:1!11" after=" " class="det" xml:id="n43001001011" lemma="ὁ" normalized="τόν" strong="3588" number="singular" gender="masculine" case="accusative" gloss="-" domain="092004" ln="92.24" morph="T-ASM" unicode="τὸν">τὸν</w> <wg class="np" xml:id="n430010010120011" head="true" rule="N2NP"> <w ref="JHN 1:1!12" after="," class="noun" type="common" xml:id="n43001001012" lemma="θεός" normalized="Θεόν" strong="2316" number="singular" gender="masculine" case="accusative" head="true" gloss="God" domain="012001" ln="12.1" morph="N-ASM" unicode="Θεόν,">Θεόν</w> </wg> </wg> </wg> </wg> <w ref="JHN 1:1!13" after=" " class="conj" xml:id="n43001001013" lemma="καί" normalized="καί" strong="2532" gloss="and" domain="089017" ln="89.92" morph="CONJ" unicode="καὶ">καὶ</w> <wg class="cl" xml:id="n430010010140040" rule="P-VC-S"> <wg role="p" class="np" xml:id="n430010010140011" head="true" rule="N2NP"> <w ref="JHN 1:1!14" after=" " class="noun" type="common" xml:id="n43001001014" lemma="θεός" normalized="Θεός" strong="2316" number="singular" gender="masculine" case="nominative" head="true" gloss="God" domain="012001" ln="12.1" morph="N-NSM" unicode="Θεὸς">Θεὸς</w> </wg> <wg role="vc" class="vp" xml:id="n430010010150011" rule="V2VP"> <w ref="JHN 1:1!15" after=" " class="verb" xml:id="n43001001015" lemma="εἰμί" normalized="ἦν" strong="1510" number="singular" person="third" tense="imperfect" voice="active" mood="indicative" gloss="was" domain="058010" ln="58.67" morph="V-IAI-3S" unicode="ἦν">ἦν</w> </wg> <wg role="s" class="np" xml:id="n430010010160020" articular="true" rule="DetNP"> <w ref="JHN 1:1!16" after=" " class="det" xml:id="n43001001016" lemma="ὁ" normalized="ὁ" strong="3588" number="singular" gender="masculine" case="nominative" gloss="the" domain="092004" ln="92.24" morph="T-NSM" unicode="ὁ">ὁ</w> <wg class="np" xml:id="n430010010170011" head="true" rule="N2NP"> <w ref="JHN 1:1!17" after="." class="noun" type="common" xml:id="n43001001017" lemma="λόγος" normalized="Λόγος" strong="3056" number="singular" gender="masculine" case="nominative" head="true" gloss="Word" domain="033006" ln="33.100" morph="N-NSM" unicode="Λόγος.">Λόγος</w> </wg> </wg> </wg> </wg> </wg> </sentence>

On Thu, Nov 17, 2022 at 8:36 AM Martin Honnen martin.honnen@gmx.de wrote:

Am 11/17/2022 um 2:29 PM schrieb Giuseppe G. A. Celano:

...

it is:

declare option output:method 'xml'; declare option output:indent 'yes’;

doc(“myfile.xml”)

How/where are you running that code? In the BaseX GUI? There I get an indented output with similar code.

Christian Grün

noon

...

But the indentation is quite different from what I see in Saxon or oXygen output when I indent. You see this with more complex examples.

That’s true, every query processor uses custom indentation algorithms; the specification gives much freedom here [1]. If indentation is important, it’s always recommendable to either preserve the original formatting or use xml:space='preserve' for mixed-context sections.

I’ll never be happy with the decision in XML to lump together indentation of structure and content.

[1] https://www.w3.org/TR/xslt-xquery-serialization-31/#xml-indent

Jonathan Robie

12:06 p.m.

On Thu, Nov 17, 2022 at 12:01 PM Christian Grün christian.gruen@gmail.com wrote:

...

But the indentation is quite different from what I see in Saxon or oXygen

...
output when I indent. You see this with more complex examples.

That’s true, every query processor uses custom indentation algorithms; the specification gives much freedom here [1]. If indentation is important, it’s always recommendable to either preserve the original formatting or use xml:space='preserve' for mixed-context sections.

DOH!

I should be using xml:space="preserve". But is there no way to declare that when I import a file to the database? Sometimes I don't want to change the original file, but I do want to preserve whitespace.

...

I’ll never be happy with the decision in XML to lump together indentation of structure and content.

[1] https://www.w3.org/TR/xslt-xquery-serialization-31/#xml-indent

On standards groups, we always spent a LOT more time discussing whitespace than character content, it took up enormous amounts of time. And part of it is that there's not really a good way in XML to distinguish indentation from whitespace content. What would you have done differently? If there's an obvious, simple way this could have been improved, I'd be curious what it is.

Jonathan

Christian Grün

1:05 p.m.

...

But is there no way to declare that when I import a file to the database?

There's currently no way to supply this for specific elements – but it's a good thought, we should think about it, now that all whitespace are preserved by default.

What would you have done differently?

...

It's always easy to complain and much harder to improve things, even more if you can't start from scratch (I have never considered how SGML has handled this issue, or if it was an issue at all).

JSON doesn’t come with a standardized solution for mixed content, but it's impossible to corrupt the contents by using a wrong indentation.

Looking at the existing XML representation, I would certainly have preferred to have heading and trailing whitespaces in elements ignored unless an element is marked as mixed-content. Next, it would have been consistent to also have a xml:space='strip' attribute value instead of 'default'.

More fundamentally, a custom node type for mixed content could have been added, and structural and content-based data could have been represented differently. If JSON and XML was merged, for example, conventional JSON could be used to store non-hierarchical metadata, and the JSON value range could be extended by a simplified XML type for mixed content. A language could support both the JavaScript dot and the XML path syntax:

books.book[text//h1 = 'Survey'].title

Well, no more ideas here. Instead of inventing something new by myself, I should definitely have a look at other projects first, e.g. Ghislain’s exciting RumbleDB project, which brings many interesting concepts together.

Liam R. E. Quin

4:44 p.m.

On Thu, 2022-11-17 at 19:05 +0100, Christian Grün wrote:

...

...
But is there no way to declare that when I import a file to the database?

There's currently no way to supply this for specific elements

Both XML Schema and DTDs do have a way to say whether text is allowed in a particular context, and the XML loader could use this information to discard whitespace text nodes that aren't text.

On how it came to be -

SGML had some really bad whitespace rules, including what was called "pernicious whitespace" - whitespace where the parser needed backtracking to know if was text or not, but the parsers didn't actually do backtracking so they flagged it as an error. This was a very common source of problems for users.

We eliminated this for XML by requiring #PCDATA (i.e. text) always to be in a repeatable or-group, so <!ELEMENT boy (noise|dirt|#PCDATA)*> and not <!ELEMENT boy (noise*, dirt*, #PCDATA)> (to paraphrase Ambrose Beirce's Devil's Dictionary, which defined a boy as a noise with dirt on it).

liam

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

Lizzi, Vincent

18 Nov 18 Nov

1:39 p.m.

Hi Liam,

XML's way handling of space characters is understandably an improvement over SGML, but it still causes problems sometimes and seems more complex than it perhaps could be. Although the ship has long since sailed, out of curiosity do you recall if there were any suggestions for a rule to ensure that spaces (and absence of spaces) would be consistently preserved without relying on a DTD or Schema?

A relatively safe way to "pretty print" indent XML is to only insert or remove spaces between an element's name and closing > and where spaces already exist in text nodes. Changing the spaces within an element opening tag can adjust formatting without inserting or removing text nodes. For example:

<sec sec-type="example">pretty print n2.</sec>

Can be indented without changing the node tree:

<sec sec-type="example"

...

<p

>pretty print n2.</sec>

However I haven't seen any XML editor or processor implement this approach.

Best regards, Vincent

_____________________________________________ Vincent M. Lizzi Head of Information Standards | Taylor & Francis Group vincent.lizzi@taylorandfrancis.commailto:vincent.lizzi@taylorandfrancis.com

Information Classification: General From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de On Behalf Of Liam R. E. Quin Sent: Thursday, November 17, 2022 4:44 PM To: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Pretty print

On Thu, 2022-11-17 at 19:05 +0100, Christian Grün wrote:

...

...
But is there no way to declare that when I import a file to the database?

There's currently no way to supply this for specific elements

Both XML Schema and DTDs do have a way to say whether text is allowed in a particular context, and the XML loader could use this information to discard whitespace text nodes that aren't text.

On how it came to be -

liam

-- Liam Quin, https://www.delightfulcomputing.com/https://www.delightfulcomputing.com Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org http://www.fromoldbooks.org

Liam R. E. Quin

19 Nov 19 Nov

12:06 a.m.

On Fri, 2022-11-18 at 18:39 +0000, Lizzi, Vincent wrote:

...

Hi Liam,

XML's way handling of space characters is understandably an improvement over SGML, but it still causes problems sometimes and seems more complex than it perhaps could be. Although the ship has long since sailed, out of curiosity do you recall if there were any suggestions for a rule to ensure that spaces (and absence of spaces) would be consistently preserved without relying on a DTD or Schema?

There were. There was a lot of discussion around this. The main proposals were (1) disallow mixed content entirely, and require an element to contain text. <T>Karen </T><emph>actually<T> smiled at this idea.</T? It's easy to see why this didn't get much traction from document people.

(2) require mixed or text elements to use different syntax, e.g. <@p>Karen <@emph>actually/@emph smiled at this idea./@p This would have ruled out XHTML, however, or any other pre-existing SGML vocabulary, and at that time that was 100% of all content: there was no XML content outside of the examples in the specification itself.

At one point i remember (foolishly) suggesting upper-case element names for ones that count not contain text directly (or the other way round, i forget), but of course this wouldn't work in a multilingual world where not all languages have upper and lower case.

XML was developed before XML Schema. When we started, a DTD was required; by the end, DTDs were optional (i had Charles Goldfarb calling me at home over this, trying to find ways to keep DTDs as mandatory!) but i think we didn't revisit all of the decisions in this light.

...

A relatively safe way to "pretty print" indent XML is to only insert or remove spaces between an element's name and closing > and where spaces already exist in text nodes.

Yes, there are tools that can do this, too.

...

However I haven't seen any XML editor or processor implement this approach.

I think maybe xmllint can, i'm not sure. And possibly xml tidy, and maybe James' xp has something like this. Overall i think it tends to confuse people more than it helps, though. I'm not sure.

liam

Eliot Kimber

17 Nov 17 Nov

1:44 p.m.

Note that the XML mixed content and whitespace design was inherited from SGML, where DTDs were required, and so a parser always knew for sure whether a given context was or was not mixed content.

It’s been a couple decades, but my memory is that anything we did in XML to address this in the face of not requiring any kind of grammar would have been even more disruptive, such as not allowing mixed content at all and having some special syntax just for identifying text nodes.

So it wasn’t really a decision so much as there not really being a better solution in the context of SGML as our starting point.

Cheers,

_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow

From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Christian Grün christian.gruen@gmail.com Date: Thursday, November 17, 2022 at 11:01 AM To: Jonathan Robie jonathan.robie@gmail.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Pretty print [External Email]

________________________________ But the indentation is quite different from what I see in Saxon or oXygen output when I indent. You see this with more complex examples.

I’ll never be happy with the decision in XML to lump together indentation of structure and content.

[1] https://www.w3.org/TR/xslt-xquery-serialization-31/#xml-indent https://www.w3.org/TR/xslt-xquery-serialization-31/#xml-indent

Jonathan Robie

2:11 p.m.

Specifying this in the schema is actually a rather good solution, I think - at least for many cases.

Jonathan

On Thu, Nov 17, 2022 at 1:44 PM Eliot Kimber eliot.kimber@servicenow.com wrote:

...

Note that the XML mixed content and whitespace design was inherited from SGML, where DTDs were required, and so a parser always knew for sure whether a given context was or was not mixed content.

It’s been a couple decades, but my memory is that anything we did in XML to address this in the face of not requiring any kind of grammar would have been even more disruptive, such as not allowing mixed content at all and having some special syntax just for identifying text nodes.

So it wasn’t really a decision so much as there not really being a better solution in the context of SGML as our starting point.

Cheers,

E.

*Eliot Kimber*

Sr Staff Content Engineer

O: 512 554 9368

M: 512 554 9368

servicenow.com https://www.servicenow.com

LinkedIn https://www.linkedin.com/company/servicenow | Twitter https://twitter.com/servicenow | YouTube https://www.youtube.com/user/servicenowinc | Facebook https://www.facebook.com/servicenow

*From: *BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Christian Grün christian.gruen@gmail.com *Date: *Thursday, November 17, 2022 at 11:01 AM *To: *Jonathan Robie jonathan.robie@gmail.com *Cc: *basex-talk@mailman.uni-konstanz.de < basex-talk@mailman.uni-konstanz.de> *Subject: *Re: [basex-talk] Pretty print

*[External Email]*

But the indentation is quite different from what I see in Saxon or oXygen output when I indent. You see this with more complex examples.

That’s true, every query processor uses custom indentation algorithms; the specification gives much freedom here [1]. If indentation is important, it’s always recommendable to either preserve the original formatting or use xml:space='preserve' for mixed-context sections.

I’ll never be happy with the decision in XML to lump together indentation of structure and content.

[1] https://www.w3.org/TR/xslt-xquery-serialization-31/#xml-indent

971

Age (days ago)

973

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

13 comments

8 participants

tags (0)

participants (8)

Christian Grün
Eliot Kimber
Giuseppe G. A. Celano
Hans-Juergen Rennau
Jonathan Robie
Liam R. E. Quin
Lizzi, Vincent
Martin Honnen