Hello.
I have a bug with BaseX version 10 and higher related to handling whitespaces in documents with mixed content elements.
I'm working with narrative documents (TEI). Following the instructions in the documentation about this type of documents (containing mixed elements) I have disabled the indentation and the STRIPWS option.
When I execute queries with connection to the database, the result is as explained in the documentation (blanks are preserved).
However, when I use the BaseX libraries to execute queries in a local context (without a connection to the database) the spaces are sometimes removed. Same error if I run the query in the Database Administration interface.
Query: db:option('stripws')
Result: false
Query: db:option('serializer')
Result: map{"omit-xml-declaration":"yes","binary":"yes","method":"basex","use-character-maps":"","tabulator":"no","allow-duplicate-names":"no","media-type":"","doctype-public":"","escape-uri-attributes":"no","standalone":"omit","csv":map{"lax":true(),"backslashes":false(),"separator":"comma","allow":"","header":false(),"quotes":true(),"format":"direct"},"indents":2,"json-node-output-method":"xml","json":map{"escape":true(),"strings":false(),"lax":false(),"indent":(),"format":"direct","merge":false()},"doctype-system":"","item-separator":(),"indent":"no","suppress-indentation":"","byte-order-mark":"no","include-content-type":"yes","encoding":"UTF-8","newline":"\n","normalization-form":"none","html-version":"","version":"","limit":-1,"undeclare-prefixes":"no","cdata-section-elements":"","parameter-document":""}
Query: let $doc := <doc><entry><sense id='1'><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg> <oRef>chusma</oRef>.</quote></cit></sense></entry></doc> return $doc/entry//sense[@id = '1']
Result: <sense id="1"><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg><oRef>chusma</oRef>.</quote></cit></sense>
The whitespace contained between the </seg> and <oRef> tags has been stripped.
In version 9.3.5 this problem did not occur.
Can you please have a look at this? In a TEI document whitespaces are important.
Thanx in advance, Montse.
___________________________________________
NOTA LEGAL
El contenido de este mensaje de correo electrónico, incluidos los ficheros adjuntos, es confidencial y está protegido por el artículo 18.3 de la Constitución española, que garantiza el secreto de las comunicaciones. Si usted recibe este mensaje por error, le rogamos que se ponga en contacto con el remitente para informarle de este hecho y no difunda su contenido ni haga copias. La Real Academia Española informa de que los datos que en esta comunicación figuran, así como los que mantiene de usted y/o de su empresa, son tratados con la finalidad de mantener el contacto, así como de realizar las gestiones que en esta aparecen, y son utilizados de forma autorizada por las partes y sin cederse a terceros ajenos. Puede ejercer sus derechos a través de proteccion_de_datos@rae.esmailto:proteccion_de_datos@rae.es. Puede obtener más información sobre protección de datos en nuestra página webhttps://www.rae.es/aviso-legal o contactando directamente con nosotros (Reglamento UE 2016/679).
Hi Montse,
I believe the current behavior is correct. The important feature of your example is that the XML is constructed in the XQuery source. In this case the, somewhat obscure, boundary-space setting applies [1]. This has been discussed on the list [2]
Your example works for me if the boundary-space preserve setting is added... ``` declare boundary-space preserve; let $doc := <doc><entry><sense id='1'><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg> <oRef>chusma</oRef>.</quote></cit></sense></entry></doc>
return $doc/entry//sense[@id = '1'] ``` Best regards /Andy
[1] https://www.w3.org/TR/xquery-31/#id-boundary-space-decls [2] https://mailman.uni-konstanz.de/pipermail/basex-talk/2016-September/011251.h...
On Thu, 29 Jun 2023 at 09:38, Montserrat Matías mmatias@rae.es wrote:
Hello.
I have a bug with BaseX version 10 and higher related to handling whitespaces in documents with mixed content elements.
I'm working with narrative documents (TEI).
Following the instructions in the documentation about this type of documents (containing mixed elements) I have disabled the indentation and the STRIPWS option.
When I execute queries with connection to the database, the result is as explained in the documentation (blanks are preserved).
However, when I use the BaseX libraries to execute queries in a local context (without a connection to the database) the spaces are sometimes removed.
Same error if I run the query in the Database Administration interface.
Query:
db:option('stripws')
Result:
false
Query:
db:option('serializer')
Result:
map{"omit-xml-declaration":"yes","binary":"yes","method":"basex","use-character-maps":"","tabulator":"no","allow-duplicate-names":"no","media-type":"","doctype-public":"","escape-uri-attributes":"no","standalone":"omit","csv":map{"lax":true(),"backslashes":false(),"separator":"comma","allow":"","header":false(),"quotes":true(),"format":"direct"},"indents":2,"json-node-output-method":"xml","json":map{"escape":true(),"strings":false(),"lax":false(),"indent":(),"format":"direct","merge":false()},"doctype-system":"","item-separator":(),"indent":"no","suppress-indentation":"","byte-order-mark":"no","include-content-type":"yes","encoding":"UTF-8","newline":"\n","normalization-form":"none","html-version":"","version":"","limit":-1,"undeclare-prefixes":"no","cdata-section-elements":"","parameter-document":""}
Query:
let $doc := <doc><entry><sense id='1'><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg> <oRef>chusma</oRef>.</quote></cit></sense></entry></doc>
return $doc/entry//sense[@id = '1']
Result:
<sense id="1"><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg><oRef>chusma</oRef>.</quote></cit></sense>
The whitespace contained between the </seg> and <oRef> tags has been stripped.
In version 9.3.5 this problem did not occur.
Can you please have a look at this? In a TEI document whitespaces are important.
Thanx in advance,
Montse.
NOTA LEGAL
El contenido de este mensaje de correo electrónico, incluidos los ficheros adjuntos, es confidencial y está protegido por el artículo 18.3 de la Constitución española, que garantiza el secreto de las comunicaciones. Si usted recibe este mensaje por error, le rogamos que se ponga en contacto con el remitente para informarle de este hecho y no difunda su contenido ni haga copias. La *Real Academia Española* informa de que los datos que en esta comunicación figuran, así como los que mantiene de usted y/o de su empresa, son tratados con la finalidad de mantener el contacto, así como de realizar las gestiones que en esta aparecen, y son utilizados de forma autorizada por las partes y sin cederse a terceros ajenos. Puede ejercer sus derechos a través de proteccion_de_datos@rae.es. Puede obtener más información sobre protección de datos en nuestra página web https://www.rae.es/aviso-legal o contactando directamente con nosotros (Reglamento UE 2016/679).
Hi Andy.
It was not a bug! I was totally wrong.
I didn't know about that feature. I have a lot to learn... Your proposed solution works fine.
Thank you very much for your quick response.
Regards, Montse.
De: Andy Bunce bunce.andy@gmail.com Enviado el: jueves, 29 de junio de 2023 12:07 Para: Montserrat Matías mmatias@rae.es CC: basex-talk@mailman.uni-konstanz.de Asunto: Re: [basex-talk] Whitespaces not preserved in TEI documents.
ATENCIÓN: Este es un mensaje externo originado fuera de la RAE. Por favor, no haga clic en enlaces ni abra archivos adjuntos a menos que reconozca al remitente y sepa que el contenido es seguro.
________________________________ Hi Montse,
I believe the current behavior is correct. The important feature of your example is that the XML is constructed in the XQuery source. In this case the, somewhat obscure, boundary-space setting applies [1]. This has been discussed on the list [2]
Your example works for me if the boundary-space preserve setting is added... ``` declare boundary-space preserve; let $doc := <doc><entry><sense id='1'><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg> <oRef>chusma</oRef>.</quote></cit></sense></entry></doc>
return $doc/entry//sense[@id = '1'] ``` Best regards /Andy
[1] https://www.w3.org/TR/xquery-31/#id-boundary-space-decls [2] https://mailman.uni-konstanz.de/pipermail/basex-talk/2016-September/011251.h...
On Thu, 29 Jun 2023 at 09:38, Montserrat Matías <mmatias@rae.esmailto:mmatias@rae.es> wrote: Hello.
I have a bug with BaseX version 10 and higher related to handling whitespaces in documents with mixed content elements.
I'm working with narrative documents (TEI). Following the instructions in the documentation about this type of documents (containing mixed elements) I have disabled the indentation and the STRIPWS option.
When I execute queries with connection to the database, the result is as explained in the documentation (blanks are preserved).
However, when I use the BaseX libraries to execute queries in a local context (without a connection to the database) the spaces are sometimes removed. Same error if I run the query in the Database Administration interface.
Query: db:option('stripws')
Result: false
Query: db:option('serializer')
Result: map{"omit-xml-declaration":"yes","binary":"yes","method":"basex","use-character-maps":"","tabulator":"no","allow-duplicate-names":"no","media-type":"","doctype-public":"","escape-uri-attributes":"no","standalone":"omit","csv":map{"lax":true(),"backslashes":false(),"separator":"comma","allow":"","header":false(),"quotes":true(),"format":"direct"},"indents":2,"json-node-output-method":"xml","json":map{"escape":true(),"strings":false(),"lax":false(),"indent":(),"format":"direct","merge":false()},"doctype-system":"","item-separator":(),"indent":"no","suppress-indentation":"","byte-order-mark":"no","include-content-type":"yes","encoding":"UTF-8","newline":"\n","normalization-form":"none","html-version":"","version":"","limit":-1,"undeclare-prefixes":"no","cdata-section-elements":"","parameter-document":""}
Query: let $doc := <doc><entry><sense id='1'><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg> <oRef>chusma</oRef>.</quote></cit></sense></entry></doc> return $doc/entry//sense[@id = '1']
Result: <sense id="1"><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg><oRef>chusma</oRef>.</quote></cit></sense>
The whitespace contained between the </seg> and <oRef> tags has been stripped.
In version 9.3.5 this problem did not occur.
Can you please have a look at this? In a TEI document whitespaces are important.
Thanx in advance, Montse.
___________________________________________
NOTA LEGAL
El contenido de este mensaje de correo electrónico, incluidos los ficheros adjuntos, es confidencial y está protegido por el artículo 18.3 de la Constitución española, que garantiza el secreto de las comunicaciones. Si usted recibe este mensaje por error, le rogamos que se ponga en contacto con el remitente para informarle de este hecho y no difunda su contenido ni haga copias. La Real Academia Española informa de que los datos que en esta comunicación figuran, así como los que mantiene de usted y/o de su empresa, son tratados con la finalidad de mantener el contacto, así como de realizar las gestiones que en esta aparecen, y son utilizados de forma autorizada por las partes y sin cederse a terceros ajenos. Puede ejercer sus derechos a través de proteccion_de_datos@rae.esmailto:proteccion_de_datos@rae.es. Puede obtener más información sobre protección de datos en nuestra página webhttps://www.rae.es/aviso-legal o contactando directamente con nosotros (Reglamento UE 2016/679).
___________________________________________
NOTA LEGAL
El contenido de este mensaje de correo electrónico, incluidos los ficheros adjuntos, es confidencial y está protegido por el artículo 18.3 de la Constitución española, que garantiza el secreto de las comunicaciones. Si usted recibe este mensaje por error, le rogamos que se ponga en contacto con el remitente para informarle de este hecho y no difunda su contenido ni haga copias. La Real Academia Española informa de que los datos que en esta comunicación figuran, así como los que mantiene de usted y/o de su empresa, son tratados con la finalidad de mantener el contacto, así como de realizar las gestiones que en esta aparecen, y son utilizados de forma autorizada por las partes y sin cederse a terceros ajenos. Puede ejercer sus derechos a través de proteccion_de_datos@rae.esmailto:proteccion_de_datos@rae.es. Puede obtener más información sobre protección de datos en nuestra página webhttps://www.rae.es/aviso-legal o contactando directamente con nosotros (Reglamento UE 2016/679).
Glad to have helped. Whitespace is important ...and hard.
I think the Wiki (https://docs.basex.org) could help more here. Maybe with a "whitespace "page or/and a page listing implementation-defined [1] settings. I had a look at what Saxon does [2] (item 32). It looks like the Saxon boundary-space default is preserve.
Perhaps that default would be more in line with user expectations especially now the notorious CHOP has gone ;-)
/Andy
[1] https://www.w3.org/TR/xquery-31/#dt-implementation-defined [2] https://www.saxonica.com/documentation12/index.html#!conformance/xquery31
On Fri, 30 Jun 2023 at 12:31, Montserrat Matías mmatias@rae.es wrote:
Hi Andy.
It was not a bug!
I was totally wrong.
I didn't know about that feature.
I have a lot to learn...
Your proposed solution works fine.
Thank you very much for your quick response.
Regards,
Montse.
*De:* Andy Bunce bunce.andy@gmail.com *Enviado el:* jueves, 29 de junio de 2023 12:07 *Para:* Montserrat Matías mmatias@rae.es *CC:* basex-talk@mailman.uni-konstanz.de *Asunto:* Re: [basex-talk] Whitespaces not preserved in TEI documents.
*ATENCIÓN*: Este es un mensaje externo originado fuera de la RAE. Por favor, no haga clic en enlaces ni abra archivos adjuntos a menos que reconozca al remitente y sepa que el contenido es seguro.
Hi Montse,
I believe the current behavior is correct. The important feature of your example is that the XML is constructed in the XQuery source.
In this case the, somewhat obscure, boundary-space setting applies [1]. This has been discussed on the list [2]
Your example works for me if the boundary-space preserve setting is added...
declare boundary-space preserve; let $doc := <doc><entry><sense id='1'><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg> <oRef>chusma</oRef>.</quote></cit></sense></entry></doc> return $doc/entry//sense[@id = '1']
Best regards
/Andy
[1] https://www.w3.org/TR/xquery-31/#id-boundary-space-decls
[2] https://mailman.uni-konstanz.de/pipermail/basex-talk/2016-September/011251.h...
On Thu, 29 Jun 2023 at 09:38, Montserrat Matías mmatias@rae.es wrote:
Hello.
I have a bug with BaseX version 10 and higher related to handling whitespaces in documents with mixed content elements.
I'm working with narrative documents (TEI).
Following the instructions in the documentation about this type of documents (containing mixed elements) I have disabled the indentation and the STRIPWS option.
When I execute queries with connection to the database, the result is as explained in the documentation (blanks are preserved).
However, when I use the BaseX libraries to execute queries in a local context (without a connection to the database) the spaces are sometimes removed.
Same error if I run the query in the Database Administration interface.
Query:
db:option('stripws')
Result:
false
Query:
db:option('serializer')
Result:
map{"omit-xml-declaration":"yes","binary":"yes","method":"basex","use-character-maps":"","tabulator":"no","allow-duplicate-names":"no","media-type":"","doctype-public":"","escape-uri-attributes":"no","standalone":"omit","csv":map{"lax":true(),"backslashes":false(),"separator":"comma","allow":"","header":false(),"quotes":true(),"format":"direct"},"indents":2,"json-node-output-method":"xml","json":map{"escape":true(),"strings":false(),"lax":false(),"indent":(),"format":"direct","merge":false()},"doctype-system":"","item-separator":(),"indent":"no","suppress-indentation":"","byte-order-mark":"no","include-content-type":"yes","encoding":"UTF-8","newline":"\n","normalization-form":"none","html-version":"","version":"","limit":-1,"undeclare-prefixes":"no","cdata-section-elements":"","parameter-document":""}
Query:
let $doc := <doc><entry><sense id='1'><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg> <oRef>chusma</oRef>.</quote></cit></sense></entry></doc>
return $doc/entry//sense[@id = '1']
Result:
<sense id="1"><def>Gente grosera o vulgar.</def><cit><quote><seg>La</seg><oRef>chusma</oRef>.</quote></cit></sense>
The whitespace contained between the </seg> and <oRef> tags has been stripped.
In version 9.3.5 this problem did not occur.
Can you please have a look at this? In a TEI document whitespaces are important.
Thanx in advance,
Montse.
NOTA LEGAL
El contenido de este mensaje de correo electrónico, incluidos los ficheros adjuntos, es confidencial y está protegido por el artículo 18.3 de la Constitución española, que garantiza el secreto de las comunicaciones. Si usted recibe este mensaje por error, le rogamos que se ponga en contacto con el remitente para informarle de este hecho y no difunda su contenido ni haga copias. La *Real Academia Española* informa de que los datos que en esta comunicación figuran, así como los que mantiene de usted y/o de su empresa, son tratados con la finalidad de mantener el contacto, así como de realizar las gestiones que en esta aparecen, y son utilizados de forma autorizada por las partes y sin cederse a terceros ajenos. Puede ejercer sus derechos a través de proteccion_de_datos@rae.es. Puede obtener más información sobre protección de datos en nuestra página web https://www.rae.es/aviso-legal o contactando directamente con nosotros (Reglamento UE 2016/679).
NOTA LEGAL
El contenido de este mensaje de correo electrónico, incluidos los ficheros adjuntos, es confidencial y está protegido por el artículo 18.3 de la Constitución española, que garantiza el secreto de las comunicaciones. Si usted recibe este mensaje por error, le rogamos que se ponga en contacto con el remitente para informarle de este hecho y no difunda su contenido ni haga copias. La *Real Academia Española* informa de que los datos que en esta comunicación figuran, así como los que mantiene de usted y/o de su empresa, son tratados con la finalidad de mantener el contacto, así como de realizar las gestiones que en esta aparecen, y son utilizados de forma autorizada por las partes y sin cederse a terceros ajenos. Puede ejercer sus derechos a través de proteccion_de_datos@rae.es. Puede obtener más información sobre protección de datos en nuestra página web https://www.rae.es/aviso-legal o contactando directamente con nosotros (Reglamento UE 2016/679).
Thanks for the suggestions. I agree that XML whitespace handling will always be a challenge and challenging to grasp… No matter what defaults are used in an implementation. It’s true, our documentation could provide more detail about this; maybe we can spend more time on that in the future. All edits are welcome ;)
I had a look at what Saxon does [2] (item 32). It looks like the Saxon
boundary-space default is preserve.
I believe that Saxon strips boundary space by default as well. At least that’s my command-line experience (I didn’t check how that correlates to the information given in their documentation):
(: query.xq :) <x> </x>
(: call :) java -cp "saxon-he-12.2.jar;xmlresolver-5.1.1.jar" net.sf.saxon.Query query.xq
(: result :) <?xml version="1.0" encoding="UTF-8"?><x/>
Best, Christian
Hi Christian,
You are right about Saxon. The documentation appeared, to me, to suggest the default was preserve, and I didn't check! The documentation is now very clear[1].
"The default is strip, in accordance with Appendix C.1 of the XQuery specification."
I have added a few words to the Wiki [2]
Thanks
/Andy [1] https://www.saxonica.com/documentation12/index.html#!conformance/xquery31 [2] https://docs.basex.org/wiki/BaseX_10#Whitespaces
On Sat, 1 Jul 2023 at 07:41, Christian Grün christian.gruen@gmail.com wrote:
Thanks for the suggestions. I agree that XML whitespace handling will always be a challenge and challenging to grasp… No matter what defaults are used in an implementation. It’s true, our documentation could provide more detail about this; maybe we can spend more time on that in the future. All edits are welcome ;)
I had a look at what Saxon does [2] (item 32). It looks like the Saxon
boundary-space default is preserve.
I believe that Saxon strips boundary space by default as well. At least that’s my command-line experience (I didn’t check how that correlates to the information given in their documentation):
(: query.xq :) <x> </x>
(: call :) java -cp "saxon-he-12.2.jar;xmlresolver-5.1.1.jar" net.sf.saxon.Query query.xq
(: result :)
<?xml version="1.0" encoding="UTF-8"?><x/>
Best, Christian
Perhaps this addition would be more findable if moved to the STRIPWS description?
/Andy
On Wed, 5 Jul 2023 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:
Hi Christian,
You are right about Saxon. The documentation appeared, to me, to suggest the default was preserve, and I didn't check! The documentation is now very clear[1].
"The default is strip, in accordance with Appendix C.1 of the XQuery specification."
I have added a few words to the Wiki [2]
Thanks
/Andy [1] https://www.saxonica.com/documentation12/index.html#!conformance/xquery31 [2] https://docs.basex.org/wiki/BaseX_10#Whitespaces
On Sat, 1 Jul 2023 at 07:41, Christian Grün christian.gruen@gmail.com wrote:
Thanks for the suggestions. I agree that XML whitespace handling will always be a challenge and challenging to grasp… No matter what defaults are used in an implementation. It’s true, our documentation could provide more detail about this; maybe we can spend more time on that in the future. All edits are welcome ;)
I had a look at what Saxon does [2] (item 32). It looks like the Saxon
boundary-space default is preserve.
I believe that Saxon strips boundary space by default as well. At least that’s my command-line experience (I didn’t check how that correlates to the information given in their documentation):
(: query.xq :) <x> </x>
(: call :) java -cp "saxon-he-12.2.jar;xmlresolver-5.1.1.jar" net.sf.saxon.Query query.xq
(: result :)
<?xml version="1.0" encoding="UTF-8"?><x/>
Best, Christian
Thanks, Andy, much appreciated. I’ve added a BaseX 10 Whitespaces reference to the STRIPWS paragraph [1].
[1] https://docs.basex.org/wiki/Options#STRIPWS
On Wed, Jul 5, 2023 at 6:12 PM Andy Bunce bunce.andy@gmail.com wrote:
Perhaps this addition would be more findable if moved to the STRIPWS description?
/Andy
On Wed, 5 Jul 2023 at 16:54, Andy Bunce bunce.andy@gmail.com wrote:
Hi Christian,
You are right about Saxon. The documentation appeared, to me, to suggest the default was preserve, and I didn't check! The documentation is now very clear[1].
"The default is strip, in accordance with Appendix C.1 of the XQuery specification."
I have added a few words to the Wiki [2]
Thanks
/Andy [1] https://www.saxonica.com/documentation12/index.html#!conformance/xquery31 [2] https://docs.basex.org/wiki/BaseX_10#Whitespaces
On Sat, 1 Jul 2023 at 07:41, Christian Grün christian.gruen@gmail.com wrote:
Thanks for the suggestions. I agree that XML whitespace handling will always be a challenge and challenging to grasp… No matter what defaults are used in an implementation. It’s true, our documentation could provide more detail about this; maybe we can spend more time on that in the future. All edits are welcome ;)
I had a look at what Saxon does [2] (item 32). It looks like the Saxon boundary-space default is preserve.
I believe that Saxon strips boundary space by default as well. At least that’s my command-line experience (I didn’t check how that correlates to the information given in their documentation):
(: query.xq :) <x> </x>
(: call :) java -cp "saxon-he-12.2.jar;xmlresolver-5.1.1.jar" net.sf.saxon.Query query.xq
(: result :)
<?xml version="1.0" encoding="UTF-8"?><x/>
Best, Christian
basex-talk@mailman.uni-konstanz.de