I am doing some transformations of datasets, then submitting pull requests to upstream sources on GitHub. For instance, today I am inserting some attributes, but I may be restructuring in various ways or enhancing data in various ways.
To make upstreams happy, I need to be disciplined about not changing whitespace.
What do I have to do? Is it sufficient to preserve whitespace when importing, do an XQuery update, and export, or can that change whitespace beyond what the update operations explicitly say?
Thanks!
Jonathan
Hi Jonathan,
If you work with whitespace-sensitive documents, it’s recommendable to add the following two options at the end of your .basex configuration file:
... # Local Options CHOP = false SERIALIZER = indent=no
The first option will ensure that no whitespaces will be chopped when parsing documents. The second one will disable automatic indentation.
Apart from that, you’ll still need to be aware that whitespaces will often be dropped if you use node constructors (that’s the default behavior of the spec):
<x> </x>
You can avoid that by adding explicit spaces:
<x>{ ' ' }</x>
Feel free to share your queries with us.
Best, Christian
On Fri, Jul 16, 2021 at 12:52 AM Jonathan Robie jonathan.robie@gmail.com wrote:
I am doing some transformations of datasets, then submitting pull requests to upstream sources on GitHub. For instance, today I am inserting some attributes, but I may be restructuring in various ways or enhancing data in various ways.
To make upstreams happy, I need to be disciplined about not changing whitespace.
What do I have to do? Is it sufficient to preserve whitespace when importing, do an XQuery update, and export, or can that change whitespace beyond what the update operations explicitly say?
Thanks!
Jonathan
Thanks, Christian, that's very helpful.
The query I am working on now simply adds a @type marker to indicate a ketiv reading.
declare updating function local:mark-ketiv($variant) { if (fn:empty($variant/catchWord)) then () else for $ketiv in get-ketiv($variant, $variant/catchWord) return if ($ketiv/@type) then replace value of node $ketiv/@type with fn:string-join(($ketiv/@type, "x-ketiv"), " ") else insert node attribute type { "x-ketiv" } into $ketiv };
Here's the output of the query. This function is called for each note of type "variant". I am working with the Open Scriptures Hebrew Bible, which marks the Qere reading but does not explicitly mark the Ketiv reading to which it corresponds. I am inserting these attributes because my system ignores the Ketiv and builds a syntax tree from the Qere.
<verse osisID="1Sam.9.1"> <w lemma="c/1961" morph="HC/Vqw3ms" id="09wci">וַֽ/יְהִי</w> <seg type="x-maqqef">־</seg> <w lemma="376" morph="HNcmsa" id="09MpA">אִ֣ישׁ</w> <w type="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w> <seg type="x-maqqef x-ketiv">־</seg> <w type="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w> <note type="variant"> <catchWord>מ/בן־ימין</catchWord> <rdg type="x-qere"> <w lemma="m/1144" n="1.0.1" morph="HR/Np" id="09EC9">מִ/בִּנְיָמִ֗ין</w> </rdg> </note>
If I can do this without messing up the whitespace, the Open Scriptures Hebrew Bible people might accept it in the upstream, which is why the whitespace is important.
Jonathan
On Fri, Jul 16, 2021 at 4:24 AM Christian Grün christian.gruen@gmail.com wrote:
Hi Jonathan,
If you work with whitespace-sensitive documents, it’s recommendable to add the following two options at the end of your .basex configuration file:
... # Local Options CHOP = false SERIALIZER = indent=no
The first option will ensure that no whitespaces will be chopped when parsing documents. The second one will disable automatic indentation.
Apart from that, you’ll still need to be aware that whitespaces will often be dropped if you use node constructors (that’s the default behavior of the spec):
<x> </x>
You can avoid that by adding explicit spaces:
<x>{ ' ' }</x>
Feel free to share your queries with us.
Best, Christian
On Fri, Jul 16, 2021 at 12:52 AM Jonathan Robie jonathan.robie@gmail.com wrote:
I am doing some transformations of datasets, then submitting pull
requests to upstream sources on GitHub. For instance, today I am inserting some attributes, but I may be restructuring in various ways or enhancing data in various ways.
To make upstreams happy, I need to be disciplined about not changing
whitespace.
What do I have to do? Is it sufficient to preserve whitespace when
importing, do an XQuery update, and export, or can that change whitespace beyond what the update operations explicitly say?
Thanks!
Jonathan
Thanks, Jonathan, for the code snippet.
replace value of node $ketiv/@type with fn:string-join(($ketiv/@type, "x-ketiv"), " ")
This statement should be completely safe, no matter which options you have set. If you want to avoid if/then/else, you can also do the following (but it’s not much shorter):
delete node $ketiv/@type, insert node attribute type { string-join(($ketiv/@type, "x-ketiv"), " ") } into $ketiv
Yes, that is shorter and more readable. Thanks!
And if I don't have to worry about setting options, that's nicely convenient. Again, thanks!
Jonathan
On Fri, Jul 16, 2021 at 8:53 AM Christian Grün christian.gruen@gmail.com wrote:
Thanks, Jonathan, for the code snippet.
replace value of node $ketiv/@type with fn:string-join(($ketiv/@type,
"x-ketiv"), " ")
This statement should be completely safe, no matter which options you have set. If you want to avoid if/then/else, you can also do the following (but it’s not much shorter):
delete node $ketiv/@type, insert node attribute type { string-join(($ketiv/@type, "x-ketiv"), " ") } into $ketiv
Hmmm, the original repo puts elements smack dab together on the same line to avoid whitespace issues, perhaps using CSS. When I do the update, it puts the updated elements on separate lines:
< <w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w><seg type="x-maqqef">־</seg><w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w> ---
<w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w> <seg type="x-maqqef">־</seg> <w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
Jonathan
On Fri, Jul 16, 2021 at 11:25 AM Jonathan Robie jonathan.robie@gmail.com wrote:
Yes, that is shorter and more readable. Thanks!
And if I don't have to worry about setting options, that's nicely convenient. Again, thanks!
Jonathan
On Fri, Jul 16, 2021 at 8:53 AM Christian Grün christian.gruen@gmail.com wrote:
Thanks, Jonathan, for the code snippet.
replace value of node $ketiv/@type with fn:string-join(($ketiv/@type,
"x-ketiv"), " ")
This statement should be completely safe, no matter which options you have set. If you want to avoid if/then/else, you can also do the following (but it’s not much shorter):
delete node $ketiv/@type, insert node attribute type { string-join(($ketiv/@type, "x-ketiv"), " ") } into $ketiv
I tried adding these options to .basex:
# Local Options CHOP = false SERIALIZER = indent=no
It still seems to be putting elements on individual lines, as above, and not just for elements that have been modified. Is there a way to prevent this?
Jonathan
On Fri, Jul 16, 2021 at 1:33 PM Jonathan Robie jonathan.robie@gmail.com wrote:
Hmmm, the original repo puts elements smack dab together on the same line to avoid whitespace issues, perhaps using CSS. When I do the update, it puts the updated elements on separate lines:
< <w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w><seg type="x-maqqef">־</seg><w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
<w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w> <seg type="x-maqqef">־</seg> <w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
Jonathan
On Fri, Jul 16, 2021 at 11:25 AM Jonathan Robie jonathan.robie@gmail.com wrote:
Yes, that is shorter and more readable. Thanks!
And if I don't have to worry about setting options, that's nicely convenient. Again, thanks!
Jonathan
On Fri, Jul 16, 2021 at 8:53 AM Christian Grün christian.gruen@gmail.com wrote:
Thanks, Jonathan, for the code snippet.
replace value of node $ketiv/@type with fn:string-join(($ketiv/@type,
"x-ketiv"), " ")
This statement should be completely safe, no matter which options you have set. If you want to avoid if/then/else, you can also do the following (but it’s not much shorter):
delete node $ketiv/@type, insert node attribute type { string-join(($ketiv/@type, "x-ketiv"), " ") } into $ketiv
Hi Jonathan,
Could you provide us with a little step-by-step description that allows us to reproduce your use case?
Thanks in advance, Christian
Jonathan Robie jonathan.robie@gmail.com schrieb am Fr., 16. Juli 2021, 19:42:
I tried adding these options to .basex:
# Local Options CHOP = false SERIALIZER = indent=no
It still seems to be putting elements on individual lines, as above, and not just for elements that have been modified. Is there a way to prevent this?
Jonathan
On Fri, Jul 16, 2021 at 1:33 PM Jonathan Robie jonathan.robie@gmail.com wrote:
Hmmm, the original repo puts elements smack dab together on the same line to avoid whitespace issues, perhaps using CSS. When I do the update, it puts the updated elements on separate lines:
< <w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w><seg type="x-maqqef">־</seg><w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
<w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w> <seg type="x-maqqef">־</seg> <w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
Jonathan
On Fri, Jul 16, 2021 at 11:25 AM Jonathan Robie jonathan.robie@gmail.com wrote:
Yes, that is shorter and more readable. Thanks!
And if I don't have to worry about setting options, that's nicely convenient. Again, thanks!
Jonathan
On Fri, Jul 16, 2021 at 8:53 AM Christian Grün < christian.gruen@gmail.com> wrote:
Thanks, Jonathan, for the code snippet.
replace value of node $ketiv/@type with
fn:string-join(($ketiv/@type, "x-ketiv"), " ")
This statement should be completely safe, no matter which options you have set. If you want to avoid if/then/else, you can also do the following (but it’s not much shorter):
delete node $ketiv/@type, insert node attribute type { string-join(($ketiv/@type, "x-ketiv"), " ") } into $ketiv
Sure. As I said, I am using these options in .basex:
CHOP = false SERIALIZER = indent=no
I am using data from the wlc subdirectory of this repo:
https://github.com/openscriptures/morphhb
Here is my .bxs file:
DROP DB oshb-morphology CREATE DB oshb-morphology ADD ./morphhb/wlc RUN ./xquery/oshb-use-qere.xq EXPORT ./out/oshb
This is the query (oshb-use-quere.xq):
declare default element namespace " http://www.bibletechnologies.net/2003/OSIS/namespace"; declare default function namespace " http://www.w3.org/2005/xquery-local-functions";
declare function local:get-ketiv($base, $catchword) { let $prev := $base/preceding-sibling::*[1] let $prevstring := fn:string($prev) where $prev and fn:ends-with($catchword, $prevstring) return ( $prev , if ($prevstring != $catchword) then get-ketiv($prev, fn:substring($catchword, 1, fn:string-length($catchword) - fn:string-length($prevstring))) else () ) };
declare updating function local:mark-ketiv($variant) { for $ketiv in get-ketiv($variant, $variant/catchWord) return ( delete node $ketiv/@type, insert node attribute type { fn:string-join(($ketiv/@type, "x-ketiv")," ") } into $ketiv ) };
let $oshb := db:open("oshb-morphology") for $verse in $oshb//verse[note[@type='variant']] for $variant in $verse/note[@type='variant'] return mark-ketiv($variant)
I really appreciate all your help with this!
Jonathan
On Fri, Jul 16, 2021 at 1:55 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Jonathan,
Could you provide us with a little step-by-step description that allows us to reproduce your use case?
Thanks in advance, Christian
Jonathan Robie jonathan.robie@gmail.com schrieb am Fr., 16. Juli 2021, 19:42:
I tried adding these options to .basex:
# Local Options CHOP = false SERIALIZER = indent=no
It still seems to be putting elements on individual lines, as above, and not just for elements that have been modified. Is there a way to prevent this?
Jonathan
On Fri, Jul 16, 2021 at 1:33 PM Jonathan Robie jonathan.robie@gmail.com wrote:
Hmmm, the original repo puts elements smack dab together on the same line to avoid whitespace issues, perhaps using CSS. When I do the update, it puts the updated elements on separate lines:
< <w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w><seg type="x-maqqef">־</seg><w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
<w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w> <seg type="x-maqqef">־</seg> <w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
Jonathan
On Fri, Jul 16, 2021 at 11:25 AM Jonathan Robie < jonathan.robie@gmail.com> wrote:
Yes, that is shorter and more readable. Thanks!
And if I don't have to worry about setting options, that's nicely convenient. Again, thanks!
Jonathan
On Fri, Jul 16, 2021 at 8:53 AM Christian Grün < christian.gruen@gmail.com> wrote:
Thanks, Jonathan, for the code snippet.
replace value of node $ketiv/@type with
fn:string-join(($ketiv/@type, "x-ketiv"), " ")
This statement should be completely safe, no matter which options you have set. If you want to avoid if/then/else, you can also do the following (but it’s not much shorter):
delete node $ketiv/@type, insert node attribute type { string-join(($ketiv/@type, "x-ketiv"), " ") } into $ketiv
…great, that was perfectly easy to reproduce.
If the EXPORT command is used, an alternative option needs to be used … And it’s called EXPORTER [1]. The following script should do the job:
SET CHOP false SET EXPORTER indent=no,omit-xml-declaration=no CREATE DB oshb-morphology morphhb/wlc RUN xquery/oshb-use-qere.xq EXPORT out/oshb
Some notes (just ignore those that you are already aware of, or that may not matter to you):
• If the options are specified in the script, the .basex file can be kept untouched • As the input files have an XML declaration, I have added the omit-xml-declaration parameter • DROP can be… skipped, as CREATE will remove an existing database • If the initial input is specified in the CREATE command, things will be slightly faster
I think we should merge the options SERIALIZER and EXPORTER in a future version of BaseX, as they have already been a source of confusion in the past.
I really appreciate all your help with this!
You are welcome! Christian
[1] https://docs.basex.org/wiki/Options#EXPORTER
On Fri, Jul 16, 2021 at 8:29 PM Jonathan Robie jonathan.robie@gmail.com wrote:
Sure. As I said, I am using these options in .basex:
CHOP = false SERIALIZER = indent=no
I am using data from the wlc subdirectory of this repo:
https://github.com/openscriptures/morphhb
Here is my .bxs file:
DROP DB oshb-morphology CREATE DB oshb-morphology ADD ./morphhb/wlc RUN ./xquery/oshb-use-qere.xq EXPORT ./out/oshb
This is the query (oshb-use-quere.xq):
declare default element namespace "http://www.bibletechnologies.net/2003/OSIS/namespace"; declare default function namespace "http://www.w3.org/2005/xquery-local-functions";
declare function local:get-ketiv($base, $catchword) { let $prev := $base/preceding-sibling::*[1] let $prevstring := fn:string($prev) where $prev and fn:ends-with($catchword, $prevstring) return ( $prev , if ($prevstring != $catchword) then get-ketiv($prev, fn:substring($catchword, 1, fn:string-length($catchword) - fn:string-length($prevstring))) else () ) };
declare updating function local:mark-ketiv($variant) { for $ketiv in get-ketiv($variant, $variant/catchWord) return ( delete node $ketiv/@type, insert node attribute type { fn:string-join(($ketiv/@type, "x-ketiv")," ") } into $ketiv ) };
let $oshb := db:open("oshb-morphology") for $verse in $oshb//verse[note[@type='variant']] for $variant in $verse/note[@type='variant'] return mark-ketiv($variant)
I really appreciate all your help with this!
Jonathan
On Fri, Jul 16, 2021 at 1:55 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Jonathan,
Could you provide us with a little step-by-step description that allows us to reproduce your use case?
Thanks in advance, Christian
Jonathan Robie jonathan.robie@gmail.com schrieb am Fr., 16. Juli 2021, 19:42:
I tried adding these options to .basex:
# Local Options CHOP = false SERIALIZER = indent=no
It still seems to be putting elements on individual lines, as above, and not just for elements that have been modified. Is there a way to prevent this?
Jonathan
On Fri, Jul 16, 2021 at 1:33 PM Jonathan Robie jonathan.robie@gmail.com wrote:
Hmmm, the original repo puts elements smack dab together on the same line to avoid whitespace issues, perhaps using CSS. When I do the update, it puts the updated elements on separate lines:
< <w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w><seg type="x-maqqef">־</seg><w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
<w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w> <seg type="x-maqqef">־</seg> <w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
Jonathan
On Fri, Jul 16, 2021 at 11:25 AM Jonathan Robie jonathan.robie@gmail.com wrote:
Yes, that is shorter and more readable. Thanks!
And if I don't have to worry about setting options, that's nicely convenient. Again, thanks!
Jonathan
On Fri, Jul 16, 2021 at 8:53 AM Christian Grün christian.gruen@gmail.com wrote:
Thanks, Jonathan, for the code snippet.
> replace value of node $ketiv/@type with fn:string-join(($ketiv/@type, "x-ketiv"), " ")
This statement should be completely safe, no matter which options you have set. If you want to avoid if/then/else, you can also do the following (but it’s not much shorter):
delete node $ketiv/@type, insert node attribute type { string-join(($ketiv/@type, "x-ketiv"), " ") } into $ketiv
Works like a charm - thanks!
I do think merging SERIALIZER and EXPORTER would be helpful. And don't be afraid of over-teaching me. You already helped me simplify an XQuery, I'm not too proud to learn ;->
Jonathan
On Fri, Jul 16, 2021 at 3:12 PM Christian Grün christian.gruen@gmail.com wrote:
…great, that was perfectly easy to reproduce.
If the EXPORT command is used, an alternative option needs to be used … And it’s called EXPORTER [1]. The following script should do the job:
SET CHOP false SET EXPORTER indent=no,omit-xml-declaration=no CREATE DB oshb-morphology morphhb/wlc RUN xquery/oshb-use-qere.xq EXPORT out/oshb
Some notes (just ignore those that you are already aware of, or that may not matter to you):
• If the options are specified in the script, the .basex file can be kept untouched • As the input files have an XML declaration, I have added the omit-xml-declaration parameter • DROP can be… skipped, as CREATE will remove an existing database • If the initial input is specified in the CREATE command, things will be slightly faster
I think we should merge the options SERIALIZER and EXPORTER in a future version of BaseX, as they have already been a source of confusion in the past.
I really appreciate all your help with this!
You are welcome! Christian
[1] https://docs.basex.org/wiki/Options#EXPORTER
On Fri, Jul 16, 2021 at 8:29 PM Jonathan Robie jonathan.robie@gmail.com wrote:
Sure. As I said, I am using these options in .basex:
CHOP = false SERIALIZER = indent=no
I am using data from the wlc subdirectory of this repo:
https://github.com/openscriptures/morphhb
Here is my .bxs file:
DROP DB oshb-morphology CREATE DB oshb-morphology ADD ./morphhb/wlc RUN ./xquery/oshb-use-qere.xq EXPORT ./out/oshb
This is the query (oshb-use-quere.xq):
declare default element namespace "
http://www.bibletechnologies.net/2003/OSIS/namespace";
declare default function namespace "
http://www.w3.org/2005/xquery-local-functions";
declare function local:get-ketiv($base, $catchword) { let $prev := $base/preceding-sibling::*[1] let $prevstring := fn:string($prev) where $prev and fn:ends-with($catchword, $prevstring) return ( $prev , if ($prevstring != $catchword) then get-ketiv($prev, fn:substring($catchword, 1,
fn:string-length($catchword) - fn:string-length($prevstring)))
else ()
) };
declare updating function local:mark-ketiv($variant) { for $ketiv in get-ketiv($variant, $variant/catchWord) return ( delete node $ketiv/@type, insert node attribute type { fn:string-join(($ketiv/@type,
"x-ketiv")," ") } into $ketiv
) };
let $oshb := db:open("oshb-morphology") for $verse in $oshb//verse[note[@type='variant']] for $variant in $verse/note[@type='variant'] return mark-ketiv($variant)
I really appreciate all your help with this!
Jonathan
On Fri, Jul 16, 2021 at 1:55 PM Christian Grün <
christian.gruen@gmail.com> wrote:
Hi Jonathan,
Could you provide us with a little step-by-step description that allows
us to reproduce your use case?
Thanks in advance, Christian
Jonathan Robie jonathan.robie@gmail.com schrieb am Fr., 16. Juli
2021, 19:42:
I tried adding these options to .basex:
# Local Options CHOP = false SERIALIZER = indent=no
It still seems to be putting elements on individual lines, as above,
and not just for elements that have been modified. Is there a way to prevent this?
Jonathan
On Fri, Jul 16, 2021 at 1:33 PM Jonathan Robie <
jonathan.robie@gmail.com> wrote:
Hmmm, the original repo puts elements smack dab together on the same
line to avoid whitespace issues, perhaps using CSS. When I do the update, it puts the updated elements on separate lines:
< <w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w><seg
type="x-maqqef">־</seg><w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
<w lemma="1121 a" morph="HNcmsc" id="01PQe">בֶּן</w> <seg type="x-maqqef">־</seg> <w lemma="3967" morph="HAcbsa" id="01Exo">מֵאָ֥ה</w>
Jonathan
On Fri, Jul 16, 2021 at 11:25 AM Jonathan Robie <
jonathan.robie@gmail.com> wrote:
Yes, that is shorter and more readable. Thanks!
And if I don't have to worry about setting options, that's nicely
convenient. Again, thanks!
Jonathan
On Fri, Jul 16, 2021 at 8:53 AM Christian Grün <
christian.gruen@gmail.com> wrote:
> > Thanks, Jonathan, for the code snippet. > > > replace value of node $ketiv/@type with
fn:string-join(($ketiv/@type, "x-ketiv"), " ")
> > This statement should be completely safe, no matter which options
you
> have set. If you want to avoid if/then/else, you can also do the > following (but it’s not much shorter): > > delete node $ketiv/@type, > insert node attribute type { string-join(($ketiv/@type,
"x-ketiv"),
> " ") } into $ketiv
basex-talk@mailman.uni-konstanz.de