Thanks for the addition, Liam; I should have mentioned that.
If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…
<p xml:space='preserve'>The <em>very</em> <id>tc34q</id>.</p>
…whitespace stripping will be safe.
Similarly, it may be helpful to know that the whitspace gets lost if XML strings…
<p>The <em>very</em> <id>tc34q</id>.</p>
…are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:
declare boundary-space preserve; <p>The <em>very</em> <id>tc34q</id>.</p>
Whitespace handling is generally a tricky issue in XML.
Best, Christian
On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin liam@fromoldbooks.org wrote:
On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:
SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })
Beware that this is not schema-based, and can remove whitespace nodes in mixed content - <p>The <em>very</em> <id>tc34q</id>.</p> may become (as i understand it) <p>The <em>very</em><id>tc34q</id>.</p> (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)
liam
--
Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org