Hi,
I wanted to know if it's possible to give a regex while deleting a resource. I have documents stored in a hierarchy of collections like {year}{month}/doc.xml. Eg: 202301/abc.xml, 202302/def.xml. If I want to delete a resource "abc.xml", Is it possible to issue commands like "*db:delete("db-name", '/*/abc.xml')*" ? Right now, I can do a XQuery with db:list and endsWith and get the complete path of "abc.xml". But regex would have been very handy.
Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.
Thank you, Deepak
On 12/02/2024 18:03, Deepak Dinakara wrote:
Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.
That one could at least use "to" e.g.
for $i in (1 to 9)!format-integer(., '01')
Not sure that is an improvement, decide for yourself.
Hi Deepak,
For deletions, you can write:
let $db := 'db' for $path in db:list($db, '2023')[matches(., '/\d\d')] return db:delete($db, $path)
When accessing documents, it’s faster to iterate over the resources:
for $doc in db:get('db', '2023') where matches(db:path($doc), '/\d\d') return ...
Hope this helps, Christian
Deepak Dinakara deepukalmane@gmail.com schrieb am Mo., 12. Feb. 2024, 18:04:
Hi,
I wanted to know if it's possible to give a regex while deleting a resource. I have documents stored in a hierarchy of collections like {year}{month}/doc.xml. Eg: 202301/abc.xml, 202302/def.xml. If I want to delete a resource "abc.xml", Is it possible to issue commands like "*db:delete("db-name", '/*/abc.xml')*" ? Right now, I can do a XQuery with db:list and endsWith and get the complete path of "abc.xml". But regex would have been very handy.
Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.
Thank you, Deepak
Hi,
I want to load 9 million XML documents into a basex database. I have 32,319 XML documents for testing and the names of the XML documents are changed before each load. The first 32,319 documents are loaded in about 15 minutes. The next 32,319 documents then take 25 minutes. If I load the 32'319 9 times, it already takes 1 hour and 54 minutes. I load the documents using the BaseXClient and the ADD method. Is it normal that the import becomes slower the more documents there are in the basex database? Or is there a way to speed this up? Have you ever loaded 9 million documents into a basex database?
Best regards Dietmar
Hi Dietmar,
Or is there a way to speed this up?
The fastest solution is to import all documents during database creation, either with the CREATE DB command or the corresponding XQuery function:
CREATE DB name-of-db /path/to/documents db:create('db', '/path/to/documents')
The database command ADD, or db:add, can be used as well to import more than one document at a time.
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:
SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })
Have you ever loaded 9 million documents into a basex database?
What’s the approximate size of the 32,000 documents?
In principle, it’s no problem to add 10 million documents or more to a database as long as the input doesn’t exceed specific limits [1]. If you exceed the limits, you can create multiple databases and access them with a single query [2].
I load the documents using the BaseXClient and the ADD method.
Are you using the Java implementation of the client? Feel free to share some code with us.
Hope this helps, Christian
[1] https://docs.basex.org/wiki/Statistics [2] https://docs.basex.org/wiki/Databases
On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:
SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })
Beware that this is not schema-based, and can remove whitespace nodes in mixed content - <p>The <em>very</em> <id>tc34q</id>.</p> may become (as i understand it) <p>The <em>very</em><id>tc34q</id>.</p> (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)
liam
Thanks for the addition, Liam; I should have mentioned that.
If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…
<p xml:space='preserve'>The <em>very</em> <id>tc34q</id>.</p>
…whitespace stripping will be safe.
Similarly, it may be helpful to know that the whitspace gets lost if XML strings…
<p>The <em>very</em> <id>tc34q</id>.</p>
…are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:
declare boundary-space preserve; <p>The <em>very</em> <id>tc34q</id>.</p>
Whitespace handling is generally a tricky issue in XML.
Best, Christian
On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin liam@fromoldbooks.org wrote:
On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:
SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })
Beware that this is not schema-based, and can remove whitespace nodes in mixed content - <p>The <em>very</em> <id>tc34q</id>.</p> may become (as i understand it) <p>The <em>very</em><id>tc34q</id>.</p> (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)
liam
--
Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org
Whitespace is probably only a minor factor here. It can’t explain the loading times that grow non-linearly with document count.
Dietmar, have you looked at the memory consumption? My experience is that if memory gets scarce, garbage collection will kick in frequently, slowing down the import process. Increasing -Xmx in the startup script might improve the import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for example, and see whether there is an improvement. You can see the memory consumption in the GUI, so try to create the DB from the GUI.
Gerrit
On 14.02.2024 10:48, Christian Grün wrote:
Thanks for the addition, Liam; I should have mentioned that.
If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…
<p xml:space='preserve'>The <em>very</em> <id>tc34q</id>.</p>
…whitespace stripping will be safe.
Similarly, it may be helpful to know that the whitspace gets lost if XML strings…
<p>The <em>very</em> <id>tc34q</id>.</p>
…are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:
declare boundary-space preserve;
<p>The <em>very</em> <id>tc34q</id>.</p>
Whitespace handling is generally a tricky issue in XML.
Best, Christian
On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin <liam@fromoldbooks.org mailto:liam@fromoldbooks.org> wrote:
On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import: SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })
Beware that this is not schema-based, and can remove whitespace nodes in mixed content - <p>The <em>very</em> <id>tc34q</id>.</p> may become (as i understand it) <p>The <em>very</em><id>tc34q</id>.</p> (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!) liam
Lack of capability to deal appropriately with whitespaces (and punctuation) results in false positives in our StratML-enabled query service at https://search.aboutthem.info/ Will look forward to learning if anything can be done about it. Owen Amburhttps://www.linkedin.com/in/owenambur/
On Wednesday, February 14, 2024 at 05:38:41 AM EST, Imsieke, Gerrit, le-tex gerrit.imsieke@le-tex.de wrote:
Whitespace is probably only a minor factor here. It can’t explain the loading times that grow non-linearly with document count.
Dietmar, have you looked at the memory consumption? My experience is that if memory gets scarce, garbage collection will kick in frequently, slowing down the import process. Increasing -Xmx in the startup script might improve the import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for example, and see whether there is an improvement. You can see the memory consumption in the GUI, so try to create the DB from the GUI.
Gerrit
On 14.02.2024 10:48, Christian Grün wrote:
Thanks for the addition, Liam; I should have mentioned that.
If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…
<p xml:space='preserve'>The <em>very</em> <id>tc34q</id>.</p>
…whitespace stripping will be safe.
Similarly, it may be helpful to know that the whitspace gets lost if XML strings…
<p>The <em>very</em> <id>tc34q</id>.</p>
…are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:
declare boundary-space preserve;
<p>The <em>very</em> <id>tc34q</id>.</p>
Whitespace handling is generally a tricky issue in XML.
Best, Christian
On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin <liam@fromoldbooks.org mailto:liam@fromoldbooks.org> wrote:
On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:
SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })
Beware that this is not schema-based, and can remove whitespace nodes in mixed content - <p>The <em>very</em> <id>tc34q</id>.</p> may become (as i understand it) <p>The <em>very</em><id>tc34q</id>.</p> (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)
liam
Hi Owen,
Do you have specific problems with whitespace in your query service? If yes, which version of BaseX are you using?
Best, Christian
On Wed, Feb 14, 2024 at 6:22 PM Owen Ambur owen.ambur@verizon.net wrote:
Lack of capability to deal appropriately with whitespaces (and punctuation) results in false positives in our StratML-enabled query service at https://search.aboutthem.info/
Will look forward to learning if anything can be done about it.
Owen Ambur https://www.linkedin.com/in/owenambur/
Yes, it did help. Thanks a ton : )
On Tue, Feb 13, 2024, 00:08 Christian Grün christian.gruen@gmail.com wrote:
Hi Deepak,
For deletions, you can write:
let $db := 'db' for $path in db:list($db, '2023')[matches(., '/\d\d')] return db:delete($db, $path)
When accessing documents, it’s faster to iterate over the resources:
for $doc in db:get('db', '2023') where matches(db:path($doc), '/\d\d') return ...
Hope this helps, Christian
Deepak Dinakara deepukalmane@gmail.com schrieb am Mo., 12. Feb. 2024, 18:04:
Hi,
I wanted to know if it's possible to give a regex while deleting a resource. I have documents stored in a hierarchy of collections like {year}{month}/doc.xml. Eg: 202301/abc.xml, 202302/def.xml. If I want to delete a resource "abc.xml", Is it possible to issue commands like "*db:delete("db-name", '/*/abc.xml')*" ? Right now, I can do a XQuery with db:list and endsWith and get the complete path of "abc.xml". But regex would have been very handy.
Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.
Thank you, Deepak
basex-talk@mailman.uni-konstanz.de