Help with usage of Regex while deleting resources

List overview All Threads
Download

newer

older

Found problems with map:for-each

Slow full-text querying

Deepak Dinakara

12 Feb 2024 12 Feb '24

12:03 p.m.

Hi,

I wanted to know if it's possible to give a regex while deleting a resource. I have documents stored in a hierarchy of collections like {year}{month}/doc.xml. Eg: 202301/abc.xml, 202302/def.xml. If I want to delete a resource "abc.xml", Is it possible to issue commands like "*db:delete("db-name", '/*/abc.xml')*" ? Right now, I can do a XQuery with db:list and endsWith and get the complete path of "abc.xml". But regex would have been very handy.

Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.

Thank you, Deepak

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Martin Honnen

12 Feb 12 Feb

12:25 p.m.

On 12/02/2024 18:03, Deepak Dinakara wrote:

...

Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.

That one could at least use "to" e.g.

for $i in (1 to 9)!format-integer(., '01')

Not sure that is an improvement, decide for yourself.

Christian Grün

1:37 p.m.

Hi Deepak,

For deletions, you can write:

let $db := 'db' for $path in db:list($db, '2023')[matches(., '/\d\d')] return db:delete($db, $path)

When accessing documents, it’s faster to iterate over the resources:

for $doc in db:get('db', '2023') where matches(db:path($doc), '/\d\d') return ...

Hope this helps, Christian

Deepak Dinakara deepukalmane@gmail.com schrieb am Mo., 12. Feb. 2024, 18:04:

...

Hi,

I wanted to know if it's possible to give a regex while deleting a resource. I have documents stored in a hierarchy of collections like {year}{month}/doc.xml. Eg: 202301/abc.xml, 202302/def.xml. If I want to delete a resource "abc.xml", Is it possible to issue commands like "*db:delete("db-name", '/*/abc.xml')*" ? Right now, I can do a XQuery with db:list and endsWith and get the complete path of "abc.xml". But regex would have been very handy.

Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.

Thank you, Deepak

Dietmar Posselt

13 Feb 13 Feb

9:45 a.m.

New subject: Help with loading of 9 million documents

Hi,

I want to load 9 million XML documents into a basex database. I have 32,319 XML documents for testing and the names of the XML documents are changed before each load. The first 32,319 documents are loaded in about 15 minutes. The next 32,319 documents then take 25 minutes. If I load the 32'319 9 times, it already takes 1 hour and 54 minutes. I load the documents using the BaseXClient and the ADD method. Is it normal that the import becomes slower the more documents there are in the basex database? Or is there a way to speed this up? Have you ever loaded 9 million documents into a basex database?

Best regards Dietmar

Christian Grün

2:29 p.m.

New subject: Help with loading of 9 million documents

Hi Dietmar,

...

Or is there a way to speed this up?

The fastest solution is to import all documents during database creation, either with the CREATE DB command or the corresponding XQuery function:

CREATE DB name-of-db /path/to/documents db:create('db', '/path/to/documents')

The database command ADD, or db:add, can be used as well to import more than one document at a time.

If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:

SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })

...

Have you ever loaded 9 million documents into a basex database?

What’s the approximate size of the 32,000 documents?

In principle, it’s no problem to add 10 million documents or more to a database as long as the input doesn’t exceed specific limits [1]. If you exceed the limits, you can create multiple databases and access them with a single query [2].

...

I load the documents using the BaseXClient and the ADD method.

Are you using the Java implementation of the client? Feel free to share some code with us.

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Statistics [2] https://docs.basex.org/wiki/Databases

Liam R. E. Quin

14 Feb 14 Feb

4:37 a.m.

New subject: Help with loading of 9 million documents

On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:

...

If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:

SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })

Beware that this is not schema-based, and can remove whitespace nodes in mixed content - The very <id>tc34q</id>. may become (as i understand it) The very<id>tc34q</id>. (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)

liam

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

Christian Grün

4:48 a.m.

New subject: Help with loading of 9 million documents

Thanks for the addition, Liam; I should have mentioned that.

If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…

The very <id>tc34q</id>.

…whitespace stripping will be safe.

Similarly, it may be helpful to know that the whitspace gets lost if XML strings…

The very <id>tc34q</id>.

…are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:

declare boundary-space preserve; The very <id>tc34q</id>.

Whitespace handling is generally a tricky issue in XML.

Best, Christian

On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin liam@fromoldbooks.org wrote:

...

On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:

If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:

SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })

Beware that this is not schema-based, and can remove whitespace nodes in mixed content - The very <id>tc34q</id>. may become (as i understand it) The very<id>tc34q</id>. (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)

liam

--

Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

Imsieke, Gerrit, le-tex

5:38 a.m.

New subject: Help with loading of 9 million documents

Whitespace is probably only a minor factor here. It can’t explain the loading times that grow non-linearly with document count.

Dietmar, have you looked at the memory consumption? My experience is that if memory gets scarce, garbage collection will kick in frequently, slowing down the import process. Increasing -Xmx in the startup script might improve the import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for example, and see whether there is an improvement. You can see the memory consumption in the GUI, so try to create the DB from the GUI.

Gerrit

On 14.02.2024 10:48, Christian Grün wrote:

...

Thanks for the addition, Liam; I should have mentioned that.

If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…

The very <id>tc34q</id>.

…whitespace stripping will be safe.

Similarly, it may be helpful to know that the whitspace gets lost if XML strings…

The very <id>tc34q</id>.

…are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:

declare boundary-space preserve;

The very <id>tc34q</id>.

Whitespace handling is generally a tricky issue in XML.

Best, Christian

On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin <liam@fromoldbooks.org mailto:liam@fromoldbooks.org> wrote:
On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
...
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:

SET STRIPWS ON; CREATE DB ...
db:create('db', '/path/to/documents', (), map { 'stripws': true() })
Beware that this is not schema-based, and can remove whitespace nodes in mixed content -
The very <id>tc34q</id>.
may become (as i understand it)
     The very<id>tc34q</id>.
(i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)

liam

Owen Ambur

12:21 p.m.

New subject: Whitespace

Lack of capability to deal appropriately with whitespaces (and punctuation) results in false positives in our StratML-enabled query service at https://search.aboutthem.info/ Will look forward to learning if anything can be done about it. Owen Amburhttps://www.linkedin.com/in/owenambur/

On Wednesday, February 14, 2024 at 05:38:41 AM EST, Imsieke, Gerrit, le-tex gerrit.imsieke@le-tex.de wrote:

Whitespace is probably only a minor factor here. It can’t explain the loading times that grow non-linearly with document count.

Gerrit

On 14.02.2024 10:48, Christian Grün wrote:

...

Thanks for the addition, Liam; I should have mentioned that.

If your input has mixed content, and if the relevant sections have xml:space='preserve' attributes…

The very <id>tc34q</id>.

…whitespace stripping will be safe.

Similarly, it may be helpful to know that the whitspace gets lost if XML strings…

The very <id>tc34q</id>.

…are evaluated as XQuery. To prevent that, you can add a statement to the prolog of the query:

declare boundary-space preserve;

The very <id>tc34q</id>.

Whitespace handling is generally a tricky issue in XML.

Best, Christian

On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin <liam@fromoldbooks.org mailto:liam@fromoldbooks.org> wrote:

On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:

...
If your XML input has been properly indented to improve readibility, you can reduce the size of your database by dropping superfluous whitespace during the import:

SET STRIPWS ON; CREATE DB ... db:create('db', '/path/to/documents', (), map { 'stripws': true() })

Beware that this is not schema-based, and can remove whitespace nodes in mixed content - The very <id>tc34q</id>. may become (as i understand it) The very<id>tc34q</id>. (i have seen this, with different software, cause potentially catastrophic problems in aircraft manuals!)

liam

Christian Grün

20 Feb 20 Feb

3:28 a.m.

New subject: Whitespace

Hi Owen,

Do you have specific problems with whitespace in your query service? If yes, which version of BaseX are you using?

Best, Christian

On Wed, Feb 14, 2024 at 6:22 PM Owen Ambur owen.ambur@verizon.net wrote:

...

Lack of capability to deal appropriately with whitespaces (and punctuation) results in false positives in our StratML-enabled query service at https://search.aboutthem.info/

Will look forward to learning if anything can be done about it.

Owen Ambur https://www.linkedin.com/in/owenambur/

Deepak Dinakara

13 Feb 13 Feb

2:31 p.m.

Yes, it did help. Thanks a ton : )

On Tue, Feb 13, 2024, 00:08 Christian Grün christian.gruen@gmail.com wrote:

...

Hi Deepak,

For deletions, you can write:

let $db := 'db' for $path in db:list($db, '2023')[matches(., '/\d\d')] return db:delete($db, $path)

When accessing documents, it’s faster to iterate over the resources:

for $doc in db:get('db', '2023') where matches(db:path($doc), '/\d\d') return ...

Hope this helps, Christian

Deepak Dinakara deepukalmane@gmail.com schrieb am Mo., 12. Feb. 2024, 18:04:

...
Hi,

I wanted to know if it's possible to give a regex while deleting a resource. I have documents stored in a hierarchy of collections like {year}{month}/doc.xml. Eg: 202301/abc.xml, 202302/def.xml. If I want to delete a resource "abc.xml", Is it possible to issue commands like "*db:delete("db-name", '/*/abc.xml')*" ? Right now, I can do a XQuery with db:list and endsWith and get the complete path of "abc.xml". But regex would have been very handy.

Similarly I also want to execute queries against a list of collections using regex. Something like "*for $document in collection('db-name/20230*')*" (First 9 months of 2023) Right now, I am doing something like "for $i in ('01', '02', '03', '04', ... '09') for $document in collection('test-collection/2023' || $i)" But if there are better ways, kindly let me know.

Thank you, Deepak

514

Age (days ago)

522

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

10 comments

7 participants

tags (0)

participants (7)

Christian Grün
Deepak Dinakara
Dietmar Posselt
Imsieke, Gerrit, le-tex
Liam R. E. Quin
Martin Honnen
Owen Ambur