If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud? I know this is a very general question, but I'm wondering what experiences y'all have had with this.
Jonathan
On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
I have a 2013 Macbook Pro with 16 Gig RAM and a 1 Terabyte SSD. So not entirely wimpy, but nowhere near as fast as the current Macbooks, I have no idea how that compares to a typical laptop these days. Most things run fairly quickly, but inserting 2.5 million attributes into a document takes perhaps 5 hours, I didn't time it. I can run that overnight, and do test runs on smaller subsets, but I want to think through my options.
Jonathan
On Sat, Feb 19, 2022 at 6:11 PM Liam R. E. Quin liam@fromoldbooks.org wrote:
On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
-- Liam Quin, https://www.delightfulcomputing.com/ Transformers team, Paligo.net
Pictures from old books - www.fromoldbooks.org
You can use prof:track() to time your insertion operation for enough iterations to get a reasonable time and then multiply by 2.5 million to get an approximate time to completion.
On my machine I’m finding times around 0.05 seconds for my operations, which are more than just attribute insertions, where I need to do 40K iterations. I would expect attribute insertion to be faster, especially if you can batch up the insertions into a small number of transactions.
But five hours to do the update doesn’t seem entirely out of spec if your machine is significantly slower. Doing the math, I get 7ms per insertion:
Hours Seconds/ Hour Seconds # operations Time/operation 5 3600 18000 2500000 0.0072
That seems pretty fast on a per-operation standpoint.
If you can break your content into multiple databases you could parallelize the updates across multiple BaseX instances and then combine the result back at the end.
So spin up one server for each core, have a master server that provides a REST API to kick off the processing and then use the REST method to farm jobs out to each of the servers (using REST to make it easy to target each of the servers via a port. Could also do it from a shell script through the baseclient command-line.).
With that should be able to reduce the processing to the time it takes one server to process its share, which will be total objects/number of cores (its share, that is).
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Jonathan Robie jonathan.robie@gmail.com Date: Monday, February 21, 2022 at 8:44 AM To: Liam R. E. Quin liam@fromoldbooks.org Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Faster in the cloud? [External Email]
I have a 2013 Macbook Pro with 16 Gig RAM and a 1 Terabyte SSD. So not entirely wimpy, but nowhere near as fast as the current Macbooks, I have no idea how that compares to a typical laptop these days. Most things run fairly quickly, but inserting 2.5 million attributes into a document takes perhaps 5 hours, I didn't time it. I can run that overnight, and do test runs on smaller subsets, but I want to think through my options.
Jonathan
On Sat, Feb 19, 2022 at 6:11 PM Liam R. E. Quin <liam@fromoldbooks.orgmailto:liam@fromoldbooks.org> wrote: On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
-- Liam Quin, https://www.delightfulcomputing.com/https://urldefense.com/v3/__https:/www.delightfulcomputing.com/__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctGQDZSN7w$ Transformers team, Paligo.net
Pictures from old books - www.fromoldbooks.orghttps://urldefense.com/v3/__http:/www.fromoldbooks.org__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctEnC4Z4jw$
Hi Jonathan !
Apologizes for my late contribution...
Do you really have to use XQuery Update ? Do you have to stick to a specific format ? If not, maybe you could use a schema on read approach ? I mean, you could add new data as new documents, and recombine these documents into the attribute based format when requesting the data.
Would that be a viable solution for you ?
I once had success with this solution, as BaseX is very quick at adding documents.
Best regards, Fabrice
________________________________ De : BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de de la part de Eliot Kimber eliot.kimber@servicenow.com Envoyé : lundi 21 février 2022 18:06 À : BaseX basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Faster in the cloud?
You can use prof:track() to time your insertion operation for enough iterations to get a reasonable time and then multiply by 2.5 million to get an approximate time to completion.
On my machine I’m finding times around 0.05 seconds for my operations, which are more than just attribute insertions, where I need to do 40K iterations. I would expect attribute insertion to be faster, especially if you can batch up the insertions into a small number of transactions.
But five hours to do the update doesn’t seem entirely out of spec if your machine is significantly slower. Doing the math, I get 7ms per insertion:
Hours
Seconds/ Hour
Seconds
# operations
Time/operation
5
3600
18000
2500000
0.0072
That seems pretty fast on a per-operation standpoint.
If you can break your content into multiple databases you could parallelize the updates across multiple BaseX instances and then combine the result back at the end.
So spin up one server for each core, have a master server that provides a REST API to kick off the processing and then use the REST method to farm jobs out to each of the servers (using REST to make it easy to target each of the servers via a port. Could also do it from a shell script through the baseclient command-line.).
With that should be able to reduce the processing to the time it takes one server to process its share, which will be total objects/number of cores (its share, that is).
Cheers,
E.
_____________________________________________
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.comhttps://www.servicenow.com
LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Jonathan Robie jonathan.robie@gmail.com Date: Monday, February 21, 2022 at 8:44 AM To: Liam R. E. Quin liam@fromoldbooks.org Cc: BaseX basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Faster in the cloud?
[External Email]
I have a 2013 Macbook Pro with 16 Gig RAM and a 1 Terabyte SSD. So not entirely wimpy, but nowhere near as fast as the current Macbooks, I have no idea how that compares to a typical laptop these days. Most things run fairly quickly, but inserting 2.5 million attributes into a document takes perhaps 5 hours, I didn't time it. I can run that overnight, and do test runs on smaller subsets, but I want to think through my options.
Jonathan
On Sat, Feb 19, 2022 at 6:11 PM Liam R. E. Quin <liam@fromoldbooks.orgmailto:liam@fromoldbooks.org> wrote:
On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
-- Liam Quin, https://www.delightfulcomputing.com/https://urldefense.com/v3/__https:/www.delightfulcomputing.com/__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctGQDZSN7w$ Transformers team, Paligo.net
Pictures from old books - www.fromoldbooks.orghttps://urldefense.com/v3/__http:/www.fromoldbooks.org__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctEnC4Z4jw$
A little announcement: With BaseX 10 [1], main memory updates will get much faster:
<x>{ (1 to 1000000) ! <y/> }</x> update { y ! (insert node <z/> into .) }
BaseX 9: ages (6-7 minutes) BaseX 10: 3 seconds
The reason: The disk-based block storage layout is now also used for the main memory representation of XML nodes.
[1] https://files.basex.org/releases/latest-10/
On Tue, Feb 22, 2022 at 9:49 AM ETANCHAUD Fabrice fabrice.etanchaud@maif.fr wrote:
Hi Jonathan !
Apologizes for my late contribution...
Do you really have to use XQuery Update ? Do you have to stick to a specific format ? If not, maybe you could use a schema on read approach ? I mean, you could add new data as new documents, and recombine these documents into the attribute based format when requesting the data.
Would that be a viable solution for you ?
I once had success with this solution, as BaseX is very quick at adding documents.
Best regards, Fabrice
*De :* BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de de la part de Eliot Kimber eliot.kimber@servicenow.com *Envoyé :* lundi 21 février 2022 18:06 *À :* BaseX basex-talk@mailman.uni-konstanz.de *Objet :* Re: [basex-talk] Faster in the cloud?
You can use prof:track() to time your insertion operation for enough iterations to get a reasonable time and then multiply by 2.5 million to get an approximate time to completion.
On my machine I’m finding times around 0.05 seconds for my operations, which are more than just attribute insertions, where I need to do 40K iterations. I would expect attribute insertion to be faster, especially if you can batch up the insertions into a small number of transactions.
But five hours to do the update doesn’t seem entirely out of spec if your machine is significantly slower. Doing the math, I get 7ms per insertion:
Hours
Seconds/ Hour
Seconds
# operations
Time/operation
5
3600
18000
2500000
0.0072
That seems pretty fast on a per-operation standpoint.
If you can break your content into multiple databases you could parallelize the updates across multiple BaseX instances and then combine the result back at the end.
So spin up one server for each core, have a master server that provides a REST API to kick off the processing and then use the REST method to farm jobs out to each of the servers (using REST to make it easy to target each of the servers via a port. Could also do it from a shell script through the baseclient command-line.).
With that should be able to reduce the processing to the time it takes one server to process its share, which will be total objects/number of cores (its share, that is).
Cheers,
E.
*Eliot Kimber*
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com https://www.servicenow.com
LinkedIn https://www.linkedin.com/company/servicenow | Twitter https://twitter.com/servicenow | YouTube https://www.youtube.com/user/servicenowinc | Facebook https://www.facebook.com/servicenow
*From: *BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Jonathan Robie jonathan.robie@gmail.com *Date: *Monday, February 21, 2022 at 8:44 AM *To: *Liam R. E. Quin liam@fromoldbooks.org *Cc: *BaseX basex-talk@mailman.uni-konstanz.de *Subject: *Re: [basex-talk] Faster in the cloud?
*[External Email]*
I have a 2013 Macbook Pro with 16 Gig RAM and a 1 Terabyte SSD. So not entirely wimpy, but nowhere near as fast as the current Macbooks, I have no idea how that compares to a typical laptop these days. Most things run fairly quickly, but inserting 2.5 million attributes into a document takes perhaps 5 hours, I didn't time it. I can run that overnight, and do test runs on smaller subsets, but I want to think through my options.
Jonathan
On Sat, Feb 19, 2022 at 6:11 PM Liam R. E. Quin liam@fromoldbooks.org wrote:
On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
-- Liam Quin, https://www.delightfulcomputing.com/ https://urldefense.com/v3/__https:/www.delightfulcomputing.com/__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctGQDZSN7w$ Transformers team, Paligo.net
Pictures from old books - www.fromoldbooks.org https://urldefense.com/v3/__http:/www.fromoldbooks.org__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctEnC4Z4jw$
Wait – my tea kettle is not yet prepared for 3 seconds! When shall I do my breaks?
Can’t wait to try this in my updating pipelines. It is always intriguing to learn from this list about performance optimizations, be it in the implementation or giving example designs, thanks!
Daniel
Von: BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de Im Auftrag von Christian Grün Gesendet: Dienstag, 22. Februar 2022 14:57 An: ETANCHAUD Fabrice fabrice.etanchaud@maif.fr Cc: BaseX basex-talk@mailman.uni-konstanz.de Betreff: Re: [basex-talk] Faster in the cloud?
A little announcement: With BaseX 10 [1], main memory updates will get much faster:
<x>{ (1 to 1000000) ! <y/> }</x> update { y ! (insert node <z/> into .) }
BaseX 9: ages (6-7 minutes) BaseX 10: 3 seconds
The reason: The disk-based block storage layout is now also used for the main memory representation of XML nodes.
[1] https://files.basex.org/releases/latest-10/
On Tue, Feb 22, 2022 at 9:49 AM ETANCHAUD Fabrice <fabrice.etanchaud@maif.frmailto:fabrice.etanchaud@maif.fr> wrote: Hi Jonathan !
Apologizes for my late contribution...
Do you really have to use XQuery Update ? Do you have to stick to a specific format ? If not, maybe you could use a schema on read approach ? I mean, you could add new data as new documents, and recombine these documents into the attribute based format when requesting the data.
Would that be a viable solution for you ?
I once had success with this solution, as BaseX is very quick at adding documents.
Best regards, Fabrice
________________________________ De : BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de> de la part de Eliot Kimber <eliot.kimber@servicenow.commailto:eliot.kimber@servicenow.com> Envoyé : lundi 21 février 2022 18:06 À : BaseX <basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de> Objet : Re: [basex-talk] Faster in the cloud?
You can use prof:track() to time your insertion operation for enough iterations to get a reasonable time and then multiply by 2.5 million to get an approximate time to completion.
On my machine I’m finding times around 0.05 seconds for my operations, which are more than just attribute insertions, where I need to do 40K iterations. I would expect attribute insertion to be faster, especially if you can batch up the insertions into a small number of transactions.
But five hours to do the update doesn’t seem entirely out of spec if your machine is significantly slower. Doing the math, I get 7ms per insertion:
Hours
Seconds/ Hour
Seconds
# operations
Time/operation
5
3600
18000
2500000
0.0072
That seems pretty fast on a per-operation standpoint.
If you can break your content into multiple databases you could parallelize the updates across multiple BaseX instances and then combine the result back at the end.
So spin up one server for each core, have a master server that provides a REST API to kick off the processing and then use the REST method to farm jobs out to each of the servers (using REST to make it easy to target each of the servers via a port. Could also do it from a shell script through the baseclient command-line.).
With that should be able to reduce the processing to the time it takes one server to process its share, which will be total objects/number of cores (its share, that is).
Cheers,
E.
_____________________________________________
Eliot Kimber
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.comhttps://www.servicenow.com
LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de> on behalf of Jonathan Robie <jonathan.robie@gmail.commailto:jonathan.robie@gmail.com> Date: Monday, February 21, 2022 at 8:44 AM To: Liam R. E. Quin <liam@fromoldbooks.orgmailto:liam@fromoldbooks.org> Cc: BaseX <basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Faster in the cloud?
[External Email]
I have a 2013 Macbook Pro with 16 Gig RAM and a 1 Terabyte SSD. So not entirely wimpy, but nowhere near as fast as the current Macbooks, I have no idea how that compares to a typical laptop these days. Most things run fairly quickly, but inserting 2.5 million attributes into a document takes perhaps 5 hours, I didn't time it. I can run that overnight, and do test runs on smaller subsets, but I want to think through my options.
Jonathan
On Sat, Feb 19, 2022 at 6:11 PM Liam R. E. Quin <liam@fromoldbooks.orgmailto:liam@fromoldbooks.org> wrote:
On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
-- Liam Quin, https://www.delightfulcomputing.com/https://urldefense.com/v3/__https:/www.delightfulcomputing.com/__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctGQDZSN7w$ Transformers team, Paligo.net
Pictures from old books - www.fromoldbooks.orghttps://urldefense.com/v3/__http:/www.fromoldbooks.org__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctEnC4Z4jw$
Cool!
That phrase "main memory updates" also implies there's something I should learn. I am simply doing updates using a repository. Should I be doing something to trigger main memory updates?
FWIW, here is a small sample of my data (one sentence) and the query I use to expand a morphology code in each <w> element into meaningful, readable attributes. So a <w> element in the output looks like this:
<w pos="preposition" n="010010010011" morph="R" lang="H" lemma="b">בְּ</w> <w pos="noun" type="common" gender="feminine" number="singular" state="absolute" n="010010010012" morph="Ncfsa" lang="H" lemma="7225" after=" ">רֵאשִׁית</w> <w pos="verb" stem="qal" type="qatal" person="third" gender="masculine" number="singular" n="010010010021" lang="H" after=" " lemma="1254 a" morph="Vqp3ms" id="01Nvk">בָּרָ֣א</w> <w pos="noun" type="common" gender="masculine" number="plural" state="absolute" n="010010010031" lang="H" after=" " lemma="430" morph="Ncmpa" id="01TyA">אֱלֹהִ֑ים</w> <w pos="particle" type="direct object marker" n="010010010041" lang="H" after=" " lemma="853" morph="To" id="01vuQ">אֵ֥ת</w> <w pos="particle" type="definite article" n="010010010051" morph="Td" lang="H" lemma="d">הַ</w> <w pos="noun" type="common" gender="masculine" number="plural" state="absolute" n="010010010052" morph="Ncmpa" lang="H" lemma="8064" after=" ">שּׁמַ֖יִם</w> <w pos="conjunction" n="010010010061" morph="C" lang="H" lemma="c">וְ</w> <w pos="particle" type="direct object marker" n="010010010062" morph="To" lang="H" lemma="853" after=" ">אֵ֥ת</w> <w pos="particle" type="definite article" n="010010010071" morph="Td" lang="H" lemma="d">הָ</w> <w pos="noun" type="common" gender="both" number="singular" state="absolute" n="010010010072" morph="Ncbsa" lang="H" lemma="776" after=":">אָֽרֶץ</w>
These are leaf nodes in a syntax tree - for simplicity, I am not showing the syntax tree here, look to the input file for that.
Jonathan
On Tue, Feb 22, 2022 at 8:57 AM Christian Grün christian.gruen@gmail.com wrote:
A little announcement: With BaseX 10 [1], main memory updates will get much faster:
<x>{ (1 to 1000000) ! <y/> }</x> update { y ! (insert node <z/> into .) }
BaseX 9: ages (6-7 minutes) BaseX 10: 3 seconds
The reason: The disk-based block storage layout is now also used for the main memory representation of XML nodes.
[1] https://files.basex.org/releases/latest-10/
On Tue, Feb 22, 2022 at 9:49 AM ETANCHAUD Fabrice < fabrice.etanchaud@maif.fr> wrote:
Hi Jonathan !
Apologizes for my late contribution...
Do you really have to use XQuery Update ? Do you have to stick to a specific format ? If not, maybe you could use a schema on read approach ? I mean, you could add new data as new documents, and recombine these documents into the attribute based format when requesting the data.
Would that be a viable solution for you ?
I once had success with this solution, as BaseX is very quick at adding documents.
Best regards, Fabrice
*De :* BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de de la part de Eliot Kimber eliot.kimber@servicenow.com *Envoyé :* lundi 21 février 2022 18:06 *À :* BaseX basex-talk@mailman.uni-konstanz.de *Objet :* Re: [basex-talk] Faster in the cloud?
You can use prof:track() to time your insertion operation for enough iterations to get a reasonable time and then multiply by 2.5 million to get an approximate time to completion.
On my machine I’m finding times around 0.05 seconds for my operations, which are more than just attribute insertions, where I need to do 40K iterations. I would expect attribute insertion to be faster, especially if you can batch up the insertions into a small number of transactions.
But five hours to do the update doesn’t seem entirely out of spec if your machine is significantly slower. Doing the math, I get 7ms per insertion:
Hours
Seconds/ Hour
Seconds
# operations
Time/operation
5
3600
18000
2500000
0.0072
That seems pretty fast on a per-operation standpoint.
If you can break your content into multiple databases you could parallelize the updates across multiple BaseX instances and then combine the result back at the end.
So spin up one server for each core, have a master server that provides a REST API to kick off the processing and then use the REST method to farm jobs out to each of the servers (using REST to make it easy to target each of the servers via a port. Could also do it from a shell script through the baseclient command-line.).
With that should be able to reduce the processing to the time it takes one server to process its share, which will be total objects/number of cores (its share, that is).
Cheers,
E.
*Eliot Kimber*
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com https://www.servicenow.com
LinkedIn https://www.linkedin.com/company/servicenow | Twitter https://twitter.com/servicenow | YouTube https://www.youtube.com/user/servicenowinc | Facebook https://www.facebook.com/servicenow
*From: *BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Jonathan Robie jonathan.robie@gmail.com *Date: *Monday, February 21, 2022 at 8:44 AM *To: *Liam R. E. Quin liam@fromoldbooks.org *Cc: *BaseX basex-talk@mailman.uni-konstanz.de *Subject: *Re: [basex-talk] Faster in the cloud?
*[External Email]*
I have a 2013 Macbook Pro with 16 Gig RAM and a 1 Terabyte SSD. So not entirely wimpy, but nowhere near as fast as the current Macbooks, I have no idea how that compares to a typical laptop these days. Most things run fairly quickly, but inserting 2.5 million attributes into a document takes perhaps 5 hours, I didn't time it. I can run that overnight, and do test runs on smaller subsets, but I want to think through my options.
Jonathan
On Sat, Feb 19, 2022 at 6:11 PM Liam R. E. Quin liam@fromoldbooks.org wrote:
On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
-- Liam Quin, https://www.delightfulcomputing.com/ https://urldefense.com/v3/__https:/www.delightfulcomputing.com/__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctGQDZSN7w$ Transformers team, Paligo.net
Pictures from old books - www.fromoldbooks.org https://urldefense.com/v3/__http:/www.fromoldbooks.org__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctEnC4Z4jw$
That phrase "main memory updates" also implies there's something I should learn.
Maybe it’s also something I should still find a better wording for ;) Previously, we have used the unfortunate term "non-updating update expression", and we now refer to it as "Main-Memory Updates" [1]. It describes updates on XML nodes that will not affect the persistent database storage, but will be returned to the user as an updated XML node.
A persistent update: delete nodes db:open('db')//text()
A main-memory update: db:open('db') update { delete nodes .//text() }
If you perform numerous updates, it’s sometimes faster to update a main-memory copy and replace the original node with the updated one:
let $node := db:open('db')/path/to/node let $updated-node := $node update { ..... } return replace node $node with $updated-node
But the major advantage of this pattern is that you can perform additional operation on the updated node (validations, optimizations, reports) before eventually writing it to disk.
Hope this helps, Christian
[1] https://docs.basex.org/wiki/XQuery_Update#Main-Memory_Updates
I am simply doing updates using a repository. Should I be doing something to trigger main memory updates?
FWIW, here is a small sample of my data (one sentence) and the query I use to expand a morphology code in each <w> element into meaningful, readable attributes. So a <w> element in the output looks like this:
<w pos="preposition" n="010010010011" morph="R" lang="H" lemma="b">בְּ</w> <w pos="noun" type="common" gender="feminine" number="singular" state="absolute" n="010010010012" morph="Ncfsa" lang="H" lemma="7225" after=" ">רֵאשִׁית</w> <w pos="verb" stem="qal" type="qatal" person="third" gender="masculine" number="singular" n="010010010021" lang="H" after=" " lemma="1254 a" morph="Vqp3ms" id="01Nvk">בָּרָ֣א</w> <w pos="noun" type="common" gender="masculine" number="plural" state="absolute" n="010010010031" lang="H" after=" " lemma="430" morph="Ncmpa" id="01TyA">אֱלֹהִ֑ים</w> <w pos="particle" type="direct object marker" n="010010010041" lang="H" after=" " lemma="853" morph="To" id="01vuQ">אֵ֥ת</w> <w pos="particle" type="definite article" n="010010010051" morph="Td" lang="H" lemma="d">הַ</w> <w pos="noun" type="common" gender="masculine" number="plural" state="absolute" n="010010010052" morph="Ncmpa" lang="H" lemma="8064" after=" ">שּׁמַ֖יִם</w> <w pos="conjunction" n="010010010061" morph="C" lang="H" lemma="c">וְ</w> <w pos="particle" type="direct object marker" n="010010010062" morph="To" lang="H" lemma="853" after=" ">אֵ֥ת</w> <w pos="particle" type="definite article" n="010010010071" morph="Td" lang="H" lemma="d">הָ</w> <w pos="noun" type="common" gender="both" number="singular" state="absolute" n="010010010072" morph="Ncbsa" lang="H" lemma="776" after=":">אָֽרֶץ</w>
These are leaf nodes in a syntax tree - for simplicity, I am not showing the syntax tree here, look to the input file for that.
Jonathan
On Tue, Feb 22, 2022 at 8:57 AM Christian Grün christian.gruen@gmail.com wrote:
A little announcement: With BaseX 10 [1], main memory updates will get much faster:
<x>{ (1 to 1000000) ! <y/> }</x> update { y ! (insert node <z/> into .) }
BaseX 9: ages (6-7 minutes) BaseX 10: 3 seconds
The reason: The disk-based block storage layout is now also used for the main memory representation of XML nodes.
[1] https://files.basex.org/releases/latest-10/
On Tue, Feb 22, 2022 at 9:49 AM ETANCHAUD Fabrice < fabrice.etanchaud@maif.fr> wrote:
Hi Jonathan !
Apologizes for my late contribution...
Do you really have to use XQuery Update ? Do you have to stick to a specific format ? If not, maybe you could use a schema on read approach ? I mean, you could add new data as new documents, and recombine these documents into the attribute based format when requesting the data.
Would that be a viable solution for you ?
I once had success with this solution, as BaseX is very quick at adding documents.
Best regards, Fabrice
*De :* BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de de la part de Eliot Kimber eliot.kimber@servicenow.com *Envoyé :* lundi 21 février 2022 18:06 *À :* BaseX basex-talk@mailman.uni-konstanz.de *Objet :* Re: [basex-talk] Faster in the cloud?
You can use prof:track() to time your insertion operation for enough iterations to get a reasonable time and then multiply by 2.5 million to get an approximate time to completion.
On my machine I’m finding times around 0.05 seconds for my operations, which are more than just attribute insertions, where I need to do 40K iterations. I would expect attribute insertion to be faster, especially if you can batch up the insertions into a small number of transactions.
But five hours to do the update doesn’t seem entirely out of spec if your machine is significantly slower. Doing the math, I get 7ms per insertion:
Hours
Seconds/ Hour
Seconds
# operations
Time/operation
5
3600
18000
2500000
0.0072
That seems pretty fast on a per-operation standpoint.
If you can break your content into multiple databases you could parallelize the updates across multiple BaseX instances and then combine the result back at the end.
So spin up one server for each core, have a master server that provides a REST API to kick off the processing and then use the REST method to farm jobs out to each of the servers (using REST to make it easy to target each of the servers via a port. Could also do it from a shell script through the baseclient command-line.).
With that should be able to reduce the processing to the time it takes one server to process its share, which will be total objects/number of cores (its share, that is).
Cheers,
E.
*Eliot Kimber*
Sr Staff Content Engineer
O: 512 554 9368
M: 512 554 9368
servicenow.com https://www.servicenow.com
LinkedIn https://www.linkedin.com/company/servicenow | Twitter https://twitter.com/servicenow | YouTube https://www.youtube.com/user/servicenowinc | Facebook https://www.facebook.com/servicenow
*From: *BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de on behalf of Jonathan Robie jonathan.robie@gmail.com *Date: *Monday, February 21, 2022 at 8:44 AM *To: *Liam R. E. Quin liam@fromoldbooks.org *Cc: *BaseX basex-talk@mailman.uni-konstanz.de *Subject: *Re: [basex-talk] Faster in the cloud?
*[External Email]*
I have a 2013 Macbook Pro with 16 Gig RAM and a 1 Terabyte SSD. So not entirely wimpy, but nowhere near as fast as the current Macbooks, I have no idea how that compares to a typical laptop these days. Most things run fairly quickly, but inserting 2.5 million attributes into a document takes perhaps 5 hours, I didn't time it. I can run that overnight, and do test runs on smaller subsets, but I want to think through my options.
Jonathan
On Sat, Feb 19, 2022 at 6:11 PM Liam R. E. Quin liam@fromoldbooks.org wrote:
On Sat, 2022-02-19 at 16:05 -0500, Jonathan Robie wrote:
If I am running my queries and updates on a typical laptop, would they run much faster if I ran them on a suitably configured instance in the cloud?
"suitably configured" is very subjective. Potentially your queries could run a lot faster.
A lot depends on the speed of the disk (or SSD) in the laptop, and the amount of memory it has, as well as the CPU - a recent Macbook Pro will be faster than a ten-year-old chromebook. However, server blades (the machines used in data centres) typically have much higher bandwidth between memory and devices including both the CPU and the long-term storage, and likely have more physical RAM than your laptop.
On the other hand, connecting over the network to the cloud can be slow....
Liam
-- Liam Quin, https://www.delightfulcomputing.com/ https://urldefense.com/v3/__https:/www.delightfulcomputing.com/__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctGQDZSN7w$ Transformers team, Paligo.net
Pictures from old books - www.fromoldbooks.org https://urldefense.com/v3/__http:/www.fromoldbooks.org__;!!N4vogdjhuJM!V3R7YXmJCN9YvR-YAdDTx7sK3hV2dELnhc4qEd_duk8NH-nwBBxjt670F0zlctEnC4Z4jw$
basex-talk@mailman.uni-konstanz.de