Hi list!
I was experimenting with the jobs module for the last weeks to speed up updates and to make them fit into less than 6 GB of memory. It does not work the way I expected.
* Updating jobs don't seem to run in parallel even if they don't work with and lock the same database. Or do they start much later than I expected?
* It seems to me there is an upper limit of jobs that can be queued (about 100?). I automatically started all of the update jobs but some updates did not run.
The upside is: I can run my updates with just about 1 GB of memory which is much better for me.
I work with dictionary like XML documents. They look like
<root>
<entry>Contents with further tags</entry>
<entry>Contents with further tags</entry>
<entry>Contents with further tags</entry>
... a few thousand more ...
<entry>Contents with further tags</entry>
<entry>Contents with further tags</entry>
</root>
I add or change larger parts of them and I also need to keep track of changes. So in a separate database there are old versions of entries with time stamps (@dt) like
<hist>
<entry dt="">Contents with further tags</entry>
<entrydt="">Contents with further tags</entry>
<entrydt="">Contents with further tags</entry>
... a few thousand more ...
<entrydt="">Contents with further tags</entry>
<entrydt="">Contents with further tags</entry>
</hist>
I tried to do everything at once and when I need to update most of my entries (about 30000) I exhaust my memory. So I use the jobs module to do this 100 at a time so the updates list does not grow beyond any reasonable size. Having those 100 as one transaction is good enough for my needs. Then I thought maybe I should use jobs to seprate the two tasks. One async job saves a memory (update) copy of the old entry, the other writes the new one. After some transformations to my XQuery BaseX told me they don't lock the same database or global. I expected the jobs running on different databases to work in parallel. They don't. I don't quite understand why. Also I have a much larger dataset split into a number of databases where it would be quite useful to execute updates in parallel. Am I missing something? Is this perhaps the wrong way to tackle this scaling problem?
Best regards Omar Siam
Hi Omar,
Beforehand: You mentioned that your RESTXQ always cause global locks. Do you have an example for that (see my last mail)?
- Updating jobs don't seem to run in parallel even if they don't work with
and lock the same database. Or do they start much later than I expected?
It should be possible indeed to run updating jobs in parallel. An example would be welcome.
- It seems to me there is an upper limit of jobs that can be queued (about
100?).
There is actually no limit; I have successfully queued more than 10.000 jobs in the past.
Is this perhaps the wrong way to tackle this scaling problem?
Concurrent updates can slow down execution a lot (because of random IO access patterns). However, from a scalability point of view, it can be the right way to go.
Cheers, Christian
Am 18.10.2017 um 17:19 schrieb Christian Grün:
Hi Omar,
Beforehand: You mentioned that your RESTXQ always cause global locks. Do you have an example for that (see my last mail)?
Sorry, no. While running that large update yesterday I thought I saw someone running a RESTXQ call without xquery:eval which was not processed because of the update in progress. But I could not reproduce it. I'll keep an eye on it but perhaps I just did not see the xquery:eval.
basex-talk@mailman.uni-konstanz.de