Performance of PUL commit

List overview All Threads
Download

newer

older

function:module

BaseX 8.0.1: Minor Patches

Marco Lettere

20 Feb 2015 20 Feb '15

9:53 a.m.

It's me again .... sorry. ;-)

One question that puzzles me in these last hours .. This is the scenario on 2 DBs (same server) with the following sizes:

136746 docs ~1 Gb 194608 docs ~0.5 Gb

declare %rest:path('{$project}/some/url/') %rest:POST('{$body}') %rest:consumes("application/xml") function _:insert($project, $body) { prof:time(

__:insert_impl($project, $body),

false(), "POST ACCOUNT") };

The function insert() is an RestXQ entry point for document insertion. It delegates to the actual implementing function. Insertion occurs continuosly. The implementing function performs queries for checking uniqueness of what is inserted and then posts the actual update which is a db:add or db:replace (BTW does this make a difference from a performance point of view?????). Insertion might occur either on one or on both the DBs. The time T output by the profile:time is a steady amount of time. This means that queries and update schedule is working in constant time. The time reported by Jetty (and perceived at the client) is a T + d. The issue is that this d may vary from something around few hundreds of milliseconds to something which is 4 or 5 times T itself. And this happens in waves. When it's slow, it's slow for few minutes and then it gets better for a few minutes and then back again ... When staring at the tcpdump it's evident that all the time is spent between the ack of the request and the sending of the response. Thus network overhead is negligible. This makes me think that the overhead is in the commit of the PUL. Is this possible? I'm using basex 8.0 official. Openjdk 7 and monitoring through JMX doesn't show anything particularly critical. Any clues on what I have to better look into? Thanks and sorry for bothering ... M.

Show replies by date

Marco Lettere

20 Feb 20 Feb

10:33 a.m.

Just one more hint, if we parallelly to the ongoing insertions post one directly with basexclient, we don't experience this lag ... M.

On 20/02/2015 15:53, Marco Lettere wrote:

...

It's me again .... sorry. ;-)

One question that puzzles me in these last hours .. This is the scenario on 2 DBs (same server) with the following sizes:

136746 docs ~1 Gb 194608 docs ~0.5 Gb

declare %rest:path('{$project}/some/url/') %rest:POST('{$body}') %rest:consumes("application/xml") function _:insert($project, $body) { prof:time(
    __:insert_impl($project, $body),
false(), "POST ACCOUNT") };

The function insert() is an RestXQ entry point for document insertion. It delegates to the actual implementing function. Insertion occurs continuosly. The implementing function performs queries for checking uniqueness of what is inserted and then posts the actual update which is a db:add or db:replace (BTW does this make a difference from a performance point of view?????). Insertion might occur either on one or on both the DBs. The time T output by the profile:time is a steady amount of time. This means that queries and update schedule is working in constant time. The time reported by Jetty (and perceived at the client) is a T + d. The issue is that this d may vary from something around few hundreds of milliseconds to something which is 4 or 5 times T itself. And this happens in waves. When it's slow, it's slow for few minutes and then it gets better for a few minutes and then back again ... When staring at the tcpdump it's evident that all the time is spent between the ack of the request and the sending of the response. Thus network overhead is negligible. This makes me think that the overhead is in the commit of the PUL. Is this possible? I'm using basex 8.0 official. Openjdk 7 and monitoring through JMX doesn't show anything particularly critical. Any clues on what I have to better look into? Thanks and sorry for bothering ... M.

Christian Grün

22 Feb 22 Feb

1:50 p.m.

Hi Marco,

...

db:add or db:replace (BTW does this make a difference from a performance point of view?????).

It's faster to add documents, as this operation won't check if a document already exists. In future versions of BaseX, however, we will possibly merge these two functions.

...

The time T output by the profile:time is a steady amount of time. This means that queries and update schedule is working in constant time. The time reported by Jetty (and perceived at the client) is a T + d. The issue is that this d may vary from something around few hundreds of milliseconds to something which is 4 or 5 times T itself.

Please note that it's not possible to measure the time spent for updating the documents, as this happens at the end of the query execution. In other words, the function call prof:time(delete node <a/>) will only measure how long it takes to add the delete operation to the pending update list [1].

Hope this helps, Christian

[1] http://docs.basex.org/wiki/XQuery_Update#Pending_Update_List

Marco Lettere

2:03 p.m.

On 22/02/2015 19:50, Christian Grün wrote:

...

Hi Marco,

...
db:add or db:replace (BTW does this make a difference from a performance point of view?????).

It's faster to add documents, as this operation won't check if a document already exists. In future versions of BaseX, however, we will possibly merge these two functions.

Good to know.

...

...
The time T output by the profile:time is a steady amount of time. This means that queries and update schedule is working in constant time. The time reported by Jetty (and perceived at the client) is a T + d. The issue is that this d may vary from something around few hundreds of milliseconds to something which is 4 or 5 times T itself.

Please note that it's not possible to measure the time spent for updating the documents, as this happens at the end of the query execution. In other words, the function call prof:time(delete node <a/>) will only measure how long it takes to add the delete operation to the pending update list [1].

Yes, this was clear to me. I'd like just to understand to what relates the huge difference (d) between "querying the db and scheduling an update to the PUL" which is what the internal function does and the time logged by the basexhttp server as the completion time of the whole RestXQ call which is T + d. Mostly these RestXQs return 1 small XML document or in most cases even nothing at all so it should not be related to the communication between basexserver and basexhttpserver. My guess was that it could be related to the commit of the PUL but I don't know if the PUL is committed after the query is performed or after the RestXQ is completed ... Any insight could be useful. In the meanwhile I'll have a look at the code you pointed me to for the other mail Thanks, Marco.

...

Hope this helps, Christian

[1] http://docs.basex.org/wiki/XQuery_Update#Pending_Update_List

Christian Grün

2:05 p.m.

...

I'd like just to understand to what relates the huge difference (d) between "querying the db and scheduling an update to the PUL" which is what the internal function does and the time logged by the basexhttp server as the completion time of the whole RestXQ call which is T + d.

Hm, difficult to tell. In all cases, the PUL will be executed before the result of a query will be sent to the client.

Marco Lettere

4:08 p.m.

I'm planning to add some logging here [1] and here [2]. Do you think it will unleash some details? M.

[1] https://github.com/BaseXdb/basex/blob/master/basex-api/src/main/java/org/bas... [2] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

On 22/02/2015 20:05, Christian Grün wrote:

...

...
I'd like just to understand to what relates the huge difference (d) between "querying the db and scheduling an update to the PUL" which is what the internal function does and the time logged by the basexhttp server as the completion time of the whole RestXQ call which is T + d.

Hm, difficult to tell. In all cases, the PUL will be executed before the result of a query will be sent to the client.

Marco Lettere

24 Feb 24 Feb

8:49 a.m.

Hi all, I've added a few logs to [1] turning it into [2]. The objective was to understand where we are leaking arbitrarily large amount of time in update operations. I can confirm that querying the DB and interacting with HTTP is nearly constant. Al the time is wasted between the "Starting updates" and "Completed updates" logs.

In some cases it takes a fair amount of time. (This behaviour would be perfect for us ~ .5 sec)

QUERY [MARCO]: Starting updates 5860 QUERY [MARCO]: Completed updates 6397

In some other it's acceptable ... (This can be tolerated if it doesn't apply too often ~ 3.2 secs)

QUERY [MARCO]: Starting updates 4650 QUERY [MARCO]: Completed updates 7832

But then there are cases when it definetely explodes (This is unacceptable ~15.5 sec)

QUERY [MARCO]: Starting updates 4650 QUERY [MARCO]: Completed updates 20362

The three examples occur more or less at the same time (in 2-3 minutes or so). I remind you that the pattern is very insertion intensive. There are two clients which read and write through RESTXQ to/from 2 different logical DBs on the same server. Is it possible that behaving concurrently we cause some sort of race condition that serializes operations for a long time period? Please help us out because it's getting to deeply nested into BaseX's internals for us to understand.

Thanks a lot and BR, Marco.

[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] public Iter iter() throws QueryException { compile();

long time = System.currentTimeMillis(); Util.errln("QUERY [MARCO]: Start query iter " + time);

try { [...] final Updates updates = resources.updates; if(updates != null) { // if parent context exists, updates will be performed by main context Util.errln("QUERY [MARCO]: Starting updates " + (System.currentTimeMillis() - time)); if(qcParent == null) { [...] Util.errln("QUERY [MARCO]: Completed updates " + (System.currentTimeMillis() - time)); } } Util.errln("QUERY [MARCO]: Completed query" + (System.currentTimeMillis() - time)); return cache;

} catch(final StackOverflowError ex) { [...] } }

On 22/02/2015 22:08, Marco Lettere wrote:

...

I'm planning to add some logging here [1] and here [2]. Do you think it will unleash some details? M.

[1] https://github.com/BaseXdb/basex/blob/master/basex-api/src/main/java/org/bas... [2] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

On 22/02/2015 20:05, Christian Grün wrote:

...
...
I'd like just to understand to what relates the huge difference (d) between "querying the db and scheduling an update to the PUL" which is what the internal function does and the time logged by the basexhttp server as the completion time of the whole RestXQ call which is T + d.

Hm, difficult to tell. In all cases, the PUL will be executed before the result of a query will be sent to the client.

Christian Grün

9:04 a.m.

Hi Marco,

...

There are two clients which read and write through RESTXQ to/from 2 different logical DBs on the same server. Is it possible that behaving concurrently we cause some sort of race condition that serializes operations for a long time period?

Yes, this is absolutely possible. I guess that your hard disk, or even SSD, needs to do many write operations at the same time, leading to a somewhat random access pattern.

You could try to set the PARALLEL flag to 1 and see how this effects the total performance!

...

Please help us out [...]

Hope this helps ;) Christian

because it's getting to deeply nested into BaseX's

...

internals for us to understand.

Thanks a lot and BR, Marco.

[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] public Iter iter() throws QueryException { compile();
long time = System.currentTimeMillis();
Util.errln("QUERY [MARCO]: Start query iter " + time);

try {
  [...]
  final Updates updates = resources.updates;
  if(updates != null) {
    // if parent context exists, updates will be performed by main
context Util.errln("QUERY [MARCO]: Starting updates " + (System.currentTimeMillis() - time)); if(qcParent == null) { [...] Util.errln("QUERY [MARCO]: Completed updates " + (System.currentTimeMillis() - time)); } } Util.errln("QUERY [MARCO]: Completed query" + (System.currentTimeMillis() - time)); return cache;
} catch(final StackOverflowError ex) {
  [...]

}
}

On 22/02/2015 22:08, Marco Lettere wrote:

...
I'm planning to add some logging here [1] and here [2]. Do you think it will unleash some details? M.

[1] https://github.com/BaseXdb/basex/blob/master/basex-api/src/main/java/org/bas... [2] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

On 22/02/2015 20:05, Christian Grün wrote:

...
...
I'd like just to understand to what relates the huge difference (d) between "querying the db and scheduling an update to the PUL" which is what the internal function does and the time logged by the basexhttp server as the completion time of the whole RestXQ call which is T + d.

Hm, difficult to tell. In all cases, the PUL will be executed before the result of a query will be sent to the client.

Marco Lettere

25 Feb 25 Feb

7:18 a.m.

New subject: Performance [former 'of PUL commit']

Hello all, putting the PARALLEL flag to 1 solved the issue of inconsistency between the query + scheduling time and the actual operation completion time. Now we have it stabilized. Thanks. Peaks are now rarely possible when an update from the insertion client serializes itself with one or more queries or updates from another. I assume that this could be reduced as soon as we complete the implementation of the splitting of the data into two different DBs. A cold one which is growing (with scheduled batch ops) but indexed and UPDINDEX=true (for rare update operations that still might occur) and a hot one for realtime data which has no index but whose size will be reduced and kept constant. Lowering in this way the general access times should also reduce the impact of peaks. Do the last statements make sense to the experts and implementers?

BR, Marco.

On 24/02/2015 15:04, Christian Grün wrote:

...

Hi Marco,

...
There are two clients which read and write through RESTXQ to/from 2 different logical DBs on the same server. Is it possible that behaving concurrently we cause some sort of race condition that serializes operations for a long time period?

Yes, this is absolutely possible. I guess that your hard disk, or even SSD, needs to do many write operations at the same time, leading to a somewhat random access pattern.

You could try to set the PARALLEL flag to 1 and see how this effects the total performance!

...
Please help us out [...]

Hope this helps ;) Christian

because it's getting to deeply nested into BaseX's

...
internals for us to understand.

Thanks a lot and BR, Marco.

[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba... [2] public Iter iter() throws QueryException { compile();
 long time = System.currentTimeMillis();
 Util.errln("QUERY [MARCO]: Start query iter " + time);

 try {
   [...]
   final Updates updates = resources.updates;
   if(updates != null) {
     // if parent context exists, updates will be performed by main
context Util.errln("QUERY [MARCO]: Starting updates " + (System.currentTimeMillis() - time)); if(qcParent == null) { [...] Util.errln("QUERY [MARCO]: Completed updates " + (System.currentTimeMillis() - time)); } } Util.errln("QUERY [MARCO]: Completed query" + (System.currentTimeMillis() - time)); return cache;
 } catch(final StackOverflowError ex) {
   [...]

 }
}

On 22/02/2015 22:08, Marco Lettere wrote:

...
I'm planning to add some logging here [1] and here [2]. Do you think it will unleash some details? M.

[1] https://github.com/BaseXdb/basex/blob/master/basex-api/src/main/java/org/bas... [2] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...

On 22/02/2015 20:05, Christian Grün wrote:

...
...
I'd like just to understand to what relates the huge difference (d) between "querying the db and scheduling an update to the PUL" which is what the internal function does and the time logged by the basexhttp server as the completion time of the whole RestXQ call which is T + d.

Hm, difficult to tell. In all cases, the PUL will be executed before the result of a query will be sent to the client.

Christian Grün

8:51 a.m.

New subject: Performance [former 'of PUL commit']

...

A cold one which is growing (with scheduled batch ops) but indexed and UPDINDEX=true (for rare update operations that still might occur) and a hot one for realtime data which has no index but whose size will be reduced and kept constant.

This makes perfect sense. You could also think about keeping your large database(s) completely static. The UPDINDEX option will keep your index structures updated, but it doesn't update all database statistics, which can also speed up your queries. But you can simply run some of your queries, or compare the query plans, in order to find out if queries are evaluated slower after an update.

Marco Lettere

10:30 a.m.

New subject: Performance [former 'of PUL commit']

On 25/02/2015 14:51, Christian Grün wrote:

...

...
A cold one which is growing (with scheduled batch ops) but indexed and UPDINDEX=true (for rare update operations that still might occur) and a hot one for realtime data which has no index but whose size will be reduced and kept constant.

This makes perfect sense. You could also think about keeping your large database(s) completely static. The UPDINDEX option will keep your index structures updated, but it doesn't update all database statistics, which can also speed up your queries. But you can simply run some of your queries, or compare the query plans, in order to find out if queries are evaluated slower after an update.

Two facts: 1) I know for sure that index access are boosting the performances significantly because we payed a lot of attention in writing the (semi-generated) queries to hit them.

2) We know that no document can definitely be considered dead so we apply a heuristic for moving docs to cold DB and hope they will rarely be revived.

Thus if I understood correclty: What you propose here is to not have the UPDINDEX=true but rather re-run an "create index *; optimize (all)" after every (hopefully rare) update on the cold DB?

Christian Grün

10:38 a.m.

New subject: Performance [former 'of PUL commit']

...

Thus if I understood correclty: What you propose here is to not have the UPDINDEX=true but rather re-run an "create index *; optimize (all)"

If you have already checked performance, it's completely fine to use UPDINDEX.

3796

Age (days ago)

3801

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

11 comments

2 participants

tags (0)

participants (2)

Christian Grün
Marco Lettere