Re: [basex-talk] Performance of Add command

7 Jul 2012

      Hi Manuel,
while many XML purists will hate this feature, I bet that many users
will love it: I have added a new STRIPNS option to remove namespaces
from imported XML documents [1-3]. This option is also available via
the GUI. After all, it depends on the particular use case if stripping
namespaces makes things easier or is pretty much nuts.
I remember that you haven't actually asked for this feature, and it
may well be that you absolutely want to retain namespaces in your
database. The discussed performance bottleneck with namespaced
documents is still on the list.
Christian
[1] http://docs.basex.org/wiki/Options#STRIPNS
[2] http://files.basex.org/releases/latest/
[3] https://github.com/BaseXdb/basex/issues/537
___________________________
...
...
is there a reason why inserting from a file is faster than from a
stream? I'd expect both to use the same insertion mechanism.
There are several reasons for that, e.g.:
– as each of the ADD operations is atomic, it must be guaranteed that
a command will not lead to a corrupt database. In contrast, CREATE
will either succeed or fail as a whole.
– if data is streamed, we first need to cache the result because of
the same reason (if the received data is invalid, the insert operation
will fail)
Apart from that, your specific bottleneck seems to be related to the
namespace method. Without that, the add operation should be very fast,
too. As malamut2 suggested…
https://github.com/BaseXdb/basex/issues/523
an additional option, which strips all namespaces in a document, could
be another solution (provided that you don't really need the
namespaces). Anyway, we'll give you an update as soon as someone has
time to look at this.
Christian
___________________________
...
Thanks,
Manuel
...
...
great, thanks! If there's anything I can do to help, let me know.
Right now I think I'm going to abort the import because it probably
will take somewhat longer.
Manuel
On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün
christian.gruen@gmail.com wrote:
...
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the
Namespaces.update() method, which in fact updates the hierarchical
namespaces structures in a database (well, you guessed that already…).
As we first need to do some more research on potential optimizations,
I have created a new GitHub issue to keep track of this bottleneck
[1].
Thanks,
Christian
[1] https://github.com/BaseXdb/basex/issues/523
___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt
bernhardt.manuel@gmail.com wrote:
...
Hi,
I'm doing some testing before migration one of our customers to a new
version of our platform that uses BaseX in order to store documents.
They have approx. 4M documents, and I'm running an import operation on
a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per
document, based on a stream of the document, at a different (unique)
path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being
untouched) is taken by the BaseX server, I fired up YourKit out of
curiosity to see where the CPU time was spent. My machine is a 2*4
core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it
should do pretty fine.
YourKit shows that what seems to use up most time is the
Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s
org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set)
org.basex.data.Namespaces.update(int, int, boolean, Set)
org.basex.data.Data.insert(int, int, Data)
org.basex.core.cmd.Add.run()
org.basex.core.Command.run(Context, OutputStream)
org.basex.core.Command.exec(Context, OutputStream)
org.basex.core.Command.execute(Context, OutputStream)
org.basex.core.Command.execute(Context)
org.basex.server.ClientListener.execute(Command)
org.basex.server.ClientListener.add()
org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function
and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert
nspaces.update(ipre, dsize, true, newNodes);

The whole set of records should have no more than 5 different
namespaces in total. Thus I'm wondering if there would perhaps be some
potential for optimization here? Note that I'm completely ignorant as
to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX
took 9285008 ms
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Performance of Add command