Hi,
I'm doing some testing before migration one of our customers to a new
version of our platform that uses BaseX in order to store documents.
They have approx. 4M documents, and I'm running an import operation on
a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per
document, based on a stream of the document, at a different (unique)
path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being
untouched) is taken by the BaseX server, I fired up YourKit out of
curiosity to see where the CPU time was spent. My machine is a 2*4
core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it
should do pretty fine.
YourKit shows that what seems to use up most time is the
Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s
org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set)
org.basex.data.Namespaces.update(int, int, boolean, Set)
org.basex.data.Data.insert(int, int, Data)
org.basex.core.cmd.Add.run()
org.basex.core.Command.run(Context, OutputStream)
org.basex.core.Command.exec(Context, OutputStream)
org.basex.core.Command.execute(Context, OutputStream)
org.basex.core.Command.execute(Context)
org.basex.server.ClientListener.execute(Command)
org.basex.server.ClientListener.add()
org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function
and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert
nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different
namespaces in total. Thus I'm wondering if there would perhaps be some
potential for optimization here? Note that I'm completely ignorant as
to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX
took 9285008 ms