Hi,
a little update on this: I started the import of 3M documents last evening using this method, and after 9h it's not yet finished (at 2,29M documents atm.). So this operation looks a lot like it is in o(n^2) (the insertion of 1M record took somewhat above 2h)
Manuel
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms