Hi Manuel,
while many XML purists will hate this feature, I bet that many users will love it: I have added a new STRIPNS option to remove namespaces from imported XML documents [1-3]. This option is also available via the GUI. After all, it depends on the particular use case if stripping namespaces makes things easier or is pretty much nuts.
I remember that you haven't actually asked for this feature, and it may well be that you absolutely want to retain namespaces in your database. The discussed performance bottleneck with namespaced documents is still on the list.
Christian
[1] http://docs.basex.org/wiki/Options#STRIPNS [2] http://files.basex.org/releases/latest/ [3] https://github.com/BaseXdb/basex/issues/537
___________________________
is there a reason why inserting from a file is faster than from a stream? I'd expect both to use the same insertion mechanism.
There are several reasons for that, e.g.:
– as each of the ADD operations is atomic, it must be guaranteed that a command will not lead to a corrupt database. In contrast, CREATE will either succeed or fail as a whole. – if data is streamed, we first need to cache the result because of the same reason (if the received data is invalid, the insert operation will fail)
Apart from that, your specific bottleneck seems to be related to the namespace method. Without that, the add operation should be very fast, too. As malamut2 suggested…
https://github.com/BaseXdb/basex/issues/523
an additional option, which strips all namespaces in a document, could be another solution (provided that you don't really need the namespaces). Anyway, we'll give you an update as soon as someone has time to look at this.
Christian ___________________________
Thanks,
Manuel
great, thanks! If there's anything I can do to help, let me know. Right now I think I'm going to abort the import because it probably will take somewhat longer.
Manuel
On Mon, Jul 2, 2012 at 3:11 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Manuel,
sorry for the delayed feedback, and thanks for pointing to the Namespaces.update() method, which in fact updates the hierarchical namespaces structures in a database (well, you guessed that already…). As we first need to do some more research on potential optimizations, I have created a new GitHub issue to keep track of this bottleneck [1].
Thanks, Christian
[1] https://github.com/BaseXdb/basex/issues/523 ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk