Hi again,
the parsing isn't finished yet, and apparently there's something odd going on:
00:44:36.429 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201590 [...] 00:44:38.681 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2252.36 ms 00:44:38.682 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201591 [...] 00:44:38.682 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.64 ms 00:44:38.683 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201592 [...] 00:44:41.049 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2366.34 ms 00:44:41.050 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201593 [...] 00:44:43.410 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2360.21 ms 00:44:43.410 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201594 [...] 00:44:43.411 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.75 ms 00:44:43.412 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201595 [...] 00:44:45.730 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2318.13 ms 00:44:45.730 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201596 [...] 00:44:48.068 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2337.68 ms 00:44:48.068 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201597 [...] 00:44:48.069 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.66 ms 00:44:48.070 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201598 [...] 00:44:50.401 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2331.87 ms 00:44:50.402 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201599 [...] 00:44:50.403 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.7 ms 00:44:50.403 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201600 [...] 00:44:52.763 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2360.16 ms 00:44:52.764 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201601 [...] 00:44:55.155 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2391.05 ms 00:44:55.155 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201602 [...] 00:44:55.156 [127.0.0.1:61979]: kulturnett____arkivportalen OK 0.96 ms 00:44:55.157 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201603 [...] 00:44:57.504 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2347.02 ms 00:44:57.504 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201604 [...] 00:44:59.846 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2341.76 ms 00:44:59.846 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201605 [...] 00:44:59.848 [127.0.0.1:61979]: kulturnett____arkivportalen OK 1.33 ms 00:44:59.848 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201606 [...] 00:45:02.180 [127.0.0.1:61979]: kulturnett____arkivportalen OK 2332.35 ms 00:45:02.181 [127.0.0.1:61979]: kulturnett____arkivportalen ADD TO no-TKAT_arkiv000000201607 [...]
Also, YourKit seems to have found potential deadlocks, or sluggish calls:
Thread-30 <--- Frozen for at least 2m 20s org.basex.index.value.MemValues.index(byte[], int) org.basex.data.MemData.index(int, int, byte[], int) org.basex.data.Data.text(int, int, byte[], int) org.basex.build.MemBuilder.addText(byte[], int, byte) org.basex.build.Builder.addText(byte[], byte) org.basex.build.Builder.text(byte[]) org.basex.build.xml.XMLParser.parse() org.basex.build.SingleParser.parse(Builder) org.basex.build.DirParser.parseResource(Builder) org.basex.build.DirParser.parse(Builder, IO) org.basex.build.DirParser.parse(Builder) org.basex.build.Builder.parse() org.basex.build.MemBuilder.build()<2 recursive calls> org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
Thread-30 <--- Frozen for at least 44s org.basex.index.name.Names.index(byte[], byte[], boolean) org.basex.build.Builder.addElem(byte[], Atts) org.basex.build.Builder.startElem(byte[], Atts) org.basex.build.xml.XMLParser.parseTag() org.basex.build.xml.XMLParser.parse() org.basex.build.SingleParser.parse(Builder) org.basex.build.DirParser.parseResource(Builder) org.basex.build.DirParser.parse(Builder, IO) org.basex.build.DirParser.parse(Builder) org.basex.build.Builder.parse() org.basex.build.MemBuilder.build()<2 recursive calls> org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
Thread-30 <--- Frozen for at least 51s org.basex.index.resource.Docs.insert(int, Data) org.basex.index.resource.Resources.insert(int, Data) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not exactly sure which of the above are relevant, but I thought I'd share them anyway. I'll try to get some better measurements tomorrow.
Manuel
On Mon, Jul 9, 2012 at 11:25 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi again,
inserting 3M records now seems to take a lot less time - I'm running an insertion for the past 40 minutes now and it's close to finishing (2.8M records so far). I have the impression that it gets slower with the amount of size still, but much less so - but I couldn't put a finger on any particular method call with YourKit (it started with a whopping 15K documents / second, and now is at 300 documents / second)
I'll leave the computer running and see tomorrow how much time it took in total (and give a detail on what calls took how long), but in any case this is a huge improvement over how it used to be, thanks a lot!
Manuel
On Mon, Jul 9, 2012 at 2:04 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi Christian,
thanks for the fix! I'll test it right away on a big import.
We don't have that many namespaces in those documents but the general idea is to keep them, so we won't be using the STRIPNS feature for the time being (though we might in the future, depending on the use-case)
Thanks,
Manuel
On Sat, Jul 7, 2012 at 4:45 PM, Christian Grün christian.gruen@gmail.com wrote:
…the problem should now be fixed. I'd be glad if you could once more test the import you've been discussing in your report with the latest code base/snapshot.
Thanks in advance, Christian ___________________________
On Sat, Jun 30, 2012 at 7:01 PM, Manuel Bernhardt bernhardt.manuel@gmail.com wrote:
Hi,
I'm doing some testing before migration one of our customers to a new version of our platform that uses BaseX in order to store documents. They have approx. 4M documents, and I'm running an import operation on a 1 M document collection on my laptop.
The way I'm inserting documents is by firing off one Add command per document, based on a stream of the document, at a different (unique) path for each document, and flushing every at 10K Adds.
Since most CPU usage (for one of the cores, the other ones being untouched) is taken by the BaseX server, I fired up YourKit out of curiosity to see where the CPU time was spent. My machine is a 2*4 core MacBook Pro with 8GB of RAM and SSD, so I think hardware-wise it should do pretty fine.
YourKit shows that what seems to use up most time is the Namespaces.update method:
Thread-12 [RUNNABLE] CPU time: 2h 7m 9s org.basex.data.Namespaces.update(NSNode, int, int, boolean, Set) org.basex.data.Namespaces.update(int, int, boolean, Set) org.basex.data.Data.insert(int, int, Data) org.basex.core.cmd.Add.run() org.basex.core.Command.run(Context, OutputStream) org.basex.core.Command.exec(Context, OutputStream) org.basex.core.Command.execute(Context, OutputStream) org.basex.core.Command.execute(Context) org.basex.server.ClientListener.execute(Command) org.basex.server.ClientListener.add() org.basex.server.ClientListener.run()
I'm not really sure what that method does - it's a recursive function and seems to be triggered by Data.insert:
// NSNodes have to be checked for pre value shifts after insert nspaces.update(ipre, dsize, true, newNodes);
The whole set of records should have no more than 5 different namespaces in total. Thus I'm wondering if there would perhaps be some potential for optimization here? Note that I'm completely ignorant as to what the method does and what its exact purpose is.
Thanks,
Manuel
PS: the import is now finished: Storing 1001712 records into BaseX took 9285008 ms _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk