Adding documents slows over time

List overview All Threads
Download

newer

older

Re: [basex-talk] Ways of...

Getting some double files

Gerald de Jong

18 Sep 2014 18 Sep '14

5:58 a.m.

I'm finding that adding documents to a database starts at about 1ms, but gradually gets slower (5ms after about 700,000). I'm doing this with autoflush off, and I've tried periodic flush an optimize commands but they have no effect.

Is this expected behavior? Is there any way to keep it speedy?

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Attachments:

attachment.html (text/html — 635 bytes)

Show replies by date

Christian Grün

18 Sep 18 Sep

7:43 a.m.

Hi Gerald,

yes, we are aware that database inserts slow down over the time. I would be interested in your experience with the latest snapshot of BaseX [1], which has an improved document index [2]. In some cases, the insertion of new files may get slower, but the replacement of existing files will be sped up a lot with this index.

Thanks in advance, Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/issues/804

On Thu, Sep 18, 2014 at 11:58 AM, Gerald de Jong gerald@delving.eu wrote:

...

I'm finding that adding documents to a database starts at about 1ms, but gradually gets slower (5ms after about 700,000). I'm doing this with autoflush off, and I've tried periodic flush an optimize commands but they have no effect.

Is this expected behavior? Is there any way to keep it speedy?

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

9:32 a.m.

Hi Christian,

Perhaps you can give me a hint as to why inserts slow down. I was imagining that most of the indexing work would be in the Optimize afterwards. Sounds like it's also a lot slower relative to giving a single file that contains the same as many documents too, right? Somehow this doesn't rhyme in my mind, so I must be missing something.

I will try find the time to try out the latest snapshot, but from what I read I guess you're not expecting greater Add speeds, just faster Replace.

Gerald

On Thu, Sep 18, 2014 at 1:43 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Gerald,

yes, we are aware that database inserts slow down over the time. I would be interested in your experience with the latest snapshot of BaseX [1], which has an improved document index [2]. In some cases, the insertion of new files may get slower, but the replacement of existing files will be sped up a lot with this index.

Thanks in advance, Christian

[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/issues/804

On Thu, Sep 18, 2014 at 11:58 AM, Gerald de Jong gerald@delving.eu wrote:

...
I'm finding that adding documents to a database starts at about 1ms, but gradually gets slower (5ms after about 700,000). I'm doing this with autoflush off, and I've tried periodic flush an optimize commands but

they

...
have no effect.

Is this expected behavior? Is there any way to keep it speedy?

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Christian Grün

12:57 p.m.

...

Perhaps you can give me a hint as to why inserts slow down.j

I didn't have time to check out 7.9, but I have done some testing with 8.0, and I didn't notice a real slow-down. This is Java testing script (1 mio documents are added in just 17 seconds; I'm using the internal BaseX parser to speed up the import):

Performance p = new Performance(); Context ctx = new Context();

new CreateDB("db").execute(ctx); new Set(MainOptions.AUTOFLUSH, false).execute(ctx); new Set(MainOptions.INTPARSE, true).execute(ctx); for(int i = 0; i < 1000000; i++) { new Add("db", "<a/>").execute(ctx); } ctx.close(); System.out.println(p);

Hope this helps, Christian

Gerald de Jong

23 Sep 23 Sep

5:02 a.m.

Hi Christian,

I set up to use the 8.0-SNAPSHOT and used the internal parser as well. In your example you're not really giving much of a challenge to the index, since every doc is just <a/>.

With respect to ADD, I'm not seeing a significant performance difference:

8.0-SNAPSHOT ------- 10000: 9250ms 20000: 7626ms 30000: 7885ms 40000: 8111ms 50000: 8365ms 60000: 8784ms 70000: 9270ms 80000: 9692ms 90000: 10158ms 100000: 10612ms 110000: 11018ms 120000: 11478ms 130000: 11940ms 140000: 12505ms 150000: 13047ms 160000: 13536ms 170000: 14055ms 180000: 14371ms 190000: 14883ms 200000: 15330ms 210000: 15888ms 220000: 16398ms 230000: 16878ms 240000: 17038ms 250000: 17453ms 260000: 17965ms 270000: 18317ms 280000: 18832ms 290000: 19373ms 300000: 19735ms 310000: 20062ms 320000: 20675ms 330000: 21113ms 340000: 21754ms 350000: 22887ms 360000: 22810ms 370000: 22985ms 380000: 23506ms 390000: 23856ms 400000: 24338ms

7.9 ----- 10000: 8229ms 20000: 7587ms 30000: 7973ms 40000: 8282ms 50000: 8717ms 60000: 9294ms 70000: 10105ms 80000: 10669ms 90000: 11301ms 100000: 11835ms 110000: 12413ms 120000: 13000ms 130000: 13577ms 140000: 14331ms 150000: 14488ms 160000: 15025ms 170000: 15463ms 180000: 15815ms 190000: 16153ms 200000: 16314ms 210000: 16562ms 220000: 17186ms 230000: 17862ms 240000: 18340ms 250000: 18790ms 260000: 19313ms 270000: 19850ms 280000: 20225ms 290000: 20650ms 300000: 21062ms 310000: 21595ms 320000: 22022ms 330000: 22414ms 340000: 22925ms 350000: 23514ms 360000: 23762ms 370000: 24360ms 380000: 25028ms 390000: 25446ms 400000: 25700ms

- Gerald de Jong

On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...

...
Perhaps you can give me a hint as to why inserts slow down.j

I didn't have time to check out 7.9, but I have done some testing with 8.0, and I didn't notice a real slow-down. This is Java testing script (1 mio documents are added in just 17 seconds; I'm using the internal BaseX parser to speed up the import):
Performance p = new Performance();
Context ctx = new Context();

new CreateDB("db").execute(ctx);
new Set(MainOptions.AUTOFLUSH, false).execute(ctx);
new Set(MainOptions.INTPARSE, true).execute(ctx);
for(int i = 0; i < 1000000; i++) {
  new Add("db", "<a/>").execute(ctx);
}
ctx.close();
System.out.println(p);
Hope this helps, Christian

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Christian Grün

5:36 a.m.

...

I set up to use the 8.0-SNAPSHOT and used the internal parser as well. In your example you're not really giving much of a challenge to the index, since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index structures?

...

With respect to ADD, I'm not seeing a significant performance difference:

Please give us more info on the data you are adding. Could you provide us with a sample document?

...

8.0-SNAPSHOT

10000: 9250ms 20000: 7626ms 30000: 7885ms 40000: 8111ms 50000: 8365ms 60000: 8784ms 70000: 9270ms 80000: 9692ms 90000: 10158ms 100000: 10612ms 110000: 11018ms 120000: 11478ms 130000: 11940ms 140000: 12505ms 150000: 13047ms 160000: 13536ms 170000: 14055ms 180000: 14371ms 190000: 14883ms 200000: 15330ms 210000: 15888ms 220000: 16398ms 230000: 16878ms 240000: 17038ms 250000: 17453ms 260000: 17965ms 270000: 18317ms 280000: 18832ms 290000: 19373ms 300000: 19735ms 310000: 20062ms 320000: 20675ms 330000: 21113ms 340000: 21754ms 350000: 22887ms 360000: 22810ms 370000: 22985ms 380000: 23506ms 390000: 23856ms 400000: 24338ms

7.9

10000: 8229ms 20000: 7587ms 30000: 7973ms 40000: 8282ms 50000: 8717ms 60000: 9294ms 70000: 10105ms 80000: 10669ms 90000: 11301ms 100000: 11835ms 110000: 12413ms 120000: 13000ms 130000: 13577ms 140000: 14331ms 150000: 14488ms 160000: 15025ms 170000: 15463ms 180000: 15815ms 190000: 16153ms 200000: 16314ms 210000: 16562ms 220000: 17186ms 230000: 17862ms 240000: 18340ms 250000: 18790ms 260000: 19313ms 270000: 19850ms 280000: 20225ms 290000: 20650ms 300000: 21062ms 310000: 21595ms 320000: 22022ms 330000: 22414ms 340000: 22925ms 350000: 23514ms 360000: 23762ms 370000: 24360ms 380000: 25028ms 390000: 25446ms 400000: 25700ms

Gerald de Jong

On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...
...
Perhaps you can give me a hint as to why inserts slow down.j

I didn't have time to check out 7.9, but I have done some testing with 8.0, and I didn't notice a real slow-down. This is Java testing script (1 mio documents are added in just 17 seconds; I'm using the internal BaseX parser to speed up the import):
Performance p = new Performance();
Context ctx = new Context();

new CreateDB("db").execute(ctx);
new Set(MainOptions.AUTOFLUSH, false).execute(ctx);
new Set(MainOptions.INTPARSE, true).execute(ctx);
for(int i = 0; i < 1000000; i++) {
  new Add("db", "<a/>").execute(ctx);
}
ctx.close();
System.out.println(p);
Hope this helps, Christian
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

6:25 a.m.

I don't know what causes the gradual slowdown. My assumption was that it was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path such that the export spreads out the files nicely into a file system tree, rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00"> <record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference> <creator.role>Fotograaf</creator.role> <object_number>9.387</object_number> <monument.label/> <monument.zipcode/> <monument.name>Kasteel Hoensbroek</monument.name> <monument.record_number>284330</monument.record_number> <reproduction.date/> <reproduction.notes>Oude filepath: 0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.name>a.vanhoute</input.name> <edit.name>RCEadmin</edit.name> <edit.name>a.vanhoute</edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source> </record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün christian.gruen@gmail.com wrote:

...

...
I set up to use the 8.0-SNAPSHOT and used the internal parser as well.

In

...
your example you're not really giving much of a challenge to the index, since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index structures?

...
With respect to ADD, I'm not seeing a significant performance difference:

Please give us more info on the data you are adding. Could you provide us with a sample document?

...
8.0-SNAPSHOT

10000: 9250ms 20000: 7626ms 30000: 7885ms 40000: 8111ms 50000: 8365ms 60000: 8784ms 70000: 9270ms 80000: 9692ms 90000: 10158ms 100000: 10612ms 110000: 11018ms 120000: 11478ms 130000: 11940ms 140000: 12505ms 150000: 13047ms 160000: 13536ms 170000: 14055ms 180000: 14371ms 190000: 14883ms 200000: 15330ms 210000: 15888ms 220000: 16398ms 230000: 16878ms 240000: 17038ms 250000: 17453ms 260000: 17965ms 270000: 18317ms 280000: 18832ms 290000: 19373ms 300000: 19735ms 310000: 20062ms 320000: 20675ms 330000: 21113ms 340000: 21754ms 350000: 22887ms 360000: 22810ms 370000: 22985ms 380000: 23506ms 390000: 23856ms 400000: 24338ms

7.9

10000: 8229ms 20000: 7587ms 30000: 7973ms 40000: 8282ms 50000: 8717ms 60000: 9294ms 70000: 10105ms 80000: 10669ms 90000: 11301ms 100000: 11835ms 110000: 12413ms 120000: 13000ms 130000: 13577ms 140000: 14331ms 150000: 14488ms 160000: 15025ms 170000: 15463ms 180000: 15815ms 190000: 16153ms 200000: 16314ms 210000: 16562ms 220000: 17186ms 230000: 17862ms 240000: 18340ms 250000: 18790ms 260000: 19313ms 270000: 19850ms 280000: 20225ms 290000: 20650ms 300000: 21062ms 310000: 21595ms 320000: 22022ms 330000: 22414ms 340000: 22925ms 350000: 23514ms 360000: 23762ms 370000: 24360ms 380000: 25028ms 390000: 25446ms 400000: 25700ms

Gerald de Jong

On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
...
Perhaps you can give me a hint as to why inserts slow down.j

I didn't have time to check out 7.9, but I have done some testing with 8.0, and I didn't notice a real slow-down. This is Java testing script (1 mio documents are added in just 17 seconds; I'm using the internal BaseX parser to speed up the import):
Performance p = new Performance();
Context ctx = new Context();

new CreateDB("db").execute(ctx);
new Set(MainOptions.AUTOFLUSH, false).execute(ctx);
new Set(MainOptions.INTPARSE, true).execute(ctx);
for(int i = 0; i < 1000000; i++) {
  new Add("db", "<a/>").execute(ctx);
}
ctx.close();
System.out.println(p);
Hope this helps, Christian
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Christian Grün

7:13 a.m.

Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...

I don't know what causes the gradual slowdown. My assumption was that it was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path such that the export spreads out the files nicely into a file system tree, rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference> <creator.role>Fotograaf</creator.role> <object_number>9.387</object_number> <monument.label/> <monument.zipcode/> <monument.name>Kasteel Hoensbroek</monument.name> <monument.record_number>284330</monument.record_number> <reproduction.date/> <reproduction.notes>Oude filepath: 0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.name>a.vanhoute</input.name> <edit.name>RCEadmin</edit.name> <edit.name>a.vanhoute</edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source>

</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün christian.gruen@gmail.com wrote:

...
...
I set up to use the 8.0-SNAPSHOT and used the internal parser as well. In your example you're not really giving much of a challenge to the index, since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index structures?

...
With respect to ADD, I'm not seeing a significant performance difference:

Please give us more info on the data you are adding. Could you provide us with a sample document?

...
8.0-SNAPSHOT

10000: 9250ms 20000: 7626ms 30000: 7885ms 40000: 8111ms 50000: 8365ms 60000: 8784ms 70000: 9270ms 80000: 9692ms 90000: 10158ms 100000: 10612ms 110000: 11018ms 120000: 11478ms 130000: 11940ms 140000: 12505ms 150000: 13047ms 160000: 13536ms 170000: 14055ms 180000: 14371ms 190000: 14883ms 200000: 15330ms 210000: 15888ms 220000: 16398ms 230000: 16878ms 240000: 17038ms 250000: 17453ms 260000: 17965ms 270000: 18317ms 280000: 18832ms 290000: 19373ms 300000: 19735ms 310000: 20062ms 320000: 20675ms 330000: 21113ms 340000: 21754ms 350000: 22887ms 360000: 22810ms 370000: 22985ms 380000: 23506ms 390000: 23856ms 400000: 24338ms

7.9

10000: 8229ms 20000: 7587ms 30000: 7973ms 40000: 8282ms 50000: 8717ms 60000: 9294ms 70000: 10105ms 80000: 10669ms 90000: 11301ms 100000: 11835ms 110000: 12413ms 120000: 13000ms 130000: 13577ms 140000: 14331ms 150000: 14488ms 160000: 15025ms 170000: 15463ms 180000: 15815ms 190000: 16153ms 200000: 16314ms 210000: 16562ms 220000: 17186ms 230000: 17862ms 240000: 18340ms 250000: 18790ms 260000: 19313ms 270000: 19850ms 280000: 20225ms 290000: 20650ms 300000: 21062ms 310000: 21595ms 320000: 22022ms 330000: 22414ms 340000: 22925ms 350000: 23514ms 360000: 23762ms 370000: 24360ms 380000: 25028ms 390000: 25446ms 400000: 25700ms

Gerald de Jong

On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...
...
Perhaps you can give me a hint as to why inserts slow down.j

I didn't have time to check out 7.9, but I have done some testing with 8.0, and I didn't notice a real slow-down. This is Java testing script (1 mio documents are added in just 17 seconds; I'm using the internal BaseX parser to speed up the import):
Performance p = new Performance();
Context ctx = new Context();

new CreateDB("db").execute(ctx);
new Set(MainOptions.AUTOFLUSH, false).execute(ctx);
new Set(MainOptions.INTPARSE, true).execute(ctx);
for(int i = 0; i < 1000000; i++) {
  new Add("db", "<a/>").execute(ctx);
}
ctx.close();
System.out.println(p);
Hope this helps, Christian
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

7:20 a.m.

WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...
I don't know what causes the gradual slowdown. My assumption was that it was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path

such

...
that the export spreads out the files nicely into a file system tree,

rather

...
than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

id="20412518"

...
mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de

zuidvleugel</description>

...
<collection>Fotocollectie</collection>
<production.date.start>1925-08-06</production.date.start>
<reproduction.format/>
<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
<creator.role>Fotograaf</creator.role>
<object_number>9.387</object_number>
<monument.label/>
<monument.zipcode/>
<monument.name>Kasteel Hoensbroek</monument.name>
<monument.record_number>284330</monument.record_number>
<reproduction.date/>
<reproduction.notes>Oude filepath:
0009\009387.jpg</reproduction.notes>

...
<reproduction.type/>
<reproduction.creator/>
<rights.type>Copyright</rights.type>
<technique>Neg.zw</technique>
<creator>Scheepens, W.C.L.A.</creator>
<order_number>avh04-2008</order_number>
<input.date>2008-04-01</input.date>
<edit.date>2011-05-03</edit.date>
<edit.date>2008-04-28</edit.date>
<monument.historical_address/>
<content.subject.type value="SUBJECT" option="SUBJECT">
  <text language="0">subject</text>
  <text language="1">onderwerp</text>
  <text language="2">sujet</text>
  <text language="3">Thema</text>
  <text language="4">موضوع</text>
  <text language="6">θέμα</text>
</content.subject.type>
<content.subject.type value="SUBJECT" option="SUBJECT">
  <text language="0">subject</text>
  <text language="1">onderwerp</text>
  <text language="2">sujet</text>
  <text language="3">Thema</text>
  <text language="4">موضوع</text>
  <text language="6">θέμα</text>
</content.subject.type>
<content.subject>Kasteel</content.subject>
<content.subject>Binnenplaats</content.subject>
<monument.province>Limburg</monument.province>
<monument.place>Hoensbroek</monument.place>
<monument.number/>
<monument.county/>
<monument.country>Nederland</monument.country>
<monument.house_number>18</monument.house_number>
<monument.street>Klinkertstraat</monument.street>
<monument.house_number.addition/>
<monument.complex_number/>
<monument.number.x_coordinates/>
<monument.number.y_coordinates/>
<monument.geographical_keyword/>
<monument.complex_number.x_coordinates/>
<monument.complex_number.y_coordinates/>
<creator.date_of_birth/>
<creator.date_of_death/>
<input.name>a.vanhoute</input.name>
<edit.name>RCEadmin</edit.name>
<edit.name>a.vanhoute</edit.name>
<creator.history/>
<record_type value="OBJECT" option="OBJECT">
  <text language="0">single object</text>
  <text language="2">objet individuel</text>
  <text language="3">Einzelnes Objekt</text>
</record_type>
<edit.time>03:10:32</edit.time>
<edit.time>11:17:08</edit.time>
<input.time>09:58:28</input.time>
<input.source>document&gt;photographs</input.source>
<edit.source>collect&gt;photograph</edit.source>
<edit.source>document&gt;photographs</edit.source>
</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün <
christian.gruen@gmail.com>

...
wrote:

...
...
I set up to use the 8.0-SNAPSHOT and used the internal parser as well. In your example you're not really giving much of a challenge to the

index,

...
...
...
since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index

structures?

...
...
...
With respect to ADD, I'm not seeing a significant performance difference:

Please give us more info on the data you are adding. Could you provide us with a sample document?

...
8.0-SNAPSHOT

10000: 9250ms 20000: 7626ms 30000: 7885ms 40000: 8111ms 50000: 8365ms 60000: 8784ms 70000: 9270ms 80000: 9692ms 90000: 10158ms 100000: 10612ms 110000: 11018ms 120000: 11478ms 130000: 11940ms 140000: 12505ms 150000: 13047ms 160000: 13536ms 170000: 14055ms 180000: 14371ms 190000: 14883ms 200000: 15330ms 210000: 15888ms 220000: 16398ms 230000: 16878ms 240000: 17038ms 250000: 17453ms 260000: 17965ms 270000: 18317ms 280000: 18832ms 290000: 19373ms 300000: 19735ms 310000: 20062ms 320000: 20675ms 330000: 21113ms 340000: 21754ms 350000: 22887ms 360000: 22810ms 370000: 22985ms 380000: 23506ms 390000: 23856ms 400000: 24338ms

7.9

10000: 8229ms 20000: 7587ms 30000: 7973ms 40000: 8282ms 50000: 8717ms 60000: 9294ms 70000: 10105ms 80000: 10669ms 90000: 11301ms 100000: 11835ms 110000: 12413ms 120000: 13000ms 130000: 13577ms 140000: 14331ms 150000: 14488ms 160000: 15025ms 170000: 15463ms 180000: 15815ms 190000: 16153ms 200000: 16314ms 210000: 16562ms 220000: 17186ms 230000: 17862ms 240000: 18340ms 250000: 18790ms 260000: 19313ms 270000: 19850ms 280000: 20225ms 290000: 20650ms 300000: 21062ms 310000: 21595ms 320000: 22022ms 330000: 22414ms 340000: 22925ms 350000: 23514ms 360000: 23762ms 370000: 24360ms 380000: 25028ms 390000: 25446ms 400000: 25700ms

Gerald de Jong

On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...
...
Perhaps you can give me a hint as to why inserts slow down.j

I didn't have time to check out 7.9, but I have done some testing

with

...
...
...
...
8.0, and I didn't notice a real slow-down. This is Java testing

script

...
...
...
...
(1 mio documents are added in just 17 seconds; I'm using the internal BaseX parser to speed up the import):
Performance p = new Performance();
Context ctx = new Context();

new CreateDB("db").execute(ctx);
new Set(MainOptions.AUTOFLUSH, false).execute(ctx);
new Set(MainOptions.INTPARSE, true).execute(ctx);
for(int i = 0; i < 1000000; i++) {
  new Add("db", "<a/>").execute(ctx);
}
ctx.close();
System.out.println(p);
Hope this helps, Christian
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

7:30 a.m.

I'm completely surprised by this! You're right, the add time is completely constant without the namespace.

This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu wrote:

...

WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün <christian.gruen@gmail.com

...
wrote:

...
Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...
I don't know what causes the gradual slowdown. My assumption was that

it

...
was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush

is

...
false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path

such

...
that the export spreads out the files nicely into a file system tree,

rather

...
than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

id="20412518"

...
mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de

zuidvleugel</description>

...
<collection>Fotocollectie</collection>
<production.date.start>1925-08-06</production.date.start>
<reproduction.format/>
<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
<creator.role>Fotograaf</creator.role>
<object_number>9.387</object_number>
<monument.label/>
<monument.zipcode/>
<monument.name>Kasteel Hoensbroek</monument.name>
<monument.record_number>284330</monument.record_number>
<reproduction.date/>
<reproduction.notes>Oude filepath:
0009\009387.jpg</reproduction.notes>

...
<reproduction.type/>
<reproduction.creator/>
<rights.type>Copyright</rights.type>
<technique>Neg.zw</technique>
<creator>Scheepens, W.C.L.A.</creator>
<order_number>avh04-2008</order_number>
<input.date>2008-04-01</input.date>
<edit.date>2011-05-03</edit.date>
<edit.date>2008-04-28</edit.date>
<monument.historical_address/>
<content.subject.type value="SUBJECT" option="SUBJECT">
  <text language="0">subject</text>
  <text language="1">onderwerp</text>
  <text language="2">sujet</text>
  <text language="3">Thema</text>
  <text language="4">موضوع</text>
  <text language="6">θέμα</text>
</content.subject.type>
<content.subject.type value="SUBJECT" option="SUBJECT">
  <text language="0">subject</text>
  <text language="1">onderwerp</text>
  <text language="2">sujet</text>
  <text language="3">Thema</text>
  <text language="4">موضوع</text>
  <text language="6">θέμα</text>
</content.subject.type>
<content.subject>Kasteel</content.subject>
<content.subject>Binnenplaats</content.subject>
<monument.province>Limburg</monument.province>
<monument.place>Hoensbroek</monument.place>
<monument.number/>
<monument.county/>
<monument.country>Nederland</monument.country>
<monument.house_number>18</monument.house_number>
<monument.street>Klinkertstraat</monument.street>
<monument.house_number.addition/>
<monument.complex_number/>
<monument.number.x_coordinates/>
<monument.number.y_coordinates/>
<monument.geographical_keyword/>
<monument.complex_number.x_coordinates/>
<monument.complex_number.y_coordinates/>
<creator.date_of_birth/>
<creator.date_of_death/>
<input.name>a.vanhoute</input.name>
<edit.name>RCEadmin</edit.name>
<edit.name>a.vanhoute</edit.name>
<creator.history/>
<record_type value="OBJECT" option="OBJECT">
  <text language="0">single object</text>
  <text language="2">objet individuel</text>
  <text language="3">Einzelnes Objekt</text>
</record_type>
<edit.time>03:10:32</edit.time>
<edit.time>11:17:08</edit.time>
<input.time>09:58:28</input.time>
<input.source>document&gt;photographs</input.source>
<edit.source>collect&gt;photograph</edit.source>
<edit.source>document&gt;photographs</edit.source>
</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün <
christian.gruen@gmail.com>

...
wrote:

...
...
I set up to use the 8.0-SNAPSHOT and used the internal parser as

well.

...
...
...
In your example you're not really giving much of a challenge to the

index,

...
...
...
since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index

structures?

...
...
...
With respect to ADD, I'm not seeing a significant performance difference:

Please give us more info on the data you are adding. Could you provide us with a sample document?

...
8.0-SNAPSHOT

10000: 9250ms 20000: 7626ms 30000: 7885ms 40000: 8111ms 50000: 8365ms 60000: 8784ms 70000: 9270ms 80000: 9692ms 90000: 10158ms 100000: 10612ms 110000: 11018ms 120000: 11478ms 130000: 11940ms 140000: 12505ms 150000: 13047ms 160000: 13536ms 170000: 14055ms 180000: 14371ms 190000: 14883ms 200000: 15330ms 210000: 15888ms 220000: 16398ms 230000: 16878ms 240000: 17038ms 250000: 17453ms 260000: 17965ms 270000: 18317ms 280000: 18832ms 290000: 19373ms 300000: 19735ms 310000: 20062ms 320000: 20675ms 330000: 21113ms 340000: 21754ms 350000: 22887ms 360000: 22810ms 370000: 22985ms 380000: 23506ms 390000: 23856ms 400000: 24338ms

7.9

10000: 8229ms 20000: 7587ms 30000: 7973ms 40000: 8282ms 50000: 8717ms 60000: 9294ms 70000: 10105ms 80000: 10669ms 90000: 11301ms 100000: 11835ms 110000: 12413ms 120000: 13000ms 130000: 13577ms 140000: 14331ms 150000: 14488ms 160000: 15025ms 170000: 15463ms 180000: 15815ms 190000: 16153ms 200000: 16314ms 210000: 16562ms 220000: 17186ms 230000: 17862ms 240000: 18340ms 250000: 18790ms 260000: 19313ms 270000: 19850ms 280000: 20225ms 290000: 20650ms 300000: 21062ms 310000: 21595ms 320000: 22022ms 330000: 22414ms 340000: 22925ms 350000: 23514ms 360000: 23762ms 370000: 24360ms 380000: 25028ms 390000: 25446ms 400000: 25700ms

Gerald de Jong

On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün christian.gruen@gmail.com wrote:

...
> Perhaps you can give me a hint as to why inserts slow down.j I didn't have time to check out 7.9, but I have done some testing

with

...
...
...
...
8.0, and I didn't notice a real slow-down. This is Java testing

script

...
...
...
...
(1 mio documents are added in just 17 seconds; I'm using the

internal

...
...
...
...
BaseX parser to speed up the import):
Performance p = new Performance();
Context ctx = new Context();

new CreateDB("db").execute(ctx);
new Set(MainOptions.AUTOFLUSH, false).execute(ctx);
new Set(MainOptions.INTPARSE, true).execute(ctx);
for(int i = 0; i < 1000000; i++) {
  new Add("db", "<a/>").execute(ctx);
}
ctx.close();
System.out.println(p);
Hope this helps, Christian
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805
-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Christian Grün

7:43 a.m.

...

This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu wrote:

...

On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu wrote:

...
WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...
I don't know what causes the gradual slowdown. My assumption was that it was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path such that the export spreads out the files nicely into a file system tree, rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference> <creator.role>Fotograaf</creator.role> <object_number>9.387</object_number> <monument.label/> <monument.zipcode/> <monument.name>Kasteel Hoensbroek</monument.name> <monument.record_number>284330</monument.record_number> <reproduction.date/> <reproduction.notes>Oude filepath: 0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.name>a.vanhoute</input.name> <edit.name>RCEadmin</edit.name> <edit.name>a.vanhoute</edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source>

</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün christian.gruen@gmail.com wrote:

...
...
I set up to use the 8.0-SNAPSHOT and used the internal parser as well. In your example you're not really giving much of a challenge to the index, since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index structures?

...
With respect to ADD, I'm not seeing a significant performance difference:

Please give us more info on the data you are adding. Could you provide us with a sample document?

...
8.0-SNAPSHOT

10000: 9250ms 20000: 7626ms 30000: 7885ms 40000: 8111ms 50000: 8365ms 60000: 8784ms 70000: 9270ms 80000: 9692ms 90000: 10158ms 100000: 10612ms 110000: 11018ms 120000: 11478ms 130000: 11940ms 140000: 12505ms 150000: 13047ms 160000: 13536ms 170000: 14055ms 180000: 14371ms 190000: 14883ms 200000: 15330ms 210000: 15888ms 220000: 16398ms 230000: 16878ms 240000: 17038ms 250000: 17453ms 260000: 17965ms 270000: 18317ms 280000: 18832ms 290000: 19373ms 300000: 19735ms 310000: 20062ms 320000: 20675ms 330000: 21113ms 340000: 21754ms 350000: 22887ms 360000: 22810ms 370000: 22985ms 380000: 23506ms 390000: 23856ms 400000: 24338ms

7.9

10000: 8229ms 20000: 7587ms 30000: 7973ms 40000: 8282ms 50000: 8717ms 60000: 9294ms 70000: 10105ms 80000: 10669ms 90000: 11301ms 100000: 11835ms 110000: 12413ms 120000: 13000ms 130000: 13577ms 140000: 14331ms 150000: 14488ms 160000: 15025ms 170000: 15463ms 180000: 15815ms 190000: 16153ms 200000: 16314ms 210000: 16562ms 220000: 17186ms 230000: 17862ms 240000: 18340ms 250000: 18790ms 260000: 19313ms 270000: 19850ms 280000: 20225ms 290000: 20650ms 300000: 21062ms 310000: 21595ms 320000: 22022ms 330000: 22414ms 340000: 22925ms 350000: 23514ms 360000: 23762ms 370000: 24360ms 380000: 25028ms 390000: 25446ms 400000: 25700ms

Gerald de Jong

On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün christian.gruen@gmail.com wrote: > > > Perhaps you can give me a hint as to why inserts slow down.j > I didn't have time to check out 7.9, but I have done some testing > with > 8.0, and I didn't notice a real slow-down. This is Java testing > script > (1 mio documents are added in just 17 seconds; I'm using the > internal > BaseX parser to speed up the import): > > Performance p = new Performance(); > Context ctx = new Context(); > > new CreateDB("db").execute(ctx); > new Set(MainOptions.AUTOFLUSH, false).execute(ctx); > new Set(MainOptions.INTPARSE, true).execute(ctx); > for(int i = 0; i < 1000000; i++) { > new Add("db", "<a/>").execute(ctx); > } > ctx.close(); > System.out.println(p); > > Hope this helps, > Christian

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

7:50 a.m.

The other case I'm testing has five necessary namespaces. :(

10000: 6462ms 20000: 7592ms 30000: 8689ms 40000: 9417ms 50000: 9566ms 60000: 10368ms 70000: 10963ms 80000: 12167ms

Is there any direction you can suggest to look for a workaround?

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün christian.gruen@gmail.com wrote:

...

...
This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu wrote:

...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu

wrote:

...
...
WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...
I don't know what causes the gradual slowdown. My assumption was

that

...
...
...
...
it was the "optimize" which would cause the index to be built, so I

didn't

...
...
...
...
expect a slowdown at all during "add" calls, especially when

autoflush

...
...
...
...
is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a

path

...
...
...
...
such that the export spreads out the files nicely into a file system tree, rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
...
...
...
<creator.role>Fotograaf</creator.role>
<object_number>9.387</object_number>
<monument.label/>
<monument.zipcode/>
<monument.name>Kasteel Hoensbroek</monument.name>
<monument.record_number>284330</monument.record_number>
<reproduction.date/>
<reproduction.notes>Oude filepath:
0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.name>a.vanhoute</input.name> <edit.name>RCEadmin</edit.name> <edit.name>a.vanhoute</edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source>

</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün christian.gruen@gmail.com wrote:

...
> I set up to use the 8.0-SNAPSHOT and used the internal parser as > well. > In > your example you're not really giving much of a challenge to the > index, > since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index structures?

> With respect to ADD, I'm not seeing a significant performance > difference:

Please give us more info on the data you are adding. Could you
provide

...
...
...
...
...
us with a sample document?

> 8.0-SNAPSHOT > ------- > 10000: 9250ms > 20000: 7626ms > 30000: 7885ms > 40000: 8111ms > 50000: 8365ms > 60000: 8784ms > 70000: 9270ms > 80000: 9692ms > 90000: 10158ms > 100000: 10612ms > 110000: 11018ms > 120000: 11478ms > 130000: 11940ms > 140000: 12505ms > 150000: 13047ms > 160000: 13536ms > 170000: 14055ms > 180000: 14371ms > 190000: 14883ms > 200000: 15330ms > 210000: 15888ms > 220000: 16398ms > 230000: 16878ms > 240000: 17038ms > 250000: 17453ms > 260000: 17965ms > 270000: 18317ms > 280000: 18832ms > 290000: 19373ms > 300000: 19735ms > 310000: 20062ms > 320000: 20675ms > 330000: 21113ms > 340000: 21754ms > 350000: 22887ms > 360000: 22810ms > 370000: 22985ms > 380000: 23506ms > 390000: 23856ms > 400000: 24338ms > > 7.9 > ----- > 10000: 8229ms > 20000: 7587ms > 30000: 7973ms > 40000: 8282ms > 50000: 8717ms > 60000: 9294ms > 70000: 10105ms > 80000: 10669ms > 90000: 11301ms > 100000: 11835ms > 110000: 12413ms > 120000: 13000ms > 130000: 13577ms > 140000: 14331ms > 150000: 14488ms > 160000: 15025ms > 170000: 15463ms > 180000: 15815ms > 190000: 16153ms > 200000: 16314ms > 210000: 16562ms > 220000: 17186ms > 230000: 17862ms > 240000: 18340ms > 250000: 18790ms > 260000: 19313ms > 270000: 19850ms > 280000: 20225ms > 290000: 20650ms > 300000: 21062ms > 310000: 21595ms > 320000: 22022ms > 330000: 22414ms > 340000: 22925ms > 350000: 23514ms > 360000: 23762ms > 370000: 24360ms > 380000: 25028ms > 390000: 25446ms > 400000: 25700ms > > - Gerald de Jong > > > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün > christian.gruen@gmail.com > wrote: >> >> > Perhaps you can give me a hint as to why inserts slow down.j >> I didn't have time to check out 7.9, but I have done some testing >> with >> 8.0, and I didn't notice a real slow-down. This is Java testing >> script >> (1 mio documents are added in just 17 seconds; I'm using the >> internal >> BaseX parser to speed up the import): >> >> Performance p = new Performance(); >> Context ctx = new Context(); >> >> new CreateDB("db").execute(ctx); >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); >> new Set(MainOptions.INTPARSE, true).execute(ctx); >> for(int i = 0; i < 1000000; i++) { >> new Add("db", "<a/>").execute(ctx); >> } >> ctx.close(); >> System.out.println(p); >> >> Hope this helps, >> Christian > > > > > -- > Delving BV, Vasteland 8, Rotterdam > http://www.delving.eu > http://twitter.com/fluxe > skype: beautifulcode > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Christian Grün

7:55 a.m.

Maybe a general question: Is the insertion really a bottleneck in your scenario? How many data do you want to store in a single database? You could e.g. store your data in multiple databases, which can then all be queried by a single XQuery expression.

On Tue, Sep 23, 2014 at 1:50 PM, Gerald de Jong gerald@delving.eu wrote:

...

The other case I'm testing has five necessary namespaces. :(

10000: 6462ms 20000: 7592ms 30000: 8689ms 40000: 9417ms 50000: 9566ms 60000: 10368ms 70000: 10963ms 80000: 12167ms

Is there any direction you can suggest to look for a workaround?

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün christian.gruen@gmail.com wrote:

...
...
This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu wrote:

...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu wrote:

...
WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...
I don't know what causes the gradual slowdown. My assumption was that it was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path such that the export spreads out the files nicely into a file system tree, rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference> <creator.role>Fotograaf</creator.role> <object_number>9.387</object_number> <monument.label/> <monument.zipcode/> <monument.name>Kasteel Hoensbroek</monument.name> <monument.record_number>284330</monument.record_number> <reproduction.date/> <reproduction.notes>Oude filepath: 0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.name>a.vanhoute</input.name> <edit.name>RCEadmin</edit.name> <edit.name>a.vanhoute</edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source>

</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün christian.gruen@gmail.com wrote: > > > I set up to use the 8.0-SNAPSHOT and used the internal parser as > > well. > > In > > your example you're not really giving much of a challenge to the > > index, > > since every doc is just <a/>. > > If I get it right, you assume the slowdown is due to the index > structures? > > > With respect to ADD, I'm not seeing a significant performance > > difference: > > Please give us more info on the data you are adding. Could you > provide > us with a sample document? > > > > 8.0-SNAPSHOT > > ------- > > 10000: 9250ms > > 20000: 7626ms > > 30000: 7885ms > > 40000: 8111ms > > 50000: 8365ms > > 60000: 8784ms > > 70000: 9270ms > > 80000: 9692ms > > 90000: 10158ms > > 100000: 10612ms > > 110000: 11018ms > > 120000: 11478ms > > 130000: 11940ms > > 140000: 12505ms > > 150000: 13047ms > > 160000: 13536ms > > 170000: 14055ms > > 180000: 14371ms > > 190000: 14883ms > > 200000: 15330ms > > 210000: 15888ms > > 220000: 16398ms > > 230000: 16878ms > > 240000: 17038ms > > 250000: 17453ms > > 260000: 17965ms > > 270000: 18317ms > > 280000: 18832ms > > 290000: 19373ms > > 300000: 19735ms > > 310000: 20062ms > > 320000: 20675ms > > 330000: 21113ms > > 340000: 21754ms > > 350000: 22887ms > > 360000: 22810ms > > 370000: 22985ms > > 380000: 23506ms > > 390000: 23856ms > > 400000: 24338ms > > > > 7.9 > > ----- > > 10000: 8229ms > > 20000: 7587ms > > 30000: 7973ms > > 40000: 8282ms > > 50000: 8717ms > > 60000: 9294ms > > 70000: 10105ms > > 80000: 10669ms > > 90000: 11301ms > > 100000: 11835ms > > 110000: 12413ms > > 120000: 13000ms > > 130000: 13577ms > > 140000: 14331ms > > 150000: 14488ms > > 160000: 15025ms > > 170000: 15463ms > > 180000: 15815ms > > 190000: 16153ms > > 200000: 16314ms > > 210000: 16562ms > > 220000: 17186ms > > 230000: 17862ms > > 240000: 18340ms > > 250000: 18790ms > > 260000: 19313ms > > 270000: 19850ms > > 280000: 20225ms > > 290000: 20650ms > > 300000: 21062ms > > 310000: 21595ms > > 320000: 22022ms > > 330000: 22414ms > > 340000: 22925ms > > 350000: 23514ms > > 360000: 23762ms > > 370000: 24360ms > > 380000: 25028ms > > 390000: 25446ms > > 400000: 25700ms > > > > - Gerald de Jong > > > > > > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün > > christian.gruen@gmail.com > > wrote: > >> > >> > Perhaps you can give me a hint as to why inserts slow down.j > >> I didn't have time to check out 7.9, but I have done some > >> testing > >> with > >> 8.0, and I didn't notice a real slow-down. This is Java testing > >> script > >> (1 mio documents are added in just 17 seconds; I'm using the > >> internal > >> BaseX parser to speed up the import): > >> > >> Performance p = new Performance(); > >> Context ctx = new Context(); > >> > >> new CreateDB("db").execute(ctx); > >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); > >> new Set(MainOptions.INTPARSE, true).execute(ctx); > >> for(int i = 0; i < 1000000; i++) { > >> new Add("db", "<a/>").execute(ctx); > >> } > >> ctx.close(); > >> System.out.println(p); > >> > >> Hope this helps, > >> Christian > > > > > > > > > > -- > > Delving BV, Vasteland 8, Rotterdam > > http://www.delving.eu > > http://twitter.com/fluxe > > skype: beautifulcode > > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

8:05 a.m.

Considering that the dataset I just mentioned involves 1.2 million add commands, it does become a bit of annoyance with some large datasets like this. We can have some patience for insertion, even with such a slowdown, so I wouldn't say bottleneck exactly.

Can you point me to an example of querying multiple databases? I could try splitting the big datasets up.

The big problem I have right now is the IllegalMonitorStateException that freezes the basexserver. After this happens I have to kill -9 the process even.

On Tue, Sep 23, 2014 at 1:55 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Maybe a general question: Is the insertion really a bottleneck in your scenario? How many data do you want to store in a single database? You could e.g. store your data in multiple databases, which can then all be queried by a single XQuery expression.

On Tue, Sep 23, 2014 at 1:50 PM, Gerald de Jong gerald@delving.eu wrote:

...
The other case I'm testing has five necessary namespaces. :(

10000: 6462ms 20000: 7592ms 30000: 8689ms 40000: 9417ms 50000: 9566ms 60000: 10368ms 70000: 10963ms 80000: 12167ms

Is there any direction you can suggest to look for a workaround?

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
...
This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu

wrote:

...
...
...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu wrote:

...
WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Thanks for the document. The declaration of the (unused) namespace

in

...
...
...
...
...
the root element seems to be the cause for the decreasing

performance

...
...
...
...
...
(I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find

out

...
...
...
...
...
if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong <gerald@delving.eu

...
...
...
...
wrote: > I don't know what causes the gradual slowdown. My assumption was > that > it > was the "optimize" which would cause the index to be built, so I > didn't > expect a slowdown at all during "add" calls, especially when > autoflush > is > false. > > I add documents with the following paths: > > /f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml > > The xml file name is a hash of the contents, and it is placed in a > path > such > that the export spreads out the files nicely into a file system > tree, > rather > than putting a million docs into one directory. > > The document content is nothing special, wrapped in a special tag: > > <narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > id="20412518" > mod="2014-09-23T11:11:51.007+02:00"> > <record> > <priref>20412518</priref> > <current_location>FTA</current_location> > <current_location.type/> > <description>Ingang op de binnenplaats van de > zuidvleugel</description> > <collection>Fotocollectie</collection> > <production.date.start>1925-08-06</production.date.start> > <reproduction.format/> > > > >

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
...
...
...
...
> <creator.role>Fotograaf</creator.role> > <object_number>9.387</object_number> > <monument.label/> > <monument.zipcode/> > <monument.name>Kasteel Hoensbroek</monument.name> > <monument.record_number>284330</monument.record_number> > <reproduction.date/> > <reproduction.notes>Oude filepath: > 0009\009387.jpg</reproduction.notes> > <reproduction.type/> > <reproduction.creator/> > <rights.type>Copyright</rights.type> > <technique>Neg.zw</technique> > <creator>Scheepens, W.C.L.A.</creator> > <order_number>avh04-2008</order_number> > <input.date>2008-04-01</input.date> > <edit.date>2011-05-03</edit.date> > <edit.date>2008-04-28</edit.date> > <monument.historical_address/> > <content.subject.type value="SUBJECT" option="SUBJECT"> > <text language="0">subject</text> > <text language="1">onderwerp</text> > <text language="2">sujet</text> > <text language="3">Thema</text> > <text language="4">موضوع</text> > <text language="6">θέμα</text> > </content.subject.type> > <content.subject.type value="SUBJECT" option="SUBJECT"> > <text language="0">subject</text> > <text language="1">onderwerp</text> > <text language="2">sujet</text> > <text language="3">Thema</text> > <text language="4">موضوع</text> > <text language="6">θέμα</text> > </content.subject.type> > <content.subject>Kasteel</content.subject> > <content.subject>Binnenplaats</content.subject> > <monument.province>Limburg</monument.province> > <monument.place>Hoensbroek</monument.place> > <monument.number/> > <monument.county/> > <monument.country>Nederland</monument.country> > <monument.house_number>18</monument.house_number> > <monument.street>Klinkertstraat</monument.street> > <monument.house_number.addition/> > <monument.complex_number/> > <monument.number.x_coordinates/> > <monument.number.y_coordinates/> > <monument.geographical_keyword/> > <monument.complex_number.x_coordinates/> > <monument.complex_number.y_coordinates/> > <creator.date_of_birth/> > <creator.date_of_death/> > <input.name>a.vanhoute</input.name> > <edit.name>RCEadmin</edit.name> > <edit.name>a.vanhoute</edit.name> > <creator.history/> > <record_type value="OBJECT" option="OBJECT"> > <text language="0">single object</text> > <text language="2">objet individuel</text> > <text language="3">Einzelnes Objekt</text> > </record_type> > <edit.time>03:10:32</edit.time> > <edit.time>11:17:08</edit.time> > <input.time>09:58:28</input.time> > <input.source>document>photographs</input.source> > <edit.source>collect>photograph</edit.source> > <edit.source>document>photographs</edit.source> > </record> > </narthex> > > On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün > christian.gruen@gmail.com > wrote: >> >> > I set up to use the 8.0-SNAPSHOT and used the internal parser

as

...
...
...
...
...
>> > well. >> > In >> > your example you're not really giving much of a challenge to

the

...
...
...
...
...
>> > index, >> > since every doc is just <a/>. >> >> If I get it right, you assume the slowdown is due to the index >> structures? >> >> > With respect to ADD, I'm not seeing a significant performance >> > difference: >> >> Please give us more info on the data you are adding. Could you >> provide >> us with a sample document? >> >> >> > 8.0-SNAPSHOT >> > ------- >> > 10000: 9250ms >> > 20000: 7626ms >> > 30000: 7885ms >> > 40000: 8111ms >> > 50000: 8365ms >> > 60000: 8784ms >> > 70000: 9270ms >> > 80000: 9692ms >> > 90000: 10158ms >> > 100000: 10612ms >> > 110000: 11018ms >> > 120000: 11478ms >> > 130000: 11940ms >> > 140000: 12505ms >> > 150000: 13047ms >> > 160000: 13536ms >> > 170000: 14055ms >> > 180000: 14371ms >> > 190000: 14883ms >> > 200000: 15330ms >> > 210000: 15888ms >> > 220000: 16398ms >> > 230000: 16878ms >> > 240000: 17038ms >> > 250000: 17453ms >> > 260000: 17965ms >> > 270000: 18317ms >> > 280000: 18832ms >> > 290000: 19373ms >> > 300000: 19735ms >> > 310000: 20062ms >> > 320000: 20675ms >> > 330000: 21113ms >> > 340000: 21754ms >> > 350000: 22887ms >> > 360000: 22810ms >> > 370000: 22985ms >> > 380000: 23506ms >> > 390000: 23856ms >> > 400000: 24338ms >> > >> > 7.9 >> > ----- >> > 10000: 8229ms >> > 20000: 7587ms >> > 30000: 7973ms >> > 40000: 8282ms >> > 50000: 8717ms >> > 60000: 9294ms >> > 70000: 10105ms >> > 80000: 10669ms >> > 90000: 11301ms >> > 100000: 11835ms >> > 110000: 12413ms >> > 120000: 13000ms >> > 130000: 13577ms >> > 140000: 14331ms >> > 150000: 14488ms >> > 160000: 15025ms >> > 170000: 15463ms >> > 180000: 15815ms >> > 190000: 16153ms >> > 200000: 16314ms >> > 210000: 16562ms >> > 220000: 17186ms >> > 230000: 17862ms >> > 240000: 18340ms >> > 250000: 18790ms >> > 260000: 19313ms >> > 270000: 19850ms >> > 280000: 20225ms >> > 290000: 20650ms >> > 300000: 21062ms >> > 310000: 21595ms >> > 320000: 22022ms >> > 330000: 22414ms >> > 340000: 22925ms >> > 350000: 23514ms >> > 360000: 23762ms >> > 370000: 24360ms >> > 380000: 25028ms >> > 390000: 25446ms >> > 400000: 25700ms >> > >> > - Gerald de Jong >> > >> > >> > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün >> > christian.gruen@gmail.com >> > wrote: >> >> >> >> > Perhaps you can give me a hint as to why inserts slow down.j >> >> I didn't have time to check out 7.9, but I have done some >> >> testing >> >> with >> >> 8.0, and I didn't notice a real slow-down. This is Java

testing

...
...
...
...
...
>> >> script >> >> (1 mio documents are added in just 17 seconds; I'm using the >> >> internal >> >> BaseX parser to speed up the import): >> >> >> >> Performance p = new Performance(); >> >> Context ctx = new Context(); >> >> >> >> new CreateDB("db").execute(ctx); >> >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); >> >> new Set(MainOptions.INTPARSE, true).execute(ctx); >> >> for(int i = 0; i < 1000000; i++) { >> >> new Add("db", "<a/>").execute(ctx); >> >> } >> >> ctx.close(); >> >> System.out.println(p); >> >> >> >> Hope this helps, >> >> Christian >> > >> > >> > >> > >> > -- >> > Delving BV, Vasteland 8, Rotterdam >> > http://www.delving.eu >> > http://twitter.com/fluxe >> > skype: beautifulcode >> > +31629339805 > > > > > -- > Delving BV, Vasteland 8, Rotterdam > http://www.delving.eu > http://twitter.com/fluxe > skype: beautifulcode > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

9:32 a.m.

A philosophical question, perhaps, or one that might be easily answered by someone with a lot more BaseX experience than me:

Would it make more sense to store one big "file" in BaseX corresponding to the, say, 1.2 million records, rather than storing 1.2 million cleverly named xml documents as i'm doing now? I suppose add would then become insert (after - for speed), but would that maybe overcome the namespace-related performance issue and even be faster in general?

On Tue, Sep 23, 2014 at 2:05 PM, Gerald de Jong gerald@delving.eu wrote:

...

Considering that the dataset I just mentioned involves 1.2 million add commands, it does become a bit of annoyance with some large datasets like this. We can have some patience for insertion, even with such a slowdown, so I wouldn't say bottleneck exactly.

Can you point me to an example of querying multiple databases? I could try splitting the big datasets up.

The big problem I have right now is the IllegalMonitorStateException that freezes the basexserver. After this happens I have to kill -9 the process even.

On Tue, Sep 23, 2014 at 1:55 PM, Christian Grün <christian.gruen@gmail.com

...
wrote:

...
Maybe a general question: Is the insertion really a bottleneck in your scenario? How many data do you want to store in a single database? You could e.g. store your data in multiple databases, which can then all be queried by a single XQuery expression.

On Tue, Sep 23, 2014 at 1:50 PM, Gerald de Jong gerald@delving.eu wrote:

...
The other case I'm testing has five necessary namespaces. :(

10000: 6462ms 20000: 7592ms 30000: 8689ms 40000: 9417ms 50000: 9566ms 60000: 10368ms 70000: 10963ms 80000: 12167ms

Is there any direction you can suggest to look for a workaround?

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
...
This namespace happens to be unnecessary, but others won't be. I'm

so

...
...
...
curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu

wrote:

...
...
...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu wrote:

...
WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote: > > Thanks for the document. The declaration of the (unused) namespace

in

...
...
...
...
> the root element seems to be the cause for the decreasing

performance

...
...
...
...
> (I noticed that the time for adding documents stays constant after > removing the declaration). I'll do some profiling in order to find

out

...
...
...
...
> if this can be sped up without too much effort (it may take a

while,

...
...
...
...
> though, because I'll be on leave for a while from tomorrow). > > > On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong <

gerald@delving.eu>

...
...
...
...
> wrote: > > I don't know what causes the gradual slowdown. My assumption was > > that > > it > > was the "optimize" which would cause the index to be built, so I > > didn't > > expect a slowdown at all during "add" calls, especially when > > autoflush > > is > > false. > > > > I add documents with the following paths: > > > > /f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml > > > > The xml file name is a hash of the contents, and it is placed in

a

...
...
...
...
> > path > > such > > that the export spreads out the files nicely into a file system > > tree, > > rather > > than putting a million docs into one directory. > > > > The document content is nothing special, wrapped in a special

tag:

...
...
...
...
> > > > <narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > > id="20412518" > > mod="2014-09-23T11:11:51.007+02:00"> > > <record> > > <priref>20412518</priref> > > <current_location>FTA</current_location> > > <current_location.type/> > > <description>Ingang op de binnenplaats van de > > zuidvleugel</description> > > <collection>Fotocollectie</collection> > > <production.date.start>1925-08-06</production.date.start> > > <reproduction.format/> > > > > > > > >

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
...
...
...
> > <creator.role>Fotograaf</creator.role> > > <object_number>9.387</object_number> > > <monument.label/> > > <monument.zipcode/> > > <monument.name>Kasteel Hoensbroek</monument.name> > > <monument.record_number>284330</monument.record_number> > > <reproduction.date/> > > <reproduction.notes>Oude filepath: > > 0009\009387.jpg</reproduction.notes> > > <reproduction.type/> > > <reproduction.creator/> > > <rights.type>Copyright</rights.type> > > <technique>Neg.zw</technique> > > <creator>Scheepens, W.C.L.A.</creator> > > <order_number>avh04-2008</order_number> > > <input.date>2008-04-01</input.date> > > <edit.date>2011-05-03</edit.date> > > <edit.date>2008-04-28</edit.date> > > <monument.historical_address/> > > <content.subject.type value="SUBJECT" option="SUBJECT"> > > <text language="0">subject</text> > > <text language="1">onderwerp</text> > > <text language="2">sujet</text> > > <text language="3">Thema</text> > > <text language="4">موضوع</text> > > <text language="6">θέμα</text> > > </content.subject.type> > > <content.subject.type value="SUBJECT" option="SUBJECT"> > > <text language="0">subject</text> > > <text language="1">onderwerp</text> > > <text language="2">sujet</text> > > <text language="3">Thema</text> > > <text language="4">موضوع</text> > > <text language="6">θέμα</text> > > </content.subject.type> > > <content.subject>Kasteel</content.subject> > > <content.subject>Binnenplaats</content.subject> > > <monument.province>Limburg</monument.province> > > <monument.place>Hoensbroek</monument.place> > > <monument.number/> > > <monument.county/> > > <monument.country>Nederland</monument.country> > > <monument.house_number>18</monument.house_number> > > <monument.street>Klinkertstraat</monument.street> > > <monument.house_number.addition/> > > <monument.complex_number/> > > <monument.number.x_coordinates/> > > <monument.number.y_coordinates/> > > <monument.geographical_keyword/> > > <monument.complex_number.x_coordinates/> > > <monument.complex_number.y_coordinates/> > > <creator.date_of_birth/> > > <creator.date_of_death/> > > <input.name>a.vanhoute</input.name> > > <edit.name>RCEadmin</edit.name> > > <edit.name>a.vanhoute</edit.name> > > <creator.history/> > > <record_type value="OBJECT" option="OBJECT"> > > <text language="0">single object</text> > > <text language="2">objet individuel</text> > > <text language="3">Einzelnes Objekt</text> > > </record_type> > > <edit.time>03:10:32</edit.time> > > <edit.time>11:17:08</edit.time> > > <input.time>09:58:28</input.time> > > <input.source>document>photographs</input.source> > > <edit.source>collect>photograph</edit.source> > > <edit.source>document>photographs</edit.source> > > </record> > > </narthex> > > > > On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün > > christian.gruen@gmail.com > > wrote: > >> > >> > I set up to use the 8.0-SNAPSHOT and used the internal parser

as

...
...
...
...
> >> > well. > >> > In > >> > your example you're not really giving much of a challenge to

the

...
...
...
...
> >> > index, > >> > since every doc is just <a/>. > >> > >> If I get it right, you assume the slowdown is due to the index > >> structures? > >> > >> > With respect to ADD, I'm not seeing a significant performance > >> > difference: > >> > >> Please give us more info on the data you are adding. Could you > >> provide > >> us with a sample document? > >> > >> > >> > 8.0-SNAPSHOT > >> > ------- > >> > 10000: 9250ms > >> > 20000: 7626ms > >> > 30000: 7885ms > >> > 40000: 8111ms > >> > 50000: 8365ms > >> > 60000: 8784ms > >> > 70000: 9270ms > >> > 80000: 9692ms > >> > 90000: 10158ms > >> > 100000: 10612ms > >> > 110000: 11018ms > >> > 120000: 11478ms > >> > 130000: 11940ms > >> > 140000: 12505ms > >> > 150000: 13047ms > >> > 160000: 13536ms > >> > 170000: 14055ms > >> > 180000: 14371ms > >> > 190000: 14883ms > >> > 200000: 15330ms > >> > 210000: 15888ms > >> > 220000: 16398ms > >> > 230000: 16878ms > >> > 240000: 17038ms > >> > 250000: 17453ms > >> > 260000: 17965ms > >> > 270000: 18317ms > >> > 280000: 18832ms > >> > 290000: 19373ms > >> > 300000: 19735ms > >> > 310000: 20062ms > >> > 320000: 20675ms > >> > 330000: 21113ms > >> > 340000: 21754ms > >> > 350000: 22887ms > >> > 360000: 22810ms > >> > 370000: 22985ms > >> > 380000: 23506ms > >> > 390000: 23856ms > >> > 400000: 24338ms > >> > > >> > 7.9 > >> > ----- > >> > 10000: 8229ms > >> > 20000: 7587ms > >> > 30000: 7973ms > >> > 40000: 8282ms > >> > 50000: 8717ms > >> > 60000: 9294ms > >> > 70000: 10105ms > >> > 80000: 10669ms > >> > 90000: 11301ms > >> > 100000: 11835ms > >> > 110000: 12413ms > >> > 120000: 13000ms > >> > 130000: 13577ms > >> > 140000: 14331ms > >> > 150000: 14488ms > >> > 160000: 15025ms > >> > 170000: 15463ms > >> > 180000: 15815ms > >> > 190000: 16153ms > >> > 200000: 16314ms > >> > 210000: 16562ms > >> > 220000: 17186ms > >> > 230000: 17862ms > >> > 240000: 18340ms > >> > 250000: 18790ms > >> > 260000: 19313ms > >> > 270000: 19850ms > >> > 280000: 20225ms > >> > 290000: 20650ms > >> > 300000: 21062ms > >> > 310000: 21595ms > >> > 320000: 22022ms > >> > 330000: 22414ms > >> > 340000: 22925ms > >> > 350000: 23514ms > >> > 360000: 23762ms > >> > 370000: 24360ms > >> > 380000: 25028ms > >> > 390000: 25446ms > >> > 400000: 25700ms > >> > > >> > - Gerald de Jong > >> > > >> > > >> > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün > >> > christian.gruen@gmail.com > >> > wrote: > >> >> > >> >> > Perhaps you can give me a hint as to why inserts slow

down.j

...
...
...
...
> >> >> I didn't have time to check out 7.9, but I have done some > >> >> testing > >> >> with > >> >> 8.0, and I didn't notice a real slow-down. This is Java

testing

...
...
...
...
> >> >> script > >> >> (1 mio documents are added in just 17 seconds; I'm using the > >> >> internal > >> >> BaseX parser to speed up the import): > >> >> > >> >> Performance p = new Performance(); > >> >> Context ctx = new Context(); > >> >> > >> >> new CreateDB("db").execute(ctx); > >> >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); > >> >> new Set(MainOptions.INTPARSE, true).execute(ctx); > >> >> for(int i = 0; i < 1000000; i++) { > >> >> new Add("db", "<a/>").execute(ctx); > >> >> } > >> >> ctx.close(); > >> >> System.out.println(p); > >> >> > >> >> Hope this helps, > >> >> Christian > >> > > >> > > >> > > >> > > >> > -- > >> > Delving BV, Vasteland 8, Rotterdam > >> > http://www.delving.eu > >> > http://twitter.com/fluxe > >> > skype: beautifulcode > >> > +31629339805 > > > > > > > > > > -- > > Delving BV, Vasteland 8, Rotterdam > > http://www.delving.eu > > http://twitter.com/fluxe > > skype: beautifulcode > > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Marco Lettere

9:40 a.m.

Hi Gerald, not sure but take into account that, AFAIK, there are limitations on the size (number of nodes) that can be kept in a single DB. M.

On 23/09/2014 15:32, Gerald de Jong wrote:

...

A philosophical question, perhaps, or one that might be easily answered by someone with a lot more BaseX experience than me:

On Tue, Sep 23, 2014 at 2:05 PM, Gerald de Jong <gerald@delving.eu mailto:gerald@delving.eu> wrote:

Considering that the dataset I just mentioned involves 1.2 million
add commands, it does become a bit of annoyance with some large
datasets like this.  We can have some patience for insertion, even
with such a slowdown, so I wouldn't say bottleneck exactly.

Can you point me to an example of querying multiple databases?  I
could try splitting the big datasets up.

The big problem I have right now is the
IllegalMonitorStateException that freezes the basexserver.  After
this happens I have to kill -9 the process even.



On Tue, Sep 23, 2014 at 1:55 PM, Christian Grün
<christian.gruen@gmail.com <mailto:christian.gruen@gmail.com>> wrote:

    Maybe a general question: Is the insertion really a bottleneck
    in your
    scenario? How many data do you want to store in a single
    database? You
    could e.g. store your data in multiple databases, which can
    then all
    be queried by a single XQuery expression.



    On Tue, Sep 23, 2014 at 1:50 PM, Gerald de Jong
    <gerald@delving.eu <mailto:gerald@delving.eu>> wrote:
    > The other case I'm testing has five necessary namespaces.  :(
    >
    > 10000: 6462ms
    > 20000: 7592ms
    > 30000: 8689ms
    > 40000: 9417ms
    > 50000: 9566ms
    > 60000: 10368ms
    > 70000: 10963ms
    > 80000: 12167ms
    >
    > Is there any direction you can suggest to look for a workaround?
    >
    >
    > On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün
    <christian.gruen@gmail.com <mailto:christian.gruen@gmail.com>>
    > wrote:
    >>
    >> > This namespace happens to be unnecessary, but others
    won't be.  I'm so
    >> > curious how this can be the thing.
    >>
    >> Unfortunately, the intricacies of namespaces have been
    keeping us XML
    >> implementers busy for a long time, and the XPath and storage
    >> algorithms would be much simpler, if not trivial, without
    the notion
    >> of namespaces. This is why it would take quite a while to
    explain what
    >> are the reasons for that, and as your input document only
    contains one
    >> namespaces, I'm not surprised that you are surprised ;) To
    put it in a
    >> nutshell: it's usually easy to optimize single namespaces
    issues, but
    >> it's difficult to optimize all cases that happen in practice.
    >>
    >> But I'll keep track of your use case.
    >>
    >>
    >> On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong
    <gerald@delving.eu <mailto:gerald@delving.eu>> wrote:
    >> >
    >> > On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong
    <gerald@delving.eu <mailto:gerald@delving.eu>>
    >> > wrote:
    >> >>
    >> >> WOW, really... the namespace? Because it's unused, or is
    it always
    >> >> going
    >> >> to slow when there are namespaces?
    >> >>
    >> >> On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün
    >> >> <christian.gruen@gmail.com
    <mailto:christian.gruen@gmail.com>> wrote:
    >> >>>
    >> >>> Thanks for the document. The declaration of the
    (unused) namespace in
    >> >>> the root element seems to be the cause for the
    decreasing performance
    >> >>> (I noticed that the time for adding documents stays
    constant after
    >> >>> removing the declaration). I'll do some profiling in
    order to find out
    >> >>> if this can be sped up without too much effort (it may
    take a while,
    >> >>> though, because I'll be on leave for a while from
    tomorrow).
    >> >>>
    >> >>>
    >> >>> On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong
    <gerald@delving.eu <mailto:gerald@delving.eu>>
    >> >>> wrote:
    >> >>> > I don't know what causes the gradual slowdown.  My
    assumption was
    >> >>> > that
    >> >>> > it
    >> >>> > was the "optimize" which would cause the index to be
    built, so I
    >> >>> > didn't
    >> >>> > expect a slowdown at all during "add" calls,
    especially when
    >> >>> > autoflush
    >> >>> > is
    >> >>> > false.
    >> >>> >
    >> >>> > I add documents with the following paths:
    >> >>> >
    >> >>> > /f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml
    >> >>> >
    >> >>> > The xml file name is a hash of the contents, and it
    is placed in a
    >> >>> > path
    >> >>> > such
    >> >>> > that the export spreads out the files nicely into a
    file system
    >> >>> > tree,
    >> >>> > rather
    >> >>> > than putting a million docs into one directory.
    >> >>> >
    >> >>> > The document content is nothing special, wrapped in a
    special tag:
    >> >>> >
    >> >>> > <narthex
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    >> >>> > id="20412518"
    >> >>> > mod="2014-09-23T11:11:51.007+02:00">
    >> >>> >   <record>
    >> >>> >  <priref>20412518</priref>
    >> >>> >  <current_location>FTA</current_location>
    >> >>> >  <current_location.type/>
    >> >>> >  <description>Ingang op de binnenplaats van de
    >> >>> > zuidvleugel</description>
    >> >>> >  <collection>Fotocollectie</collection>
    >> >>> >
     <production.date.start>1925-08-06</production.date.start>
    >> >>> >  <reproduction.format/>
    >> >>> >
    >> >>> >
    >> >>> >
    >> >>> >
    <reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>
    >> >>> >  <creator.role>Fotograaf</creator.role>
    >> >>> >  <object_number>9.387</object_number>
    >> >>> >  <monument.label/>
    >> >>> >  <monument.zipcode/>
    >> >>> >     <monument.name <http://monument.name>>Kasteel
    Hoensbroek</monument.name <http://monument.name>>
    >> >>> >  <monument.record_number>284330</monument.record_number>
    >> >>> >  <reproduction.date/>
    >> >>> >  <reproduction.notes>Oude filepath:
    >> >>> > 0009\009387.jpg</reproduction.notes>
    >> >>> >  <reproduction.type/>
    >> >>> >  <reproduction.creator/>
    >> >>> >  <rights.type>Copyright</rights.type>
    >> >>> >  <technique>Neg.zw</technique>
    >> >>> >  <creator>Scheepens, W.C.L.A.</creator>
    >> >>> >  <order_number>avh04-2008</order_number>
    >> >>> >  <input.date>2008-04-01</input.date>
    >> >>> >  <edit.date>2011-05-03</edit.date>
    >> >>> >  <edit.date>2008-04-28</edit.date>
    >> >>> >  <monument.historical_address/>
    >> >>> >  <content.subject.type value="SUBJECT" option="SUBJECT">
    >> >>> >       <text language="0">subject</text>
    >> >>> >       <text language="1">onderwerp</text>
    >> >>> >       <text language="2">sujet</text>
    >> >>> >       <text language="3">Thema</text>
    >> >>> >       <text language="4">موضوع</text>
    >> >>> >       <text language="6">θέμα</text>
    >> >>> >  </content.subject.type>
    >> >>> >  <content.subject.type value="SUBJECT" option="SUBJECT">
    >> >>> >       <text language="0">subject</text>
    >> >>> >       <text language="1">onderwerp</text>
    >> >>> >       <text language="2">sujet</text>
    >> >>> >       <text language="3">Thema</text>
    >> >>> >       <text language="4">موضوع</text>
    >> >>> >       <text language="6">θέμα</text>
    >> >>> >  </content.subject.type>
    >> >>> >  <content.subject>Kasteel</content.subject>
    >> >>> >  <content.subject>Binnenplaats</content.subject>
    >> >>> >  <monument.province>Limburg</monument.province>
    >> >>> >  <monument.place>Hoensbroek</monument.place>
    >> >>> >  <monument.number/>
    >> >>> >  <monument.county/>
    >> >>> >  <monument.country>Nederland</monument.country>
    >> >>> >  <monument.house_number>18</monument.house_number>
    >> >>> >  <monument.street>Klinkertstraat</monument.street>
    >> >>> >  <monument.house_number.addition/>
    >> >>> >  <monument.complex_number/>
    >> >>> >  <monument.number.x_coordinates/>
    >> >>> >  <monument.number.y_coordinates/>
    >> >>> >  <monument.geographical_keyword/>
    >> >>> >  <monument.complex_number.x_coordinates/>
    >> >>> >  <monument.complex_number.y_coordinates/>
    >> >>> >  <creator.date_of_birth/>
    >> >>> >  <creator.date_of_death/>
    >> >>> >     <input.name
    <http://input.name>>a.vanhoute</input.name <http://input.name>>
    >> >>> >     <edit.name <http://edit.name>>RCEadmin</edit.name
    <http://edit.name>>
    >> >>> >     <edit.name
    <http://edit.name>>a.vanhoute</edit.name <http://edit.name>>
    >> >>> >  <creator.history/>
    >> >>> >     <record_type value="OBJECT" option="OBJECT">
    >> >>> >       <text language="0">single object</text>
    >> >>> >       <text language="2">objet individuel</text>
    >> >>> >       <text language="3">Einzelnes Objekt</text>
    >> >>> >  </record_type>
    >> >>> >  <edit.time>03:10:32</edit.time>
    >> >>> >  <edit.time>11:17:08</edit.time>
    >> >>> >  <input.time>09:58:28</input.time>
    >> >>> >  <input.source>document&gt;photographs</input.source>
    >> >>> >  <edit.source>collect&gt;photograph</edit.source>
    >> >>> >  <edit.source>document&gt;photographs</edit.source>
    >> >>> >   </record>
    >> >>> > </narthex>
    >> >>> >
    >> >>> > On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün
    >> >>> > <christian.gruen@gmail.com
    <mailto:christian.gruen@gmail.com>>
    >> >>> > wrote:
    >> >>> >>
    >> >>> >> > I set up to use the 8.0-SNAPSHOT and used the
    internal parser as
    >> >>> >> > well.
    >> >>> >> > In
    >> >>> >> > your example you're not really giving much of a
    challenge to the
    >> >>> >> > index,
    >> >>> >> > since every doc is just <a/>.
    >> >>> >>
    >> >>> >> If I get it right, you assume the slowdown is due to
    the index
    >> >>> >> structures?
    >> >>> >>
    >> >>> >> > With respect to ADD, I'm not seeing a significant
    performance
    >> >>> >> > difference:
    >> >>> >>
    >> >>> >> Please give us more info on the data you are adding.
    Could you
    >> >>> >> provide
    >> >>> >> us with a sample document?
    >> >>> >>
    >> >>> >>
    >> >>> >> > 8.0-SNAPSHOT
    >> >>> >> > -------
    >> >>> >> > 10000: 9250ms
    >> >>> >> > 20000: 7626ms
    >> >>> >> > 30000: 7885ms
    >> >>> >> > 40000: 8111ms
    >> >>> >> > 50000: 8365ms
    >> >>> >> > 60000: 8784ms
    >> >>> >> > 70000: 9270ms
    >> >>> >> > 80000: 9692ms
    >> >>> >> > 90000: 10158ms
    >> >>> >> > 100000: 10612ms
    >> >>> >> > 110000: 11018ms
    >> >>> >> > 120000: 11478ms
    >> >>> >> > 130000: 11940ms
    >> >>> >> > 140000: 12505ms
    >> >>> >> > 150000: 13047ms
    >> >>> >> > 160000: 13536ms
    >> >>> >> > 170000: 14055ms
    >> >>> >> > 180000: 14371ms
    >> >>> >> > 190000: 14883ms
    >> >>> >> > 200000: 15330ms
    >> >>> >> > 210000: 15888ms
    >> >>> >> > 220000: 16398ms
    >> >>> >> > 230000: 16878ms
    >> >>> >> > 240000: 17038ms
    >> >>> >> > 250000: 17453ms
    >> >>> >> > 260000: 17965ms
    >> >>> >> > 270000: 18317ms
    >> >>> >> > 280000: 18832ms
    >> >>> >> > 290000: 19373ms
    >> >>> >> > 300000: 19735ms
    >> >>> >> > 310000: 20062ms
    >> >>> >> > 320000: 20675ms
    >> >>> >> > 330000: 21113ms
    >> >>> >> > 340000: 21754ms
    >> >>> >> > 350000: 22887ms
    >> >>> >> > 360000: 22810ms
    >> >>> >> > 370000: 22985ms
    >> >>> >> > 380000: 23506ms
    >> >>> >> > 390000: 23856ms
    >> >>> >> > 400000: 24338ms
    >> >>> >> >
    >> >>> >> > 7.9
    >> >>> >> > -----
    >> >>> >> > 10000: 8229ms
    >> >>> >> > 20000: 7587ms
    >> >>> >> > 30000: 7973ms
    >> >>> >> > 40000: 8282ms
    >> >>> >> > 50000: 8717ms
    >> >>> >> > 60000: 9294ms
    >> >>> >> > 70000: 10105ms
    >> >>> >> > 80000: 10669ms
    >> >>> >> > 90000: 11301ms
    >> >>> >> > 100000: 11835ms
    >> >>> >> > 110000: 12413ms
    >> >>> >> > 120000: 13000ms
    >> >>> >> > 130000: 13577ms
    >> >>> >> > 140000: 14331ms
    >> >>> >> > 150000: 14488ms
    >> >>> >> > 160000: 15025ms
    >> >>> >> > 170000: 15463ms
    >> >>> >> > 180000: 15815ms
    >> >>> >> > 190000: 16153ms
    >> >>> >> > 200000: 16314ms
    >> >>> >> > 210000: 16562ms
    >> >>> >> > 220000: 17186ms
    >> >>> >> > 230000: 17862ms
    >> >>> >> > 240000: 18340ms
    >> >>> >> > 250000: 18790ms
    >> >>> >> > 260000: 19313ms
    >> >>> >> > 270000: 19850ms
    >> >>> >> > 280000: 20225ms
    >> >>> >> > 290000: 20650ms
    >> >>> >> > 300000: 21062ms
    >> >>> >> > 310000: 21595ms
    >> >>> >> > 320000: 22022ms
    >> >>> >> > 330000: 22414ms
    >> >>> >> > 340000: 22925ms
    >> >>> >> > 350000: 23514ms
    >> >>> >> > 360000: 23762ms
    >> >>> >> > 370000: 24360ms
    >> >>> >> > 380000: 25028ms
    >> >>> >> > 390000: 25446ms
    >> >>> >> > 400000: 25700ms
    >> >>> >> >
    >> >>> >> > - Gerald de Jong
    >> >>> >> >
    >> >>> >> >
    >> >>> >> > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün
    >> >>> >> > <christian.gruen@gmail.com
    <mailto:christian.gruen@gmail.com>>
    >> >>> >> > wrote:
    >> >>> >> >>
    >> >>> >> >> > Perhaps you can give me a hint as to why
    inserts slow down.j
    >> >>> >> >> I didn't have time to check out 7.9, but I have
    done some
    >> >>> >> >> testing
    >> >>> >> >> with
    >> >>> >> >> 8.0, and I didn't notice a real slow-down. This
    is Java testing
    >> >>> >> >> script
    >> >>> >> >> (1 mio documents are added in just 17 seconds;
    I'm using the
    >> >>> >> >> internal
    >> >>> >> >> BaseX parser to speed up the import):
    >> >>> >> >>
    >> >>> >> >>  Performance p = new Performance();
    >> >>> >> >>  Context ctx = new Context();
    >> >>> >> >>
    >> >>> >> >>  new CreateDB("db").execute(ctx);
    >> >>> >> >>  new Set(MainOptions.AUTOFLUSH, false).execute(ctx);
    >> >>> >> >>  new Set(MainOptions.INTPARSE, true).execute(ctx);
    >> >>> >> >>  for(int i = 0; i < 1000000; i++) {
    >> >>> >> >>  new Add("db", "<a/>").execute(ctx);
    >> >>> >> >>     }
    >> >>> >> >>  ctx.close();
    >> >>> >> >>  System.out.println(p);
    >> >>> >> >>
    >> >>> >> >> Hope this helps,
    >> >>> >> >> Christian
    >> >>> >> >
    >> >>> >> >
    >> >>> >> >
    >> >>> >> >
    >> >>> >> > --
    >> >>> >> > Delving BV, Vasteland 8, Rotterdam
    >> >>> >> > http://www.delving.eu
    >> >>> >> > http://twitter.com/fluxe
    >> >>> >> > skype: beautifulcode
    >> >>> >> > +31629339805 <tel:%2B31629339805>
    >> >>> >
    >> >>> >
    >> >>> >
    >> >>> >
    >> >>> > --
    >> >>> > Delving BV, Vasteland 8, Rotterdam
    >> >>> > http://www.delving.eu
    >> >>> > http://twitter.com/fluxe
    >> >>> > skype: beautifulcode
    >> >>> > +31629339805 <tel:%2B31629339805>
    >> >>
    >> >>
    >> >>
    >> >>
    >> >> --
    >> >> Delving BV, Vasteland 8, Rotterdam
    >> >> http://www.delving.eu
    >> >> http://twitter.com/fluxe
    >> >> skype: beautifulcode
    >> >> +31629339805 <tel:%2B31629339805>
    >> >
    >> >
    >> >
    >> >
    >> > --
    >> > Delving BV, Vasteland 8, Rotterdam
    >> > http://www.delving.eu
    >> > http://twitter.com/fluxe
    >> > skype: beautifulcode
    >> > +31629339805 <tel:%2B31629339805>
    >
    >
    >
    >
    > --
    > Delving BV, Vasteland 8, Rotterdam
    > http://www.delving.eu
    > http://twitter.com/fluxe
    > skype: beautifulcode
    > +31629339805 <tel:%2B31629339805>




-- 
Delving BV, Vasteland 8, Rotterdam
http://www.delving.eu
http://twitter.com/fluxe
skype: beautifulcode
+31629339805 <tel:%2B31629339805>

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

9:50 a.m.

I've been looking around and I can't find what those limitations are. I'll scan the 7.9 book tonight, maybe it's there.

Alternatively, maybe it would make sense to store, say, 100,000 documents per database, and then query over multiple when necessary.

On Tue, Sep 23, 2014 at 3:40 PM, Marco Lettere marco.lettere@dedalus.eu wrote:

...

Hi Gerald, not sure but take into account that, AFAIK, there are limitations on the size (number of nodes) that can be kept in a single DB. M.

On 23/09/2014 15:32, Gerald de Jong wrote:

A philosophical question, perhaps, or one that might be easily answered by someone with a lot more BaseX experience than me:

Would it make more sense to store one big "file" in BaseX corresponding to the, say, 1.2 million records, rather than storing 1.2 million cleverly named xml documents as i'm doing now? I suppose add would then become insert (after - for speed), but would that maybe overcome the namespace-related performance issue and even be faster in general?

On Tue, Sep 23, 2014 at 2:05 PM, Gerald de Jong gerald@delving.eu wrote:

...
Considering that the dataset I just mentioned involves 1.2 million add commands, it does become a bit of annoyance with some large datasets like this. We can have some patience for insertion, even with such a slowdown, so I wouldn't say bottleneck exactly.

Can you point me to an example of querying multiple databases? I could try splitting the big datasets up.

The big problem I have right now is the IllegalMonitorStateException that freezes the basexserver. After this happens I have to kill -9 the process even.

On Tue, Sep 23, 2014 at 1:55 PM, Christian Grün < christian.gruen@gmail.com> wrote:

...
Maybe a general question: Is the insertion really a bottleneck in your scenario? How many data do you want to store in a single database? You could e.g. store your data in multiple databases, which can then all be queried by a single XQuery expression.

On Tue, Sep 23, 2014 at 1:50 PM, Gerald de Jong gerald@delving.eu wrote:

...
The other case I'm testing has five necessary namespaces. :(

10000: 6462ms 20000: 7592ms 30000: 8689ms 40000: 9417ms 50000: 9566ms 60000: 10368ms 70000: 10963ms 80000: 12167ms

Is there any direction you can suggest to look for a workaround?

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
...
This namespace happens to be unnecessary, but others won't be. I'm

so

...
...
...
curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu

wrote:

...
...
...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu wrote: > > WOW, really... the namespace? Because it's unused, or is it always > going > to slow when there are namespaces? > > On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün > christian.gruen@gmail.com wrote: >> >> Thanks for the document. The declaration of the (unused)

namespace in

...
...
...
>> the root element seems to be the cause for the decreasing

performance

...
...
...
>> (I noticed that the time for adding documents stays constant after >> removing the declaration). I'll do some profiling in order to

find out

...
...
...
>> if this can be sped up without too much effort (it may take a

while,

...
...
...
>> though, because I'll be on leave for a while from tomorrow). >> >> >> On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong <

gerald@delving.eu>

...
...
...
>> wrote: >> > I don't know what causes the gradual slowdown. My assumption

was

...
...
...
>> > that >> > it >> > was the "optimize" which would cause the index to be built, so I >> > didn't >> > expect a slowdown at all during "add" calls, especially when >> > autoflush >> > is >> > false. >> > >> > I add documents with the following paths: >> > >> > /f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml >> > >> > The xml file name is a hash of the contents, and it is placed

in a

...
...
...
>> > path >> > such >> > that the export spreads out the files nicely into a file system >> > tree, >> > rather >> > than putting a million docs into one directory. >> > >> > The document content is nothing special, wrapped in a special

tag:

...
...
...
>> > >> > <narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >> > id="20412518" >> > mod="2014-09-23T11:11:51.007+02:00"> >> > <record> >> > <priref>20412518</priref> >> > <current_location>FTA</current_location> >> > <current_location.type/> >> > <description>Ingang op de binnenplaats van de >> > zuidvleugel</description> >> > <collection>Fotocollectie</collection> >> > <production.date.start>1925-08-06</production.date.start> >> > <reproduction.format/> >> > >> > >> > >> >

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
...
...
>> > <creator.role>Fotograaf</creator.role> >> > <object_number>9.387</object_number> >> > <monument.label/> >> > <monument.zipcode/> >> > <monument.name>Kasteel Hoensbroek</monument.name> >> > <monument.record_number>284330</monument.record_number> >> > <reproduction.date/> >> > <reproduction.notes>Oude filepath: >> > 0009\009387.jpg</reproduction.notes> >> > <reproduction.type/> >> > <reproduction.creator/> >> > <rights.type>Copyright</rights.type> >> > <technique>Neg.zw</technique> >> > <creator>Scheepens, W.C.L.A.</creator> >> > <order_number>avh04-2008</order_number> >> > <input.date>2008-04-01</input.date> >> > <edit.date>2011-05-03</edit.date> >> > <edit.date>2008-04-28</edit.date> >> > <monument.historical_address/> >> > <content.subject.type value="SUBJECT" option="SUBJECT"> >> > <text language="0">subject</text> >> > <text language="1">onderwerp</text> >> > <text language="2">sujet</text> >> > <text language="3">Thema</text> >> > <text language="4">موضوع</text> >> > <text language="6">θέμα</text> >> > </content.subject.type> >> > <content.subject.type value="SUBJECT" option="SUBJECT"> >> > <text language="0">subject</text> >> > <text language="1">onderwerp</text> >> > <text language="2">sujet</text> >> > <text language="3">Thema</text> >> > <text language="4">موضوع</text> >> > <text language="6">θέμα</text> >> > </content.subject.type> >> > <content.subject>Kasteel</content.subject> >> > <content.subject>Binnenplaats</content.subject> >> > <monument.province>Limburg</monument.province> >> > <monument.place>Hoensbroek</monument.place> >> > <monument.number/> >> > <monument.county/> >> > <monument.country>Nederland</monument.country> >> > <monument.house_number>18</monument.house_number> >> > <monument.street>Klinkertstraat</monument.street> >> > <monument.house_number.addition/> >> > <monument.complex_number/> >> > <monument.number.x_coordinates/> >> > <monument.number.y_coordinates/> >> > <monument.geographical_keyword/> >> > <monument.complex_number.x_coordinates/> >> > <monument.complex_number.y_coordinates/> >> > <creator.date_of_birth/> >> > <creator.date_of_death/> >> > <input.name>a.vanhoute</input.name> >> > <edit.name>RCEadmin</edit.name> >> > <edit.name>a.vanhoute</edit.name> >> > <creator.history/> >> > <record_type value="OBJECT" option="OBJECT"> >> > <text language="0">single object</text> >> > <text language="2">objet individuel</text> >> > <text language="3">Einzelnes Objekt</text> >> > </record_type> >> > <edit.time>03:10:32</edit.time> >> > <edit.time>11:17:08</edit.time> >> > <input.time>09:58:28</input.time> >> > <input.source>document>photographs</input.source> >> > <edit.source>collect>photograph</edit.source> >> > <edit.source>document>photographs</edit.source> >> > </record> >> > </narthex> >> > >> > On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün >> > christian.gruen@gmail.com >> > wrote: >> >> >> >> > I set up to use the 8.0-SNAPSHOT and used the internal

parser as

...
...
...
>> >> > well. >> >> > In >> >> > your example you're not really giving much of a challenge to

the

...
...
...
>> >> > index, >> >> > since every doc is just <a/>. >> >> >> >> If I get it right, you assume the slowdown is due to the index >> >> structures? >> >> >> >> > With respect to ADD, I'm not seeing a significant performance >> >> > difference: >> >> >> >> Please give us more info on the data you are adding. Could you >> >> provide >> >> us with a sample document? >> >> >> >> >> >> > 8.0-SNAPSHOT >> >> > ------- >> >> > 10000: 9250ms >> >> > 20000: 7626ms >> >> > 30000: 7885ms >> >> > 40000: 8111ms >> >> > 50000: 8365ms >> >> > 60000: 8784ms >> >> > 70000: 9270ms >> >> > 80000: 9692ms >> >> > 90000: 10158ms >> >> > 100000: 10612ms >> >> > 110000: 11018ms >> >> > 120000: 11478ms >> >> > 130000: 11940ms >> >> > 140000: 12505ms >> >> > 150000: 13047ms >> >> > 160000: 13536ms >> >> > 170000: 14055ms >> >> > 180000: 14371ms >> >> > 190000: 14883ms >> >> > 200000: 15330ms >> >> > 210000: 15888ms >> >> > 220000: 16398ms >> >> > 230000: 16878ms >> >> > 240000: 17038ms >> >> > 250000: 17453ms >> >> > 260000: 17965ms >> >> > 270000: 18317ms >> >> > 280000: 18832ms >> >> > 290000: 19373ms >> >> > 300000: 19735ms >> >> > 310000: 20062ms >> >> > 320000: 20675ms >> >> > 330000: 21113ms >> >> > 340000: 21754ms >> >> > 350000: 22887ms >> >> > 360000: 22810ms >> >> > 370000: 22985ms >> >> > 380000: 23506ms >> >> > 390000: 23856ms >> >> > 400000: 24338ms >> >> > >> >> > 7.9 >> >> > ----- >> >> > 10000: 8229ms >> >> > 20000: 7587ms >> >> > 30000: 7973ms >> >> > 40000: 8282ms >> >> > 50000: 8717ms >> >> > 60000: 9294ms >> >> > 70000: 10105ms >> >> > 80000: 10669ms >> >> > 90000: 11301ms >> >> > 100000: 11835ms >> >> > 110000: 12413ms >> >> > 120000: 13000ms >> >> > 130000: 13577ms >> >> > 140000: 14331ms >> >> > 150000: 14488ms >> >> > 160000: 15025ms >> >> > 170000: 15463ms >> >> > 180000: 15815ms >> >> > 190000: 16153ms >> >> > 200000: 16314ms >> >> > 210000: 16562ms >> >> > 220000: 17186ms >> >> > 230000: 17862ms >> >> > 240000: 18340ms >> >> > 250000: 18790ms >> >> > 260000: 19313ms >> >> > 270000: 19850ms >> >> > 280000: 20225ms >> >> > 290000: 20650ms >> >> > 300000: 21062ms >> >> > 310000: 21595ms >> >> > 320000: 22022ms >> >> > 330000: 22414ms >> >> > 340000: 22925ms >> >> > 350000: 23514ms >> >> > 360000: 23762ms >> >> > 370000: 24360ms >> >> > 380000: 25028ms >> >> > 390000: 25446ms >> >> > 400000: 25700ms >> >> > >> >> > - Gerald de Jong >> >> > >> >> > >> >> > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün >> >> > christian.gruen@gmail.com >> >> > wrote: >> >> >> >> >> >> > Perhaps you can give me a hint as to why inserts slow

down.j

...
...
...
>> >> >> I didn't have time to check out 7.9, but I have done some >> >> >> testing >> >> >> with >> >> >> 8.0, and I didn't notice a real slow-down. This is Java

testing

...
...
...
>> >> >> script >> >> >> (1 mio documents are added in just 17 seconds; I'm using the >> >> >> internal >> >> >> BaseX parser to speed up the import): >> >> >> >> >> >> Performance p = new Performance(); >> >> >> Context ctx = new Context(); >> >> >> >> >> >> new CreateDB("db").execute(ctx); >> >> >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); >> >> >> new Set(MainOptions.INTPARSE, true).execute(ctx); >> >> >> for(int i = 0; i < 1000000; i++) { >> >> >> new Add("db", "<a/>").execute(ctx); >> >> >> } >> >> >> ctx.close(); >> >> >> System.out.println(p); >> >> >> >> >> >> Hope this helps, >> >> >> Christian >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Delving BV, Vasteland 8, Rotterdam >> >> > http://www.delving.eu >> >> > http://twitter.com/fluxe >> >> > skype: beautifulcode >> >> > +31629339805 >> > >> > >> > >> > >> > -- >> > Delving BV, Vasteland 8, Rotterdam >> > http://www.delving.eu >> > http://twitter.com/fluxe >> > skype: beautifulcode >> > +31629339805 > > > > > -- > Delving BV, Vasteland 8, Rotterdam > http://www.delving.eu > http://twitter.com/fluxe > skype: beautifulcode > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

10:18 a.m.

I see on http://stackoverflow.com/questions/25113900/inserting-millions-of-xml-files-...

The limit is 2^29 which is 546,870,912 (the number of stored XML documents). The limit for XML elements is 2^31 which is 2,147,483,648 (although this includes all nodes including attributes, texts, etc.).

On Tue, Sep 23, 2014 at 3:50 PM, Gerald de Jong gerald@delving.eu wrote:

...

I've been looking around and I can't find what those limitations are. I'll scan the 7.9 book tonight, maybe it's there.

Alternatively, maybe it would make sense to store, say, 100,000 documents per database, and then query over multiple when necessary.

On Tue, Sep 23, 2014 at 3:40 PM, Marco Lettere marco.lettere@dedalus.eu wrote:

...
Hi Gerald, not sure but take into account that, AFAIK, there are limitations on the size (number of nodes) that can be kept in a single DB. M.

On 23/09/2014 15:32, Gerald de Jong wrote:

A philosophical question, perhaps, or one that might be easily answered by someone with a lot more BaseX experience than me:

Would it make more sense to store one big "file" in BaseX corresponding to the, say, 1.2 million records, rather than storing 1.2 million cleverly named xml documents as i'm doing now? I suppose add would then become insert (after - for speed), but would that maybe overcome the namespace-related performance issue and even be faster in general?

On Tue, Sep 23, 2014 at 2:05 PM, Gerald de Jong gerald@delving.eu wrote:

...
Considering that the dataset I just mentioned involves 1.2 million add commands, it does become a bit of annoyance with some large datasets like this. We can have some patience for insertion, even with such a slowdown, so I wouldn't say bottleneck exactly.

Can you point me to an example of querying multiple databases? I could try splitting the big datasets up.

The big problem I have right now is the IllegalMonitorStateException that freezes the basexserver. After this happens I have to kill -9 the process even.

On Tue, Sep 23, 2014 at 1:55 PM, Christian Grün < christian.gruen@gmail.com> wrote:

...
Maybe a general question: Is the insertion really a bottleneck in your scenario? How many data do you want to store in a single database? You could e.g. store your data in multiple databases, which can then all be queried by a single XQuery expression.

On Tue, Sep 23, 2014 at 1:50 PM, Gerald de Jong gerald@delving.eu wrote:

...
The other case I'm testing has five necessary namespaces. :(

10000: 6462ms 20000: 7592ms 30000: 8689ms 40000: 9417ms 50000: 9566ms 60000: 10368ms 70000: 10963ms 80000: 12167ms

Is there any direction you can suggest to look for a workaround?

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün <

christian.gruen@gmail.com>

...
wrote:

...
> This namespace happens to be unnecessary, but others won't be.

I'm so

...
...
> curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain

what

...
...
are the reasons for that, and as your input document only contains

one

...
...
namespaces, I'm not surprised that you are surprised ;) To put it in

a

...
...
nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu

wrote:

...
...
> > On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong <gerald@delving.eu

...
> wrote: >> >> WOW, really... the namespace? Because it's unused, or is it always >> going >> to slow when there are namespaces? >> >> On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün >> christian.gruen@gmail.com wrote: >>> >>> Thanks for the document. The declaration of the (unused)

namespace in

...
...
>>> the root element seems to be the cause for the decreasing

performance

...
...
>>> (I noticed that the time for adding documents stays constant

after

...
...
>>> removing the declaration). I'll do some profiling in order to

find out

...
...
>>> if this can be sped up without too much effort (it may take a

while,

...
...
>>> though, because I'll be on leave for a while from tomorrow). >>> >>> >>> On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong <

gerald@delving.eu>

...
...
>>> wrote: >>> > I don't know what causes the gradual slowdown. My assumption

was

...
...
>>> > that >>> > it >>> > was the "optimize" which would cause the index to be built, so

I

...
...
>>> > didn't >>> > expect a slowdown at all during "add" calls, especially when >>> > autoflush >>> > is >>> > false. >>> > >>> > I add documents with the following paths: >>> > >>> > /f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml >>> > >>> > The xml file name is a hash of the contents, and it is placed

in a

...
...
>>> > path >>> > such >>> > that the export spreads out the files nicely into a file system >>> > tree, >>> > rather >>> > than putting a million docs into one directory. >>> > >>> > The document content is nothing special, wrapped in a special

tag:

...
...
>>> > >>> > <narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >>> > id="20412518" >>> > mod="2014-09-23T11:11:51.007+02:00"> >>> > <record> >>> > <priref>20412518</priref> >>> > <current_location>FTA</current_location> >>> > <current_location.type/> >>> > <description>Ingang op de binnenplaats van de >>> > zuidvleugel</description> >>> > <collection>Fotocollectie</collection> >>> > <production.date.start>1925-08-06</production.date.start> >>> > <reproduction.format/> >>> > >>> > >>> > >>> >

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
...
>>> > <creator.role>Fotograaf</creator.role> >>> > <object_number>9.387</object_number> >>> > <monument.label/> >>> > <monument.zipcode/> >>> > <monument.name>Kasteel Hoensbroek</monument.name> >>> > <monument.record_number>284330</monument.record_number> >>> > <reproduction.date/> >>> > <reproduction.notes>Oude filepath: >>> > 0009\009387.jpg</reproduction.notes> >>> > <reproduction.type/> >>> > <reproduction.creator/> >>> > <rights.type>Copyright</rights.type> >>> > <technique>Neg.zw</technique> >>> > <creator>Scheepens, W.C.L.A.</creator> >>> > <order_number>avh04-2008</order_number> >>> > <input.date>2008-04-01</input.date> >>> > <edit.date>2011-05-03</edit.date> >>> > <edit.date>2008-04-28</edit.date> >>> > <monument.historical_address/> >>> > <content.subject.type value="SUBJECT" option="SUBJECT"> >>> > <text language="0">subject</text> >>> > <text language="1">onderwerp</text> >>> > <text language="2">sujet</text> >>> > <text language="3">Thema</text> >>> > <text language="4">موضوع</text> >>> > <text language="6">θέμα</text> >>> > </content.subject.type> >>> > <content.subject.type value="SUBJECT" option="SUBJECT"> >>> > <text language="0">subject</text> >>> > <text language="1">onderwerp</text> >>> > <text language="2">sujet</text> >>> > <text language="3">Thema</text> >>> > <text language="4">موضوع</text> >>> > <text language="6">θέμα</text> >>> > </content.subject.type> >>> > <content.subject>Kasteel</content.subject> >>> > <content.subject>Binnenplaats</content.subject> >>> > <monument.province>Limburg</monument.province> >>> > <monument.place>Hoensbroek</monument.place> >>> > <monument.number/> >>> > <monument.county/> >>> > <monument.country>Nederland</monument.country> >>> > <monument.house_number>18</monument.house_number> >>> > <monument.street>Klinkertstraat</monument.street> >>> > <monument.house_number.addition/> >>> > <monument.complex_number/> >>> > <monument.number.x_coordinates/> >>> > <monument.number.y_coordinates/> >>> > <monument.geographical_keyword/> >>> > <monument.complex_number.x_coordinates/> >>> > <monument.complex_number.y_coordinates/> >>> > <creator.date_of_birth/> >>> > <creator.date_of_death/> >>> > <input.name>a.vanhoute</input.name> >>> > <edit.name>RCEadmin</edit.name> >>> > <edit.name>a.vanhoute</edit.name> >>> > <creator.history/> >>> > <record_type value="OBJECT" option="OBJECT"> >>> > <text language="0">single object</text> >>> > <text language="2">objet individuel</text> >>> > <text language="3">Einzelnes Objekt</text> >>> > </record_type> >>> > <edit.time>03:10:32</edit.time> >>> > <edit.time>11:17:08</edit.time> >>> > <input.time>09:58:28</input.time> >>> > <input.source>document>photographs</input.source> >>> > <edit.source>collect>photograph</edit.source> >>> > <edit.source>document>photographs</edit.source> >>> > </record> >>> > </narthex> >>> > >>> > On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün >>> > christian.gruen@gmail.com >>> > wrote: >>> >> >>> >> > I set up to use the 8.0-SNAPSHOT and used the internal

parser as

...
...
>>> >> > well. >>> >> > In >>> >> > your example you're not really giving much of a challenge

to the

...
...
>>> >> > index, >>> >> > since every doc is just <a/>. >>> >> >>> >> If I get it right, you assume the slowdown is due to the index >>> >> structures? >>> >> >>> >> > With respect to ADD, I'm not seeing a significant

performance

...
...
>>> >> > difference: >>> >> >>> >> Please give us more info on the data you are adding. Could you >>> >> provide >>> >> us with a sample document? >>> >> >>> >> >>> >> > 8.0-SNAPSHOT >>> >> > ------- >>> >> > 10000: 9250ms >>> >> > 20000: 7626ms >>> >> > 30000: 7885ms >>> >> > 40000: 8111ms >>> >> > 50000: 8365ms >>> >> > 60000: 8784ms >>> >> > 70000: 9270ms >>> >> > 80000: 9692ms >>> >> > 90000: 10158ms >>> >> > 100000: 10612ms >>> >> > 110000: 11018ms >>> >> > 120000: 11478ms >>> >> > 130000: 11940ms >>> >> > 140000: 12505ms >>> >> > 150000: 13047ms >>> >> > 160000: 13536ms >>> >> > 170000: 14055ms >>> >> > 180000: 14371ms >>> >> > 190000: 14883ms >>> >> > 200000: 15330ms >>> >> > 210000: 15888ms >>> >> > 220000: 16398ms >>> >> > 230000: 16878ms >>> >> > 240000: 17038ms >>> >> > 250000: 17453ms >>> >> > 260000: 17965ms >>> >> > 270000: 18317ms >>> >> > 280000: 18832ms >>> >> > 290000: 19373ms >>> >> > 300000: 19735ms >>> >> > 310000: 20062ms >>> >> > 320000: 20675ms >>> >> > 330000: 21113ms >>> >> > 340000: 21754ms >>> >> > 350000: 22887ms >>> >> > 360000: 22810ms >>> >> > 370000: 22985ms >>> >> > 380000: 23506ms >>> >> > 390000: 23856ms >>> >> > 400000: 24338ms >>> >> > >>> >> > 7.9 >>> >> > ----- >>> >> > 10000: 8229ms >>> >> > 20000: 7587ms >>> >> > 30000: 7973ms >>> >> > 40000: 8282ms >>> >> > 50000: 8717ms >>> >> > 60000: 9294ms >>> >> > 70000: 10105ms >>> >> > 80000: 10669ms >>> >> > 90000: 11301ms >>> >> > 100000: 11835ms >>> >> > 110000: 12413ms >>> >> > 120000: 13000ms >>> >> > 130000: 13577ms >>> >> > 140000: 14331ms >>> >> > 150000: 14488ms >>> >> > 160000: 15025ms >>> >> > 170000: 15463ms >>> >> > 180000: 15815ms >>> >> > 190000: 16153ms >>> >> > 200000: 16314ms >>> >> > 210000: 16562ms >>> >> > 220000: 17186ms >>> >> > 230000: 17862ms >>> >> > 240000: 18340ms >>> >> > 250000: 18790ms >>> >> > 260000: 19313ms >>> >> > 270000: 19850ms >>> >> > 280000: 20225ms >>> >> > 290000: 20650ms >>> >> > 300000: 21062ms >>> >> > 310000: 21595ms >>> >> > 320000: 22022ms >>> >> > 330000: 22414ms >>> >> > 340000: 22925ms >>> >> > 350000: 23514ms >>> >> > 360000: 23762ms >>> >> > 370000: 24360ms >>> >> > 380000: 25028ms >>> >> > 390000: 25446ms >>> >> > 400000: 25700ms >>> >> > >>> >> > - Gerald de Jong >>> >> > >>> >> > >>> >> > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün >>> >> > christian.gruen@gmail.com >>> >> > wrote: >>> >> >> >>> >> >> > Perhaps you can give me a hint as to why inserts slow

down.j

...
...
>>> >> >> I didn't have time to check out 7.9, but I have done some >>> >> >> testing >>> >> >> with >>> >> >> 8.0, and I didn't notice a real slow-down. This is Java

testing

...
...
>>> >> >> script >>> >> >> (1 mio documents are added in just 17 seconds; I'm using

the

...
...
>>> >> >> internal >>> >> >> BaseX parser to speed up the import): >>> >> >> >>> >> >> Performance p = new Performance(); >>> >> >> Context ctx = new Context(); >>> >> >> >>> >> >> new CreateDB("db").execute(ctx); >>> >> >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); >>> >> >> new Set(MainOptions.INTPARSE, true).execute(ctx); >>> >> >> for(int i = 0; i < 1000000; i++) { >>> >> >> new Add("db", "<a/>").execute(ctx); >>> >> >> } >>> >> >> ctx.close(); >>> >> >> System.out.println(p); >>> >> >> >>> >> >> Hope this helps, >>> >> >> Christian >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Delving BV, Vasteland 8, Rotterdam >>> >> > http://www.delving.eu >>> >> > http://twitter.com/fluxe >>> >> > skype: beautifulcode >>> >> > +31629339805 >>> > >>> > >>> > >>> > >>> > -- >>> > Delving BV, Vasteland 8, Rotterdam >>> > http://www.delving.eu >>> > http://twitter.com/fluxe >>> > skype: beautifulcode >>> > +31629339805 >> >> >> >> >> -- >> Delving BV, Vasteland 8, Rotterdam >> http://www.delving.eu >> http://twitter.com/fluxe >> skype: beautifulcode >> +31629339805 > > > > > -- > Delving BV, Vasteland 8, Rotterdam > http://www.delving.eu > http://twitter.com/fluxe > skype: beautifulcode > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Fabrice Etanchaud

10:20 a.m.

Hi all,

Yes there can be up to 2^31 nodes, up to 2^29 files (around 536 millions)

http://docs.basex.org/wiki/Statistics

But whatever the strategy – document versus xml – you will encounter the same limitation on the number of nodes.

In my experience, document strategy is close to nosql document data stores like couchbase or mongodb.

If you update your collection per document, you can use the replace command instead of xquery update and get free of pending update list limitations.

Christian, from what I read in the last exchanges, the document index is now a persistent data structure ? Could you tell us if document paths are indexed and if this index is incremental or has to be rebuilt with the optimize command ?

If so, using document strategy could be a real benefit because you do not have to reindex the attribute or text index in order to update an entire document ‘s content. (if you store your documents in a single big document, you have to maintain metadata in each root element in order to access them directly, and so you have to reindex after each update query)

Here is my use case : 80 million documents partitioned in a few collections, and about 400 000 documents inserted/replaced each week. because of the previous limitation of the document list, I had to use the xquery update strategy, aggregating documents in big documents. I can say that finally I spend more time updating than reindexing, Because I have to update all the sub-documents of each collection at once in order to use the indexes.

The new document data structure is very good news !

Best regards, Fabrice Etanchaud Questel/Orbit

De : basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] De la part de Marco Lettere Envoyé : mardi 23 septembre 2014 15:40 À : basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Adding documents slows over time

Hi Gerald, not sure but take into account that, AFAIK, there are limitations on the size (number of nodes) that can be kept in a single DB. M.

On 23/09/2014 15:32, Gerald de Jong wrote: A philosophical question, perhaps, or one that might be easily answered by someone with a lot more BaseX experience than me:

On Tue, Sep 23, 2014 at 2:05 PM, Gerald de Jong <gerald@delving.eumailto:gerald@delving.eu> wrote: Considering that the dataset I just mentioned involves 1.2 million add commands, it does become a bit of annoyance with some large datasets like this. We can have some patience for insertion, even with such a slowdown, so I wouldn't say bottleneck exactly.

Can you point me to an example of querying multiple databases? I could try splitting the big datasets up.

The big problem I have right now is the IllegalMonitorStateException that freezes the basexserver. After this happens I have to kill -9 the process even.

On Tue, Sep 23, 2014 at 1:55 PM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote: Maybe a general question: Is the insertion really a bottleneck in your scenario? How many data do you want to store in a single database? You could e.g. store your data in multiple databases, which can then all be queried by a single XQuery expression.

On Tue, Sep 23, 2014 at 1:50 PM, Gerald de Jong <gerald@delving.eumailto:gerald@delving.eu> wrote:

...

The other case I'm testing has five necessary namespaces. :(

10000: 6462ms 20000: 7592ms 30000: 8689ms 40000: 9417ms 50000: 9566ms 60000: 10368ms 70000: 10963ms 80000: 12167ms

Is there any direction you can suggest to look for a workaround?

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote:

...
...
This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong <gerald@delving.eumailto:gerald@delving.eu> wrote:

...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong <gerald@delving.eumailto:gerald@delving.eu> wrote:

...
WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote:

...
Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong <gerald@delving.eumailto:gerald@delving.eu> wrote:

...
I don't know what causes the gradual slowdown. My assumption was that it was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path such that the export spreads out the files nicely into a file system tree, rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference> <creator.role>Fotograaf</creator.role> <object_number>9.387</object_number> <monument.label/> <monument.zipcode/> <monument.namehttp://monument.name>Kasteel Hoensbroek</monument.namehttp://monument.name> <monument.record_number>284330</monument.record_number> <reproduction.date/> <reproduction.notes>Oude filepath: 0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.namehttp://input.name>a.vanhoute</input.namehttp://input.name> <edit.namehttp://edit.name>RCEadmin</edit.namehttp://edit.name> <edit.namehttp://edit.name>a.vanhoute</edit.namehttp://edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source>

</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün <christian.gruen@gmail.commailto:christian.gruen@gmail.com> wrote: > > > I set up to use the 8.0-SNAPSHOT and used the internal parser as > > well. > > In > > your example you're not really giving much of a challenge to the > > index, > > since every doc is just <a/>. > > If I get it right, you assume the slowdown is due to the index > structures? > > > With respect to ADD, I'm not seeing a significant performance > > difference: > > Please give us more info on the data you are adding. Could you > provide > us with a sample document? > > > > 8.0-SNAPSHOT > > ------- > > 10000: 9250ms > > 20000: 7626ms > > 30000: 7885ms > > 40000: 8111ms > > 50000: 8365ms > > 60000: 8784ms > > 70000: 9270ms > > 80000: 9692ms > > 90000: 10158ms > > 100000: 10612ms > > 110000: 11018ms > > 120000: 11478ms > > 130000: 11940ms > > 140000: 12505ms > > 150000: 13047ms > > 160000: 13536ms > > 170000: 14055ms > > 180000: 14371ms > > 190000: 14883ms > > 200000: 15330ms > > 210000: 15888ms > > 220000: 16398ms > > 230000: 16878ms > > 240000: 17038ms > > 250000: 17453ms > > 260000: 17965ms > > 270000: 18317ms > > 280000: 18832ms > > 290000: 19373ms > > 300000: 19735ms > > 310000: 20062ms > > 320000: 20675ms > > 330000: 21113ms > > 340000: 21754ms > > 350000: 22887ms > > 360000: 22810ms > > 370000: 22985ms > > 380000: 23506ms > > 390000: 23856ms > > 400000: 24338ms > > > > 7.9 > > ----- > > 10000: 8229ms > > 20000: 7587ms > > 30000: 7973ms > > 40000: 8282ms > > 50000: 8717ms > > 60000: 9294ms > > 70000: 10105ms > > 80000: 10669ms > > 90000: 11301ms > > 100000: 11835ms > > 110000: 12413ms > > 120000: 13000ms > > 130000: 13577ms > > 140000: 14331ms > > 150000: 14488ms > > 160000: 15025ms > > 170000: 15463ms > > 180000: 15815ms > > 190000: 16153ms > > 200000: 16314ms > > 210000: 16562ms > > 220000: 17186ms > > 230000: 17862ms > > 240000: 18340ms > > 250000: 18790ms > > 260000: 19313ms > > 270000: 19850ms > > 280000: 20225ms > > 290000: 20650ms > > 300000: 21062ms > > 310000: 21595ms > > 320000: 22022ms > > 330000: 22414ms > > 340000: 22925ms > > 350000: 23514ms > > 360000: 23762ms > > 370000: 24360ms > > 380000: 25028ms > > 390000: 25446ms > > 400000: 25700ms > > > > - Gerald de Jong > > > > > > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün > > <christian.gruen@gmail.commailto:christian.gruen@gmail.com> > > wrote: > >> > >> > Perhaps you can give me a hint as to why inserts slow down.j > >> I didn't have time to check out 7.9, but I have done some > >> testing > >> with > >> 8.0, and I didn't notice a real slow-down. This is Java testing > >> script > >> (1 mio documents are added in just 17 seconds; I'm using the > >> internal > >> BaseX parser to speed up the import): > >> > >> Performance p = new Performance(); > >> Context ctx = new Context(); > >> > >> new CreateDB("db").execute(ctx); > >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); > >> new Set(MainOptions.INTPARSE, true).execute(ctx); > >> for(int i = 0; i < 1000000; i++) { > >> new Add("db", "<a/>").execute(ctx); > >> } > >> ctx.close(); > >> System.out.println(p); > >> > >> Hope this helps, > >> Christian > > > > > > > > > > -- > > Delving BV, Vasteland 8, Rotterdam > > http://www.delving.eu > > http://twitter.com/fluxe > > skype: beautifulcode > > +31629339805tel:%2B31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805tel:%2B31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805tel:%2B31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805tel:%2B31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805tel:%2B31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805tel:%2B31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Christian Grün

10:34 a.m.

Hi Fabrice,

...

If you update your collection per document, you can use the replace command instead of xquery update and get free of pending update list limitations.

I would be interested what limitations you have observed so far?

...

Christian, from what I read in the last exchanges, the document index is now a persistent data structure ?

Exactly. After it has been requested for the first time, it will additionally stored on disk and updated incrementally. I would be interested to have your feedback on the latest snapshot.

Christian

Fabrice Etanchaud

noon

New subject: TR: Adding documents slows over time

-----Message d'origine----- De : Fabrice Etanchaud Envoyé : mardi 23 septembre 2014 18:00 À : 'Christian Grün' Objet : RE: [basex-talk] Adding documents slows over time

Dear Christian,

In our old tests, we found that in a collection with several millions documents, opening that collection, or replacing a document was very very long.

In latest snapshot, could you tell us how to use the index on the document names ? Given 10 000 000 documents named $i.xml containing <xml>{$i}</xml> We found that text index is 470x faster than documents' one :

Compiling: - pre-evaluating (7000001 to 7001000) Query: for $i in 7000001 to 7001000 return db:open('docs', xs:string($i) || '.xml') Optimized Query: for $i_0 in (7000001 to 7001000) return db:open("docs", fn:concat($i_0 cast as xs:string, ".xml")) Result: - Hit(s): 1000 Items - Updated: 0 Items - Printed: 19500 Bytes - Read Locking: local [docs] - Write Locking: none Timing: - Parsing: 0.91 ms - Compiling: 0.24 ms - Evaluating: 68514.39 ms - Printing: 1.61 ms - Total Time: 68517.16 ms

Compiling: - pre-evaluating (7000001 to 7001000) Query: for $i in 7000001 to 7001000 return db:text('docs', xs:string($i))/root() Optimized Query: for $i_0 in (7000001 to 7001000) return db:text("docs", $i_0 cast as xs:string)/fn:root() Result: - Hit(s): 1000 Items - Updated: 0 Items - Printed: 19500 Bytes - Read Locking: local [docs] - Write Locking: none Timing: - Parsing: 2.62 ms - Compiling: 0.23 ms - Evaluating: 143.72 ms - Printing: 1.59 ms - Total Time: 148.16 ms

-----Message d'origine----- De : Christian Grün [mailto:christian.gruen@gmail.com] Envoyé : mardi 23 septembre 2014 16:34 À : Fabrice Etanchaud Cc : Marco Lettere; basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Adding documents slows over time

Hi Fabrice,

...

If you update your collection per document, you can use the replace command instead of xquery update and get free of pending update list limitations.

I would be interested what limitations you have observed so far?

...

Christian, from what I read in the last exchanges, the document index is now a persistent data structure ?

Exactly. After it has been requested for the first time, it will additionally stored on disk and updated incrementally. I would be interested to have your feedback on the latest snapshot.

Christian

Christian Grün

12:03 p.m.

New subject: TR: Adding documents slows over time

...

In latest snapshot, could you tell us how to use the index on the document names ?

The index should be created automatically after having run your first path-based query; subsequent queries should give you better results.

...

Given 10 000 000 documents named $i.xml containing <xml>{$i}</xml> We found that text index is 470x faster than documents' one :

Compiling:

pre-evaluating (7000001 to 7001000)

Query: for $i in 7000001 to 7001000 return db:open('docs', xs:string($i) || '.xml') Optimized Query: for $i_0 in (7000001 to 7001000) return db:open("docs", fn:concat($i_0 cast as xs:string, ".xml")) Result:

Hit(s): 1000 Items

Updated: 0 Items

Printed: 19500 Bytes

Read Locking: local [docs]

Write Locking: none

Timing:

Parsing: 0.91 ms

Compiling: 0.24 ms

Evaluating: 68514.39 ms

Printing: 1.61 ms

Total Time: 68517.16 ms

Compiling:

pre-evaluating (7000001 to 7001000)

Query: for $i in 7000001 to 7001000 return db:text('docs', xs:string($i))/root() Optimized Query: for $i_0 in (7000001 to 7001000) return db:text("docs", $i_0 cast as xs:string)/fn:root() Result:

Hit(s): 1000 Items

Updated: 0 Items

Printed: 19500 Bytes

Read Locking: local [docs]

Write Locking: none

Timing:

Parsing: 2.62 ms

Compiling: 0.23 ms

Evaluating: 143.72 ms

Printing: 1.59 ms

Total Time: 148.16 ms

-----Message d'origine----- De : Christian Grün [mailto:christian.gruen@gmail.com] Envoyé : mardi 23 septembre 2014 16:34 À : Fabrice Etanchaud Cc : Marco Lettere; basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Adding documents slows over time

Hi Fabrice,

...
If you update your collection per document, you can use the replace command instead of xquery update and get free of pending update list limitations.

I would be interested what limitations you have observed so far?

...
Christian, from what I read in the last exchanges, the document index is now a persistent data structure ?

Exactly. After it has been requested for the first time, it will additionally stored on disk and updated incrementally. I would be interested to have your feedback on the latest snapshot.

Christian

Fabrice Etanchaud

12:18 p.m.

New subject: TR: Adding documents slows over time

Dear Christian, By path based query, do you mean db:open or collection calls ? These requests are very slow, it's like documents' list was not indexed at all.

Best regards,

-----Message d'origine----- De : Christian Grün [mailto:christian.gruen@gmail.com] Envoyé : mardi 23 septembre 2014 18:03 À : Fabrice Etanchaud Cc : basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] TR: Adding documents slows over time

...

In latest snapshot, could you tell us how to use the index on the document names ?

The index should be created automatically after having run your first path-based query; subsequent queries should give you better results.

...

Given 10 000 000 documents named $i.xml containing <xml>{$i}</xml> We found that text index is 470x faster than documents' one :

Compiling:

pre-evaluating (7000001 to 7001000)

Query: for $i in 7000001 to 7001000 return db:open('docs', xs:string($i) || '.xml') Optimized Query: for $i_0 in (7000001 to 7001000) return db:open("docs", fn:concat($i_0 cast as xs:string, ".xml")) Result:

Hit(s): 1000 Items

Updated: 0 Items

Printed: 19500 Bytes

Read Locking: local [docs]

Write Locking: none

Timing:

Parsing: 0.91 ms

Compiling: 0.24 ms

Evaluating: 68514.39 ms

Printing: 1.61 ms

Total Time: 68517.16 ms

Compiling:

pre-evaluating (7000001 to 7001000)

Query: for $i in 7000001 to 7001000 return db:text('docs', xs:string($i))/root() Optimized Query: for $i_0 in (7000001 to 7001000) return db:text("docs", $i_0 cast as xs:string)/fn:root() Result:

Hit(s): 1000 Items

Updated: 0 Items

Printed: 19500 Bytes

Read Locking: local [docs]

Write Locking: none

Timing:

Parsing: 2.62 ms

Compiling: 0.23 ms

Evaluating: 143.72 ms

Printing: 1.59 ms

Total Time: 148.16 ms

-----Message d'origine----- De : Christian Grün [mailto:christian.gruen@gmail.com] Envoyé : mardi 23 septembre 2014 16:34 À : Fabrice Etanchaud Cc : Marco Lettere; basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Adding documents slows over time

Hi Fabrice,

...
If you update your collection per document, you can use the replace command instead of xquery update and get free of pending update list limitations.

I would be interested what limitations you have observed so far?

...
Christian, from what I read in the last exchanges, the document index is now a persistent data structure ?

Exactly. After it has been requested for the first time, it will additionally stored on disk and updated incrementally. I would be interested to have your feedback on the latest snapshot.

Christian

Christian Grün

12:27 p.m.

New subject: TR: Adding documents slows over time

...

By path based query, do you mean db:open or collection calls ?

It should apply on both.

...

These requests are very slow, it's like documents' list was not indexed at all.

Have you already tried 8.0? If yes, you should find a "doc.basex" file in your database directory after running your query.

Christian Grün

12:34 p.m.

Gerald,

I'm glad to tell that the latest snapshot [1] contains some additional optimizations for adding documents with namespaces. It should now be irrelevant if your added document has a namespace on top or not.

I'll be offline for some days (and I hope I didn't introduce a bad bug with the latest commit ;).

Have fun, Christian

[1] http://files.basex.org/releases/latest

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün christian.gruen@gmail.com wrote:

...

...
This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu wrote:

...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu wrote:

...
WOW, really... the namespace? Because it's unused, or is it always going to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...
I don't know what causes the gradual slowdown. My assumption was that it was the "optimize" which would cause the index to be built, so I didn't expect a slowdown at all during "add" calls, especially when autoflush is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a path such that the export spreads out the files nicely into a file system tree, rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference> <creator.role>Fotograaf</creator.role> <object_number>9.387</object_number> <monument.label/> <monument.zipcode/> <monument.name>Kasteel Hoensbroek</monument.name> <monument.record_number>284330</monument.record_number> <reproduction.date/> <reproduction.notes>Oude filepath: 0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.name>a.vanhoute</input.name> <edit.name>RCEadmin</edit.name> <edit.name>a.vanhoute</edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source>

</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün christian.gruen@gmail.com wrote:

...
> I set up to use the 8.0-SNAPSHOT and used the internal parser as > well. > In > your example you're not really giving much of a challenge to the > index, > since every doc is just <a/>.

If I get it right, you assume the slowdown is due to the index structures?

> With respect to ADD, I'm not seeing a significant performance > difference:

Please give us more info on the data you are adding. Could you provide us with a sample document?

> 8.0-SNAPSHOT > ------- > 10000: 9250ms > 20000: 7626ms > 30000: 7885ms > 40000: 8111ms > 50000: 8365ms > 60000: 8784ms > 70000: 9270ms > 80000: 9692ms > 90000: 10158ms > 100000: 10612ms > 110000: 11018ms > 120000: 11478ms > 130000: 11940ms > 140000: 12505ms > 150000: 13047ms > 160000: 13536ms > 170000: 14055ms > 180000: 14371ms > 190000: 14883ms > 200000: 15330ms > 210000: 15888ms > 220000: 16398ms > 230000: 16878ms > 240000: 17038ms > 250000: 17453ms > 260000: 17965ms > 270000: 18317ms > 280000: 18832ms > 290000: 19373ms > 300000: 19735ms > 310000: 20062ms > 320000: 20675ms > 330000: 21113ms > 340000: 21754ms > 350000: 22887ms > 360000: 22810ms > 370000: 22985ms > 380000: 23506ms > 390000: 23856ms > 400000: 24338ms > > 7.9 > ----- > 10000: 8229ms > 20000: 7587ms > 30000: 7973ms > 40000: 8282ms > 50000: 8717ms > 60000: 9294ms > 70000: 10105ms > 80000: 10669ms > 90000: 11301ms > 100000: 11835ms > 110000: 12413ms > 120000: 13000ms > 130000: 13577ms > 140000: 14331ms > 150000: 14488ms > 160000: 15025ms > 170000: 15463ms > 180000: 15815ms > 190000: 16153ms > 200000: 16314ms > 210000: 16562ms > 220000: 17186ms > 230000: 17862ms > 240000: 18340ms > 250000: 18790ms > 260000: 19313ms > 270000: 19850ms > 280000: 20225ms > 290000: 20650ms > 300000: 21062ms > 310000: 21595ms > 320000: 22022ms > 330000: 22414ms > 340000: 22925ms > 350000: 23514ms > 360000: 23762ms > 370000: 24360ms > 380000: 25028ms > 390000: 25446ms > 400000: 25700ms > > - Gerald de Jong > > > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün > christian.gruen@gmail.com > wrote: >> >> > Perhaps you can give me a hint as to why inserts slow down.j >> I didn't have time to check out 7.9, but I have done some testing >> with >> 8.0, and I didn't notice a real slow-down. This is Java testing >> script >> (1 mio documents are added in just 17 seconds; I'm using the >> internal >> BaseX parser to speed up the import): >> >> Performance p = new Performance(); >> Context ctx = new Context(); >> >> new CreateDB("db").execute(ctx); >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); >> new Set(MainOptions.INTPARSE, true).execute(ctx); >> for(int i = 0; i < 1000000; i++) { >> new Add("db", "<a/>").execute(ctx); >> } >> ctx.close(); >> System.out.println(p); >> >> Hope this helps, >> Christian > > > > > -- > Delving BV, Vasteland 8, Rotterdam > http://www.delving.eu > http://twitter.com/fluxe > skype: beautifulcode > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

Gerald de Jong

24 Sep 24 Sep

3:42 a.m.

Wonderful, Christian! Thanks. I will try it out.

On Tue, Sep 23, 2014 at 6:34 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Gerald,

I'm glad to tell that the latest snapshot [1] contains some additional optimizations for adding documents with namespaces. It should now be irrelevant if your added document has a namespace on top or not.

I'll be offline for some days (and I hope I didn't introduce a bad bug with the latest commit ;).

Have fun, Christian

[1] http://files.basex.org/releases/latest

On Tue, Sep 23, 2014 at 1:43 PM, Christian Grün christian.gruen@gmail.com wrote:

...
...
This namespace happens to be unnecessary, but others won't be. I'm so curious how this can be the thing.

Unfortunately, the intricacies of namespaces have been keeping us XML implementers busy for a long time, and the XPath and storage algorithms would be much simpler, if not trivial, without the notion of namespaces. This is why it would take quite a while to explain what are the reasons for that, and as your input document only contains one namespaces, I'm not surprised that you are surprised ;) To put it in a nutshell: it's usually easy to optimize single namespaces issues, but it's difficult to optimize all cases that happen in practice.

But I'll keep track of your use case.

On Tue, Sep 23, 2014 at 1:30 PM, Gerald de Jong gerald@delving.eu

wrote:

...
...
On Tue, Sep 23, 2014 at 1:20 PM, Gerald de Jong gerald@delving.eu

wrote:

...
...
...
WOW, really... the namespace? Because it's unused, or is it always

going

...
...
...
to slow when there are namespaces?

On Tue, Sep 23, 2014 at 1:13 PM, Christian Grün christian.gruen@gmail.com wrote:

...
Thanks for the document. The declaration of the (unused) namespace in the root element seems to be the cause for the decreasing performance (I noticed that the time for adding documents stays constant after removing the declaration). I'll do some profiling in order to find out if this can be sped up without too much effort (it may take a while, though, because I'll be on leave for a while from tomorrow).

On Tue, Sep 23, 2014 at 12:25 PM, Gerald de Jong gerald@delving.eu wrote:

...
I don't know what causes the gradual slowdown. My assumption was

that

...
...
...
...
...
it was the "optimize" which would cause the index to be built, so I

didn't

...
...
...
...
...
expect a slowdown at all during "add" calls, especially when

autoflush

...
...
...
...
...
is false.

I add documents with the following paths:

/f/f/e/ffe0f5be2aa14e81050f759c8f9c3eb7.xml

The xml file name is a hash of the contents, and it is placed in a

path

...
...
...
...
...
such that the export spreads out the files nicely into a file system

tree,

...
...
...
...
...
rather than putting a million docs into one directory.

The document content is nothing special, wrapped in a special tag:

<narthex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="20412518" mod="2014-09-23T11:11:51.007+02:00">

<record> <priref>20412518</priref> <current_location>FTA</current_location> <current_location.type/> <description>Ingang op de binnenplaats van de zuidvleugel</description> <collection>Fotocollectie</collection> <production.date.start>1925-08-06</production.date.start> <reproduction.format/>

<reproduction.reference>2186abf4-7108-f9b8-ffbb-902881afe836</reproduction.reference>

...
...
...
...
...
<creator.role>Fotograaf</creator.role>
<object_number>9.387</object_number>
<monument.label/>
<monument.zipcode/>
<monument.name>Kasteel Hoensbroek</monument.name>
<monument.record_number>284330</monument.record_number>
<reproduction.date/>
<reproduction.notes>Oude filepath:
0009\009387.jpg</reproduction.notes> <reproduction.type/> <reproduction.creator/> <rights.type>Copyright</rights.type> <technique>Neg.zw</technique> <creator>Scheepens, W.C.L.A.</creator> <order_number>avh04-2008</order_number> <input.date>2008-04-01</input.date> <edit.date>2011-05-03</edit.date> <edit.date>2008-04-28</edit.date> <monument.historical_address/> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject.type value="SUBJECT" option="SUBJECT"> <text language="0">subject</text> <text language="1">onderwerp</text> <text language="2">sujet</text> <text language="3">Thema</text> <text language="4">موضوع</text> <text language="6">θέμα</text> </content.subject.type> <content.subject>Kasteel</content.subject> <content.subject>Binnenplaats</content.subject> <monument.province>Limburg</monument.province> <monument.place>Hoensbroek</monument.place> <monument.number/> <monument.county/> <monument.country>Nederland</monument.country> <monument.house_number>18</monument.house_number> <monument.street>Klinkertstraat</monument.street> <monument.house_number.addition/> <monument.complex_number/> <monument.number.x_coordinates/> <monument.number.y_coordinates/> <monument.geographical_keyword/> <monument.complex_number.x_coordinates/> <monument.complex_number.y_coordinates/> <creator.date_of_birth/> <creator.date_of_death/> <input.name>a.vanhoute</input.name> <edit.name>RCEadmin</edit.name> <edit.name>a.vanhoute</edit.name> <creator.history/> <record_type value="OBJECT" option="OBJECT"> <text language="0">single object</text> <text language="2">objet individuel</text> <text language="3">Einzelnes Objekt</text> </record_type> <edit.time>03:10:32</edit.time> <edit.time>11:17:08</edit.time> <input.time>09:58:28</input.time> <input.source>document>photographs</input.source> <edit.source>collect>photograph</edit.source> <edit.source>document>photographs</edit.source>

</record> </narthex>

On Tue, Sep 23, 2014 at 11:36 AM, Christian Grün christian.gruen@gmail.com wrote: > > > I set up to use the 8.0-SNAPSHOT and used the internal parser as > > well. > > In > > your example you're not really giving much of a challenge to the > > index, > > since every doc is just <a/>. > > If I get it right, you assume the slowdown is due to the index > structures? > > > With respect to ADD, I'm not seeing a significant performance > > difference: > > Please give us more info on the data you are adding. Could you
provide

...
...
...
...
...
> us with a sample document? > > > > 8.0-SNAPSHOT > > ------- > > 10000: 9250ms > > 20000: 7626ms > > 30000: 7885ms > > 40000: 8111ms > > 50000: 8365ms > > 60000: 8784ms > > 70000: 9270ms > > 80000: 9692ms > > 90000: 10158ms > > 100000: 10612ms > > 110000: 11018ms > > 120000: 11478ms > > 130000: 11940ms > > 140000: 12505ms > > 150000: 13047ms > > 160000: 13536ms > > 170000: 14055ms > > 180000: 14371ms > > 190000: 14883ms > > 200000: 15330ms > > 210000: 15888ms > > 220000: 16398ms > > 230000: 16878ms > > 240000: 17038ms > > 250000: 17453ms > > 260000: 17965ms > > 270000: 18317ms > > 280000: 18832ms > > 290000: 19373ms > > 300000: 19735ms > > 310000: 20062ms > > 320000: 20675ms > > 330000: 21113ms > > 340000: 21754ms > > 350000: 22887ms > > 360000: 22810ms > > 370000: 22985ms > > 380000: 23506ms > > 390000: 23856ms > > 400000: 24338ms > > > > 7.9 > > ----- > > 10000: 8229ms > > 20000: 7587ms > > 30000: 7973ms > > 40000: 8282ms > > 50000: 8717ms > > 60000: 9294ms > > 70000: 10105ms > > 80000: 10669ms > > 90000: 11301ms > > 100000: 11835ms > > 110000: 12413ms > > 120000: 13000ms > > 130000: 13577ms > > 140000: 14331ms > > 150000: 14488ms > > 160000: 15025ms > > 170000: 15463ms > > 180000: 15815ms > > 190000: 16153ms > > 200000: 16314ms > > 210000: 16562ms > > 220000: 17186ms > > 230000: 17862ms > > 240000: 18340ms > > 250000: 18790ms > > 260000: 19313ms > > 270000: 19850ms > > 280000: 20225ms > > 290000: 20650ms > > 300000: 21062ms > > 310000: 21595ms > > 320000: 22022ms > > 330000: 22414ms > > 340000: 22925ms > > 350000: 23514ms > > 360000: 23762ms > > 370000: 24360ms > > 380000: 25028ms > > 390000: 25446ms > > 400000: 25700ms > > > > - Gerald de Jong > > > > > > On Thu, Sep 18, 2014 at 6:57 PM, Christian Grün > > christian.gruen@gmail.com > > wrote: > >> > >> > Perhaps you can give me a hint as to why inserts slow down.j > >> I didn't have time to check out 7.9, but I have done some

testing

...
...
...
...
...
> >> with > >> 8.0, and I didn't notice a real slow-down. This is Java testing > >> script > >> (1 mio documents are added in just 17 seconds; I'm using the > >> internal > >> BaseX parser to speed up the import): > >> > >> Performance p = new Performance(); > >> Context ctx = new Context(); > >> > >> new CreateDB("db").execute(ctx); > >> new Set(MainOptions.AUTOFLUSH, false).execute(ctx); > >> new Set(MainOptions.INTPARSE, true).execute(ctx); > >> for(int i = 0; i < 1000000; i++) { > >> new Add("db", "<a/>").execute(ctx); > >> } > >> ctx.close(); > >> System.out.println(p); > >> > >> Hope this helps, > >> Christian > > > > > > > > > > -- > > Delving BV, Vasteland 8, Rotterdam > > http://www.delving.eu > > http://twitter.com/fluxe > > skype: beautifulcode > > +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

-- Delving BV, Vasteland 8, Rotterdam http://www.delving.eu http://twitter.com/fluxe skype: beautifulcode +31629339805

3950

Age (days ago)

3956

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

25 comments

4 participants

tags (0)

participants (4)

Christian Grün
Fabrice Etanchaud
Gerald de Jong
Marco Lettere