Hello, I wonder if the attached query can be optimised. I'm attaching all relevant information. Basex 7.9, Debian, powerful server. This is just an example. The queries will be built based on a compilation of a search form. So reordering the conditions for having smaller subset right from the begging isn't relevant. Any help would be appreciated. 40 seconds are not acceptable.
Hi Menashè,
First of all, I wonder if your query really does what you want it to do. I noticed for example that some of the where conditions start with "$x/", while others start with "/" and some others start with no slash. Is this intentional?
Some more comments:
* I would recommend you to avoid numeric tests in the @codeListValue tests and use string tests instead (/@codeListValue = "7827", etc). * Usually, you can also get rid of the xs:dateTime() conversions, because items of type date and time can also be compared as strings. * I'm not sure what the predicates [*] are supposed to do in your query. If you remove them, you will get the same results. * In some cases, if you know that an element name is distinct, you can get rid of all the explicit child steps and directly address the node via the descendant axis.
So reordering the conditions for having smaller subset right from the begging isn't relevant.
Reordering shouldn't make a big difference anyway, because BaseX tries to find the cheapest index request by itself, based on the database statistics.
Beside that, I would be interested to hear if you get better results with BaseX 8.0 [1], because we recently spent quite some time to further improve our index rewriting rules.
Hope this helps, Christian
[1] http://files.basex.org/releases/latest
On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, I wonder if the attached query can be optimised. I'm attaching all relevant information. Basex 7.9, Debian, powerful server. This is just an example. The queries will be built based on a compilation of a search form. Any help would be appreciated. 40 seconds are not acceptable.
-- With kind regards, Menashè
Hi Christian,
Thank you for your reply. Updated files are attached.
On 01/30/2015 04:35 PM, Christian Grün wrote:
Hi Menashè,
First of all, I wonder if your query really does what you want it to do. I noticed for example that some of the where conditions start with "$x/", while others start with "/" and some others start with no slash. Is this intentional?
I've added $x and now it takes little less: 30 sec. I haven't seen a case of no slash.
Some more comments:
- I would recommend you to avoid numeric tests in the @codeListValue
tests and use string tests instead (/@codeListValue = "7827", etc).
Done. Down to 23 sec.
- Usually, you can also get rid of the xs:dateTime() conversions,
because items of type date and time can also be compared as strings.
Done. Down to almost 19 sec. Still too much.
- I'm not sure what the predicates [*] are supposed to do in your
query. If you remove them, you will get the same results.
* means that I don't know if it's 1,2 or any other number inside the XPath, e.g. /gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue How can I remove *?
- In some cases, if you know that an element name is distinct, you can
get rid of all the explicit child steps and directly address the node via the descendant axis.
Thanks, but it's not relevant in my case.
So reordering the conditions for having smaller subset right from the begging isn't relevant.
Reordering shouldn't make a big difference anyway, because BaseX tries to find the cheapest index request by itself, based on the database statistics.
Great, as I expect from a good product :)
Beside that, I would be interested to hear if you get better results with BaseX 8.0 [1], because we recently spent quite some time to further improve our index rewriting rules.
Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db importing the xml files for testing the improved indexing?
Hope this helps, Christian
[1] http://files.basex.org/releases/latest
On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, I wonder if the attached query can be optimised. I'm attaching all relevant information. Basex 7.9, Debian, powerful server. This is just an example. The queries will be built based on a compilation of a search form. Any help would be appreciated. 40 seconds are not acceptable.
-- With kind regards, Menashè
-- With kind regards, Menashè
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
How can I remove *?
Simply remove the predicate; a[*]/b is the same as a/b.
- In some cases, if you know that an element name is distinct, you can
get rid of all the explicit child steps and directly address the node via the descendant axis.
Thanks, but it's not relevant in my case.
Is it because the element names are not distinct? Or is it because your input form allows users to choose arbitrary paths for arbitrary documents?
Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db importing the xml files for testing the improved indexing?
We have actually improved support for collections, but the database format itself has not changed, so it shouldn't make a difference in your case.
Christian
[1] http://files.basex.org/releases/latest
On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, I wonder if the attached query can be optimised. I'm attaching all relevant information. Basex 7.9, Debian, powerful server. This is just an example. The queries will be built based on a compilation of a search form. Any help would be appreciated. 40 seconds are not acceptable.
-- With kind regards, Menashè
-- With kind regards, Menashè
On 01/30/2015 05:18 PM, Christian Grün wrote:
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
How can I remove *?
Simply remove the predicate; a[*]/b is the same as a/b.
Maybe I wasn't clear. The actual number appears in the xml file, e.g., gmd:descriptiveKeywords[1] Anyway, I've removed all [*] and I get the same correct result, however the processing time is doubled...
- In some cases, if you know that an element name is distinct, you can
get rid of all the explicit child steps and directly address the node via the descendant axis.
Thanks, but it's not relevant in my case.
Is it because the element names are not distinct? Or is it because your input form allows users to choose arbitrary paths for arbitrary documents?
The element names are not distinct.
Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db importing the xml files for testing the improved indexing?
We have actually improved support for collections, but the database format itself has not changed, so it shouldn't make a difference in your case.
Christian
[1] http://files.basex.org/releases/latest
On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, I wonder if the attached query can be optimised. I'm attaching all relevant information. Basex 7.9, Debian, powerful server. This is just an example. The queries will be built based on a compilation of a search form. Any help would be appreciated. 40 seconds are not acceptable.
-- With kind regards, Menashè
-- With kind regards, Menashè
With kind regards, Menashè
It's indeed interesting that your query does not use any of the existing index structures (if they did, you would find strings like "applying text index" or "applying attribute index" in the query info). Maybe/hopefully things look different with Version 8.0.
On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
On 01/30/2015 05:18 PM, Christian Grün wrote:
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
How can I remove *?
Simply remove the predicate; a[*]/b is the same as a/b.
Maybe I wasn't clear. The actual number appears in the xml file, e.g., gmd:descriptiveKeywords[1] Anyway, I've removed all [*] and I get the same correct result, however the processing time is doubled...
- In some cases, if you know that an element name is distinct, you can
get rid of all the explicit child steps and directly address the node via the descendant axis.
Thanks, but it's not relevant in my case.
Is it because the element names are not distinct? Or is it because your input form allows users to choose arbitrary paths for arbitrary documents?
The element names are not distinct.
Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db importing the xml files for testing the improved indexing?
We have actually improved support for collections, but the database format itself has not changed, so it shouldn't make a difference in your case.
Christian
[1] http://files.basex.org/releases/latest
On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, I wonder if the attached query can be optimised. I'm attaching all relevant information. Basex 7.9, Debian, powerful server. This is just an example. The queries will be built based on a compilation of a search form. Any help would be appreciated. 40 seconds are not acceptable.
-- With kind regards, Menashè
-- With kind regards, Menashè
With kind regards, Menashè
Almost the same speed with version 8.0. No indexing (no "applying" in the query info). As I've attached before, indexes are active for this DB.
With kind regards, Menashè
On 01/30/2015 05:31 PM, Christian Grün wrote:
It's indeed interesting that your query does not use any of the existing index structures (if they did, you would find strings like "applying text index" or "applying attribute index" in the query info). Maybe/hopefully things look different with Version 8.0.
On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
On 01/30/2015 05:18 PM, Christian Grün wrote:
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
How can I remove *?
Simply remove the predicate; a[*]/b is the same as a/b.
Maybe I wasn't clear. The actual number appears in the xml file, e.g., gmd:descriptiveKeywords[1] Anyway, I've removed all [*] and I get the same correct result, however the processing time is doubled...
- In some cases, if you know that an element name is distinct, you can
get rid of all the explicit child steps and directly address the node via the descendant axis.
Thanks, but it's not relevant in my case.
Is it because the element names are not distinct? Or is it because your input form allows users to choose arbitrary paths for arbitrary documents?
The element names are not distinct.
Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db importing the xml files for testing the improved indexing?
We have actually improved support for collections, but the database format itself has not changed, so it shouldn't make a difference in your case.
Christian
[1] http://files.basex.org/releases/latest
On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, I wonder if the attached query can be optimised. I'm attaching all relevant information. Basex 7.9, Debian, powerful server. This is just an example. The queries will be built based on a compilation of a search form. Any help would be appreciated. 40 seconds are not acceptable.
-- With kind regards, Menashè
-- With kind regards, Menashè
With kind regards, Menashè
Could you possibly provide me with a small snapshot of your data sources (one, two documents might be sufficient)?
On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Almost the same speed with version 8.0. No indexing (no "applying" in the query info). As I've attached before, indexes are active for this DB.
With kind regards, Menashè
On 01/30/2015 05:31 PM, Christian Grün wrote:
It's indeed interesting that your query does not use any of the existing index structures (if they did, you would find strings like "applying text index" or "applying attribute index" in the query info). Maybe/hopefully things look different with Version 8.0.
On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
On 01/30/2015 05:18 PM, Christian Grün wrote:
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
How can I remove *?
Simply remove the predicate; a[*]/b is the same as a/b.
Maybe I wasn't clear. The actual number appears in the xml file, e.g., gmd:descriptiveKeywords[1] Anyway, I've removed all [*] and I get the same correct result, however the processing time is doubled...
- In some cases, if you know that an element name is distinct, you can
get rid of all the explicit child steps and directly address the node via the descendant axis.
Thanks, but it's not relevant in my case.
Is it because the element names are not distinct? Or is it because your input form allows users to choose arbitrary paths for arbitrary documents?
The element names are not distinct.
Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db importing the xml files for testing the improved indexing?
We have actually improved support for collections, but the database format itself has not changed, so it shouldn't make a difference in your case.
Christian
[1] http://files.basex.org/releases/latest
On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote: > > Hello, > I wonder if the attached query can be optimised. I'm attaching all > relevant > information. > Basex 7.9, Debian, powerful server. > This is just an example. The queries will be built based on a > compilation > of > a search form. > Any help would be appreciated. > 40 seconds are not acceptable. > > -- > With kind regards, > Menashè >
-- With kind regards, Menashè
With kind regards, Menashè
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote:
Could you possibly provide me with a small snapshot of your data sources (one, two documents might be sufficient)?
On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Almost the same speed with version 8.0. No indexing (no "applying" in the query info). As I've attached before, indexes are active for this DB.
With kind regards, Menashè
On 01/30/2015 05:31 PM, Christian Grün wrote:
It's indeed interesting that your query does not use any of the existing index structures (if they did, you would find strings like "applying text index" or "applying attribute index" in the query info). Maybe/hopefully things look different with Version 8.0.
On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
On 01/30/2015 05:18 PM, Christian Grün wrote:
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
How can I remove *?
Simply remove the predicate; a[*]/b is the same as a/b.
Maybe I wasn't clear. The actual number appears in the xml file, e.g., gmd:descriptiveKeywords[1] Anyway, I've removed all [*] and I get the same correct result, however the processing time is doubled...
> * In some cases, if you know that an element name is distinct, you can > get rid of all the explicit child steps and directly address the node > via the descendant axis.
Thanks, but it's not relevant in my case.
Is it because the element names are not distinct? Or is it because your input form allows users to choose arbitrary paths for arbitrary documents?
The element names are not distinct.
Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db importing the xml files for testing the improved indexing?
We have actually improved support for collections, but the database format itself has not changed, so it shouldn't make a difference in your case.
Christian
> [1] http://files.basex.org/releases/latest > > > > On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer > meliezer@ogs.trieste.it wrote: >> >> Hello, >> I wonder if the attached query can be optimised. I'm attaching all >> relevant >> information. >> Basex 7.9, Debian, powerful server. >> This is just an example. The queries will be built based on a >> compilation >> of >> a search form. >> Any help would be appreciated. >> 40 seconds are not acceptable. >> >> -- >> With kind regards, >> Menashè
>>
With kind regards, Menashè
With kind regards, Menashè
Hi Christian,
Interesting! I'll check it when I'm back at the office and keep you updated. I'll use for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification as you've suggested. Should I expect to see the usage of an index for each of the where phrases?
Have a nice weekend! Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote:
Could you possibly provide me with a small snapshot of your data sources (one, two documents might be sufficient)?
On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Almost the same speed with version 8.0. No indexing (no "applying" in the query info). As I've attached before, indexes are active for this DB.
With kind regards, Menashè
On 01/30/2015 05:31 PM, Christian Grün wrote:
It's indeed interesting that your query does not use any of the existing index structures (if they did, you would find strings like "applying text index" or "applying attribute index" in the query info). Maybe/hopefully things look different with Version 8.0.
On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
On 01/30/2015 05:18 PM, Christian Grün wrote:
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
> > How can I remove *?
Simply remove the predicate; a[*]/b is the same as a/b.
Maybe I wasn't clear. The actual number appears in the xml file,
e.g.,
gmd:descriptiveKeywords[1] Anyway, I've removed all [*] and I get the same correct result,
however
the processing time is doubled...
>> * In some cases, if you know that an element name is distinct, you
can
>> get rid of all the explicit child steps and directly address the
node
>> via the descendant axis. > > Thanks, but it's not relevant in my case.
Is it because the element names are not distinct? Or is it because your input form allows users to choose arbitrary paths for arbitrary documents?
The element names are not distinct.
> Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db > importing > the xml files for testing the improved indexing?
We have actually improved support for collections, but the database format itself has not changed, so it shouldn't make a difference in your case.
Christian
>> [1] http://files.basex.org/releases/latest >> >> >> >> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >> meliezer@ogs.trieste.it wrote: >>> >>> Hello, >>> I wonder if the attached query can be optimised. I'm attaching
all
>>> relevant >>> information. >>> Basex 7.9, Debian, powerful server. >>> This is just an example. The queries will be built based on a >>> compilation >>> of >>> a search form. >>> Any help would be appreciated. >>> 40 seconds are not acceptable. >>> >>> -- >>> With kind regards, >>> Menashè >>> > -- > With kind regards, > Menashè > >
With kind regards, Menashè
Hi Menashè,
Should I expect to see the usage of an index for each of the where phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
Have a nice weekend!
Enjoy, Christian
Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote:
Could you possibly provide me with a small snapshot of your data sources (one, two documents might be sufficient)?
On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Almost the same speed with version 8.0. No indexing (no "applying" in the query info). As I've attached before, indexes are active for this DB.
With kind regards, Menashè
On 01/30/2015 05:31 PM, Christian Grün wrote:
It's indeed interesting that your query does not use any of the existing index structures (if they did, you would find strings like "applying text index" or "applying attribute index" in the query info). Maybe/hopefully things look different with Version 8.0.
On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
On 01/30/2015 05:18 PM, Christian Grün wrote: > > > >
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>> >> How can I remove *? > > Simply remove the predicate; a[*]/b is the same as a/b.
Maybe I wasn't clear. The actual number appears in the xml file,
e.g.,
gmd:descriptiveKeywords[1] Anyway, I've removed all [*] and I get the same correct result,
however
the processing time is doubled... > > >>> * In some cases, if you know that an element name is distinct, you
can
>>> get rid of all the explicit child steps and directly address the
node
>>> via the descendant axis. >> >> Thanks, but it's not relevant in my case. > > Is it because the element names are not distinct? Or is it because > your input form allows users to choose arbitrary paths for arbitrary > documents?
The element names are not distinct.
>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db >> importing >> the xml files for testing the improved indexing? > > We have actually improved support for collections, but the database > format itself has not changed, so it shouldn't make a difference in > your case. > > Christian > > >>> [1] http://files.basex.org/releases/latest >>> >>> >>> >>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>> meliezer@ogs.trieste.it wrote: >>>> >>>> Hello, >>>> I wonder if the attached query can be optimised. I'm attaching
all
>>>> relevant >>>> information. >>>> Basex 7.9, Debian, powerful server. >>>> This is just an example. The queries will be built based on a >>>> compilation >>>> of >>>> a search form. >>>> Any help would be appreciated. >>>> 40 seconds are not acceptable. >>>> >>>> -- >>>> With kind regards, >>>> Menashè >>>> >> -- >> With kind regards, >> Menashè >> >> With kind regards, Menashè
-- Menashè
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Should I expect to see the usage of an index for each of the where phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
Have a nice weekend!
Enjoy, Christian
Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote:
Could you possibly provide me with a small snapshot of your data sources (one, two documents might be sufficient)?
On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Almost the same speed with version 8.0. No indexing (no "applying" in the query info). As I've attached before, indexes are active for this DB.
With kind regards, Menashè
On 01/30/2015 05:31 PM, Christian Grün wrote:
It's indeed interesting that your query does not use any of the existing index structures (if they did, you would find strings like "applying text index" or "applying attribute index" in the query info). Maybe/hopefully things look different with Version 8.0.
On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote: > > On 01/30/2015 05:18 PM, Christian Grün wrote: >> >> >> >>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>> >>> How can I remove *? >> >> Simply remove the predicate; a[*]/b is the same as a/b. > > Maybe I wasn't clear. The actual number appears in the xml file,
e.g.,
> gmd:descriptiveKeywords[1] > Anyway, I've removed all [*] and I get the same correct result,
however
> the > processing time is doubled... >> >> >>>> * In some cases, if you know that an element name is distinct, you
can
>>>> get rid of all the explicit child steps and directly address the
node
>>>> via the descendant axis. >>> >>> Thanks, but it's not relevant in my case. >> >> Is it because the element names are not distinct? Or is it because >> your input form allows users to choose arbitrary paths for arbitrary >> documents? > > The element names are not distinct. > >>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate the db >>> importing >>> the xml files for testing the improved indexing? >> >> We have actually improved support for collections, but the database >> format itself has not changed, so it shouldn't make a difference in >> your case. >> >> Christian >> >> >>>> [1] http://files.basex.org/releases/latest >>>> >>>> >>>> >>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>> meliezer@ogs.trieste.it wrote: >>>>> >>>>> Hello, >>>>> I wonder if the attached query can be optimised. I'm attaching
all
>>>>> relevant >>>>> information. >>>>> Basex 7.9, Debian, powerful server. >>>>> This is just an example. The queries will be built based on a >>>>> compilation >>>>> of >>>>> a search form. >>>>> Any help would be appreciated. >>>>> 40 seconds are not acceptable. >>>>> >>>>> -- >>>>> With kind regards, >>>>> Menashè >>>>> >>> -- >>> With kind regards, >>> Menashè >>> >>> > With kind regards, > Menashè >
-- Menashè
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Should I expect to see the usage of an index for each of the where
phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
Have a nice weekend!
Enjoy, Christian
Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote:
Could you possibly provide me with a small snapshot of your data sources (one, two documents might be sufficient)?
On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Almost the same speed with version 8.0. No indexing (no "applying" in the query info). As I've attached before, indexes are active for this DB.
With kind regards, Menashè
On 01/30/2015 05:31 PM, Christian Grün wrote: > > It's indeed interesting that your query does not use any of the > existing index structures (if they did, you would find strings like > "applying text index" or "applying attribute index" in the query > info). Maybe/hopefully things look different with Version 8.0. > > > On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer > meliezer@ogs.trieste.it wrote: >> >> On 01/30/2015 05:18 PM, Christian Grün wrote: >>> >>> >>> >>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>> >>>> How can I remove *? >>> >>> Simply remove the predicate; a[*]/b is the same as a/b. >> >> Maybe I wasn't clear. The actual number appears in the xml file,
e.g.,
>> gmd:descriptiveKeywords[1] >> Anyway, I've removed all [*] and I get the same correct result,
however
>> the >> processing time is doubled... >>> >>> >>>>> * In some cases, if you know that an element name is distinct,
you
can
>>>>> get rid of all the explicit child steps and directly address
the
node
>>>>> via the descendant axis. >>>> >>>> Thanks, but it's not relevant in my case. >>> >>> Is it because the element names are not distinct? Or is it
because
>>> your input form allows users to choose arbitrary paths for
arbitrary
>>> documents? >> >> The element names are not distinct. >> >>>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate the
db
>>>> importing >>>> the xml files for testing the improved indexing? >>> >>> We have actually improved support for collections, but the
database
>>> format itself has not changed, so it shouldn't make a difference
in
>>> your case. >>> >>> Christian >>> >>> >>>>> [1] http://files.basex.org/releases/latest >>>>> >>>>> >>>>> >>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>> meliezer@ogs.trieste.it wrote: >>>>>> >>>>>> Hello, >>>>>> I wonder if the attached query can be optimised. I'm attaching
all
>>>>>> relevant >>>>>> information. >>>>>> Basex 7.9, Debian, powerful server. >>>>>> This is just an example. The queries will be built based on a >>>>>> compilation >>>>>> of >>>>>> a search form. >>>>>> Any help would be appreciated. >>>>>> 40 seconds are not acceptable. >>>>>> >>>>>> -- >>>>>> With kind regards, >>>>>> Menashè >>>>>> >>>> -- >>>> With kind regards, >>>> Menashè >>>> >>>> >> With kind regards, >> Menashè >>
-- Menashè
Hi Christian,
Thank you! The performance arrives to 0.5 sec!
The biggest improvement is related to the query rephrasing you've suggested. Then the latest snapshot also helps a lot! You may want to know that in the log of the latest snapshot I see applying attribute index for "7827" which is not clear to the user, instead of BaseX80-20150130.124009 which has also used indexing: applying attribute index for ("ALKY", "AYMD")
I'm attaching the first and the second launch of the query using BaseXGUI. Relaunching the same query reduces the time from over 1 second to 0.5 second. Some data: BaseX80-20150130.124009 Total Time: 30676.02 ms After using "for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification": Total Time: 5456.74 ms applying attribute index for ("ALKY", "AYMD") in log. Second launch: 1333.71 ms Latest snapshot (BaseX80-20150202.121033): 1st: Total Time: 1873.02 ms 2nd: Total Time: 548.62 ms
With kind regards, Menashè
On 02/02/2015 02:02 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Should I expect to see the usage of an index for each of the where
phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
Have a nice weekend!
Enjoy, Christian
Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote:
Could you possibly provide me with a small snapshot of your data sources (one, two documents might be sufficient)?
On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote: > Almost the same speed with version 8.0. > No indexing (no "applying" in the query info). > As I've attached before, indexes are active for this DB. > > With kind regards, > Menashè > > > On 01/30/2015 05:31 PM, Christian Grün wrote: >> It's indeed interesting that your query does not use any of the >> existing index structures (if they did, you would find strings like >> "applying text index" or "applying attribute index" in the query >> info). Maybe/hopefully things look different with Version 8.0. >> >> >> On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer >> meliezer@ogs.trieste.it wrote: >>> On 01/30/2015 05:18 PM, Christian Grün wrote: >>>> >>>> >>>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>>> How can I remove *? >>>> Simply remove the predicate; a[*]/b is the same as a/b. >>> Maybe I wasn't clear. The actual number appears in the xml file,
e.g.,
>>> gmd:descriptiveKeywords[1] >>> Anyway, I've removed all [*] and I get the same correct result,
however
>>> the >>> processing time is doubled... >>>> >>>>>> * In some cases, if you know that an element name is distinct,
you
can
>>>>>> get rid of all the explicit child steps and directly address
the
node
>>>>>> via the descendant axis. >>>>> Thanks, but it's not relevant in my case. >>>> Is it because the element names are not distinct? Or is it
because
>>>> your input form allows users to choose arbitrary paths for
arbitrary
>>>> documents? >>> The element names are not distinct. >>> >>>>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate the
db
>>>>> importing >>>>> the xml files for testing the improved indexing? >>>> We have actually improved support for collections, but the
database
>>>> format itself has not changed, so it shouldn't make a difference
in
>>>> your case. >>>> >>>> Christian >>>> >>>> >>>>>> [1] http://files.basex.org/releases/latest >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>>> meliezer@ogs.trieste.it wrote: >>>>>>> Hello, >>>>>>> I wonder if the attached query can be optimised. I'm attaching
all
>>>>>>> relevant >>>>>>> information. >>>>>>> Basex 7.9, Debian, powerful server. >>>>>>> This is just an example. The queries will be built based on a >>>>>>> compilation >>>>>>> of >>>>>>> a search form. >>>>>>> Any help would be appreciated. >>>>>>> 40 seconds are not acceptable. >>>>>>> >>>>>>> -- >>>>>>> With kind regards, >>>>>>> Menashè >>>>>>> >>>>> -- >>>>> With kind regards, >>>>> Menashè >>>>> >>>>> >>> With kind regards, >>> Menashè >>>
-- Menashè
;) Looks good!
Thanks for the updated report, Christian
On Tue, Feb 3, 2015 at 1:13 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hi Christian,
Thank you! The performance arrives to 0.5 sec!
The biggest improvement is related to the query rephrasing you've suggested. Then the latest snapshot also helps a lot! You may want to know that in the log of the latest snapshot I see applying attribute index for "7827" which is not clear to the user, instead of BaseX80-20150130.124009 which has also used indexing: applying attribute index for ("ALKY", "AYMD")
I'm attaching the first and the second launch of the query using BaseXGUI. Relaunching the same query reduces the time from over 1 second to 0.5 second. Some data: BaseX80-20150130.124009 Total Time: 30676.02 ms After using "for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification": Total Time: 5456.74 ms applying attribute index for ("ALKY", "AYMD") in log. Second launch: 1333.71 ms Latest snapshot (BaseX80-20150202.121033): 1st: Total Time: 1873.02 ms 2nd: Total Time: 548.62 ms
With kind regards, Menashè
On 02/02/2015 02:02 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Should I expect to see the usage of an index for each of the where
phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
Have a nice weekend!
Enjoy, Christian
Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote: > > Could you possibly provide me with a small snapshot of your data > sources (one, two documents might be sufficient)? > > > On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer > meliezer@ogs.trieste.it wrote: >> >> Almost the same speed with version 8.0. >> No indexing (no "applying" in the query info). >> As I've attached before, indexes are active for this DB. >> >> With kind regards, >> Menashè >> >> >> On 01/30/2015 05:31 PM, Christian Grün wrote: >>> >>> It's indeed interesting that your query does not use any of the >>> existing index structures (if they did, you would find strings like >>> "applying text index" or "applying attribute index" in the query >>> info). Maybe/hopefully things look different with Version 8.0. >>> >>> >>> On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer >>> meliezer@ogs.trieste.it wrote: >>>> >>>> On 01/30/2015 05:18 PM, Christian Grün wrote: >>>>> >>>>> >>>>> >>>>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>>>> >>>>>> How can I remove *? >>>>> >>>>> Simply remove the predicate; a[*]/b is the same as a/b. >>>> >>>> Maybe I wasn't clear. The actual number appears in the xml file,
e.g., >>>> >>>> gmd:descriptiveKeywords[1] >>>> Anyway, I've removed all [*] and I get the same correct result,
however >>>> >>>> the >>>> processing time is doubled... >>>>> >>>>> >>>>>>> * In some cases, if you know that an element name is distinct,
you
can >>>>>>> >>>>>>> get rid of all the explicit child steps and directly address
the
node >>>>>>> >>>>>>> via the descendant axis. >>>>>> >>>>>> Thanks, but it's not relevant in my case. >>>>> >>>>> Is it because the element names are not distinct? Or is it
because
>>>>> >>>>> your input form allows users to choose arbitrary paths for
arbitrary
>>>>> >>>>> documents? >>>> >>>> The element names are not distinct. >>>> >>>>>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate the
db
>>>>>> >>>>>> importing >>>>>> the xml files for testing the improved indexing? >>>>> >>>>> We have actually improved support for collections, but the
database
>>>>> >>>>> format itself has not changed, so it shouldn't make a difference
in
>>>>> >>>>> your case. >>>>> >>>>> Christian >>>>> >>>>> >>>>>>> [1] http://files.basex.org/releases/latest >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>>>> meliezer@ogs.trieste.it wrote: >>>>>>>> >>>>>>>> Hello, >>>>>>>> I wonder if the attached query can be optimised. I'm attaching
all >>>>>>>> >>>>>>>> relevant >>>>>>>> information. >>>>>>>> Basex 7.9, Debian, powerful server. >>>>>>>> This is just an example. The queries will be built based on a >>>>>>>> compilation >>>>>>>> of >>>>>>>> a search form. >>>>>>>> Any help would be appreciated. >>>>>>>> 40 seconds are not acceptable. >>>>>>>> >>>>>>>> -- >>>>>>>> With kind regards, >>>>>>>> Menashè >>>>>>>> >>>>>> -- >>>>>> With kind regards, >>>>>> Menashè >>>>>> >>>>>> >>>> With kind regards, >>>> Menashè >>>>
-- Menashè
Hi Christian, I'm have again performance problems. I have BaseX 8.2.1. As you may remember, you've recommended changing 'for $x in collection("CDI")' to 'for $x in collection("CDI")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification'. However, I've discovered I cannot specify XPath while working with IDs (db:node-pre). It's a multi-step process: client program sends to the server the search filter defined by end-user and get IDs. Then there are several queries for getting different information about this specific subset. Instead of redefining the filters, the only condition is where db:node-pre($x)=$ids for having a better performance. Once I specific XPath, it seems that the ids have no meaning. The resultset is always empty once they are being used. So, I've returned to use 'for $x in collection("CDI")' in the first query of getting all IDs, but the performance is extremely low.
**I'm attaching the query and its related info using BaseXGUI (local server) with much smaller database. The performance seems ok.
I'm using your BaseXClient.java, however I see the delay already in the BaseX server logs: QUERY[0] xquery version "3.0"; declare namespace queryName ='GetIDS'; declare namespace gco = "http://www.isotc211.org/2005/gco"; declare namespace gmd = "http://www.isotc211.org/2005/gmd"; declare namespace gml = "http://www.opengis.net/gml"; declare namespace gmx="http://www.isotc211.org/2005/gmx"; declare namespace sdn = "http://www.seadatanet.org"; dec lare namespace fn = "http://www.w3.org/2005/xpath-functions"; declare namespace xs = "http://www.w3.org/2001/XMLSchema"; declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; declare option output:method 'xml';declare option output:item-separator ","; let $db := db:open("CDI") for $x in $db where $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Exte nt/gmd:geographicElement/gmd:EX_GeographicBoundingBox/gmd:westBoundLongitude/gco:Decimal>="-5.8447265625" and $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Extent/gmd :geographicElement... 0.17 ms 110 16:36:09.713 192.168.155.30:39211 admin OK RESULTS[0] 25957.11 ms
Then I have other slow queries, but I would like to focus in this phase on the biggest delay. Server: Java 1.7.0_79, VM="-XX:MaxPermSize=512m -Xms3096m -Xmx3096m" The network layer between client and server is very fast.
P.S. Id there an undocumented way to log the full xquery in BaseX server logs? I've seen the -V option, but I don't use the standalone version, but: java -cp /usr/share/java/basex.jar org.basex.BaseXServer -d doesn't give me extra query info.
With kind regards, Menashè
On 02/03/2015 01:13 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you! The performance arrives to 0.5 sec!
The biggest improvement is related to the query rephrasing you've suggested. Then the latest snapshot also helps a lot! You may want to know that in the log of the latest snapshot I see applying attribute index for "7827" which is not clear to the user, instead of BaseX80-20150130.124009 which has also used indexing: applying attribute index for ("ALKY", "AYMD")
I'm attaching the first and the second launch of the query using BaseXGUI. Relaunching the same query reduces the time from over 1 second to 0.5 second. Some data: BaseX80-20150130.124009 Total Time: 30676.02 ms After using "for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification": Total Time: 5456.74 ms applying attribute index for ("ALKY", "AYMD") in log. Second launch: 1333.71 ms Latest snapshot (BaseX80-20150202.121033): 1st: Total Time: 1873.02 ms 2nd: Total Time: 548.62 ms
With kind regards, Menashè
On 02/02/2015 02:02 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Should I expect to see the usage of an index for each of the where
phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
Have a nice weekend!
Enjoy, Christian
Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Thanks for the XML samples you sent me in private. I noticed that the index rewritings will only be triggered if you formulate your query as follows:
OLD: for $x in collection("ALL-CDIS") where $x/gmd:MD_Metadata/gmd:identificationInfo/... return ...
NEW: for $x in collection("ALL-CDIS")/gmd:MD_Metadata where $x/gmd:identificationInfo/... return ...
It's difficult to explain in short sentences why Variant 1 cannot be optimized that straightforward (basically, it's quite a different pattern to look for), but I'll check out if we can extend our matcher to also support these kind of queries.
So, if possible, I would recommend you for now (and at least for testing) to move the root element test after the collection() function. I noticed that the first three child steps are the same in all of your conditions:
gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
If that will be always be the case, it surely makes sense to move all of them to the "for" clause.
Looking forward to your updated performance tests, Christian _______________________________
On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün christian.gruen@gmail.com wrote: > Could you possibly provide me with a small snapshot of your data > sources (one, two documents might be sufficient)? > > > On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer > meliezer@ogs.trieste.it wrote: >> Almost the same speed with version 8.0. >> No indexing (no "applying" in the query info). >> As I've attached before, indexes are active for this DB. >> >> With kind regards, >> Menashè >> >> >> On 01/30/2015 05:31 PM, Christian Grün wrote: >>> It's indeed interesting that your query does not use any of the >>> existing index structures (if they did, you would find strings >>> like >>> "applying text index" or "applying attribute index" in the query >>> info). Maybe/hopefully things look different with Version 8.0. >>> >>> >>> On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer >>> meliezer@ogs.trieste.it wrote: >>>> On 01/30/2015 05:18 PM, Christian Grün wrote: >>>>> >>>>> >>>>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>>>> How can I remove *? >>>>> Simply remove the predicate; a[*]/b is the same as a/b. >>>> Maybe I wasn't clear. The actual number appears in the xml file, e.g., >>>> gmd:descriptiveKeywords[1] >>>> Anyway, I've removed all [*] and I get the same correct result, however >>>> the >>>> processing time is doubled... >>>>> >>>>>>> * In some cases, if you know that an element name is >>>>>>> distinct,
you
can >>>>>>> get rid of all the explicit child steps and directly address
the
node >>>>>>> via the descendant axis. >>>>>> Thanks, but it's not relevant in my case. >>>>> Is it because the element names are not distinct? Or is it
because
>>>>> your input form allows users to choose arbitrary paths for
arbitrary
>>>>> documents? >>>> The element names are not distinct. >>>> >>>>>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate >>>>>> the
db
>>>>>> importing >>>>>> the xml files for testing the improved indexing? >>>>> We have actually improved support for collections, but the
database
>>>>> format itself has not changed, so it shouldn't make a >>>>> difference
in
>>>>> your case. >>>>> >>>>> Christian >>>>> >>>>> >>>>>>> [1] http://files.basex.org/releases/latest >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>>>> meliezer@ogs.trieste.it wrote: >>>>>>>> Hello, >>>>>>>> I wonder if the attached query can be optimised. I'm >>>>>>>> attaching all >>>>>>>> relevant >>>>>>>> information. >>>>>>>> Basex 7.9, Debian, powerful server. >>>>>>>> This is just an example. The queries will be built based >>>>>>>> on a >>>>>>>> compilation >>>>>>>> of >>>>>>>> a search form. >>>>>>>> Any help would be appreciated. >>>>>>>> 40 seconds are not acceptable. >>>>>>>> >>>>>>>> -- >>>>>>>> With kind regards, >>>>>>>> Menashè >>>>>>>> >>>>>> -- >>>>>> With kind regards, >>>>>> Menashè >>>>>> >>>>>> >>>> With kind regards, >>>> Menashè >>>>
-- Menashè
Hi, I've used ssh -X for producing query info right from the server machine. Please see attached. I hope it would help.
With kind regards, Menashè
On 06/22/2015 04:48 PM, Menashè Eliezer wrote:
Hi Christian, I'm have again performance problems. I have BaseX 8.2.1. As you may remember, you've recommended changing 'for $x in collection("CDI")' to 'for $x in collection("CDI")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification'. However, I've discovered I cannot specify XPath while working with IDs (db:node-pre). It's a multi-step process: client program sends to the server the search filter defined by end-user and get IDs. Then there are several queries for getting different information about this specific subset. Instead of redefining the filters, the only condition is where db:node-pre($x)=$ids for having a better performance. Once I specific XPath, it seems that the ids have no meaning. The resultset is always empty once they are being used. So, I've returned to use 'for $x in collection("CDI")' in the first query of getting all IDs, but the performance is extremely low.
**I'm attaching the query and its related info using BaseXGUI (local server) with much smaller database. The performance seems ok.
I'm using your BaseXClient.java, however I see the delay already in the BaseX server logs: QUERY[0] xquery version "3.0"; declare namespace queryName ='GetIDS'; declare namespace gco = "http://www.isotc211.org/2005/gco"; declare namespace gmd = "http://www.isotc211.org/2005/gmd"; declare namespace gml = "http://www.opengis.net/gml"; declare namespace gmx="http://www.isotc211.org/2005/gmx"; declare namespace sdn = "http://www.seadatanet.org"; dec lare namespace fn = "http://www.w3.org/2005/xpath-functions"; declare namespace xs = "http://www.w3.org/2001/XMLSchema"; declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; declare option output:method 'xml';declare option output:item-separator ","; let $db := db:open("CDI") for $x in $db where $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Exte nt/gmd:geographicElement/gmd:EX_GeographicBoundingBox/gmd:westBoundLongitude/gco:Decimal>="-5.8447265625" and $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Extent/gmd :geographicElement... 0.17 ms 110 16:36:09.713 192.168.155.30:39211 admin OK RESULTS[0] 25957.11 ms
Then I have other slow queries, but I would like to focus in this phase on the biggest delay. Server: Java 1.7.0_79, VM="-XX:MaxPermSize=512m -Xms3096m -Xmx3096m" The network layer between client and server is very fast.
P.S. Id there an undocumented way to log the full xquery in BaseX server logs? I've seen the -V option, but I don't use the standalone version, but: java -cp /usr/share/java/basex.jar org.basex.BaseXServer -d doesn't give me extra query info.
With kind regards, Menashè
On 02/03/2015 01:13 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you! The performance arrives to 0.5 sec!
The biggest improvement is related to the query rephrasing you've suggested. Then the latest snapshot also helps a lot! You may want to know that in the log of the latest snapshot I see applying attribute index for "7827" which is not clear to the user, instead of BaseX80-20150130.124009 which has also used indexing: applying attribute index for ("ALKY", "AYMD")
I'm attaching the first and the second launch of the query using BaseXGUI. Relaunching the same query reduces the time from over 1 second to 0.5 second. Some data: BaseX80-20150130.124009 Total Time: 30676.02 ms After using "for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification": Total Time: 5456.74 ms applying attribute index for ("ALKY", "AYMD") in log. Second launch: 1333.71 ms Latest snapshot (BaseX80-20150202.121033): 1st: Total Time: 1873.02 ms 2nd: Total Time: 548.62 ms
With kind regards, Menashè
On 02/02/2015 02:02 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
Should I expect to see the usage of an index for each of the where
phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
Have a nice weekend!
Enjoy, Christian
Menashè
On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün christian.gruen@gmail.com wrote: > Hi Menashè, > > Thanks for the XML samples you sent me in private. I noticed > that the > index rewritings will only be triggered if you formulate your > query as > follows: > > OLD: > for $x in collection("ALL-CDIS") > where $x/gmd:MD_Metadata/gmd:identificationInfo/... > return ... > > NEW: > for $x in collection("ALL-CDIS")/gmd:MD_Metadata > where $x/gmd:identificationInfo/... > return ... > > It's difficult to explain in short sentences why Variant 1 > cannot be > optimized that straightforward (basically, it's quite a different > pattern to look for), but I'll check out if we can extend our > matcher > to also support these kind of queries. > > So, if possible, I would recommend you for now (and at least for > testing) to move the root element test after the collection() > function. I noticed that the first three child steps are the > same in > all of your conditions: > > gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification > > If that will be always be the case, it surely makes sense to > move all > of them to the "for" clause. > > Looking forward to your updated performance tests, > Christian > _______________________________ > > On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün > christian.gruen@gmail.com wrote: >> Could you possibly provide me with a small snapshot of your data >> sources (one, two documents might be sufficient)? >> >> >> On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer >> meliezer@ogs.trieste.it wrote: >>> Almost the same speed with version 8.0. >>> No indexing (no "applying" in the query info). >>> As I've attached before, indexes are active for this DB. >>> >>> With kind regards, >>> Menashè >>> >>> >>> On 01/30/2015 05:31 PM, Christian Grün wrote: >>>> It's indeed interesting that your query does not use any of the >>>> existing index structures (if they did, you would find >>>> strings like >>>> "applying text index" or "applying attribute index" in the query >>>> info). Maybe/hopefully things look different with Version 8.0. >>>> >>>> >>>> On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer >>>> meliezer@ogs.trieste.it wrote: >>>>> On 01/30/2015 05:18 PM, Christian Grün wrote: >>>>>> >>>>>> >>>>>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>>>>> How can I remove *? >>>>>> Simply remove the predicate; a[*]/b is the same as a/b. >>>>> Maybe I wasn't clear. The actual number appears in the xml >>>>> file, > e.g., >>>>> gmd:descriptiveKeywords[1] >>>>> Anyway, I've removed all [*] and I get the same correct result, > however >>>>> the >>>>> processing time is doubled... >>>>>> >>>>>>>> * In some cases, if you know that an element name is >>>>>>>> distinct,
you
> can >>>>>>>> get rid of all the explicit child steps and directly address
the
> node >>>>>>>> via the descendant axis. >>>>>>> Thanks, but it's not relevant in my case. >>>>>> Is it because the element names are not distinct? Or is it
because
>>>>>> your input form allows users to choose arbitrary paths for
arbitrary
>>>>>> documents? >>>>> The element names are not distinct. >>>>> >>>>>>> Sure, I'l also try BaseX 8.0 and compare. Should I >>>>>>> recreate the
db
>>>>>>> importing >>>>>>> the xml files for testing the improved indexing? >>>>>> We have actually improved support for collections, but the
database
>>>>>> format itself has not changed, so it shouldn't make a >>>>>> difference
in
>>>>>> your case. >>>>>> >>>>>> Christian >>>>>> >>>>>> >>>>>>>> [1] http://files.basex.org/releases/latest >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>>>>> meliezer@ogs.trieste.it wrote: >>>>>>>>> Hello, >>>>>>>>> I wonder if the attached query can be optimised. I'm >>>>>>>>> attaching > all >>>>>>>>> relevant >>>>>>>>> information. >>>>>>>>> Basex 7.9, Debian, powerful server. >>>>>>>>> This is just an example. The queries will be built based >>>>>>>>> on a >>>>>>>>> compilation >>>>>>>>> of >>>>>>>>> a search form. >>>>>>>>> Any help would be appreciated. >>>>>>>>> 40 seconds are not acceptable. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> With kind regards, >>>>>>>>> Menashè >>>>>>>>> >>>>>>> -- >>>>>>> With kind regards, >>>>>>> Menashè >>>>>>> >>>>>>> >>>>> With kind regards, >>>>> Menashè
>>>>>
Menashè
Hi Menashè,
QUERY[0] xquery version "3.0"; declare namespace queryName ='GetIDS'; declare namespace gco = "http://www.isotc211.org/2005/gco"; declare [...]
It would be great if you could help us and simplify the query, such that we can have a look at the core issue.
Id there an undocumented way to log the full xquery in BaseX server logs?
The maximum size of log entries can be adjusted via the option LOGMSGMAXLEN [1].
Cheers, Christian
[1] http://docs.basex.org/wiki/Options#LOGMSGMAXLEN
I've seen the -V option, but I don't use the standalone version, but: java -cp /usr/share/java/basex.jar org.basex.BaseXServer -d doesn't give me extra query info.
With kind regards, Menashè
On 02/03/2015 01:13 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you! The performance arrives to 0.5 sec!
The biggest improvement is related to the query rephrasing you've suggested. Then the latest snapshot also helps a lot! You may want to know that in the log of the latest snapshot I see applying attribute index for "7827" which is not clear to the user, instead of BaseX80-20150130.124009 which has also used indexing: applying attribute index for ("ALKY", "AYMD")
I'm attaching the first and the second launch of the query using BaseXGUI. Relaunching the same query reduces the time from over 1 second to 0.5 second. Some data: BaseX80-20150130.124009 Total Time: 30676.02 ms After using "for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification": Total Time: 5456.74 ms applying attribute index for ("ALKY", "AYMD") in log. Second launch: 1333.71 ms Latest snapshot (BaseX80-20150202.121033): 1st: Total Time: 1873.02 ms 2nd: Total Time: 548.62 ms
With kind regards, Menashè
On 02/02/2015 02:02 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
> Should I expect to see the usage of an index for each of the where
phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
> Have a nice weekend!
Enjoy, Christian
> Menashè > > On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün > christian.gruen@gmail.com wrote: >> >> Hi Menashè, >> >> Thanks for the XML samples you sent me in private. I noticed that >> the >> index rewritings will only be triggered if you formulate your query >> as >> follows: >> >> OLD: >> for $x in collection("ALL-CDIS") >> where $x/gmd:MD_Metadata/gmd:identificationInfo/... >> return ... >> >> NEW: >> for $x in collection("ALL-CDIS")/gmd:MD_Metadata >> where $x/gmd:identificationInfo/... >> return ... >> >> It's difficult to explain in short sentences why Variant 1 cannot be >> optimized that straightforward (basically, it's quite a different >> pattern to look for), but I'll check out if we can extend our >> matcher >> to also support these kind of queries. >> >> So, if possible, I would recommend you for now (and at least for >> testing) to move the root element test after the collection() >> function. I noticed that the first three child steps are the same in >> all of your conditions: >> >> gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification >> >> If that will be always be the case, it surely makes sense to move >> all >> of them to the "for" clause. >> >> Looking forward to your updated performance tests, >> Christian >> _______________________________ >> >> On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün >> christian.gruen@gmail.com wrote: >>> >>> Could you possibly provide me with a small snapshot of your data >>> sources (one, two documents might be sufficient)? >>> >>> >>> On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer >>> meliezer@ogs.trieste.it wrote: >>>> >>>> Almost the same speed with version 8.0. >>>> No indexing (no "applying" in the query info). >>>> As I've attached before, indexes are active for this DB. >>>> >>>> With kind regards, >>>> Menashè >>>> >>>> >>>> On 01/30/2015 05:31 PM, Christian Grün wrote: >>>>> >>>>> It's indeed interesting that your query does not use any of the >>>>> existing index structures (if they did, you would find strings >>>>> like >>>>> "applying text index" or "applying attribute index" in the query >>>>> info). Maybe/hopefully things look different with Version 8.0. >>>>> >>>>> >>>>> On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer >>>>> meliezer@ogs.trieste.it wrote: >>>>>> >>>>>> On 01/30/2015 05:18 PM, Christian Grün wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>>>>>> >>>>>>>> How can I remove *? >>>>>>> >>>>>>> Simply remove the predicate; a[*]/b is the same as a/b. >>>>>> >>>>>> Maybe I wasn't clear. The actual number appears in the xml file, >> >> e.g., >>>>>> >>>>>> gmd:descriptiveKeywords[1] >>>>>> Anyway, I've removed all [*] and I get the same correct result, >> >> however >>>>>> >>>>>> the >>>>>> processing time is doubled... >>>>>>> >>>>>>> >>>>>>>>> * In some cases, if you know that an element name is >>>>>>>>> distinct,
you
>> >> can >>>>>>>>> >>>>>>>>> get rid of all the explicit child steps and directly address
the
>> >> node >>>>>>>>> >>>>>>>>> via the descendant axis. >>>>>>>> >>>>>>>> Thanks, but it's not relevant in my case. >>>>>>> >>>>>>> Is it because the element names are not distinct? Or is it
because
>>>>>>> >>>>>>> your input form allows users to choose arbitrary paths for
arbitrary
>>>>>>> >>>>>>> documents? >>>>>> >>>>>> The element names are not distinct. >>>>>> >>>>>>>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate >>>>>>>> the
db
>>>>>>>> >>>>>>>> importing >>>>>>>> the xml files for testing the improved indexing? >>>>>>> >>>>>>> We have actually improved support for collections, but the
database
>>>>>>> >>>>>>> format itself has not changed, so it shouldn't make a >>>>>>> difference
in
>>>>>>> >>>>>>> your case. >>>>>>> >>>>>>> Christian >>>>>>> >>>>>>> >>>>>>>>> [1] http://files.basex.org/releases/latest >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>>>>>> meliezer@ogs.trieste.it wrote: >>>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> I wonder if the attached query can be optimised. I'm >>>>>>>>>> attaching >> >> all >>>>>>>>>> >>>>>>>>>> relevant >>>>>>>>>> information. >>>>>>>>>> Basex 7.9, Debian, powerful server. >>>>>>>>>> This is just an example. The queries will be built based on >>>>>>>>>> a >>>>>>>>>> compilation >>>>>>>>>> of >>>>>>>>>> a search form. >>>>>>>>>> Any help would be appreciated. >>>>>>>>>> 40 seconds are not acceptable. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> With kind regards, >>>>>>>>>> Menashè >>>>>>>>>> >>>>>>>> -- >>>>>>>> With kind regards, >>>>>>>> Menashè >>>>>>>> >>>>>>>> >>>>>> With kind regards, >>>>>> Menashè >>>>>> > -- > Menashè
Hi Christian, Even when I leave only the first filter and test it as standalone it takes more than 8 seconds:
Result: - Hit(s): 250000 Items - Updated: 0 Items - Printed: 2048 KB - Read Locking: local [CDI] - Write Locking: none Timing: - Parsing: 2.0 ms - Compiling: 107.74 ms - Evaluating: 8085.55 ms - Printing: 106.4 ms - Total Time: 8301.69 ms
With kind regards, Menashè
On 06/22/2015 07:57 PM, Christian Grün wrote:
Hi Menashè,
QUERY[0] xquery version "3.0"; declare namespace queryName ='GetIDS'; declare namespace gco = "http://www.isotc211.org/2005/gco"; declare [...]
It would be great if you could help us and simplify the query, such that we can have a look at the core issue.
Id there an undocumented way to log the full xquery in BaseX server logs?
The maximum size of log entries can be adjusted via the option LOGMSGMAXLEN [1].
Cheers, Christian
[1] http://docs.basex.org/wiki/Options#LOGMSGMAXLEN
I've seen the -V option, but I don't use the standalone version, but: java -cp /usr/share/java/basex.jar org.basex.BaseXServer -d doesn't give me extra query info.
With kind regards, Menashè
On 02/03/2015 01:13 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you! The performance arrives to 0.5 sec!
The biggest improvement is related to the query rephrasing you've suggested. Then the latest snapshot also helps a lot! You may want to know that in the log of the latest snapshot I see applying attribute index for "7827" which is not clear to the user, instead of BaseX80-20150130.124009 which has also used indexing: applying attribute index for ("ALKY", "AYMD")
I'm attaching the first and the second launch of the query using BaseXGUI. Relaunching the same query reduces the time from over 1 second to 0.5 second. Some data: BaseX80-20150130.124009 Total Time: 30676.02 ms After using "for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification": Total Time: 5456.74 ms applying attribute index for ("ALKY", "AYMD") in log. Second launch: 1333.71 ms Latest snapshot (BaseX80-20150202.121033): 1st: Total Time: 1873.02 ms 2nd: Total Time: 548.62 ms
With kind regards, Menashè
On 02/02/2015 02:02 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote: > Hi Menashè, > >> Should I expect to see the usage of an index for each of the where phrases? > Usually, only one predicate will be rewritten for index access, and > the remaining conditions will be answered sequentially. > >> Have a nice weekend! > Enjoy, > Christian > > >> Menashè >> >> On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün >> christian.gruen@gmail.com wrote: >>> Hi Menashè, >>> >>> Thanks for the XML samples you sent me in private. I noticed that >>> the >>> index rewritings will only be triggered if you formulate your query >>> as >>> follows: >>> >>> OLD: >>> for $x in collection("ALL-CDIS") >>> where $x/gmd:MD_Metadata/gmd:identificationInfo/... >>> return ... >>> >>> NEW: >>> for $x in collection("ALL-CDIS")/gmd:MD_Metadata >>> where $x/gmd:identificationInfo/... >>> return ... >>> >>> It's difficult to explain in short sentences why Variant 1 cannot be >>> optimized that straightforward (basically, it's quite a different >>> pattern to look for), but I'll check out if we can extend our >>> matcher >>> to also support these kind of queries. >>> >>> So, if possible, I would recommend you for now (and at least for >>> testing) to move the root element test after the collection() >>> function. I noticed that the first three child steps are the same in >>> all of your conditions: >>> >>> gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification >>> >>> If that will be always be the case, it surely makes sense to move >>> all >>> of them to the "for" clause. >>> >>> Looking forward to your updated performance tests, >>> Christian >>> _______________________________ >>> >>> On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün >>> christian.gruen@gmail.com wrote: >>>> Could you possibly provide me with a small snapshot of your data >>>> sources (one, two documents might be sufficient)? >>>> >>>> >>>> On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer >>>> meliezer@ogs.trieste.it wrote: >>>>> Almost the same speed with version 8.0. >>>>> No indexing (no "applying" in the query info). >>>>> As I've attached before, indexes are active for this DB. >>>>> >>>>> With kind regards, >>>>> Menashè >>>>> >>>>> >>>>> On 01/30/2015 05:31 PM, Christian Grün wrote: >>>>>> It's indeed interesting that your query does not use any of the >>>>>> existing index structures (if they did, you would find strings >>>>>> like >>>>>> "applying text index" or "applying attribute index" in the query >>>>>> info). Maybe/hopefully things look different with Version 8.0. >>>>>> >>>>>> >>>>>> On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer >>>>>> meliezer@ogs.trieste.it wrote: >>>>>>> On 01/30/2015 05:18 PM, Christian Grün wrote: >>>>>>>> >>>>>>>> >>>>>>>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>>>>>>> How can I remove *? >>>>>>>> Simply remove the predicate; a[*]/b is the same as a/b. >>>>>>> Maybe I wasn't clear. The actual number appears in the xml file, >>> e.g., >>>>>>> gmd:descriptiveKeywords[1] >>>>>>> Anyway, I've removed all [*] and I get the same correct result, >>> however >>>>>>> the >>>>>>> processing time is doubled... >>>>>>>> >>>>>>>>>> * In some cases, if you know that an element name is >>>>>>>>>> distinct, you >>> can >>>>>>>>>> get rid of all the explicit child steps and directly address the >>> node >>>>>>>>>> via the descendant axis. >>>>>>>>> Thanks, but it's not relevant in my case. >>>>>>>> Is it because the element names are not distinct? Or is it because >>>>>>>> your input form allows users to choose arbitrary paths for arbitrary >>>>>>>> documents? >>>>>>> The element names are not distinct. >>>>>>> >>>>>>>>> Sure, I'l also try BaseX 8.0 and compare. Should I recreate >>>>>>>>> the db >>>>>>>>> importing >>>>>>>>> the xml files for testing the improved indexing? >>>>>>>> We have actually improved support for collections, but the database >>>>>>>> format itself has not changed, so it shouldn't make a >>>>>>>> difference in >>>>>>>> your case. >>>>>>>> >>>>>>>> Christian >>>>>>>> >>>>>>>> >>>>>>>>>> [1] http://files.basex.org/releases/latest >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>>>>>>> meliezer@ogs.trieste.it wrote: >>>>>>>>>>> Hello, >>>>>>>>>>> I wonder if the attached query can be optimised. I'm >>>>>>>>>>> attaching >>> all >>>>>>>>>>> relevant >>>>>>>>>>> information. >>>>>>>>>>> Basex 7.9, Debian, powerful server. >>>>>>>>>>> This is just an example. The queries will be built based on >>>>>>>>>>> a >>>>>>>>>>> compilation >>>>>>>>>>> of >>>>>>>>>>> a search form. >>>>>>>>>>> Any help would be appreciated. >>>>>>>>>>> 40 seconds are not acceptable. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> With kind regards, >>>>>>>>>>> Menashè >>>>>>>>>>> >>>>>>>>> -- >>>>>>>>> With kind regards, >>>>>>>>> Menashè >>>>>>>>> >>>>>>>>> >>>>>>> With kind regards, >>>>>>> Menashè >>>>>>> >> -- >> Menashè
Hi Christian, I'm have again performance problems. I have BaseX 8.2.2. With the exact same query as below and a recreated db (indexed), I have no longer use of any index, while in the previous post I had only one index being used. I don't know why. I'm attaching the latest query and the related query plan (Locally, using the client GUI). I hope you can help.
With kind regards, Menashè
On 06/22/2015 05:11 PM, Menashè Eliezer wrote:
Hi, I've used ssh -X for producing query info right from the server machine. Please see attached. I hope it would help.
With kind regards, Menashè
On 06/22/2015 04:48 PM, Menashè Eliezer wrote:
Hi Christian, I'm have again performance problems. I have BaseX 8.2.1. As you may remember, you've recommended changing 'for $x in collection("CDI")' to 'for $x in collection("CDI")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification'. However, I've discovered I cannot specify XPath while working with IDs (db:node-pre). It's a multi-step process: client program sends to the server the search filter defined by end-user and get IDs. Then there are several queries for getting different information about this specific subset. Instead of redefining the filters, the only condition is where db:node-pre($x)=$ids for having a better performance. Once I specific XPath, it seems that the ids have no meaning. The resultset is always empty once they are being used. So, I've returned to use 'for $x in collection("CDI")' in the first query of getting all IDs, but the performance is extremely low.
**I'm attaching the query and its related info using BaseXGUI (local server) with much smaller database. The performance seems ok.
I'm using your BaseXClient.java, however I see the delay already in the BaseX server logs: QUERY[0] xquery version "3.0"; declare namespace queryName ='GetIDS'; declare namespace gco = "http://www.isotc211.org/2005/gco"; declare namespace gmd = "http://www.isotc211.org/2005/gmd"; declare namespace gml = "http://www.opengis.net/gml"; declare namespace gmx="http://www.isotc211.org/2005/gmx"; declare namespace sdn = "http://www.seadatanet.org"; dec lare namespace fn = "http://www.w3.org/2005/xpath-functions"; declare namespace xs = "http://www.w3.org/2001/XMLSchema"; declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; declare option output:method 'xml';declare option output:item-separator ","; let $db := db:open("CDI") for $x in $db where $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Exte nt/gmd:geographicElement/gmd:EX_GeographicBoundingBox/gmd:westBoundLongitude/gco:Decimal>="-5.8447265625" and $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Extent/gmd :geographicElement... 0.17 ms 110 16:36:09.713 192.168.155.30:39211 admin OK RESULTS[0] 25957.11 ms
Then I have other slow queries, but I would like to focus in this phase on the biggest delay. Server: Java 1.7.0_79, VM="-XX:MaxPermSize=512m -Xms3096m -Xmx3096m" The network layer between client and server is very fast.
P.S. Id there an undocumented way to log the full xquery in BaseX server logs? I've seen the -V option, but I don't use the standalone version, but: java -cp /usr/share/java/basex.jar org.basex.BaseXServer -d doesn't give me extra query info.
With kind regards, Menashè
On 02/03/2015 01:13 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you! The performance arrives to 0.5 sec!
The biggest improvement is related to the query rephrasing you've suggested. Then the latest snapshot also helps a lot! You may want to know that in the log of the latest snapshot I see applying attribute index for "7827" which is not clear to the user, instead of BaseX80-20150130.124009 which has also used indexing: applying attribute index for ("ALKY", "AYMD")
I'm attaching the first and the second launch of the query using BaseXGUI. Relaunching the same query reduces the time from over 1 second to 0.5 second. Some data: BaseX80-20150130.124009 Total Time: 30676.02 ms After using "for $x in collection("ALL-CDIS")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification": Total Time: 5456.74 ms applying attribute index for ("ALKY", "AYMD") in log. Second launch: 1333.71 ms Latest snapshot (BaseX80-20150202.121033): 1st: Total Time: 1873.02 ms 2nd: Total Time: 548.62 ms
With kind regards, Menashè
On 02/02/2015 02:02 PM, Menashè Eliezer wrote:
Hi Christian,
Thank you very much! Unfortunately I'll be at the office only tomorrow.
Menashè
On Sat, 31 Jan 2015 16:42:32 +0100, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
With the latest snapshot [1], your original query should now be rewritten for index access as well. Looking forward to your tests,
Christian
PS: In terms of performance, it may still be worthwhile to move redundant paths to the for clause; but just try and see.
[1] http://files.basex.org/releases/latest/
On Fri, Jan 30, 2015 at 9:49 PM, Christian Grün christian.gruen@gmail.com wrote:
Hi Menashè,
> Should I expect to see the usage of an index for each of the where
phrases?
Usually, only one predicate will be rewritten for index access, and the remaining conditions will be answered sequentially.
> Have a nice weekend! Enjoy, Christian
> Menashè > > On Fri, 30 Jan 2015 18:11:59 +0100, Christian Grün > christian.gruen@gmail.com wrote: >> Hi Menashè, >> >> Thanks for the XML samples you sent me in private. I noticed >> that the >> index rewritings will only be triggered if you formulate your >> query as >> follows: >> >> OLD: >> for $x in collection("ALL-CDIS") >> where $x/gmd:MD_Metadata/gmd:identificationInfo/... >> return ... >> >> NEW: >> for $x in collection("ALL-CDIS")/gmd:MD_Metadata >> where $x/gmd:identificationInfo/... >> return ... >> >> It's difficult to explain in short sentences why Variant 1 >> cannot be >> optimized that straightforward (basically, it's quite a different >> pattern to look for), but I'll check out if we can extend our >> matcher >> to also support these kind of queries. >> >> So, if possible, I would recommend you for now (and at least for >> testing) to move the root element test after the collection() >> function. I noticed that the first three child steps are the >> same in >> all of your conditions: >> >> gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification >> >> If that will be always be the case, it surely makes sense to >> move all >> of them to the "for" clause. >> >> Looking forward to your updated performance tests, >> Christian >> _______________________________ >> >> On Fri, Jan 30, 2015 at 5:55 PM, Christian Grün >> christian.gruen@gmail.com wrote: >>> Could you possibly provide me with a small snapshot of your data >>> sources (one, two documents might be sufficient)? >>> >>> >>> On Fri, Jan 30, 2015 at 5:52 PM, Menashè Eliezer >>> meliezer@ogs.trieste.it wrote: >>>> Almost the same speed with version 8.0. >>>> No indexing (no "applying" in the query info). >>>> As I've attached before, indexes are active for this DB. >>>> >>>> With kind regards, >>>> Menashè >>>> >>>> >>>> On 01/30/2015 05:31 PM, Christian Grün wrote: >>>>> It's indeed interesting that your query does not use any of the >>>>> existing index structures (if they did, you would find >>>>> strings like >>>>> "applying text index" or "applying attribute index" in the >>>>> query >>>>> info). Maybe/hopefully things look different with Version 8.0. >>>>> >>>>> >>>>> On Fri, Jan 30, 2015 at 5:26 PM, Menashè Eliezer >>>>> meliezer@ogs.trieste.it wrote: >>>>>> On 01/30/2015 05:18 PM, Christian Grün wrote: >>>>>>> >>>>>>> >>>>>>>
/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:descriptiveKeywords[1]/gmd:MD_Keywords/gmd:keyword[2]/sdn:SDN_ParameterDiscoveryCode/@codeListValue
>>>>>>>> How can I remove *? >>>>>>> Simply remove the predicate; a[*]/b is the same as a/b. >>>>>> Maybe I wasn't clear. The actual number appears in the xml >>>>>> file, >> e.g., >>>>>> gmd:descriptiveKeywords[1] >>>>>> Anyway, I've removed all [*] and I get the same correct >>>>>> result, >> however >>>>>> the >>>>>> processing time is doubled... >>>>>>> >>>>>>>>> * In some cases, if you know that an element name is >>>>>>>>> distinct,
you
>> can >>>>>>>>> get rid of all the explicit child steps and directly >>>>>>>>> address
the
>> node >>>>>>>>> via the descendant axis. >>>>>>>> Thanks, but it's not relevant in my case. >>>>>>> Is it because the element names are not distinct? Or is it
because
>>>>>>> your input form allows users to choose arbitrary paths for
arbitrary
>>>>>>> documents? >>>>>> The element names are not distinct. >>>>>> >>>>>>>> Sure, I'l also try BaseX 8.0 and compare. Should I >>>>>>>> recreate the
db
>>>>>>>> importing >>>>>>>> the xml files for testing the improved indexing? >>>>>>> We have actually improved support for collections, but the
database
>>>>>>> format itself has not changed, so it shouldn't make a >>>>>>> difference
in
>>>>>>> your case. >>>>>>> >>>>>>> Christian >>>>>>> >>>>>>> >>>>>>>>> [1] http://files.basex.org/releases/latest >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jan 30, 2015 at 3:55 PM, Menashè Eliezer >>>>>>>>> meliezer@ogs.trieste.it wrote: >>>>>>>>>> Hello, >>>>>>>>>> I wonder if the attached query can be optimised. I'm >>>>>>>>>> attaching >> all >>>>>>>>>> relevant >>>>>>>>>> information. >>>>>>>>>> Basex 7.9, Debian, powerful server. >>>>>>>>>> This is just an example. The queries will be built >>>>>>>>>> based on a >>>>>>>>>> compilation >>>>>>>>>> of >>>>>>>>>> a search form. >>>>>>>>>> Any help would be appreciated. >>>>>>>>>> 40 seconds are not acceptable. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> With kind regards, >>>>>>>>>> Menashè >>>>>>>>>> >>>>>>>> -- >>>>>>>> With kind regards, >>>>>>>> Menashè >>>>>>>> >>>>>>>> >>>>>> With kind regards, >>>>>> Menashè >>>>>> > -- > Menashè
What was the last version it was working with?
8.2.1. Not really working, but better...
I ran the attached query with 8.2.1, and no index was used either. Are you sure you sent me the correct query?
Sorry for confronting you with all those questions, but to help you, I really need your help as well. Could you check the attached files again and give me some hints on how to proceed?
I think I've already mentioned that the new query is different. The reference to 8.2.1 is included here where also the old query can be found: https://www.mail-archive.com/basex-talk%40mailman.uni-konstanz.de/msg06544.h...
With kind regards, Menashè
On 08/03/2015 03:38 PM, Christian Grün wrote:
What was the last version it was working with?
8.2.1. Not really working, but better...
I ran the attached query with 8.2.1, and no index was used either. Are you sure you sent me the correct query?
Sorry for confronting you with all those questions, but to help you, I really need your help as well. Could you check the attached files again and give me some hints on how to proceed?
Hi Menashè,
I am not sure if I can propose any way out, because there are too many factors that would need to be looked at right now (automatically composed queries, no node ids, gigabytes of data, ...).
So let's maybe go back to your original observation:
Once I specific XPath, it seems that the ids have no meaning. The resultset is always empty once they are being used.
Can you give us more details on that? Can you give me a *simple* example that worked, and another example that does not work anymore for you? Did you think about using db:node-pre($db/root()) ?
I attached two simplified versions of your query. How do they perform on your database instance? Maybe we should first try to get these queries fast before looking at more complex examples (even if these are the ones that are composed in practice).
Christian
On Mon, Jun 22, 2015 at 4:48 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hi Christian, I'm have again performance problems. I have BaseX 8.2.1. As you may remember, you've recommended changing 'for $x in collection("CDI")' to 'for $x in collection("CDI")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification'. However, I've discovered I cannot specify XPath while working with IDs (db:node-pre). It's a multi-step process: client program sends to the server the search filter defined by end-user and get IDs. Then there are several queries for getting different information about this specific subset. Instead of redefining the filters, the only condition is where db:node-pre($x)=$ids for having a better performance. Once I specific XPath, it seems that the ids have no meaning. The resultset is always empty once they are being used. So, I've returned to use 'for $x in collection("CDI")' in the first query of getting all IDs, but the performance is extremely low.
**I'm attaching the query and its related info using BaseXGUI (local server) with much smaller database. The performance seems ok.
I'm using your BaseXClient.java, however I see the delay already in the BaseX server logs: QUERY[0] xquery version "3.0"; declare namespace queryName ='GetIDS'; declare namespace gco = "http://www.isotc211.org/2005/gco"; declare namespace gmd = "http://www.isotc211.org/2005/gmd"; declare namespace gml = "http://www.opengis.net/gml"; declare namespace gmx="http://www.isotc211.org/2005/gmx"; declare namespace sdn = "http://www.seadatanet.org"; dec lare namespace fn = "http://www.w3.org/2005/xpath-functions"; declare namespace xs = "http://www.w3.org/2001/XMLSchema"; declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; declare option output:method 'xml';declare option output:item-separator ","; let $db := db:open("CDI") for $x in $db where $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Exte nt/gmd:geographicElement/gmd:EX_GeographicBoundingBox/gmd:westBoundLongitude/gco:Decimal>="-5.8447265625" and $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Extent/gmd :geographicElement... 0.17 ms 110 16:36:09.713 192.168.155.30:39211 admin OK RESULTS[0] 25957.11 ms
Then I have other slow queries, but I would like to focus in this phase on the biggest delay. Server: Java 1.7.0_79, VM="-XX:MaxPermSize=512m -Xms3096m -Xmx3096m" The network layer between client and server is very fast.
P.S. Id there an undocumented way to log the full xquery in BaseX server logs? I've seen the -V option, but I don't use the standalone version, but: java -cp /usr/share/java/basex.jar org.basex.BaseXServer -d doesn't give me extra query info.
With kind regards, Menashè
Hi Christian,
Did you think about using db:node-pre($db/root()) ?
Actually, no. Now that I'm using it I have no problems except for performance problems. Please see the attached logs files.
Is there an option to ask BaseX to parse only a part of the imported xml files under a specific xpath, (or at least limit useless indexing of non relevant components)? I don't need the rest of the xml files, even though it's not too big. Maybe it can help.
With kind regards, Menashè
On 06/23/2015 12:10 PM, Christian Grün wrote:
Hi Menashè,
I am not sure if I can propose any way out, because there are too many factors that would need to be looked at right now (automatically composed queries, no node ids, gigabytes of data, ...).
So let's maybe go back to your original observation:
Once I specific XPath, it seems that the ids have no meaning. The resultset is always empty once they are being used.
Can you give us more details on that? Can you give me a *simple* example that worked, and another example that does not work anymore for you? Did you think about using db:node-pre($db/root()) ?
I attached two simplified versions of your query. How do they perform on your database instance? Maybe we should first try to get these queries fast before looking at more complex examples (even if these are the ones that are composed in practice).
Christian
On Mon, Jun 22, 2015 at 4:48 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hi Christian, I'm have again performance problems. I have BaseX 8.2.1. As you may remember, you've recommended changing 'for $x in collection("CDI")' to 'for $x in collection("CDI")/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification'. However, I've discovered I cannot specify XPath while working with IDs (db:node-pre). It's a multi-step process: client program sends to the server the search filter defined by end-user and get IDs. Then there are several queries for getting different information about this specific subset. Instead of redefining the filters, the only condition is where db:node-pre($x)=$ids for having a better performance. Once I specific XPath, it seems that the ids have no meaning. The resultset is always empty once they are being used. So, I've returned to use 'for $x in collection("CDI")' in the first query of getting all IDs, but the performance is extremely low.
**I'm attaching the query and its related info using BaseXGUI (local server) with much smaller database. The performance seems ok.
I'm using your BaseXClient.java, however I see the delay already in the BaseX server logs: QUERY[0] xquery version "3.0"; declare namespace queryName ='GetIDS'; declare namespace gco = "http://www.isotc211.org/2005/gco"; declare namespace gmd = "http://www.isotc211.org/2005/gmd"; declare namespace gml = "http://www.opengis.net/gml"; declare namespace gmx="http://www.isotc211.org/2005/gmx"; declare namespace sdn = "http://www.seadatanet.org"; dec lare namespace fn = "http://www.w3.org/2005/xpath-functions"; declare namespace xs = "http://www.w3.org/2001/XMLSchema"; declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; declare option output:method 'xml';declare option output:item-separator ","; let $db := db:open("CDI") for $x in $db where $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Exte nt/gmd:geographicElement/gmd:EX_GeographicBoundingBox/gmd:westBoundLongitude/gco:Decimal>="-5.8447265625" and $x/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification/gmd:extent/gmd:EX_Extent/gmd :geographicElement... 0.17 ms 110 16:36:09.713 192.168.155.30:39211 admin OK RESULTS[0] 25957.11 ms
Then I have other slow queries, but I would like to focus in this phase on the biggest delay. Server: Java 1.7.0_79, VM="-XX:MaxPermSize=512m -Xms3096m -Xmx3096m" The network layer between client and server is very fast.
P.S. Id there an undocumented way to log the full xquery in BaseX server logs? I've seen the -V option, but I don't use the standalone version, but: java -cp /usr/share/java/basex.jar org.basex.BaseXServer -d doesn't give me extra query info.
With kind regards, Menashè
Is there an option to ask BaseX to parse only a part of the imported xml files under a specific xpath, (or at least limit useless indexing of non relevant components)? I don't need the rest of the xml files, even though it's not too big. Maybe it can help.
The usual approach is to simply create another database that only contains the relevant parts of your document. This can directly be done in XQuery (using db:create, db:add, ...), or, if memory consumption is too high, by exporting and importing parts of your document.
Hope this helps, Christian
Thank you Christian, I may try it later as a last option. I hope you can find an alternative solution. Is there also an option to define inside the part only the xpaths which I would need? Otherwise, many elements and attributes which I don't need are being indexed.
Another question, how can I know if the following values have been exceeded in a specific database? Quoting:
MAXLEN
*Signature* |MAXLEN [int]| *Default* |96| *Summary* Specifies the maximum length of strings that are to be indexed by the name, path, value, and full-text index structures. The value of this option will be assigned once to a new database, and cannot be changed after that.
MAXCATS
*Signature* |MAXCATS [int]| *Default* |100| *Summary* Specifies the maximum number of distinct values (categories) that will be stored together with the element/attribute names or unique paths in theName Index http://docs.basex.org/wiki/Index#Name_IndexorPath Index http://docs.basex.org/wiki/Index#Path_Index. The value of this option will be assigned once to a new database, and cannot be changed after that.
With kind regards, Menashè
On 06/23/2015 12:51 PM, Christian Grün wrote:
Is there an option to ask BaseX to parse only a part of the imported xml files under a specific xpath, (or at least limit useless indexing of non relevant components)? I don't need the rest of the xml files, even though it's not too big. Maybe it can help.
The usual approach is to simply create another database that only contains the relevant parts of your document. This can directly be done in XQuery (using db:create, db:add, ...), or, if memory consumption is too high, by exporting and importing parts of your document.
Hope this helps, Christian
Is there also an option to define inside the part only the xpaths which I would need?
I guess no, but to be honest, I am not exactly sure what you mean? Would you like to restrict indexing to specific parts of the document? In that case, you'll have to wait for someone implementing [1] (contributors are always welcome)…
Another question, how can I know if the following values have been exceeded in a specific database? Quoting:
MAXLEN
You will know by looking at your data. Just write a query that returns the maximum string lengths of all distinct paths.
MAXCATS
You can e.g. use index:facets().
Hope this helps, Christian
Thank you Christian for the helpful reply.
With kind regards, Menashè
On 06/23/2015 01:32 PM, Christian Grün wrote:
Is there also an option to define inside the part only the xpaths which I would need?
I guess no, but to be honest, I am not exactly sure what you mean? Would you like to restrict indexing to specific parts of the document? In that case, you'll have to wait for someone implementing [1] (contributors are always welcome)…
Another question, how can I know if the following values have been exceeded in a specific database? Quoting: MAXLEN
You will know by looking at your data. Just write a query that returns the maximum string lengths of all distinct paths.
MAXCATS
You can e.g. use index:facets().
Hope this helps, Christian
Hi Christian,
The usual approach is to simply create another database that only contains the relevant parts of your document. This can directly be done in XQuery (using db:create, db:add, ...), or, if memory consumption is too high, by exporting and importing parts of your document.
I couldn't find an option in db:add to specificy an XPath. In my case, I need to extract only the elements under /gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
With kind regards, Menashè
I couldn't find an option in db:add to specificy an XPath. In my case, I need to extract only the elements under /gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
We try to avoid XPath strings arguments whenever possible. Instead, simply use XQuery, which allows you to do all kinds of things.
Example 1 (add one document per element):
for $node at $pos in /gmd:MD_Metadata/..... return db:add('db', $pos || '.xml', $node)
Example 2 (add single document):
db:add('db', 'doc.xml', element xml { /gmd:MD_Metadata/..... })
Cheers, Christian
Hi Christian,
I've created a new database with only the relevant part of each xml. It's much smaller and I hope it would help. The created xml is not a valid one since the xml and xml-model tags are missing, but it shouldn't be a problem. I've used map { 'stripns': true(), 'intparse': true() }) in db:add, but the namespaces were not removed, e.g. there is gml:beginPosition. Anyway, maybe because the xml are not valid, I get always 0 hits unless I ask to return the doc itself. Even with where db:node-id($ext)=0 or without conditions, but when I ask to return return $ext/sdn:SDN_DataIdentification instead of $ext (The xml doc).
With kind regards, Menashè
On 06/24/2015 01:58 PM, Christian Grün wrote:
I couldn't find an option in db:add to specificy an XPath. In my case, I need to extract only the elements under /gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification
We try to avoid XPath strings arguments whenever possible. Instead, simply use XQuery, which allows you to do all kinds of things.
Example 1 (add one document per element):
for $node at $pos in /gmd:MD_Metadata/..... return db:add('db', $pos || '.xml', $node)
Example 2 (add single document):
db:add('db', 'doc.xml', element xml { /gmd:MD_Metadata/..... })
Cheers, Christian
Hi Menashè,
I've used map { 'stripns': true(), 'intparse': true() }) in db:add, but the namespaces were not removed, e.g. there is gml:beginPosition.
True. I forgot to mention that the 'stripns' option (as all other XML parsing options [1]) only applies to newly parsed XML strings.
Anyway, maybe because the xml are not valid, I get always 0 hits unless I ask to return the doc itself.
Hm, the stored data must be well-formed, otherwise it couldn't be stored. And it will always be well-formed if you store it via db:add (in XML terminology, validity requires a schema [2]).
Did you already have a look at the data stored in your new database? Christian
[1] http://docs.basex.org/wiki/Options#STRIPNS [2] https://en.wikipedia.org/wiki/XML#Schemas_and_validation
Hi Christian,
True. I forgot to mention that the 'stripns' option (as all other XML parsing options [1]) only applies to newly parsed XML strings.
But these strings belong to new documents being added using db:add. Anyway, how can I strip the namespaces in my new database? I don't need them.
Anyway, maybe because the xml are not valid, I get always 0 hits unless I ask to return the doc itself.
Hm, the stored data must be well-formed, otherwise it couldn't be stored. And it will always be well-formed if you store it via db:add (in XML terminology, validity requires a schema [2]).
Did you already have a look at the data stored in your new database? Christian
I've used db:add. The data is not well-formed. It's just like copy&paste of the relevant xml. No headers. This is how I've created it:
declare namespace gco = "http://www.isotc211.org/2005/gco"; declare namespace gmd = "http://www.isotc211.org/2005/gmd"; declare namespace gml = "http://www.opengis.net/gml"; declare namespace gmx="http://www.isotc211.org/2005/gmx"; declare namespace sdn = "http://www.seadatanet.org";
declare namespace fn = "http://www.w3.org/2005/xpath-functions"; declare namespace xs = "http://www.w3.org/2001/XMLSchema";
let $db := db:open("ENTIRE-CDI","Vertical_profiles") for $x in $db/gmd:MD_Metadata/gmd:identificationInfo/sdn:SDN_DataIdentification let $id := string($x/gmd:citation/gmd:CI_Citation/gmd:alternateTitle/gco:CharacterString) return db:add("CDI", $x, 'Vertical_profiles/' || $id || '.xml', map { 'stripns': true(), 'intparse': true() })
With kind regards, Menashè
On 06/25/2015 03:05 PM, Christian Grün wrote:
Hi Menashè,
I've used map { 'stripns': true(), 'intparse': true() }) in db:add, but the namespaces were not removed, e.g. there is gml:beginPosition.
True. I forgot to mention that the 'stripns' option (as all other XML parsing options [1]) only applies to newly parsed XML strings.
Anyway, maybe because the xml are not valid, I get always 0 hits unless I ask to return the doc itself.
Hm, the stored data must be well-formed, otherwise it couldn't be stored. And it will always be well-formed if you store it via db:add (in XML terminology, validity requires a schema [2]).
Did you already have a look at the data stored in your new database? Christian
[1] http://docs.basex.org/wiki/Options#STRIPNS [2] https://en.wikipedia.org/wiki/XML#Schemas_and_validation
Anyway, how can I strip the namespaces in my new database? I don't need them.
Just create a new database from the input data with this option turned on.
I've used db:add. The data is not well-formed. It's just like copy&paste of the relevant xml. No headers.
If it's not well-formed, you can't store it in BaseX.. If you can do so, it would be an error (and rather surprising to me ;).
Cheers, C.
Hi,
Just create a new database from the input data with this option turned on.
I've expected db:add to do it. Not important.
If it's not well-formed, you can't store it in BaseX.. If you can do so, it would be an error (and rather surprising to me ;).
Well, the not well-formed is the response for my query for getting the document after I've done what I've described earlier. Maybe this is the reason that the new db does not 'function'. I'll try adding the header lines to each new xml before using db:add.
Cheers, C.
Hello, Creating a database of partial xml documents had almost no effect. Therefore I've created a database with very simple xml structure. I'm attaching an example (demo.xml). BaseX version: 8.2.2 Number of documents: 374739
However, the attached query takes 4 seconds (attached simple_query.log). I don't know if it's considered a normal performance, but my real query is different: I'm copying all the documents which correspond to my query to a newly created temporary collection, for having faster processing for this subset: reporting, ecc. Adding to db (Remote Java client): 12 sec. Optimising the db (Remote Java client): 23 sec. Both the Java client and BaseX server are installed on powerful servers. Are these numbers normal? The attached results are based on a local client (Using the BaseX GUI).
In the future, I should have even much more documents to handle... Any ideas? I can also change the scheme of my new xml.
As for the idea of creating a new temporary db, I'm checking an alternative: return in one query all what I need, including reports, all in one xml.
With kind regards, Menashè
Hi Menashè,
The attached log file is empty. Maybe it's sufficient if you provide us with the query and give us information on the query compilation (are any indexes used?).
C.
On Mon, Jul 13, 2015 at 3:32 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, Creating a database of partial xml documents had almost no effect. Therefore I've created a database with very simple xml structure. I'm attaching an example (demo.xml). BaseX version: 8.2.2 Number of documents: 374739
However, the attached query takes 4 seconds (attached simple_query.log). I don't know if it's considered a normal performance, but my real query is different: I'm copying all the documents which correspond to my query to a newly created temporary collection, for having faster processing for this subset: reporting, ecc. Adding to db (Remote Java client): 12 sec. Optimising the db (Remote Java client): 23 sec. Both the Java client and BaseX server are installed on powerful servers. Are these numbers normal? The attached results are based on a local client (Using the BaseX GUI).
In the future, I should have even much more documents to handle... Any ideas? I can also change the scheme of my new xml.
As for the idea of creating a new temporary db, I'm checking an alternative: return in one query all what I need, including reports, all in one xml.
With kind regards, Menashè
Hi Christian, oops, I'm sorry. It's attached. There are text and attribute indexes.
With kind regards, Menashè
On 07/14/2015 09:32 AM, Christian Grün wrote:
Hi Menashè,
The attached log file is empty. Maybe it's sufficient if you provide us with the query and give us information on the query compilation (are any indexes used?).
C.
On Mon, Jul 13, 2015 at 3:32 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hello, Creating a database of partial xml documents had almost no effect. Therefore I've created a database with very simple xml structure. I'm attaching an example (demo.xml). BaseX version: 8.2.2 Number of documents: 374739
However, the attached query takes 4 seconds (attached simple_query.log). I don't know if it's considered a normal performance, but my real query is different: I'm copying all the documents which correspond to my query to a newly created temporary collection, for having faster processing for this subset: reporting, ecc. Adding to db (Remote Java client): 12 sec. Optimising the db (Remote Java client): 23 sec. Both the Java client and BaseX server are installed on powerful servers. Are these numbers normal? The attached results are based on a local client (Using the BaseX GUI).
In the future, I should have even much more documents to handle... Any ideas? I can also change the scheme of my new xml.
As for the idea of creating a new temporary db, I'm checking an alternative: return in one query all what I need, including reports, all in one xml.
With kind regards, Menashè
oops, I'm sorry. It's attached. There are text and attribute indexes.
It may be slightly faster if you remove the explicit string() conversion:
for $x in db:open("CDI") let $beginPosition := $x//startTime where $beginPosition >= "1889-01-01" and $beginPosition <= "2015-07-10" return db:node-pre($x)
But please note that BaseX provides no native range index, which would be a good fit for your longitude/latitude filter.
Hi,
On 07/14/2015 11:05 AM, Christian Grün wrote:
It may be slightly faster if you remove the explicit string() conversion
No, it's actually slower.
But please note that BaseX provides no native range index, which would be a good fit for your longitude/latitude filter.
Should *geo:within *of http://docs.basex.org/wiki/Geo_Module help?
Should geo:within of http://docs.basex.org/wiki/Geo_Module help?
The functions of the Geo Module don't use any index structures, so I am afraid they won't speed up the query.
One more idea: you could convert all latitudes and longitudes to strings with a fixed number of digits.... _____________________________________
(:~ Allowed range. :) declare variable $RANGE := 999999; (:~ Maximum latitude. :) declare variable $LAT-MIN := -90; (:~ Maximum longitude. :) declare variable $LAT-MAX := 90;
(:~ : Converts a double value to a normalized string value : with a fixed size of digits. : @param $num number to be converted : @param $min minimum allowed value : @param $max maximum allowed value : @return resulting value :) declare function local:normalize( $num as xs:double, $min as xs:integer, $max as xs:integer ) { let $norm := $RANGE * ($num - $min) div ($max - $min) return format-number($norm, '000000') };
(: Run code for various latitude values :) for $latitude in (-90, -89.9999, -13.345, 0, 89.99999) return local:normalize($latitude, $LAT-MIN, $LAT-MAX) _____________________________________
Next, you could to do string comparisons on these values:
for $doc in db:open("CDI") let $lat := $doc//latitude let $lon := $doc//longitude where $lat >= "883387" and $lat <= "893463" and $lon >= "173467" and $lon <= "178745" return db:node-pre($doc)
It should be fast enough if the maximum value is not much bigger than the minimum value.
Hi, It sounds like a great idea and I can also implement it to the date comparisons, but unfortunately the new query is much slower. Please see the attached log.
With kind regards, Menashè
On 07/14/2015 12:50 PM, Christian Grün wrote:
Should geo:within of http://docs.basex.org/wiki/Geo_Module help?
The functions of the Geo Module don't use any index structures, so I am afraid they won't speed up the query.
One more idea: you could convert all latitudes and longitudes to strings with a fixed number of digits.... _____________________________________
(:~ Allowed range. :) declare variable $RANGE := 999999; (:~ Maximum latitude. :) declare variable $LAT-MIN := -90; (:~ Maximum longitude. :) declare variable $LAT-MAX := 90;
(:~ : Converts a double value to a normalized string value : with a fixed size of digits. : @param $num number to be converted : @param $min minimum allowed value : @param $max maximum allowed value : @return resulting value :) declare function local:normalize( $num as xs:double, $min as xs:integer, $max as xs:integer ) { let $norm := $RANGE * ($num - $min) div ($max - $min) return format-number($norm, '000000') };
(: Run code for various latitude values :) for $latitude in (-90, -89.9999, -13.345, 0, 89.99999) return local:normalize($latitude, $LAT-MIN, $LAT-MAX) _____________________________________
Next, you could to do string comparisons on these values:
for $doc in db:open("CDI") let $lat := $doc//latitude let $lon := $doc//longitude where $lat >= "883387" and $lat <= "893463" and $lon >= "173467" and $lon <= "178745" return db:node-pre($doc)
It should be fast enough if the maximum value is not much bigger than the minimum value.
...it only makes sense if you store the data in its normalized representation.
On Tue, Jul 14, 2015 at 2:42 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hi, It sounds like a great idea and I can also implement it to the date comparisons, but unfortunately the new query is much slower. Please see the attached log.
With kind regards, Menashè
On 07/14/2015 12:50 PM, Christian Grün wrote:
Should geo:within of http://docs.basex.org/wiki/Geo_Module help?
The functions of the Geo Module don't use any index structures, so I am afraid they won't speed up the query.
One more idea: you could convert all latitudes and longitudes to strings with a fixed number of digits.... _____________________________________
(:~ Allowed range. :) declare variable $RANGE := 999999; (:~ Maximum latitude. :) declare variable $LAT-MIN := -90; (:~ Maximum longitude. :) declare variable $LAT-MAX := 90;
(:~ : Converts a double value to a normalized string value : with a fixed size of digits. : @param $num number to be converted : @param $min minimum allowed value : @param $max maximum allowed value : @return resulting value :) declare function local:normalize( $num as xs:double, $min as xs:integer, $max as xs:integer ) { let $norm := $RANGE * ($num - $min) div ($max - $min) return format-number($norm, '000000') };
(: Run code for various latitude values :) for $latitude in (-90, -89.9999, -13.345, 0, 89.99999) return local:normalize($latitude, $LAT-MIN, $LAT-MAX) _____________________________________
Next, you could to do string comparisons on these values:
for $doc in db:open("CDI") let $lat := $doc//latitude let $lon := $doc//longitude where $lat >= "883387" and $lat <= "893463" and $lon >= "173467" and $lon <= "178745" return db:node-pre($doc)
It should be fast enough if the maximum value is not much bigger than the minimum value.
:) I've thought to do it as a second step, but I should do it earlier. Thank you.
With kind regards, Menashè
On 07/14/2015 03:22 PM, Christian Grün wrote:
...it only makes sense if you store the data in its normalized representation.
On Tue, Jul 14, 2015 at 2:42 PM, Menashè Eliezer meliezer@ogs.trieste.it wrote:
Hi, It sounds like a great idea and I can also implement it to the date comparisons, but unfortunately the new query is much slower. Please see the attached log.
With kind regards, Menashè
On 07/14/2015 12:50 PM, Christian Grün wrote:
Should geo:within of http://docs.basex.org/wiki/Geo_Module help?
The functions of the Geo Module don't use any index structures, so I am afraid they won't speed up the query.
One more idea: you could convert all latitudes and longitudes to strings with a fixed number of digits.... _____________________________________
(:~ Allowed range. :) declare variable $RANGE := 999999; (:~ Maximum latitude. :) declare variable $LAT-MIN := -90; (:~ Maximum longitude. :) declare variable $LAT-MAX := 90;
(:~ : Converts a double value to a normalized string value : with a fixed size of digits. : @param $num number to be converted : @param $min minimum allowed value : @param $max maximum allowed value : @return resulting value :) declare function local:normalize( $num as xs:double, $min as xs:integer, $max as xs:integer ) { let $norm := $RANGE * ($num - $min) div ($max - $min) return format-number($norm, '000000') };
(: Run code for various latitude values :) for $latitude in (-90, -89.9999, -13.345, 0, 89.99999) return local:normalize($latitude, $LAT-MIN, $LAT-MAX) _____________________________________
Next, you could to do string comparisons on these values:
for $doc in db:open("CDI") let $lat := $doc//latitude let $lon := $doc//longitude where $lat >= "883387" and $lat <= "893463" and $lon >= "173467" and $lon <= "178745" return db:node-pre($doc)
It should be fast enough if the maximum value is not much bigger than the minimum value.
basex-talk@mailman.uni-konstanz.de