Hi Christian,
Thanks for following up on this. Please use the attached XML files to create the CTGov and MeSH databases (the first contains just NCT00303472 while the second the definitions of the 4 MeSH terms referenced in the <condition_browse> section of this CT.gov trial). Also attached is the stopwords file (containing just ‘syndrome'). I verified that the issue is reproducible with these minimal files.
Note: I enabled full text indexing for both databases (using SET FTINDEX true), in case it matters.
Looking forward to having this resolved.
Best, Ron
Ron Katriel, Ph.D. | Senior Data Scientist | Medidata Solutions 350 Hudson Street, 7th Floor, New York, NY 10014 rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 | main: +1 212 918 1800
On September 14, 2015 at 6:28:24 AM, Christian Grün (christian.gruen@gmail.com) wrote:
Hi Ron,
Sorry for late reply and thanks for your bug report. I am pretty sure this is a bug -- but it's difficult to guess what's going wrong. Could you possibly point me to the XML source documents or ideally provide me a small example to test?
Thanks, Christian
On Sun, Aug 30, 2015 at 5:56 PM, Ron Katriel rkatriel@mdsol.com wrote:
Hi,
I encountered a peculiar error with a query using a stopwords file in the context of a full text search. The query joins two XML databases: CT.gov (containing 86635503 nodes) and the 2015 MeSH dictionary (containing 12064461 nodes). I am debugging using CT.gov trial NCT00303472, hardcoded in the ‘where' clause of the following query:
let $trees := db:open('MeSH')/DescriptorRecordSet/DescriptorRecord for $article in db:open('CTGov')/clinical_study where $article/id_info/nct_id = 'NCT00303472' let $mesh := $article/condition_browse/mesh_term let $tn1 := $trees[DescriptorName/String contains text { $mesh }] let $tn2 := $trees[DescriptorName/String contains text { $mesh } using stop words at "/Volumes/Extra/Documents/Standards/MeSH/stopwords.txt"]/TreeNumberList/TreeNumber return <match> { $article/id_info/nct_id, $mesh, $tn2 } </match>
When the return clause contains the variable $tn2 (i.e., using stopwords - as shown above) a Java NullPointerException is generated (see the stack trace below). However, when only $tn1 is returned there is no problem (the code for $tn2 is removed by the optimizer).
The issue is related to a specific stopword (“syndrome”). When the stopword is removed from the file the exception does not occur. Surprisingly, when the stopword is in uppercase (“Syndrome”) the issue does not occur - even though the target MeSH term in this CT.gov trial is in uppercase, that is
<mesh_term>Syndrome</mesh_term>
Am I doing something wrong, or is this a real bug in BaseX? If the former, please suggest a workaround as I would like to filter out generic MeSH terms that match the stopwords before any further processing (I removed a lot of code from the above query to make it easier to debug).
Thanks, Ron
Error: Improper use? Potential bug? Your feedback is welcome: Contact: basex-talk@mailman.uni-konstanz.de Version: BaseX 8.2 Java: Oracle Corporation, 1.8.0_20 OS: Mac OS X, x86_64 Stack Trace: java.lang.NullPointerException at org.basex.query.expr.ft.FTWords$1.next(FTWords.java:166) at org.basex.query.expr.ft.FTIndexAccess$1.next(FTIndexAccess.java:48) at org.basex.query.expr.ft.FTIndexAccess$1.next(FTIndexAccess.java:45) at org.basex.query.iter.Iter.value(Iter.java:53) at org.basex.query.expr.ParseExpr.value(ParseExpr.java:67) at org.basex.query.QueryContext.value(QueryContext.java:421) at org.basex.query.expr.path.CachedPath.iter(CachedPath.java:41) at org.basex.query.expr.path.CachedPath.iter(CachedPath.java:22) at org.basex.query.QueryContext.iter(QueryContext.java:410) at org.basex.query.expr.List$1.next(List.java:133) at org.basex.query.expr.constr.Constr.add(Constr.java:70) at org.basex.query.expr.constr.CElem.item(CElem.java:92) at org.basex.query.expr.constr.CElem.item(CElem.java:23) at org.basex.query.expr.ParseExpr.iter(ParseExpr.java:43) at org.basex.query.expr.gflwor.GFLWOR$1.next(GFLWOR.java:99) at org.basex.query.MainModule$1.next(MainModule.java:114) at org.basex.query.QueryContext.cache(QueryContext.java:660) at org.basex.query.QueryProcessor.cache(QueryProcessor.java:103) at org.basex.core.cmd.AQuery.query(AQuery.java:83) at org.basex.core.cmd.XQuery.run(XQuery.java:22) at org.basex.core.Command.run(Command.java:398) at org.basex.core.Command.execute(Command.java:100) at org.basex.gui.GUI.exec(GUI.java:472) at org.basex.gui.GUI.access$400(GUI.java:43) at org.basex.gui.GUI$7.run(GUI.java:412) Compiling:
- pre-evaluating db:open("MeSH")
- pre-evaluating db:open("CTGov")
- inlining $trees_0
- applying full-text index for { $mesh_2 } using language 'English'
- applying full-text index for { $mesh_2 } using language 'English'
- inlining $tn2_4
- removing variable $tn1_3
- applying text index for "NCT00303472"
- rewriting where clause(s)
Query: let $trees := db:open('MeSH')/DescriptorRecordSet/DescriptorRecord for $article in db:open('CTGov')/clinical_study where $article/id_info/nct_id = 'NCT00303472' let $mesh := $article/condition_browse/mesh_term let $tn1 := $trees[DescriptorName/String contains text { $mesh }] let $tn2 := $trees[DescriptorName/String contains text { $mesh } using stop words at "/Volumes/Extra/Documents/Standards/MeSH/stopwords.txt"]/TreeNumberList/TreeNumber return <match> { $article/id_info/nct_id, $mesh, $tn2 } </match> Optimized Query: for $article_1 in db:text("CTGov", "NCT00303472")/parent::*:nct_id/parent::*:id_info/parent::*:clinical_study let $mesh_2 := $article_1/*:condition_browse/*:mesh_term return element match { (($article_1/*:id_info/*:nct_id, $mesh_2, ft:search("MeSH", { $mesh_2 } using language 'English')/parent::*:String/parent::*:DescriptorName/parent::*:DescriptorRecord/TreeNumberList/TreeNumber)) } Query plan:
<QueryPlan compiled="true"> <GFLWOR> <For> <Var name="$article" id="1"/> <IterPath> <ValueAccess data="CTGov" type="TEXT" name="*:nct_id"> <Str value="NCT00303472" type="xs:string"/> </ValueAccess> <IterStep axis="parent" test="*:id_info"/> <IterStep axis="parent" test="*:clinical_study"/> </IterPath> </For> <Let> <Var name="$mesh" id="2"/> <IterPath> <VarRef> <Var name="$article" id="1"/> </VarRef> <IterStep axis="child" test="*:condition_browse"/> <IterStep axis="child" test="*:mesh_term"/> </IterPath> </Let> <CElem> <QNm value="match" type="xs:QName"/> <List> <IterPath> <VarRef> <Var name="$article" id="1"/> </VarRef> <IterStep axis="child" test="*:id_info"/> <IterStep axis="child" test="*:nct_id"/> </IterPath> <VarRef> <Var name="$mesh" id="2"/> </VarRef> <CachedPath> <FTIndexAccess data="MeSH"> <FTWords> <VarRef> <Var name="$mesh" id="2"/> </VarRef> </FTWords> </FTIndexAccess> <IterStep axis="parent" test="*:String"/> <IterStep axis="parent" test="*:DescriptorName"/> <IterStep axis="parent" test="*:DescriptorRecord"/> <IterStep axis="child" test="TreeNumberList"/> <IterStep axis="child" test="TreeNumber"/> </CachedPath> </List> </CElem> </GFLWOR> </QueryPlan>
Ron Katriel, Ph.D. | Senior Data Scientist | Medidata Solutions 350 Hudson Street, 7th Floor, New York, NY 10014 rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 | main: +1 212 918 1800