On September 14, 2015 at 6:28:24 AM, Christian Grün (christian.gruen@gmail.com) wrote:
Hi Ron,
Sorry for late reply and thanks for your bug report. I am pretty sure
this is a bug -- but it's difficult to guess what's going wrong. Could
you possibly point me to the XML source documents or ideally provide
me a small example to test?
Thanks,
Christian
On Sun, Aug 30, 2015 at 5:56 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
> Hi,
>
> I encountered a peculiar error with a query using a stopwords file in the
> context of a full text search. The query joins two XML databases: CT.gov
> (containing 86635503 nodes) and the 2015 MeSH dictionary (containing
> 12064461 nodes). I am debugging using CT.gov trial NCT00303472, hardcoded in
> the ‘where' clause of the following query:
>
> let $trees := db:open('MeSH')/DescriptorRecordSet/DescriptorRecord
> for $article in db:open('CTGov')/clinical_study
> where $article/id_info/nct_id = 'NCT00303472'
> let $mesh := $article/condition_browse/mesh_term
> let $tn1 := $trees[DescriptorName/String contains text { $mesh }]
> let $tn2 := $trees[DescriptorName/String contains text { $mesh } using stop
> words at
> "/Volumes/Extra/Documents/Standards/MeSH/stopwords.txt"]/TreeNumberList/TreeNumber
> return <match> { $article/id_info/nct_id, $mesh, $tn2 } </match>
>
> When the return clause contains the variable $tn2 (i.e., using stopwords -
> as shown above) a Java NullPointerException is generated (see the stack
> trace below). However, when only $tn1 is returned there is no problem (the
> code for $tn2 is removed by the optimizer).
>
> The issue is related to a specific stopword (“syndrome”). When the stopword
> is removed from the file the exception does not occur. Surprisingly, when
> the stopword is in uppercase (“Syndrome”) the issue does not occur - even
> though the target MeSH term in this CT.gov trial is in uppercase, that is
>
> <mesh_term>Syndrome</mesh_term>
>
> Am I doing something wrong, or is this a real bug in BaseX? If the former,
> please suggest a workaround as I would like to filter out generic MeSH terms
> that match the stopwords before any further processing (I removed a lot of
> code from the above query to make it easier to debug).
>
> Thanks,
> Ron
>
>
> Error:
> Improper use? Potential bug? Your feedback is welcome:
> Contact: basex-talk@mailman.uni-konstanz.de
> Version: BaseX 8.2
> Java: Oracle Corporation, 1.8.0_20
> OS: Mac OS X, x86_64
> Stack Trace:
> java.lang.NullPointerException
> at org.basex.query.expr.ft.FTWords$1.next(FTWords.java:166)
> at org.basex.query.expr.ft.FTIndexAccess$1.next(FTIndexAccess.java:48)
> at org.basex.query.expr.ft.FTIndexAccess$1.next(FTIndexAccess.java:45)
> at org.basex.query.iter.Iter.value(Iter.java:53)
> at org.basex.query.expr.ParseExpr.value(ParseExpr.java:67)
> at org.basex.query.QueryContext.value(QueryContext.java:421)
> at org.basex.query.expr.path.CachedPath.iter(CachedPath.java:41)
> at org.basex.query.expr.path.CachedPath.iter(CachedPath.java:22)
> at org.basex.query.QueryContext.iter(QueryContext.java:410)
> at org.basex.query.expr.List$1.next(List.java:133)
> at org.basex.query.expr.constr.Constr.add(Constr.java:70)
> at org.basex.query.expr.constr.CElem.item(CElem.java:92)
> at org.basex.query.expr.constr.CElem.item(CElem.java:23)
> at org.basex.query.expr.ParseExpr.iter(ParseExpr.java:43)
> at org.basex.query.expr.gflwor.GFLWOR$1.next(GFLWOR.java:99)
> at org.basex.query.MainModule$1.next(MainModule.java:114)
> at org.basex.query.QueryContext.cache(QueryContext.java:660)
> at org.basex.query.QueryProcessor.cache(QueryProcessor.java:103)
> at org.basex.core.cmd.AQuery.query(AQuery.java:83)
> at org.basex.core.cmd.XQuery.run(XQuery.java:22)
> at org.basex.core.Command.run(Command.java:398)
> at org.basex.core.Command.execute(Command.java:100)
> at org.basex.gui.GUI.exec(GUI.java:472)
> at org.basex.gui.GUI.access$400(GUI.java:43)
> at org.basex.gui.GUI$7.run(GUI.java:412)
> Compiling:
> - pre-evaluating db:open("MeSH")
> - pre-evaluating db:open("CTGov")
> - inlining $trees_0
> - applying full-text index for { $mesh_2 } using language 'English'
> - applying full-text index for { $mesh_2 } using language 'English'
> - inlining $tn2_4
> - removing variable $tn1_3
> - applying text index for "NCT00303472"
> - rewriting where clause(s)
> Query:
> let $trees := db:open('MeSH')/DescriptorRecordSet/DescriptorRecord for
> $article in db:open('CTGov')/clinical_study where $article/id_info/nct_id =
> 'NCT00303472' let $mesh := $article/condition_browse/mesh_term let $tn1 :=
> $trees[DescriptorName/String contains text { $mesh }] let $tn2 :=
> $trees[DescriptorName/String contains text { $mesh } using stop words at
> "/Volumes/Extra/Documents/Standards/MeSH/stopwords.txt"]/TreeNumberList/TreeNumber
> return <match> { $article/id_info/nct_id, $mesh, $tn2 } </match>
> Optimized Query:
> for $article_1 in db:text("CTGov",
> "NCT00303472")/parent::*:nct_id/parent::*:id_info/parent::*:clinical_study
> let $mesh_2 := $article_1/*:condition_browse/*:mesh_term return element
> match { (($article_1/*:id_info/*:nct_id, $mesh_2, ft:search("MeSH", {
> $mesh_2 } using language
> 'English')/parent::*:String/parent::*:DescriptorName/parent::*:DescriptorRecord/TreeNumberList/TreeNumber))
> }
> Query plan:
> <QueryPlan compiled="true">
> <GFLWOR>
> <For>
> <Var name="$article" id="1"/>
> <IterPath>
> <ValueAccess data="CTGov" type="TEXT" name="*:nct_id">
> <Str value="NCT00303472" type="xs:string"/>
> </ValueAccess>
> <IterStep axis="parent" test="*:id_info"/>
> <IterStep axis="parent" test="*:clinical_study"/>
> </IterPath>
> </For>
> <Let>
> <Var name="$mesh" id="2"/>
> <IterPath>
> <VarRef>
> <Var name="$article" id="1"/>
> </VarRef>
> <IterStep axis="child" test="*:condition_browse"/>
> <IterStep axis="child" test="*:mesh_term"/>
> </IterPath>
> </Let>
> <CElem>
> <QNm value="match" type="xs:QName"/>
> <List>
> <IterPath>
> <VarRef>
> <Var name="$article" id="1"/>
> </VarRef>
> <IterStep axis="child" test="*:id_info"/>
> <IterStep axis="child" test="*:nct_id"/>
> </IterPath>
> <VarRef>
> <Var name="$mesh" id="2"/>
> </VarRef>
> <CachedPath>
> <FTIndexAccess data="MeSH">
> <FTWords>
> <VarRef>
> <Var name="$mesh" id="2"/>
> </VarRef>
> </FTWords>
> </FTIndexAccess>
> <IterStep axis="parent" test="*:String"/>
> <IterStep axis="parent" test="*:DescriptorName"/>
> <IterStep axis="parent" test="*:DescriptorRecord"/>
> <IterStep axis="child" test="TreeNumberList"/>
> <IterStep axis="child" test="TreeNumber"/>
> </CachedPath>
> </List>
> </CElem>
> </GFLWOR>
> </QueryPlan>
>
>
> Ron Katriel, Ph.D. | Senior Data Scientist | Medidata Solutions
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 |
> main: +1 212 918 1800
>