Hi Christian,

Thanks for following up on this. Please use the attached XML files to create the CTGov and MeSH databases (the first contains just NCT00303472 while the second the definitions of the 4 MeSH terms referenced in the <condition_browse> section of this CT.gov trial). Also attached is the stopwords file (containing just ‘syndrome'). I verified that the issue is reproducible with these minimal files.

Note: I enabled full text indexing for both databases (using SET FTINDEX true), in case it matters.

Looking forward to having this resolved.

Best,
Ron


Ron Katriel, Ph.D. | Senior Data Scientist | Medidata Solutions
350 Hudson Street, 7th Floor, New York, NY 10014
rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 main: +1 212 918 1800


On September 14, 2015 at 6:28:24 AM, Christian Grün (christian.gruen@gmail.com) wrote:

Hi Ron,

Sorry for late reply and thanks for your bug report. I am pretty sure
this is a bug -- but it's difficult to guess what's going wrong. Could
you possibly point me to the XML source documents or ideally provide
me a small example to test?

Thanks,
Christian


On Sun, Aug 30, 2015 at 5:56 PM, Ron Katriel <rkatriel@mdsol.com> wrote:
> Hi,
>
> I encountered a peculiar error with a query using a stopwords file in the
> context of a full text search. The query joins two XML databases: CT.gov
> (containing 86635503 nodes) and the 2015 MeSH dictionary (containing
> 12064461 nodes). I am debugging using CT.gov trial NCT00303472, hardcoded in
> the ‘where' clause of the following query:
>
> let $trees := db:open('MeSH')/DescriptorRecordSet/DescriptorRecord
> for $article in db:open('CTGov')/clinical_study
> where $article/id_info/nct_id = 'NCT00303472'
> let $mesh := $article/condition_browse/mesh_term
> let $tn1 := $trees[DescriptorName/String contains text { $mesh }]
> let $tn2 := $trees[DescriptorName/String contains text { $mesh } using stop
> words at
> "/Volumes/Extra/Documents/Standards/MeSH/stopwords.txt"]/TreeNumberList/TreeNumber
> return <match> { $article/id_info/nct_id, $mesh, $tn2 } </match>
>
> When the return clause contains the variable $tn2 (i.e., using stopwords -
> as shown above) a Java NullPointerException is generated (see the stack
> trace below). However, when only $tn1 is returned there is no problem (the
> code for $tn2 is removed by the optimizer).
>
> The issue is related to a specific stopword (“syndrome”). When the stopword
> is removed from the file the exception does not occur. Surprisingly, when
> the stopword is in uppercase (“Syndrome”) the issue does not occur - even
> though the target MeSH term in this CT.gov trial is in uppercase, that is
>
> <mesh_term>Syndrome</mesh_term>
>
> Am I doing something wrong, or is this a real bug in BaseX? If the former,
> please suggest a workaround as I would like to filter out generic MeSH terms
> that match the stopwords before any further processing (I removed a lot of
> code from the above query to make it easier to debug).
>
> Thanks,
> Ron
>
>
> Error:
> Improper use? Potential bug? Your feedback is welcome:
> Contact: basex-talk@mailman.uni-konstanz.de
> Version: BaseX 8.2
> Java: Oracle Corporation, 1.8.0_20
> OS: Mac OS X, x86_64
> Stack Trace:
> java.lang.NullPointerException
> at org.basex.query.expr.ft.FTWords$1.next(FTWords.java:166)
> at org.basex.query.expr.ft.FTIndexAccess$1.next(FTIndexAccess.java:48)
> at org.basex.query.expr.ft.FTIndexAccess$1.next(FTIndexAccess.java:45)
> at org.basex.query.iter.Iter.value(Iter.java:53)
> at org.basex.query.expr.ParseExpr.value(ParseExpr.java:67)
> at org.basex.query.QueryContext.value(QueryContext.java:421)
> at org.basex.query.expr.path.CachedPath.iter(CachedPath.java:41)
> at org.basex.query.expr.path.CachedPath.iter(CachedPath.java:22)
> at org.basex.query.QueryContext.iter(QueryContext.java:410)
> at org.basex.query.expr.List$1.next(List.java:133)
> at org.basex.query.expr.constr.Constr.add(Constr.java:70)
> at org.basex.query.expr.constr.CElem.item(CElem.java:92)
> at org.basex.query.expr.constr.CElem.item(CElem.java:23)
> at org.basex.query.expr.ParseExpr.iter(ParseExpr.java:43)
> at org.basex.query.expr.gflwor.GFLWOR$1.next(GFLWOR.java:99)
> at org.basex.query.MainModule$1.next(MainModule.java:114)
> at org.basex.query.QueryContext.cache(QueryContext.java:660)
> at org.basex.query.QueryProcessor.cache(QueryProcessor.java:103)
> at org.basex.core.cmd.AQuery.query(AQuery.java:83)
> at org.basex.core.cmd.XQuery.run(XQuery.java:22)
> at org.basex.core.Command.run(Command.java:398)
> at org.basex.core.Command.execute(Command.java:100)
> at org.basex.gui.GUI.exec(GUI.java:472)
> at org.basex.gui.GUI.access$400(GUI.java:43)
> at org.basex.gui.GUI$7.run(GUI.java:412)
> Compiling:
> - pre-evaluating db:open("MeSH")
> - pre-evaluating db:open("CTGov")
> - inlining $trees_0
> - applying full-text index for { $mesh_2 } using language 'English'
> - applying full-text index for { $mesh_2 } using language 'English'
> - inlining $tn2_4
> - removing variable $tn1_3
> - applying text index for "NCT00303472"
> - rewriting where clause(s)
> Query:
> let $trees := db:open('MeSH')/DescriptorRecordSet/DescriptorRecord for
> $article in db:open('CTGov')/clinical_study where $article/id_info/nct_id =
> 'NCT00303472' let $mesh := $article/condition_browse/mesh_term let $tn1 :=
> $trees[DescriptorName/String contains text { $mesh }] let $tn2 :=
> $trees[DescriptorName/String contains text { $mesh } using stop words at
> "/Volumes/Extra/Documents/Standards/MeSH/stopwords.txt"]/TreeNumberList/TreeNumber
> return <match> { $article/id_info/nct_id, $mesh, $tn2 } </match>
> Optimized Query:
> for $article_1 in db:text("CTGov",
> "NCT00303472")/parent::*:nct_id/parent::*:id_info/parent::*:clinical_study
> let $mesh_2 := $article_1/*:condition_browse/*:mesh_term return element
> match { (($article_1/*:id_info/*:nct_id, $mesh_2, ft:search("MeSH", {
> $mesh_2 } using language
> 'English')/parent::*:String/parent::*:DescriptorName/parent::*:DescriptorRecord/TreeNumberList/TreeNumber))
> }
> Query plan:
> <QueryPlan compiled="true">
> <GFLWOR>
> <For>
> <Var name="$article" id="1"/>
> <IterPath>
> <ValueAccess data="CTGov" type="TEXT" name="*:nct_id">
> <Str value="NCT00303472" type="xs:string"/>
> </ValueAccess>
> <IterStep axis="parent" test="*:id_info"/>
> <IterStep axis="parent" test="*:clinical_study"/>
> </IterPath>
> </For>
> <Let>
> <Var name="$mesh" id="2"/>
> <IterPath>
> <VarRef>
> <Var name="$article" id="1"/>
> </VarRef>
> <IterStep axis="child" test="*:condition_browse"/>
> <IterStep axis="child" test="*:mesh_term"/>
> </IterPath>
> </Let>
> <CElem>
> <QNm value="match" type="xs:QName"/>
> <List>
> <IterPath>
> <VarRef>
> <Var name="$article" id="1"/>
> </VarRef>
> <IterStep axis="child" test="*:id_info"/>
> <IterStep axis="child" test="*:nct_id"/>
> </IterPath>
> <VarRef>
> <Var name="$mesh" id="2"/>
> </VarRef>
> <CachedPath>
> <FTIndexAccess data="MeSH">
> <FTWords>
> <VarRef>
> <Var name="$mesh" id="2"/>
> </VarRef>
> </FTWords>
> </FTIndexAccess>
> <IterStep axis="parent" test="*:String"/>
> <IterStep axis="parent" test="*:DescriptorName"/>
> <IterStep axis="parent" test="*:DescriptorRecord"/>
> <IterStep axis="child" test="TreeNumberList"/>
> <IterStep axis="child" test="TreeNumber"/>
> </CachedPath>
> </List>
> </CElem>
> </GFLWOR>
> </QueryPlan>
>
>
> Ron Katriel, Ph.D. | Senior Data Scientist | Medidata Solutions
> 350 Hudson Street, 7th Floor, New York, NY 10014
> rkatriel@mdsol.com | direct: +1 201 337 3622 | mobile: +1 201 675 5598 |
> main: +1 212 918 1800
>