I've just installed BaseX and started using it. I just wonder if the
following query is possible with BaseX.
I have a corpus in an XML format, where <s> means 'sentence', <w> means 'word'.
<s n="6"><w c5="PNP" hw="she" pos="PRON">She </w><w c5="VVD" hw="say" pos="VERB">said </w><w c5="PNP" hw="she" pos="PRON">she </w><w c5="VVD" hw="go" pos="VERB">went </w> <w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="buy" pos="VERB">buy </w> <w c5="PNI" hw="something" pos="PRON">something </w><unclear/> <w c5="PNX" hw="herself" pos="PRON">herself</w><c c5="PUN">, </c>...</s>
If you create a BaseX database with an option of full-text, you can extract sentences in which the word A and the word B" appear within a designated number of words. For example, the following query will extract the sentence above: //s[. contains text 'buy' ftand 'herself' window 5 words]
So this is my question: Is there a way to extract all the sentences in which a word of a particular part of speech (for example, a verb) and another word appear within a designated number of words, like any verb and "herself" appear within a window of 5 words.
Thank you in advance.
Best, Sam, A.
Hi Sam,
sure; I have attached a possible solution below. It's not as compact as the XPath version, though.
Hope this helps, Christian _______________________________________
let $word := 'herself' let $pos := 'VERB' let $dist := 3 (: read from file/database let $doc := doc("your-document.xml") :) let $doc := document { <s n="6"> <w c5="PNP" hw="she" pos="PRON">She</w> <w c5="VVD" hw="say" pos="VERB">said</w> <w c5="PNP" hw="she" pos="PRON">she</w> <w c5="VVD" hw="go" pos="VERB">went</w> <w c5="TO0" hw="to" pos="PREP">to</w> <w c5="VVI" hw="buy" pos="VERB">buy</w> <w c5="PNI" hw="something" pos="PRON">something</w> <unclear/> <w c5="PNX" hw="herself" pos="PRON">herself</w> <c c5="PUN">,</c>... </s> }
for $s in $doc//s let $hits-word := $s/w[. contains text { $word }] let $hits-pos := $s/w[@pos = $pos] let $near-pos := ( $hits-pos/following-sibling::node()[position() <= $dist] union $hits-pos/preceding-sibling::node()[position() <= $dist] ) where ($near-pos intersect $hits-word) return $s ___________________________
On Tue, Sep 4, 2012 at 1:31 PM, 赤瀬川 史朗 akasan123@yahoo.co.jp wrote:
I've just installed BaseX and started using it. I just wonder if the
following query is possible with BaseX.
I have a corpus in an XML format, where <s> means 'sentence', <w> means 'word'.
<s n="6"><w c5="PNP" hw="she" pos="PRON">She </w><w c5="VVD" hw="say" pos="VERB">said </w><w c5="PNP" hw="she" pos="PRON">she </w><w c5="VVD" hw="go" pos="VERB">went </w> <w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="buy" pos="VERB">buy </w> <w c5="PNI" hw="something" pos="PRON">something </w><unclear/> <w c5="PNX" hw="herself" pos="PRON">herself</w><c c5="PUN">, </c>...</s>
If you create a BaseX database with an option of full-text, you can extract sentences in which the word A and the word B" appear within a designated number of words. For example, the following query will extract the sentence above: //s[. contains text 'buy' ftand 'herself' window 5 words]
So this is my question: Is there a way to extract all the sentences in which a word of a particular part of speech (for example, a verb) and another word appear within a designated number of words, like any verb and "herself" appear within a window of 5 words.
Thank you in advance.
Best, Sam, A.
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
basex-talk@mailman.uni-konstanz.de