I also note, that when I try to mock up something similar with one of my texts, the tokenize Seems to give me a zero length string at the start.
It’s there in the output window of basexgui, in the first line, but easy to miss the fact that it’s significant whitespace in this context:
(tokenize(string-join(collection('BOV')[ends-with( db:path(.), '.tei' )][1]/TEI/text/body/p//text() ), '\W+' ) => distinct-values())[not(. = ( "of", "the", "in", "and", "at","by","to" ))]
August 16 17 2015 Members Board Visitors University Virginia met Retreat Open Executive Session Forum …
But visible if I apply string-length to the sequence:
(tokenize(string-join(collection('BOV')[ends-with( db:path(.), '.tei' )][1]/TEI/text/body/p//text() ), '\W+' ) => distinct-values())[not(. = ( "of", "the", "in", "and", "at","by","to" ))] ! string-length(.)
0 6 2 2 4 7 5 8 10 8 3 7 4 9 7 …
I wonder if that’s the semantic difference here.
— Steve M.
On Feb 27, 2020, at 3:43 PM, Majewski, Steven Dennis (sdm7g) sdm7g@virginia.edu wrote:
So, if the counts are different depending on who is counting ( R or BaseX ), The first question is : who is correct ? ( And the 2nd question is probably: what do you mean by correct ? as the semantics of XQuery sequences and whatever destination R datatype is being counted may be slightly different. I don’t know R that well, but semantics of XQuery sequences and arrays are rather different, for example. )
— Steve M.
On Feb 27, 2020, at 2:48 PM, Ben Engbers Ben.Engbers@Be-Logical.nl wrote:
Op 27-02-2020 om 19:19 schreef Christian Grün:
It’s difficult to understand what’s going on here. Could you please provide us self-contained queries without the R wrapper code?
Version 1:
import module namespace functx = 'http://www.functx.com'; (: Extract the text :) let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() (: Convert to lower-case and tokenize :) let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') (: Read Stopwords :) let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() (: Remove Stopwords :) let $Stop := functx:value-except($INC_RM, $Stoppers) return $Stop"
My R-code first executes this as XQUERY and then calculates the length of the returned list (=5842).
Version 2:
import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return count($Stop)
Returns the length of the sequence (counts 5843 words).
The '\' in the regular expression is intentional (R-specific). With a single '' the query can be executed in BaseXGUI.
Does this help?
Ben