While experimenting (I try to speed up the querys), I compared the
results from these 2 querys:
It’s difficult to understand what’s going on here. Could you please provide us self-contained queries without the R wrapper code?
Word_Inc_Rm_Stop_txt <- "import module namespace functx =
'http://www.functx.com';
let $txt :=
collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text()
let $INC_RM := tokenize(lower-case(string-join($txt)),
'(\\s|[,.!:;]|[n][b][s][p][;])+')
let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text()
let $Stop := functx:value-except($INC_RM, $Stoppers)
return $Stop"
Word_Inc_Rm_Stop <- Session$Execute(as.character(glue("xquery
{Word_Inc_Rm_Stop_txt}")))$result[[1]]
Word_Inc_Rm_Stop_Count <- length(Word_Inc_Rm_Stop)
Word_Inc_Rm_Stop_txt_2 <- "import module namespace functx =
'http://www.functx.com';
let $txt :=
collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text()
let $INC_RM := tokenize(lower-case(string-join($txt)),
'(\\s|[,.!:;]|[n][b][s][p][;])+')
let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text()
let $Stop := functx:value-except($INC_RM, $Stoppers)
return count($Stop)"
Word_Inc_Rm_Stop_Count_2 <- Session$Execute(as.character(glue("xquery
{Word_Inc_Rm_Stop_txt_2}")))$result[[1]]
These are the processing-times:
Version 1:
> print(proc.time() - ptm)
user system elapsed
2.903 0.022 3.160
Version 2:
> print(proc.time() - ptm)
user system elapsed
0.041 0.004 1.089
I guess it makes sense to put effort in speeding up my code. But what
bothers me is the following.
The first query computes the length from the vector that is returned,
The result is 5842.
The second query returns the length as computed by basex. This result is
5843. The GUI also returns 5843 as result.
I copied the output from
..
return $Stop
to a new LibreOffice-document. That document counts 5842 words.
Who is right?
Cheers,
Ben