Hi,
My RBaseX client is finally stable enough to use it for real development. All regular commands are executed without errors. But now I am facing another problem.
In a client-session, I want to use the following function: fn_get_words_txt <- "declare function local:cloudWords( $Veld as xs:string) as xs:string* { let $base := collection('IncidentRemarks/Incidents')/csv/record let $txt := string-join( $base/*[name() = $Veld]/text(), ' ') let $words := tokenize($txt,'(\s|[,.!:;]|[n][b][s][p][;])+') return ($words)};" (Doubling the '' in the regular expression-string is R-specific.)
Session$Execute(fn_get_words_txt) returns: Gestopt bij , 1/8: Onbekend commando: declare. Probeer 'help'. Error in Session$Execute(fn_get_words_txt) : Gestopt bij , 1/8: Onbekend commando: declare. Probeer 'help'.
fn_get_words_Query <- Session$Query(fn_get_words_txt) fn_get_words_Query$queryObject$ExecuteQuery() returns: Error in private$default_query_pattern(match.call()[[1]]) : Gestopt bij ., 5/20: [XPST0003] Expecting expression.
Since fn_get_words_txt neither represents a regular command nor a regular function-all, I understand these errors.
Before I even start trying to implement this in my package, my question is if it should be able to create local functions for that session? If so, any idea how to tackle this problem? Could the problem be genaralized to the question how a prolog can be added or changed?
Cheers, Ben
Hi Ben,
Session$Execute(fn_get_words_txt) returns:
If you want to evaluate XQuery, you will either need to prefix your query string with the XQUERY command, or, as you’ve already done…
fn_get_words_Query <- Session$Query(fn_get_words_txt)
…create a query object, and attach the actual function call to your query string.
If you want to make XQuery code persistent for future invocations, you can include your function in an XQuery library module and install this module in the repository [1].
Best, Christian
Op 27-02-2020 om 16:41 schreef Christian Grün:
Hi Ben,
…create a query object, and attach the actual function call to your query string.
I already thougth about that but what would be the benefit of repeating the function-definition, every time I want to call the function ;-( ?
If you want to make XQuery code persistent for future invocations, you can include your function in an XQuery library module and install this module in the repository [1].
I will probably go for this.
While experimenting (I try to speed up the querys), I compared the results from these 2 querys:
Word_Inc_Rm_Stop_txt <- "import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return $Stop" Word_Inc_Rm_Stop <- Session$Execute(as.character(glue("xquery {Word_Inc_Rm_Stop_txt}")))$result[[1]] Word_Inc_Rm_Stop_Count <- length(Word_Inc_Rm_Stop)
Word_Inc_Rm_Stop_txt_2 <- "import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return count($Stop)" Word_Inc_Rm_Stop_Count_2 <- Session$Execute(as.character(glue("xquery {Word_Inc_Rm_Stop_txt_2}")))$result[[1]]
These are the processing-times:
Version 1:
print(proc.time() - ptm)
user system elapsed 2.903 0.022 3.160 Version 2:
print(proc.time() - ptm)
user system elapsed 0.041 0.004 1.089
I guess it makes sense to put effort in speeding up my code. But what bothers me is the following.
The first query computes the length from the vector that is returned, The result is 5842. The second query returns the length as computed by basex. This result is 5843. The GUI also returns 5843 as result.
I copied the output from .. return $Stop
to a new LibreOffice-document. That document counts 5842 words.
Who is right?
Cheers, Ben
While experimenting (I try to speed up the querys), I compared the results from these 2 querys:
It’s difficult to understand what’s going on here. Could you please provide us self-contained queries without the R wrapper code?
Word_Inc_Rm_Stop_txt <- "import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return $Stop" Word_Inc_Rm_Stop <- Session$Execute(as.character(glue("xquery {Word_Inc_Rm_Stop_txt}")))$result[[1]] Word_Inc_Rm_Stop_Count <- length(Word_Inc_Rm_Stop)
Word_Inc_Rm_Stop_txt_2 <- "import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return count($Stop)" Word_Inc_Rm_Stop_Count_2 <- Session$Execute(as.character(glue("xquery {Word_Inc_Rm_Stop_txt_2}")))$result[[1]]
These are the processing-times:
Version 1:
print(proc.time() - ptm)
user system elapsed 2.903 0.022 3.160 Version 2:
print(proc.time() - ptm)
user system elapsed 0.041 0.004 1.089
I guess it makes sense to put effort in speeding up my code. But what bothers me is the following.
The first query computes the length from the vector that is returned, The result is 5842. The second query returns the length as computed by basex. This result is 5843. The GUI also returns 5843 as result.
I copied the output from .. return $Stop
to a new LibreOffice-document. That document counts 5842 words.
Who is right?
Cheers, Ben
Op 27-02-2020 om 19:19 schreef Christian Grün:
It’s difficult to understand what’s going on here. Could you please provide us self-contained queries without the R wrapper code?
Version 1:
import module namespace functx = 'http://www.functx.com'; (: Extract the text :) let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() (: Convert to lower-case and tokenize :) let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') (: Read Stopwords :) let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() (: Remove Stopwords :) let $Stop := functx:value-except($INC_RM, $Stoppers) return $Stop"
My R-code first executes this as XQUERY and then calculates the length of the returned list (=5842).
Version 2:
import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return count($Stop)
Returns the length of the sequence (counts 5843 words).
The '\' in the regular expression is intentional (R-specific). With a single '' the query can be executed in BaseXGUI.
Does this help?
Ben
So, if the counts are different depending on who is counting ( R or BaseX ), The first question is : who is correct ? ( And the 2nd question is probably: what do you mean by correct ? as the semantics of XQuery sequences and whatever destination R datatype is being counted may be slightly different. I don’t know R that well, but semantics of XQuery sequences and arrays are rather different, for example. )
— Steve M.
On Feb 27, 2020, at 2:48 PM, Ben Engbers Ben.Engbers@Be-Logical.nl wrote:
Op 27-02-2020 om 19:19 schreef Christian Grün:
It’s difficult to understand what’s going on here. Could you please provide us self-contained queries without the R wrapper code?
Version 1:
import module namespace functx = 'http://www.functx.com'; (: Extract the text :) let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() (: Convert to lower-case and tokenize :) let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') (: Read Stopwords :) let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() (: Remove Stopwords :) let $Stop := functx:value-except($INC_RM, $Stoppers) return $Stop"
My R-code first executes this as XQUERY and then calculates the length of the returned list (=5842).
Version 2:
import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return count($Stop)
Returns the length of the sequence (counts 5843 words).
The '\' in the regular expression is intentional (R-specific). With a single '' the query can be executed in BaseXGUI.
Does this help?
Ben
I also note, that when I try to mock up something similar with one of my texts, the tokenize Seems to give me a zero length string at the start.
It’s there in the output window of basexgui, in the first line, but easy to miss the fact that it’s significant whitespace in this context:
(tokenize(string-join(collection('BOV')[ends-with( db:path(.), '.tei' )][1]/TEI/text/body/p//text() ), '\W+' ) => distinct-values())[not(. = ( "of", "the", "in", "and", "at","by","to" ))]
August 16 17 2015 Members Board Visitors University Virginia met Retreat Open Executive Session Forum …
But visible if I apply string-length to the sequence:
(tokenize(string-join(collection('BOV')[ends-with( db:path(.), '.tei' )][1]/TEI/text/body/p//text() ), '\W+' ) => distinct-values())[not(. = ( "of", "the", "in", "and", "at","by","to" ))] ! string-length(.)
0 6 2 2 4 7 5 8 10 8 3 7 4 9 7 …
I wonder if that’s the semantic difference here.
— Steve M.
On Feb 27, 2020, at 3:43 PM, Majewski, Steven Dennis (sdm7g) sdm7g@virginia.edu wrote:
So, if the counts are different depending on who is counting ( R or BaseX ), The first question is : who is correct ? ( And the 2nd question is probably: what do you mean by correct ? as the semantics of XQuery sequences and whatever destination R datatype is being counted may be slightly different. I don’t know R that well, but semantics of XQuery sequences and arrays are rather different, for example. )
— Steve M.
On Feb 27, 2020, at 2:48 PM, Ben Engbers Ben.Engbers@Be-Logical.nl wrote:
Op 27-02-2020 om 19:19 schreef Christian Grün:
It’s difficult to understand what’s going on here. Could you please provide us self-contained queries without the R wrapper code?
Version 1:
import module namespace functx = 'http://www.functx.com'; (: Extract the text :) let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() (: Convert to lower-case and tokenize :) let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') (: Read Stopwords :) let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() (: Remove Stopwords :) let $Stop := functx:value-except($INC_RM, $Stoppers) return $Stop"
My R-code first executes this as XQUERY and then calculates the length of the returned list (=5842).
Version 2:
import module namespace functx = 'http://www.functx.com'; let $txt := collection('IncidentRemarks/Incidents')/csv/record/INC_RM/text() let $INC_RM := tokenize(lower-case(string-join($txt)), '(\s|[,.!:;]|[n][b][s][p][;])+') let $Stoppers := doc('TextMining/Stopwoorden.txt')/text/line/text() let $Stop := functx:value-except($INC_RM, $Stoppers) return count($Stop)
Returns the length of the sequence (counts 5843 words).
The '\' in the regular expression is intentional (R-specific). With a single '' the query can be executed in BaseXGUI.
Does this help?
Ben
basex-talk@mailman.uni-konstanz.de