Op 27-02-2020 om 22:03 schreef Majewski, Steven Dennis (sdm7g):
Also: is ‘( )’ what you want as part of you regex to also catch the ampersand ? I’m just guessing your intent here. You could also try ‘(\W| )+’ - i.e. non-word, but I’m kind of assuming that it handles non-normalized unicode accented characters correctly and reads them as word chars and not delimiters. That would be, of course, the right thing, but I’ld probably test it first.
— Steve.
I just copied the regex-expression from this page "https://en.wikibooks.org/wiki/XQuery/Tag_Cloud" (using regex always gives me headaches ;-( ). But even after removing the "|[n][b][s][p][;]" from the regex, basexgui still returns 5843.
Ben
I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.
Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.
On Fri, Feb 28, 2020 at 11:45 AM Ben Engbers Ben.Engbers@be-logical.nl wrote:
Op 27-02-2020 om 22:03 schreef Majewski, Steven Dennis (sdm7g):
Also: is ‘( )’ what you want as part of you regex to also catch the ampersand ? I’m just guessing your intent here. You could also try ‘(\W| )+’ - i.e. non-word, but I’m kind of assuming that it handles non-normalized unicode accented characters correctly and reads them as word chars and not delimiters. That would be, of course, the right thing, but I’ld probably test it first.
— Steve.
I just copied the regex-expression from this page "https://en.wikibooks.org/wiki/XQuery/Tag_Cloud" (using regex always gives me headaches ;-( ). But even after removing the "|[n][b][s][p][;]" from the regex, basexgui still returns 5843.
Ben
Op 28-02-2020 om 14:39 schreef Christian Grün:
I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.
Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.
It should be relatively easy to create a database with the (approximately 500) stopwords and another database with with the Incidents. Shall I send you a backup of those two databases?
Ben
A few incident and stories entries should be sufficient. Just attach the two XML documents to your next reply.
Ben Engbers Ben.Engbers@be-logical.nl schrieb am Fr., 28. Feb. 2020, 16:20:
Op 28-02-2020 om 14:39 schreef Christian Grün:
I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.
Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.
It should be relatively easy to create a database with the (approximately 500) stopwords and another database with with the Incidents. Shall I send you a backup of those two databases?
Ben
Hi Ben,
I had a look at the attached documents you sent to me in private. I couldn’t find out why the XQuery and R results differ (I’m pretty sure it’s due to the zero-length string, as guessed by Steve), but I noticed that your query may not necessarily do what you are trying to achieve.
Here is an alternative version that, as I believe, should match your requirements better:
let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)
There is no need to remove nbsp substrings as they’ll never occur in your input, and the ft:tokenize function will ensure that your input (case, special characters, diacritics) will be normalized (see [1,2] for more details). Using functx is perfectly valid; I only removed the reference to make the code a bit shorter.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text_Module#ft:tokenize [2] http://docs.basex.org/wiki/Full-Text
On Fri, Feb 28, 2020 at 4:57 PM Christian Grün christian.gruen@gmail.com wrote:
A few incident and stories entries should be sufficient. Just attach the two XML documents to your next reply.
Ben Engbers Ben.Engbers@be-logical.nl schrieb am Fr., 28. Feb. 2020, 16:20:
Op 28-02-2020 om 14:39 schreef Christian Grün:
I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.
Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.
It should be relatively easy to create a database with the (approximately 500) stopwords and another database with with the Incidents. Shall I send you a backup of those two databases?
Ben
Op 02-03-2020 om 13:27 schreef Christian Grün:
Hi Ben,
Here is an alternative version that, as I believe, should match your requirements better:
let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)
There is no need to remove nbsp substrings as they’ll never occur in your input, and the ft:tokenize function will ensure that your input (case, special characters, diacritics) will be normalized (see [1,2] for more details). Using functx is perfectly valid; I only removed the reference to make the code a bit shorter.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text_Module#ft:tokenize [2] http://docs.basex.org/wiki/Full-Text
Hi Christian,
Since my primary goal for this is moment is to see how basex/XQuery can be used for full text analysis (and compare the results or needed efforts with similar tasks in R), I am very glad that you brought the fn:tokenize() function to my attention!
Ben
PS, Just for fun, I created a repository with this tiny function: declare function tidyTM:wordFreqs( $Words as xs:string*) { for $w in $Words let $f := $w group by $f order by count($w) descending return ($f, count($w)) } ;
It took less than 10 minutes to create a repository and populate with this function. Creating a R-package takes much longer time!!!
Op 02-03-2020 om 13:27 schreef Christian Grün:
Hi Ben,
Here is an alternative version that, as I believe, should match your requirements better:
let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)
Hi Christian,
I don't have a separate database 'Stopwords'. The file 'Stopwoorden.txt' was used as option while creating the 'Incidents'-database. Since I have several lists with stopwords and several lists that can be used with sentiment-analysts, I have stored all those files in a 'Textmining' database.
Without caring about stopwords, this query works:
let $words := for $text in collection('IncidentRemarks/Incidents')/csv/record/INC_RM return ft:tokenize($text) return $words
("sort($words)" returns a long list of numbers)
In an article, ("Full-Text Search in XML Databases" by Skoglund, Robin, 2009), I saw this example on page 23: 1 (: will match "propagating few errors" :) 2 /books /book [@number="1"]//p ftcontains" propagation of errors" 3 with stemming with stop words ("a" , "the" , "of")
The query may be changed to "stemming without stop words".
What I would like to see in BaseX, is that similar as in xquery, 'Stopwords' could be used as if it were a separate resource in the 'Incidents'-database and that it could be used as follows in the query:
let $words := for $text in collection('IncidentRemarks/Incidents')/csv/record/INC_RM with stemming without stop words return ft:tokenize($text) return $words
As far as I understand, 'stemming' has alrady been made available in the ft:module. Would it also be possible to use STOPWORDS in a similar way?
Cheers, Ben
What I would like to see in BaseX, is that similar as in xquery, 'Stopwords' could be used as if it were a separate resource in the 'Incidents'-database and that it could be used as follows in the query:
That’s not supported by the spec (see [1] for all the details). You need to write your stop words to a local file (e.g. via file:write); after that, you can reference this file in your fulltext expression:
/books/book [@number="1"]//p contains text "propagation of errors" using stop words at "stopwords.txt"
However, the more common and more efficient approach is to supply a stop words file when creating the full-text index. This will reduce the size of your full-text index (which is the major advantage in practice). If you call ft:search or ...[text() contains text ...] later on, this stop words file will be used to filter out terms that occur in your search terms.
As far as I understand, 'stemming' has alrady been made available in the ft:module. Would it also be possible to use STOPWORDS in a similar way?
Which function would you like to see extended?
[1] https://www.w3.org/TR/xpath-full-text-10/#ftstopwordoption
Please see my earlier message to the list, as I’m betting it’s the zero-length string that isn’t getting counted in R! — Steve M.
On Feb 28, 2020, at 5:45 AM, Ben Engbers Ben.Engbers@be-logical.nl wrote:
Op 27-02-2020 om 22:03 schreef Majewski, Steven Dennis (sdm7g):
Also: is ‘( )’ what you want as part of you regex to also catch the ampersand ? I’m just guessing your intent here. You could also try ‘(\W| )+’ - i.e. non-word, but I’m kind of assuming that it handles non-normalized unicode accented characters correctly and reads them as word chars and not delimiters. That would be, of course, the right thing, but I’ld probably test it first.
— Steve.
I just copied the regex-expression from this page "https://en.wikibooks.org/wiki/XQuery/Tag_Cloud" (using regex always gives me headaches ;-( ). But even after removing the "|[n][b][s][p][;]" from the regex, basexgui still returns 5843.
Ben
basex-talk@mailman.uni-konstanz.de