Re: [basex-talk] Should it be possible to declare a function in the client?

List overview All Threads
Download

newer

older

New version for RBaseX

Huge No of XML files.

Ben Engbers

28 Feb 2020 28 Feb '20

5:45 a.m.

Op 27-02-2020 om 22:03 schreef Majewski, Steven Dennis (sdm7g):

...

Also: is ‘(&nbsp;)’ what you want as part of you regex to also catch the ampersand ? I’m just guessing your intent here. You could also try ‘(\W|&nbsp;)+’ - i.e. non-word, but I’m kind of assuming that it handles non-normalized unicode accented characters correctly and reads them as word chars and not delimiters. That would be, of course, the right thing, but I’ld probably test it first.

— Steve.

I just copied the regex-expression from this page "https://en.wikibooks.org/wiki/XQuery/Tag_Cloud" (using regex always gives me headaches ;-( ). But even after removing the "|[n][b][s][p][;]" from the regex, basexgui still returns 5843.

Ben

Show replies by date

Christian Grün

28 Feb 28 Feb

8:39 a.m.

New subject: Should it be possible to declare a function in the client?

I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.

Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.

On Fri, Feb 28, 2020 at 11:45 AM Ben Engbers Ben.Engbers@be-logical.nl wrote:

...

Op 27-02-2020 om 22:03 schreef Majewski, Steven Dennis (sdm7g):

...
Also: is ‘(&nbsp;)’ what you want as part of you regex to also catch the ampersand ? I’m just guessing your intent here. You could also try ‘(\W|&nbsp;)+’ - i.e. non-word, but I’m kind of assuming that it handles non-normalized unicode accented characters correctly and reads them as word chars and not delimiters. That would be, of course, the right thing, but I’ld probably test it first.

— Steve.

I just copied the regex-expression from this page "https://en.wikibooks.org/wiki/XQuery/Tag_Cloud" (using regex always gives me headaches ;-( ). But even after removing the "|[n][b][s][p][;]" from the regex, basexgui still returns 5843.

Ben

Ben Engbers

10:20 a.m.

New subject: Should it be possible to declare a function in the client?

Op 28-02-2020 om 14:39 schreef Christian Grün:

...

I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.

Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.

It should be relatively easy to create a database with the (approximately 500) stopwords and another database with with the Incidents. Shall I send you a backup of those two databases?

Ben

Christian Grün

10:57 a.m.

New subject: Should it be possible to declare a function in the client?

A few incident and stories entries should be sufficient. Just attach the two XML documents to your next reply.

Ben Engbers Ben.Engbers@be-logical.nl schrieb am Fr., 28. Feb. 2020, 16:20:

...

Op 28-02-2020 om 14:39 schreef Christian Grün:

...
I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.

Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.

It should be relatively easy to create a database with the (approximately 500) stopwords and another database with with the Incidents. Shall I send you a backup of those two databases?

Ben

Christian Grün

2 Mar 2 Mar

7:27 a.m.

New subject: Should it be possible to declare a function in the client?

Hi Ben,

I had a look at the attached documents you sent to me in private. I couldn’t find out why the XQuery and R results differ (I’m pretty sure it’s due to the zero-length string, as guessed by Steve), but I noticed that your query may not necessarily do what you are trying to achieve.

Here is an alternative version that, as I believe, should match your requirements better:

let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)

There is no need to remove nbsp substrings as they’ll never occur in your input, and the ft:tokenize function will ensure that your input (case, special characters, diacritics) will be normalized (see [1,2] for more details). Using functx is perfectly valid; I only removed the reference to make the code a bit shorter.

Hope this helps, Christian

[1] http://docs.basex.org/wiki/Full-Text_Module#ft:tokenize [2] http://docs.basex.org/wiki/Full-Text

On Fri, Feb 28, 2020 at 4:57 PM Christian Grün christian.gruen@gmail.com wrote:

...

A few incident and stories entries should be sufficient. Just attach the two XML documents to your next reply.

Ben Engbers Ben.Engbers@be-logical.nl schrieb am Fr., 28. Feb. 2020, 16:20:

...
Op 28-02-2020 om 14:39 schreef Christian Grün:

...
I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.

Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.

It should be relatively easy to create a database with the (approximately 500) stopwords and another database with with the Incidents. Shall I send you a backup of those two databases?

Ben

Ben Engbers

3 Mar 3 Mar

5:55 a.m.

New subject: Should it be possible to declare a function in the client?

Op 02-03-2020 om 13:27 schreef Christian Grün:

...

Hi Ben,

Here is an alternative version that, as I believe, should match your requirements better:

let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)

There is no need to remove nbsp substrings as they’ll never occur in your input, and the ft:tokenize function will ensure that your input (case, special characters, diacritics) will be normalized (see [1,2] for more details). Using functx is perfectly valid; I only removed the reference to make the code a bit shorter.

Hope this helps, Christian

[1] http://docs.basex.org/wiki/Full-Text_Module#ft:tokenize [2] http://docs.basex.org/wiki/Full-Text

Hi Christian,

Since my primary goal for this is moment is to see how basex/XQuery can be used for full text analysis (and compare the results or needed efforts with similar tasks in R), I am very glad that you brought the fn:tokenize() function to my attention!

Ben

PS, Just for fun, I created a repository with this tiny function: declare function tidyTM:wordFreqs( $Words as xs:string*) { for $w in $Words let $f := $w group by $f order by count($w) descending return ($f, count($w)) } ;

It took less than 10 minutes to create a repository and populate with this function. Creating a R-package takes much longer time!!!

Ben Engbers

8:54 a.m.

New subject: Is it possible to use 'Stopwords' in a query?

Op 02-03-2020 om 13:27 schreef Christian Grün:

...

Hi Ben,

Here is an alternative version that, as I believe, should match your requirements better:

let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)

Hi Christian,

I don't have a separate database 'Stopwords'. The file 'Stopwoorden.txt' was used as option while creating the 'Incidents'-database. Since I have several lists with stopwords and several lists that can be used with sentiment-analysts, I have stored all those files in a 'Textmining' database.

Without caring about stopwords, this query works:

let $words := for $text in collection('IncidentRemarks/Incidents')/csv/record/INC_RM return ft:tokenize($text) return $words

("sort($words)" returns a long list of numbers)

In an article, ("Full-Text Search in XML Databases" by Skoglund, Robin, 2009), I saw this example on page 23: 1 (: will match "propagating few errors" :) 2 /books /book [@number="1"]//p ftcontains" propagation of errors" 3 with stemming with stop words ("a" , "the" , "of")

The query may be changed to "stemming without stop words".

What I would like to see in BaseX, is that similar as in xquery, 'Stopwords' could be used as if it were a separate resource in the 'Incidents'-database and that it could be used as follows in the query:

let $words := for $text in collection('IncidentRemarks/Incidents')/csv/record/INC_RM with stemming without stop words return ft:tokenize($text) return $words

As far as I understand, 'stemming' has alrady been made available in the ft:module. Would it also be possible to use STOPWORDS in a similar way?

Cheers, Ben

Christian Grün

10:43 a.m.

New subject: Is it possible to use 'Stopwords' in a query?

...

What I would like to see in BaseX, is that similar as in xquery, 'Stopwords' could be used as if it were a separate resource in the 'Incidents'-database and that it could be used as follows in the query:

That’s not supported by the spec (see [1] for all the details). You need to write your stop words to a local file (e.g. via file:write); after that, you can reference this file in your fulltext expression:

/books/book [@number="1"]//p contains text "propagation of errors" using stop words at "stopwords.txt"

However, the more common and more efficient approach is to supply a stop words file when creating the full-text index. This will reduce the size of your full-text index (which is the major advantage in practice). If you call ft:search or ...[text() contains text ...] later on, this stop words file will be used to filter out terms that occur in your search terms.

...

As far as I understand, 'stemming' has alrady been made available in the ft:module. Would it also be possible to use STOPWORDS in a similar way?

Which function would you like to see extended?

[1] https://www.w3.org/TR/xpath-full-text-10/#ftstopwordoption

Majewski, Steven Dennis (sdm7g)

28 Feb 28 Feb

9:39 a.m.

New subject: Should it be possible to declare a function in the client?

Please see my earlier message to the list, as I’m betting it’s the zero-length string that isn’t getting counted in R! — Steve M.

...

On Feb 28, 2020, at 5:45 AM, Ben Engbers Ben.Engbers@be-logical.nl wrote:

Op 27-02-2020 om 22:03 schreef Majewski, Steven Dennis (sdm7g):

...
Also: is ‘(&nbsp;)’ what you want as part of you regex to also catch the ampersand ? I’m just guessing your intent here. You could also try ‘(\W|&nbsp;)+’ - i.e. non-word, but I’m kind of assuming that it handles non-normalized unicode accented characters correctly and reads them as word chars and not delimiters. That would be, of course, the right thing, but I’ld probably test it first.

— Steve.

I just copied the regex-expression from this page "https://en.wikibooks.org/wiki/XQuery/Tag_Cloud" (using regex always gives me headaches ;-( ). But even after removing the "|[n][b][s][p][;]" from the regex, basexgui still returns 5843.

Ben

1962

Age (days ago)

1966

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

8 comments

3 participants

tags (0)

participants (3)

Ben Engbers
Christian Grün
Majewski, Steven Dennis (sdm7g)