Op 02-03-2020 om 13:27 schreef Christian GrĂ¼n:
Hi Ben,
Here is an alternative version that, as I believe, should match your requirements better:
let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)
Hi Christian,
I don't have a separate database 'Stopwords'. The file 'Stopwoorden.txt' was used as option while creating the 'Incidents'-database. Since I have several lists with stopwords and several lists that can be used with sentiment-analysts, I have stored all those files in a 'Textmining' database.
Without caring about stopwords, this query works:
let $words := for $text in collection('IncidentRemarks/Incidents')/csv/record/INC_RM return ft:tokenize($text) return $words
("sort($words)" returns a long list of numbers)
In an article, ("Full-Text Search in XML Databases" by Skoglund, Robin, 2009), I saw this example on page 23: 1 (: will match "propagating few errors" :) 2 /books /book [@number="1"]//p ftcontains" propagation of errors" 3 with stemming with stop words ("a" , "the" , "of")
The query may be changed to "stemming without stop words".
What I would like to see in BaseX, is that similar as in xquery, 'Stopwords' could be used as if it were a separate resource in the 'Incidents'-database and that it could be used as follows in the query:
let $words := for $text in collection('IncidentRemarks/Incidents')/csv/record/INC_RM with stemming without stop words return ft:tokenize($text) return $words
As far as I understand, 'stemming' has alrady been made available in the ft:module. Would it also be possible to use STOPWORDS in a similar way?
Cheers, Ben