Hi Ben,
I had a look at the attached documents you sent to me in private. I couldn’t find out why the XQuery and R results differ (I’m pretty sure it’s due to the zero-length string, as guessed by Steve), but I noticed that your query may not necessarily do what you are trying to achieve.
Here is an alternative version that, as I believe, should match your requirements better:
let $words := distinct-values( for $text in db:open('Incidents')/csv/record/INC_RM return ft:tokenize($text) ) let $stopwords := db:open('Stopwords')/text/line let $result := $words[not(. = $stopwords)] return sort($result)
There is no need to remove nbsp substrings as they’ll never occur in your input, and the ft:tokenize function will ensure that your input (case, special characters, diacritics) will be normalized (see [1,2] for more details). Using functx is perfectly valid; I only removed the reference to make the code a bit shorter.
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Full-Text_Module#ft:tokenize [2] http://docs.basex.org/wiki/Full-Text
On Fri, Feb 28, 2020 at 4:57 PM Christian Grün christian.gruen@gmail.com wrote:
A few incident and stories entries should be sufficient. Just attach the two XML documents to your next reply.
Ben Engbers Ben.Engbers@be-logical.nl schrieb am Fr., 28. Feb. 2020, 16:20:
Op 28-02-2020 om 14:39 schreef Christian Grün:
I was wondering about nbsp as well. Maybe you don’t need it at all, but we’d need to have a look at your files.
Could you additionally provide us with minimized instances of your Incidents and Stopwoorden.txt XML documents? They should have the same structure, but contain only a few lines of contents.
It should be relatively easy to create a database with the (approximately 500) stopwords and another database with with the Incidents. Shall I send you a backup of those two databases?
Ben