Hi Gerrit,
Thanks for the suggestions. I would like to retain the original diacritics (for output purposes) but only match them when warranted (e.g., match acétazolamide to acétazolamide, but not acétazolamide to acetazolamide). I am looking for a simple solution that does not involve modifying the database or maintaining multiple copies (both for processing simplicity and storage efficiency reasons).
Thanks, Ron
On August 3, 2018 at 9:08:19 AM, Imsieke, Gerrit, le-tex ( gerrit.imsieke@le-tex.de) wrote:
Hi Ron,
You can add an extra element (or attribute) to the content when importing or modifying it. (Or another document in another database if you like – you can create and later find such an index document by giving it the same db:path as the original document.)
In this extra database, document, element and/or attribute, you can recreate the original text, except that you normalize the characters with diacritical marks to a canonical decomposition form and then strip away the diacritical marks like this:
replace(normalize-unicode($input, 'NFKD'), '\p{Mn}', '')
The full updating statement is beyond my cursory XQuery capabilities – I’d probably do it in XSLT. Also I don’t know how to trigger an event that would cause an update of the auxiliary fields when the underlying data changes.
Gerrit
On 03.08.2018 14:39, Ron Katriel wrote:
Christian,
Adding diacritics sensitive slows execution by a factor of 3. My script (fragment below), which joins two large databases, namely CT.gov <
https://urldefense.proofpoint.com/v2/url?u=http-3A__clinicaltrials.gov&d... and DrugBank, takes 2 hours without the
diacritics sensitive constraint but 6 hours with it. Given the combinatorics involved, I am wondering if there is a better way to do this in BaseX.
Thanks, Ron
for $drug in db:open('DrugBank')/drugbank/drug let $drug_name := $drug/name/text() let $drug_synonyms := functx:value-union(normalize-space(lower-case($drug/name)), local:drug-synonyms($drug_name)) for $synonym_name in $drug_synonyms ... for $study in db:open('CTGov')/clinical_study[intervention/intervention_name contains text { $synonym_name } using case insensitive using diacritics sensitive] ...
Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions http://www.mdsol.com/ 350 Hudson Street, 7th Floor, New York, NY 10014 rkatriel@mdsol.com mailto:tbrophy@mdsol.com | direct: +1 201 337 3622 tel://201%20337%203622 | mobile: +1 201 675 5598 tel://+1%20201%20675%205598 | main: +1 212 918 1800 tel://+1%20212%20918%201800
On August 1, 2018 at 12:41:26 PM, Ron Katriel (rkatriel@mdsol.com mailto:rkatriel@mdsol.com) wrote:
Thanks, Christian. Strange, prior to contacting you and on a hunch, I tried adding the missing “using” keyword but still got the syntax error. Anyway, everything is good now!
Best, Ron
On August 1, 2018 at 3:57:51 AM, Christian Grün (christian.gruen@gmail.com mailto:christian.gruen@gmail.com) wrote:
I have fixed the example in the doc. Best, Christian
On Wed, Aug 1, 2018 at 5:08 AM Ron Katriel <rkatriel@mdsol.com mailto:rkatriel@mdsol.com> wrote:
Hi,
The following from your website (docs.basex.org/wiki/Full-Text
<
https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.basex.org_wiki_Full...) appears to be syntactically
incorrect
"'Äpfel' will not be found..." contains text "Apfel" diacritics
sensitive
In the BaseX GUI the keyword diacritics is underlined in red and the
following error is reported
Unexpected end of query: 'diacritic sens...'.
This happens in version 8.6.4 and also the latest (9.0.2).
Thanks, Ron
Ron Katriel, Ph.D. | Principal Data Scientist | Medidata Solutions
350 Hudson Street, 7th Floor, New York, NY 10014
rkatriel@mdsol.com mailto:rkatriel@mdsol.com | direct: +1 201 337
3622 | mobile: +1 201 675 5598 | main: +1 212 918 1800