Hi,
I would appreciate your advice on optimizing a query against a large BaseX
(9.2.4) database. It is loaded with data from the FDA’s Adverse Event
Reporting System (FAERS). Currently this is just the 2020 dataset which
comprises 12 documents stored as 308,870,597 nodes (6,565 MB).
The queries below effectively - though not necessarily efficiently -
implement a histogram. The first, which is applied to patient gender (sex),
returns the results (3 items) in 52 seconds:
2 893694
1 583999
0 198
The second does this for patient weight - rounded to the closest 10 lbs
increment. It takes 580 seconds to place the data into 67 bins. Initially I
tried running it on the rounded weights but aborted the run as it was a
taking an inordinate amount of time (there are 217 distinct weights in the
dataset).
Is there a way to improve the performance of this type of query?
Thanks,
Ron
(: 3 items - 52 sec :)
let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
for $value in distinct-values($safetyreport/patient/patientsex)
return concat($value, " ", count(index-of($safetyreport/patient/patientsex,
$value)))
(: 67 items - 580 sec :)
let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
for $value in distinct-values($safetyreport/patient/patientweight ! (. div
10.0) ! round(.) ! (. * 10))
return concat($value, " ",
count(index-of($safetyreport/patient/patientweight ! (. div 10.0) !
round(.) ! (. * 10), $value)))
--
The information in this email and any attachments are intended solely for
the recipient(s) to whom it is addressed, and may be confidential and/or
privileged. Any unauthorized distribution or copying of this transmittal or
its attachments is prohibited. If you are not a named recipient or have
received this email in error: (i) you should not read, disclose, or copy
it, (ii) please notify the sender of your receipt by reply email and delete
this email and all attachments.