I
would appreciate your advice on optimizing a query against a large
BaseX (9.2.4) database. It is loaded with data from the FDA’s
Adverse Event Reporting System (FAERS). Currently this is just the
2020 dataset which comprises 12 documents stored as 308,870,597
nodes (6,565 MB).
The
queries below effectively - though not necessarily efficiently -
implement a histogram. The first, which is applied to patient
gender (sex), returns the results (3 items) in 52 seconds:
The
second does this for patient weight - rounded to the closest 10 lbs
increment. It takes 580 seconds to place the data into 67 bins.
Initially I tried running it on the rounded weights but aborted the
run as it was a taking an inordinate amount of time (there are 217
distinct weights in the dataset).
Is
there a way to improve the performance of this type of query?
Thanks,
Ron
(: 3 items - 52 sec
:)
let $safetyreport :=
db:open('FAERS')/ichicsr/safetyreport
for $value in
distinct-values($safetyreport/patient/patientsex)
return concat($value, " ",
count(index-of($safetyreport/patient/patientsex,
$value)))
(: 67 items - 580 sec
:)
let $safetyreport :=
db:open('FAERS')/ichicsr/safetyreport
for $value in
distinct-values($safetyreport/patient/patientweight ! (. div 10.0)
! round(.) ! (. * 10))
return concat($value, " ",
count(index-of($safetyreport/patient/patientweight ! (. div 10.0) !
round(.) ! (. * 10), $value)))