Hi,
I would appreciate your advice on optimizing a query against a large BaseX (9.2.4) database. It is loaded with data from the FDA’s Adverse Event Reporting System (FAERS). Currently this is just the 2020 dataset which comprises 12 documents stored as 308,870,597 nodes (6,565 MB).
The queries below effectively - though not necessarily efficiently - implement a histogram. The first, which is applied to patient gender (sex), returns the results (3 items) in 52 seconds:
2 893694 1 583999 0 198
The second does this for patient weight - rounded to the closest 10 lbs increment. It takes 580 seconds to place the data into 67 bins. Initially I tried running it on the rounded weights but aborted the run as it was a taking an inordinate amount of time (there are 217 distinct weights in the dataset).
Is there a way to improve the performance of this type of query?
Thanks, Ron
(: 3 items - 52 sec :) let $safetyreport := db:open('FAERS')/ichicsr/safetyreport for $value in distinct-values($safetyreport/patient/patientsex) return concat($value, " ", count(index-of($safetyreport/patient/patientsex, $value)))
(: 67 items - 580 sec :) let $safetyreport := db:open('FAERS')/ichicsr/safetyreport for $value in distinct-values($safetyreport/patient/patientweight ! (. div 10.0) ! round(.) ! (. * 10)) return concat($value, " ", count(index-of($safetyreport/patient/patientweight ! (. div 10.0) ! round(.) ! (. * 10), $value)))