[basex-talk] Histogramming a large dataset

10 Mar 2021

      Hi,
I would appreciate your advice on optimizing a query against a large BaseX
(9.2.4) database. It is loaded with data from the FDA’s Adverse Event
Reporting System (FAERS). Currently this is just the 2020 dataset which
comprises 12 documents stored as 308,870,597 nodes (6,565 MB).
The queries below effectively - though not necessarily efficiently -
implement a histogram. The first, which is applied to patient gender (sex),
returns the results (3 items) in 52 seconds:
2 893694
1 583999
0 198
The second does this for patient weight - rounded to the closest 10 lbs
increment. It takes 580 seconds to place the data into 67 bins. Initially I
tried running it on the rounded weights but aborted the run as it was a
taking an inordinate amount of time (there are 217 distinct weights in the
dataset).
Is there a way to improve the performance of this type of query?
Thanks,
Ron
(: 3 items - 52 sec :)
let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
for $value in distinct-values($safetyreport/patient/patientsex)
return concat($value, " ", count(index-of($safetyreport/patient/patientsex,
$value)))
(: 67 items - 580 sec :)
let $safetyreport := db:open('FAERS')/ichicsr/safetyreport
for $value in distinct-values($safetyreport/patient/patientweight ! (. div
10.0) ! round(.) ! (. * 10))
return concat($value, " ",
count(index-of($safetyreport/patient/patientweight ! (. div 10.0) !
round(.) ! (. * 10), $value)))
-- 

The information in this email and any attachments are intended solely for 
the recipient(s) to whom it is addressed, and may be confidential and/or 
privileged. Any unauthorized distribution or copying of this transmittal or 
its attachments is prohibited. If you are not a named recipient or have 
received this email in error: (i) you should not read, disclose, or copy 
it, (ii) please notify the sender of your receipt by reply email and delete 
this email and all attachments.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[basex-talk] Histogramming a large dataset