Re: [basex-talk] I am looking for the fastest way to sort 2.4 Mio tags by two attribute ascending and descending

14 Nov 2019

      Hi Christian,
Am 13.11.2019 um 18:38 schrieb Christian Grün:
...
Hi Omar,
...
I am not 100% sure what redundant expressions you saw in my code. Is this about using reverse() instead of having two for loops?
In your initial query, the path…
 collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d

…was evaluated four times. If you bind it to a variable, it will only
be evaluated once. In addition, using child steps instead // is
faster, too (in many cases, BaseX will rewrite your path for you).
I always try to make the query optimizer's job as easy as possible and 
that makes things fast most of the time. I think the statements were 
optimized as db:attribute(..., 'none') so // actually was never used. My 
current approach looks like this as optimized query:
let $ds_0 := db:attribute("_qdb-TEI-02__cache", 
"none")/self::order/parent::element()/_:d
let $sorted-ascending_1 := for $d_2 in $ds_0
order by data($d_2/@vutlsk) empty least
return $d_2
let $sorted-ascending-archiv_3 := for $d_4 in $ds_0
order by data($d_4/@vutlsk-archiv) empty least
return $d_4
return (db:replace("_qdb-TEI-02__cache", "ascending_cache.xml", element 
Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { 
("ascending") }, attribute ids { 
(string-join(subsequence($sorted-ascending_1, 1, 15000)/((@ID, 
@xml:id)), " ")) }) }),
db:replace("_qdb-TEI-02__cache", "descending_cache.xml", element 
Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { 
("descending") }, attribute ids { 
(string-join(subsequence(reverse($sorted-ascending_1), 1, 15000)/((@ID, 
@xml:id)), " ")) }) }),
db:replace("_qdb-TEI-02__cache", "ascending-archiv_cache.xml", element 
Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { 
("ascending") }, attribute label { ("archiv") }, attribute ids { 
(string-join(subsequence($sorted-ascending-archiv_3, 1, 15000)/((@ID, 
@xml:id)), " ")) }) }),
db:replace("_qdb-TEI-02__cache", "descending-archiv_cache.xml", element 
Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { 
("descending") }, attribute label { ("archiv") }, attribute ids { 
(string-join(subsequence(reverse($sorted-ascending-archiv_3), 1, 
15000)/((@ID, @xml:id)), " ")) }) }))
It is interesting to hear that BaseX does not profit from // 
expressions. I think this is one thing your competing open source XML DB 
stresses in their docs: to always use as little parts in an XPath as 
possible.
...
...
I don't quite get how I would do incremental changes to the entries ordered by a key. I so an incremental update by just getting the updated pre values for the database that was changed. That is reasonably fast even with incremental attribute index update.
Just two ideas: You can store the data sets of your main database in a
pre-sorted fashion. Incremental entries can be sorted on-the-fly in
your query, and the results can then be merged with the sorted entries
of the main database.
Document order matters to me so I can't sort the main DB. At least not 
in this dataset.
...
Another approach is to store the references and
the index keys in your index database. The incremental entries can be
merged with the sorted index entries (by looking at the index keys,
which are available in both data structures).
I tried to store the _:d tags sorted by key ascending and descending 
once. That make 2.4 mio x keys (perhaps x 2) tags in the database. 
Writing this of course took much longer so a complete or initial index 
generation was up to five minutes. I think that is not worth it.
Efficiently merging by looking at the index keys is a problem because I 
join all the @xml:id that identify an entry into that one long @ids 
attribute. So I loose the relation between the key and the id. I did 
that because this was the fastest way to write this data to the db. 
Everything else I tried was much slower. And tokenize(@ids) is 
remarkably fast. Even if all 2.4 mio ids are in there this is really 
fast. Just writing out 2.4 mio ids to the database is slow.
...
...
... ! db:open-pre(./@db_name, ./@pre)
In BaseX 9.3, it will be possible to supply integer sequences as
second argument; this may speed up your query a little.
I'll give it a try.
But I have to say some "get me all entries with ids starting with s800 
sorted by some key" using this query
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util";
for $key in db:attribute("_qdb-TEI-02__cache", 
index:attributes("_qdb-TEI-02__cache", 's800'))[. instance of 
attribute(xml:id)]
order by $key/../@vutlsk ascending
where starts-with($key/../@xml:id, 's800')
return db:open-pre($key/../@db_name, $key/../@pre)
only takes 140 ms for about 3900 entries. Unfortunately 
starts-with(@xml:id, 's800') is not optimized in such a way automatically.
Best regards
Omar Siam

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] I am looking for the fastest way to sort 2.4 Mio tags by two attribute ascending and descending