Hi,
I have a custom index that looks like this (one db, different files):
<_:dryed xmlns:_="https://www.oeaw.ac.at/acdh/tools/vle/util" db_name="z881_qdb-TEI-02n" order="none"> <_:d pre="15627" db_name="z881_qdb-TEI-02n" xml:id="z881_qdbn-d16e2" vutlsk="tsįttr Ziter [Subst]" vutlsk-archiv="HK 881, z8810118.sch#1"/> <_:d pre="15673" db_name="z881_qdb-TEI-02n" xml:id="z881_qdbn-d16e21" vutlsk="tsįttr Ziter [Subst]" vutlsk-archiv="HK 881, z8810118.sch#1"/> ... </_:dryed> <_:dryed xmlns:_="https://www.oeaw.ac.at/acdh/tools/vle/util" db_name="f227_qdb-TEI-02n" order="none"> <_:d pre="467" db_name="f227_qdb-TEI-02n" xml:id="f237_qdb-d1e29398" vutlsk="(aus)faren [Verb]" vutlsk-archiv="HK 327, f227#944.1 = fare0126.eck#1.1"/> <_:d pre="591" db_name="f227_qdb-TEI-02n" xml:id="f237_qdb-d1e29438" vutlsk="(aus)faren [Verb]" vutlsk-archiv="HK 327, f227#945.1 = fare0126.eck#2.1"/> ... </_:dryed>
There are about 2.4 Mio _:d tags in this db.
I need to sort them by the @vutlsk* attributes alphabetically in ascending and descending order.
With the code I have now:
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util";
let $sorted-ascending := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk ascending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-descending := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk descending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-ascending-archiv := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk-archiv ascending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-descending-archiv := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk-archiv descending return $d/(@ID, @xml:id)/data(), 1, 10000) return (db:replace("_qdb-TEI-02__cache", 'ascending_cache.xml', <_:dryed order="ascending" ids="{string-join($sorted-ascending, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending_cache.xml', <_:dryed order="descending" ids="{string-join($sorted-descending, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'ascending-archiv_cache.xml', <_:dryed order="ascending" label="archiv" ids="{string-join($sorted-ascending-archiv, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending-archiv_cache.xml', <_:dryed order="descending" label="archiv" ids="{string-join($sorted-descending-archiv, ' ')}"/>))
This takes 30 s to about a minute depending on the subsequence I choose.
I did experiments with doing multithreading and not. Multiple jobs or fork-join make it worse.
Worst case I need to do it every time I save a change to the original DBs for which I maintain that index.
Any ideas how to speed this up?
Best regards
Omar Siam
Dear Omar,
Some spontaneous ideas:
• You could try to evaluate redundant expressions once and bind them to a variable instead (see the attached code). • You could save each document to a separate database via db:create (depending on your data, this may be faster than replacements in a single database), or save all new elements in a single document. • Instead of creating full index structures with each update operation, you may save a lot of time if you only update parts of the data that have actually changed. • If that’s close to impossible (because the types of updates are too manifold), you could work with daily databases that only contain incremental changes, and merge them with the main database every night.
2,4 million tags are a lot, though; and the string length of the created attribute values seem to exceed 100.000 characters, which is a lot, too. What will you do with the resulting documents?
Best, Christian
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util";
let $id-string := function($nodes) { $nodes/(@ID, @xml:id) => subsequence(1, 10000) => string-join(' ') }
let $db := '_qdb-TEI-02__cache' let $nodes := db:open($db)/_:dryed[@order = 'none']/_:d
let $vutlsk := sort($nodes, (), function($n) { $n/@vutlsk }) let $archiv := sort($nodes, (), function($n) { $n/@vutlsk-archiv })
return ( db:replace($db, 'ascending_cache.xml', <_:dryed order="ascending" ids="{ $id-string($vutlsk) }"/>), db:replace($db, 'descending_cache.xml', <_:dryed order="descending" ids="{ $id-string(reverse($vutlsk)) }"/>), db:replace($db, 'ascending-archiv_cache.xml', <_:dryed order="ascending" ids="{ $id-string($archiv) }" label="archiv"/>), db:replace($db, 'descending-archiv_cache.xml', <_:dryed order="descending" ids="{ $id-string(reverse($archiv)) }" label="archiv"/>) ) ____________________________
On Tue, Nov 12, 2019 at 6:00 PM Omar Siam Omar.Siam@oeaw.ac.at wrote:
Hi,
I have a custom index that looks like this (one db, different files):
<_:dryed xmlns:_="https://www.oeaw.ac.at/acdh/tools/vle/util" db_name="z881_qdb-TEI-02n" order="none"> <_:d pre="15627" db_name="z881_qdb-TEI-02n" xml:id="z881_qdbn-d16e2" vutlsk="tsįttr Ziter [Subst]" vutlsk-archiv="HK 881, z8810118.sch#1"/> <_:d pre="15673" db_name="z881_qdb-TEI-02n" xml:id="z881_qdbn-d16e21" vutlsk="tsįttr Ziter [Subst]" vutlsk-archiv="HK 881, z8810118.sch#1"/> ... </_:dryed> <_:dryed xmlns:_="https://www.oeaw.ac.at/acdh/tools/vle/util" db_name="f227_qdb-TEI-02n" order="none"> <_:d pre="467" db_name="f227_qdb-TEI-02n" xml:id="f237_qdb-d1e29398" vutlsk="(aus)faren [Verb]" vutlsk-archiv="HK 327, f227#944.1 = fare0126.eck#1.1"/> <_:d pre="591" db_name="f227_qdb-TEI-02n" xml:id="f237_qdb-d1e29438" vutlsk="(aus)faren [Verb]" vutlsk-archiv="HK 327, f227#945.1 = fare0126.eck#2.1"/> ... </_:dryed>
There are about 2.4 Mio _:d tags in this db.
I need to sort them by the @vutlsk* attributes alphabetically in ascending and descending order.
With the code I have now:
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util";
let $sorted-ascending := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk ascending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-descending := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk descending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-ascending-archiv := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk-archiv ascending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-descending-archiv := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk-archiv descending return $d/(@ID, @xml:id)/data(), 1, 10000) return (db:replace("_qdb-TEI-02__cache", 'ascending_cache.xml', <_:dryed order="ascending" ids="{string-join($sorted-ascending, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending_cache.xml', <_:dryed order="descending" ids="{string-join($sorted-descending, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'ascending-archiv_cache.xml', <_:dryed order="ascending" label="archiv" ids="{string-join($sorted-ascending-archiv, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending-archiv_cache.xml', <_:dryed order="descending" label="archiv" ids="{string-join($sorted-descending-archiv, ' ')}"/>))
This takes 30 s to about a minute depending on the subsequence I choose.
I did experiments with doing multithreading and not. Multiple jobs or fork-join make it worse.
Worst case I need to do it every time I save a change to the original DBs for which I maintain that index.
Any ideas how to speed this up?
Best regards
Omar Siam
And I discovered a little bug, just after I have dispatched the message: The reversed nodes will be rewritten to document order in the $id-string function. You can circumvent this by utilizing the simple map operator:
Old: $nodes/(@ID, @xml:id) New: $nodes!(@ID, @xml:id)
On Tue, Nov 12, 2019 at 6:48 PM Christian Grün christian.gruen@gmail.com wrote:
Dear Omar,
Some spontaneous ideas:
• You could try to evaluate redundant expressions once and bind them to a variable instead (see the attached code). • You could save each document to a separate database via db:create (depending on your data, this may be faster than replacements in a single database), or save all new elements in a single document. • Instead of creating full index structures with each update operation, you may save a lot of time if you only update parts of the data that have actually changed. • If that’s close to impossible (because the types of updates are too manifold), you could work with daily databases that only contain incremental changes, and merge them with the main database every night.
2,4 million tags are a lot, though; and the string length of the created attribute values seem to exceed 100.000 characters, which is a lot, too. What will you do with the resulting documents?
Best, Christian
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util";
let $id-string := function($nodes) { $nodes/(@ID, @xml:id) => subsequence(1, 10000) => string-join(' ') }
let $db := '_qdb-TEI-02__cache' let $nodes := db:open($db)/_:dryed[@order = 'none']/_:d
let $vutlsk := sort($nodes, (), function($n) { $n/@vutlsk }) let $archiv := sort($nodes, (), function($n) { $n/@vutlsk-archiv })
return ( db:replace($db, 'ascending_cache.xml', <_:dryed order="ascending" ids="{ $id-string($vutlsk) }"/>), db:replace($db, 'descending_cache.xml', <_:dryed order="descending" ids="{ $id-string(reverse($vutlsk)) }"/>), db:replace($db, 'ascending-archiv_cache.xml', <_:dryed order="ascending" ids="{ $id-string($archiv) }" label="archiv"/>), db:replace($db, 'descending-archiv_cache.xml', <_:dryed order="descending" ids="{ $id-string(reverse($archiv)) }" label="archiv"/>) ) ____________________________
On Tue, Nov 12, 2019 at 6:00 PM Omar Siam Omar.Siam@oeaw.ac.at wrote:
Hi,
I have a custom index that looks like this (one db, different files):
<_:dryed xmlns:_="https://www.oeaw.ac.at/acdh/tools/vle/util" db_name="z881_qdb-TEI-02n" order="none"> <_:d pre="15627" db_name="z881_qdb-TEI-02n" xml:id="z881_qdbn-d16e2" vutlsk="tsįttr Ziter [Subst]" vutlsk-archiv="HK 881, z8810118.sch#1"/> <_:d pre="15673" db_name="z881_qdb-TEI-02n" xml:id="z881_qdbn-d16e21" vutlsk="tsįttr Ziter [Subst]" vutlsk-archiv="HK 881, z8810118.sch#1"/> ... </_:dryed> <_:dryed xmlns:_="https://www.oeaw.ac.at/acdh/tools/vle/util" db_name="f227_qdb-TEI-02n" order="none"> <_:d pre="467" db_name="f227_qdb-TEI-02n" xml:id="f237_qdb-d1e29398" vutlsk="(aus)faren [Verb]" vutlsk-archiv="HK 327, f227#944.1 = fare0126.eck#1.1"/> <_:d pre="591" db_name="f227_qdb-TEI-02n" xml:id="f237_qdb-d1e29438" vutlsk="(aus)faren [Verb]" vutlsk-archiv="HK 327, f227#945.1 = fare0126.eck#2.1"/> ... </_:dryed>
There are about 2.4 Mio _:d tags in this db.
I need to sort them by the @vutlsk* attributes alphabetically in ascending and descending order.
With the code I have now:
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util";
let $sorted-ascending := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk ascending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-descending := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk descending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-ascending-archiv := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk-archiv ascending return $d/(@ID, @xml:id)/data(), 1, 10000) let $sorted-descending-archiv := subsequence(for $d in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d order by $d/@vutlsk-archiv descending return $d/(@ID, @xml:id)/data(), 1, 10000) return (db:replace("_qdb-TEI-02__cache", 'ascending_cache.xml', <_:dryed order="ascending" ids="{string-join($sorted-ascending, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending_cache.xml', <_:dryed order="descending" ids="{string-join($sorted-descending, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'ascending-archiv_cache.xml', <_:dryed order="ascending" label="archiv" ids="{string-join($sorted-ascending-archiv, ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending-archiv_cache.xml', <_:dryed order="descending" label="archiv" ids="{string-join($sorted-descending-archiv, ' ')}"/>))
This takes 30 s to about a minute depending on the subsequence I choose.
I did experiments with doing multithreading and not. Multiple jobs or fork-join make it worse.
Worst case I need to do it every time I save a change to the original DBs for which I maintain that index.
Any ideas how to speed this up?
Best regards
Omar Siam
Dear Christian,
Thank you, your suggestion is indeed 4s faster on my machine than my code. This is quite impressive. Now I am below 20s. Not ideal but a good start. If I try hard to not do this if not necessary then I am willing to leave it at that.
I also tried some ideas from your code but with the traditional for loop. Looks like that is even faster:
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util";
let $sorted-ascending := for $key in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d/@vutlsk order by data($key) ascending return $key let $sorted-ascending-archiv := for $key in collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d/@vutlsk-archiv order by data($key) ascending return $key return (db:replace("_qdb-TEI-02__cache", 'ascending_cache.xml', <_:dryed order="ascending" ids="{string-join(subsequence($sorted-ascending, 1, 15000)/../(@ID, @xml:id), ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending_cache.xml', <_:dryed order="descending" ids="{string-join(subsequence(reverse($sorted-ascending), 1, 15000)/../(@ID, @xml:id), ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'ascending-archiv_cache.xml', <_:dryed order="ascending" label="archiv" ids="{string-join(subsequence($sorted-ascending-archiv, 1, 15000)/../(@ID, @xml:id), ' ')}"/>), db:replace("_qdb-TEI-02__cache", 'descending-archiv_cache.xml', <_:dryed order="descending" label="archiv" ids="{string-join(subsequence(reverse($sorted-ascending-archiv), 1, 15000)/../(@ID, @xml:id), ' ')}"/>))
This takes only 11s on my machine.
One thing I think I also saw previously: parent axis is rather slow. Do you agree with that or am I imagining something?
Some replies to your comments below:
Some spontaneous ideas:
• You could try to evaluate redundant expressions once and bind them to a variable instead (see the attached code).
I am not 100% sure what redundant expressions you saw in my code. Is this about using reverse() instead of having two for loops?
• You could save each document to a separate database via db:create (depending on your data, this may be faster than replacements in a single database), or save all new elements in a single document.
I tried that now and it does not make a difference whether I do a db:replace in _qdb-TEI-02__cache or create separate dbs for each document with db:create. I already adjusted attrinclude so it ignores the ids attribute.
• Instead of creating full index structures with each update operation, you may save a lot of time if you only update parts of the data that have actually changed.
I thought about that but could not imagine how to do that. The most probable change that is affecting the sort order is something like removing a space at the start or a ( or changing the first letter. Doing any minimal update here would probably still mean to sort the 2.4 Mio entries.
• If that’s close to impossible (because the types of updates are too manifold), you could work with daily databases that only contain incremental changes, and merge them with the main database every night.
I don't quite get how I would do incremental changes to the entries ordered by a key. I so an incremental update by just getting the updated pre values for the database that was changed. That is reasonably fast even with incremental attribute index update.
2,4 million tags are a lot, though; and the string length of the created attribute values seem to exceed 100.000 characters, which is a lot, too. What will you do with the resulting documents?
As I mentioned this is a custom index to a set of databases containing 2,4 million TEI entry elements with data. These are more than 700 databases with about 3500 entries each and updates happen to one of them. This is quite fast.
I was not sure what is a lot of data in BaseX. I had a feeling that my dataset is not medium sized anymore but I am not sure what the size of datasets is that should give reasonable performance. I have to say that searching this data in BaseX proved to be a very fast and pleasant experience. Just editing it entry by entry is tricky.
This really big attributes string values are one part of a two step lookup I want to use to get a paging feature (at least for some out of the 2.4 mio entries).
The RESTXQ user can ask for a result with 10, 25, 100 entries per page and specify a page in alphabetical order of one of the sort keys. Worst case is the user deos not specify any other filte criteria. If she does then things are fast enough in all my realistic scenarios invloving only 2 or two databases so aroung 7000 index entries. I implemented getting a page out of all entries with sorting and subsequence. But that means it takes 8s or more for the first page to show. That is to long.
Using this code
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util"; let $all := collection("_qdb-TEI-02__cache")//_:dryed[@order='ascending' and not(@label)]/tokenize(@ids) return db:attribute("_qdb-TEI-02__cache", subsequence($all, 1000, 25))[. instance of attribute(ID) or . instance of attribute(xml:id)]/..!db:open-pre(./@db_name, ./@pre)
showing a page takes about 500 ms.
Best regads
Omar Siam
Hi Omar,
I am not 100% sure what redundant expressions you saw in my code. Is this about using reverse() instead of having two for loops?
In your initial query, the path…
collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d
…was evaluated four times. If you bind it to a variable, it will only be evaluated once. In addition, using child steps instead // is faster, too (in many cases, BaseX will rewrite your path for you).
I don't quite get how I would do incremental changes to the entries ordered by a key. I so an incremental update by just getting the updated pre values for the database that was changed. That is reasonably fast even with incremental attribute index update.
Just two ideas: You can store the data sets of your main database in a pre-sorted fashion. Incremental entries can be sorted on-the-fly in your query, and the results can then be merged with the sorted entries of the main database. Another approach is to store the references and the index keys in your index database. The incremental entries can be merged with the sorted index entries (by looking at the index keys, which are available in both data structures).
I was not sure what is a lot of data in BaseX.
True, that’s difficult to tell in general; it always depends on the context.
... ! db:open-pre(./@db_name, ./@pre)
In BaseX 9.3, it will be possible to supply integer sequences as second argument; this may speed up your query a little.
Best, Christian
Hi Christian,
Am 13.11.2019 um 18:38 schrieb Christian Grün:
Hi Omar,
I am not 100% sure what redundant expressions you saw in my code. Is this about using reverse() instead of having two for loops?
In your initial query, the path…
collection('_qdb-TEI-02__cache')//*[@order="none"]/_:d
…was evaluated four times. If you bind it to a variable, it will only be evaluated once. In addition, using child steps instead // is faster, too (in many cases, BaseX will rewrite your path for you).
I always try to make the query optimizer's job as easy as possible and that makes things fast most of the time. I think the statements were optimized as db:attribute(..., 'none') so // actually was never used. My current approach looks like this as optimized query:
let $ds_0 := db:attribute("_qdb-TEI-02__cache", "none")/self::order/parent::element()/_:d let $sorted-ascending_1 := for $d_2 in $ds_0 order by data($d_2/@vutlsk) empty least return $d_2 let $sorted-ascending-archiv_3 := for $d_4 in $ds_0 order by data($d_4/@vutlsk-archiv) empty least return $d_4 return (db:replace("_qdb-TEI-02__cache", "ascending_cache.xml", element Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { ("ascending") }, attribute ids { (string-join(subsequence($sorted-ascending_1, 1, 15000)/((@ID, @xml:id)), " ")) }) }), db:replace("_qdb-TEI-02__cache", "descending_cache.xml", element Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { ("descending") }, attribute ids { (string-join(subsequence(reverse($sorted-ascending_1), 1, 15000)/((@ID, @xml:id)), " ")) }) }), db:replace("_qdb-TEI-02__cache", "ascending-archiv_cache.xml", element Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { ("ascending") }, attribute label { ("archiv") }, attribute ids { (string-join(subsequence($sorted-ascending-archiv_3, 1, 15000)/((@ID, @xml:id)), " ")) }) }), db:replace("_qdb-TEI-02__cache", "descending-archiv_cache.xml", element Q{https://www.oeaw.ac.at/acdh/tools/vle/util%7Ddryed { (attribute order { ("descending") }, attribute label { ("archiv") }, attribute ids { (string-join(subsequence(reverse($sorted-ascending-archiv_3), 1, 15000)/((@ID, @xml:id)), " ")) }) }))
It is interesting to hear that BaseX does not profit from // expressions. I think this is one thing your competing open source XML DB stresses in their docs: to always use as little parts in an XPath as possible.
I don't quite get how I would do incremental changes to the entries ordered by a key. I so an incremental update by just getting the updated pre values for the database that was changed. That is reasonably fast even with incremental attribute index update.
Just two ideas: You can store the data sets of your main database in a pre-sorted fashion. Incremental entries can be sorted on-the-fly in your query, and the results can then be merged with the sorted entries of the main database.
Document order matters to me so I can't sort the main DB. At least not in this dataset.
Another approach is to store the references and the index keys in your index database. The incremental entries can be merged with the sorted index entries (by looking at the index keys, which are available in both data structures).
I tried to store the _:d tags sorted by key ascending and descending once. That make 2.4 mio x keys (perhaps x 2) tags in the database. Writing this of course took much longer so a complete or initial index generation was up to five minutes. I think that is not worth it.
Efficiently merging by looking at the index keys is a problem because I join all the @xml:id that identify an entry into that one long @ids attribute. So I loose the relation between the key and the id. I did that because this was the fastest way to write this data to the db. Everything else I tried was much slower. And tokenize(@ids) is remarkably fast. Even if all 2.4 mio ids are in there this is really fast. Just writing out 2.4 mio ids to the database is slow.
... ! db:open-pre(./@db_name, ./@pre)
In BaseX 9.3, it will be possible to supply integer sequences as second argument; this may speed up your query a little.
I'll give it a try.
But I have to say some "get me all entries with ids starting with s800 sorted by some key" using this query
declare namespace _ = "https://www.oeaw.ac.at/acdh/tools/vle/util"; for $key in db:attribute("_qdb-TEI-02__cache", index:attributes("_qdb-TEI-02__cache", 's800'))[. instance of attribute(xml:id)] order by $key/../@vutlsk ascending where starts-with($key/../@xml:id, 's800') return db:open-pre($key/../@db_name, $key/../@pre)
only takes 140 ms for about 3900 entries. Unfortunately starts-with(@xml:id, 's800') is not optimized in such a way automatically.
Best regards
Omar Siam
basex-talk@mailman.uni-konstanz.de