What I like about BaseX is that it is very good at optimizing
self-contained queries about the size a user can read and understand [1]
[2] and that it has a DB locking system for transaction management [3]
that is robust and easy to understand.
What I don’t like so much about BaseX is that these two mechanisms don’t
work very well with complex code that is split into various modules. I
use modules for code that may be shared among projects or just as a
means of grouping common concerns in one module.
That I don’t like this behaviour does not mean I know (or have any hope)
that this can be solved in a better way without at least make unpleasant
sacrifices elsewhere. It is just the setting I have to deal with.
When BaseX cannot determine anymore which DBs are used in a query and
which are not, it falls back to assuming there are no indexes, so
automatic optimization in this regard is stopped, and it assumes that
just all DBs known to BaseX are used in that query so it acquires a
global lock. [4]
When doing only reading queries this is not much of a problem. Using
indexes in queries can be forced with functions or with the
db:enforceindex pragma [5].
Problems start showing when trying to implement a CRUD RestXQ
application. Create, update and delete can be implemented using the
XQuery update standard but of course now this will get slow and
cumbersome when for many read operations it cannot be determined which
DBs they use and so a global read lock is held. That of course means
that no global write lock can be acquired until all read operations are
finished on all DBs known to a BaseX instance.
This is especially problematic when one instance of BaseX with a RestXQ
application is used to serve data from independent databases. Say one
instance of BaseX has a RestXQ API that servers a lot of different
dictionaries for different natural languages. This is my use case.
Although the content of dictionary entries is different, the parts in
the TEI/XML I try to manipulate, that are created, read, updated or
deleted, are the same. So, a common API should handle many independent
dictionaries, edited by many users, using one instance of BaseX.
Also, when working with my biggest XML database of several GB I ran into
problems when reindexing after an update. Reindexing all those GB of
data takes too long and makes small updates in there impossible.
Why not multiple instances of BaseX? Well because for better or worse
BaseX runs in a JVM and even after I tried to minimize the memory
footprint of an idle BaseX it is still a little less than 300 MB and we
run a lot of services here on shared servers so RAM usage matters. Also,
RAM usage is a part of the costs when using commercial cloud services.
But of course, not running BaseX at all if not used is best if you pay
per minute. And also: as recently discussed on the list: BaseX as any
Java program gets optimized while running by the JVM and then those
optimizations as well as caching will benefit all the data hosted in one
instance but would be less efficient with multiple instances I assume.
So how do I achieve four goals:
* Keep the XQuery short and concise because that is what the optimizer
can handle best?
* Keep the code separated into Modules that deal with one particular aspect?
* Use RestXQ and not another technique to actually implement the RESTful
API?
* All this while being able to split GB of XML data into portions that
can be reindexed in a reasonable amount of time?
The two thing that help here a lot is eval functions like xquery:eval
[6] and String Constructors [7].
Say, I want to run a query but on different collections (databases). I
can do this by having a list of collections and executing the actual
query in a for loop with the concrete collection as a variable.
If I just write the XQuery code down like this the problem is that the
optimizer would need to evaluate the query to find out which databases
to lock and what indexes can be used. BaseX is not built to do this
(yet). It does not mock run the query. So, it decides that a global lock
needs to be used. Depending on the use of XQuery Update either a global
write lock or a global read lock is acquired. Easy to understand but
does not help with performance here.
If I want to make the situation worse for the optimizer I can use
xquery:eval. That of course makes the XQuery code totally opaque to the
optimizer. A global lock is guaranteed.
Still another eval function is a solution here. There is the jobs module
jobs:eval [8].
If I break up my code into jobs only these jobs hold locks for as long
as they run. This can be a much shorter period of time than what it
takes to run a whole RestXQ request. It is also possible to find a place
that needs to be changed in a number of databases and then only write
lock one of them to change something.
So, if my data is stored in not one but several database files I can
make them look like one big XML for API purposes, but still have small
enough independent parts that can be indexed separately so updates with
reindexing are relatively fast.
If I have a search I want to perform on parts of databases that are in
principle independent, like dictionary entries in a large dictionary, I
can do this in parallel on each database.
I tried to implement this idea with jobs:eval and it actually worked
very well. Only the interface of the function was cumbersome to use the
way I wanted to make use of this functionality.
So, I wrote a wrapper around jobs:eval and jobs:wait that makes it easy
to generate small self-contained XQuery code [9] using String
Constructors [10] and some other functions used for querying the
structure of the data stored in BaseX like listing and filtering
databases by name [11].
Another other goal for this util:eval(s) function was to make it still
easy to see errors [12].
A typical use is something like: run a filter query in all databases
[13] that are found using a database name filter in some settings
database [14] and use a string for comparison from a request URL
parameter [15].
Find an entry out of a few million and replace it with an updated
version. Of course, with reindexing [16].
What were some (unexpected) problems?
Because now jobs and especially write jobs lock databases while the
RestXQ code is running the RestXQ code itself cannot hold any read or
write lock. That is possible in BaseX but some functions force a global
lock. For example db:list. I think there are good reasons why you want
to have a global lock and therefore atomicity during a query when you
ask for the list of databases. Of course, my code happens to need to
list databases quite often. And my code should not hold a global lock
here after getting the list. My list of databases may change during a
RestXQ call but I don’t care yet about that situation. I think it does
not matter to me.
There is also a now simple solution: Outsource db:list to its own job [17].
I also remember there was a problem with an automatic conversion of
RestXQ parameters creating a randomly named lock. But it was no problem
to do the conversion explicitly in XQuery code and so have the RestXQ
code not hold any lock again.
Now there already was a question on the mailing list about BaseX
behaviour in a multithreaded environment. I don’t use that BaseX.jar in
such a way with my own Java code but jobs are (Java) threads. And the
interesting thing here is now that with a lot of threads (say 700) that
don’t lock each other, a bottleneck shows in the way BaseX handles file
access. At least the Java profiler showed me this as a primary source of
wasted time [18].
If I get it correctly then file access is as usual done in 4KB portions
which are read into a buffer and smaller parts are accessed from there.
This way is by far the most efficient way to do this on any current
operating system and file system. But now this buffer’s handling needs
protection from the buffer being manipulated in different threads.
All I found in the JDK for this is a performance nightmare was the Jave
nio streams systems [19], which tries to guarantee quite a few threads
related consistencies [20] and seems really slow. This seems to be a
well-known fact documented numerous times on the internet [21]. I also
tried with one of the tests BaseX contains [22] and an attempt to use
FileChannel instead of the current RandomAccessFile base implementation
and found the documented behaviour: Java nio file classes are no
replacement for the current implementation when it comes to performance.
Looking at other databases I saw they implement something OS dependent
but it is hard to compare [23].
[1] https://docs.basex.org/wiki/Indexes
[2] https://docs.basex.org/wiki/XQuery_Optimizations
[3] https://docs.basex.org/wiki/Transaction_Management
[4] https://docs.basex.org/wiki/Transaction_Management#Limitations
[5] https://docs.basex.org/wiki/Indexes#Enforce_Rewritings
[6] https://docs.basex.org/wiki/XQuery_Module#xquery:eval
[7]
https://www.w3.org/TR/2017/REC-xquery-31-20170321/#id-string-constructors
[8] https://docs.basex.org/wiki/Jobs_Module#jobs:eval
[9]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L50-L69
[10]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access.xqm#L76-L93
[11]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/profile.xqm#L152-L158
[12]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/util.xqm#L91-L97
[13]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access.xqm#L112-L119
[14]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access.xqm#L121-L144
[15]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/entries.xqm#L93-L142
[16]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/data/access.xqm#L329-L345
[17]
https://github.com/acdh-oeaw/vleserver_basex/blob/main/vleserver/dicts.xqm#L49
[18]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/io/random/DataAccess.java#L184
and other read methods there
[19]
https://blogs.oracle.com/javamagazine/post/java-nio-nio2-buffers-channels-async-future-callback
[20]
https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
The view of a file provided by an instance of this class is guaranteed
to be consistent with other views of the same file provided by other
instances in the same program.
[21] https://www.mathematik.uni-marburg.de/~alexmaurer/files/NioVsIo.pdf
as an example. May be more recent evaluations of Java 11 or 17 nio or
nio.2 performance is better?
[22]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/test/java/org/basex/io/random/DataAccessTest.java
[23]
https://github.com/neo4j/neo4j/blob/4.4/community/native/src/main/java/org/neo4j/internal/nativeimpl/LinuxNativeAccess.java
Best regards
--
Mag. Ing. Omar Siam
Austrian Center for Digital Humanities and Cultural Heritage
Österreichische Akademie der Wissenschaften | Austrian Academy of Sciences
Stellvertretende Behindertenvertrauensperson | Deputy representative for disabled persons
Wohllebengasse 12-14, 1040 Wien, Österreich | Vienna, Austria
T: +43 1 51581-7295
omar.siam@oeaw.ac.at | www.oeaw.ac.at/acdh