Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM - BaseX-Talk - mailman.uni-konstanz.de

6 Oct 2019


      Hello,
This is essentially part2 of trying to index large amounts of web data.
To summarize what happened before: The initial discussion started here [1],
Christian suggested some options, I dove into each of them, I realized that
doing this on a low-memory system is harder than I initially thought.
At Christian's suggestion, I tried to split the big db into smaller dbs and
came up with a rudimentary sharding mechanism [3].
All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM
(do take into consideration that I only have 3.1GB of memory to allocate
for BaseX).
Where to?
I decided to look more into what Christian said in [2] about option 2, and
to pick the exact values that I want, and to transfer them to PostgreSQL
(after transferring, a GiST index would have to be built there, to allow
full-text searches; PostgreSQL is picked because it uses an in-memory
buffer for all large operations, and several files on disk, and if it needs
to combine results that exceed the available memory, it goes to disk, but
at all times it never exceeds the given amount of memory).
Variant 1 (see attached script pg-import.sh)
All good. So, I basically started writing XQuery that would do the
following:
- Open up a JDBC connection to PostgreSQL
- Get me all text content from each thread page of the forum, and the db it
belonged to
- Create a prepared statement for one such thread page, populate the
prepared statement, and execute it
This ended up in OOM after around 250k records. So just to be clear, 250k
lines were rows in PostgreSQL, which is nice but eventually it ended up in
OOM. (Perhaps it has to do with how the GC works in Java .. I don't know)
Variant 2 (see attached script pg-import2.sh)
I did something similar to the above:
- Open up a JDBC connection to PostgreSQL
- Get all posts and for each post get the author, the date, the message
content, the post id, the BaseX db name (cause we're going over all shards,
and each shard is a BaseX db)
- Create a prepared statement for each post with the data mentioned above,
and execute it
This also ended up in OOM after around 340k records (my approximation would
be that there were around 3M posts in the data).
To summarize, I'm tempted to believe that there might be a leak in the
BaseX implementation of XQuery.
I will provide in the following, the relevant versions of the software used:
- BaseX 9.2.4
- java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
- the JVM memory param value was  -Xmx3100m
I would be interested to know your thoughts
Thanks,
Stefan
[1]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.h...
[2]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.htm...
[3]
https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.htm...