Hello,
This is essentially part2 of trying to index large amounts of web data.
To summarize what happened before: The initial discussion started here [1], Christian suggested some options, I dove into each of them, I realized that doing this on a low-memory system is harder than I initially thought.
At Christian's suggestion, I tried to split the big db into smaller dbs and came up with a rudimentary sharding mechanism [3].
All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM (do take into consideration that I only have 3.1GB of memory to allocate for BaseX).
Where to?
I decided to look more into what Christian said in [2] about option 2, and to pick the exact values that I want, and to transfer them to PostgreSQL
(after transferring, a GiST index would have to be built there, to allow full-text searches; PostgreSQL is picked because it uses an in-memory buffer for all large operations, and several files on disk, and if it needs to combine results that exceed the available memory, it goes to disk, but at all times it never exceeds the given amount of memory).
Variant 1 (see attached script pg-import.sh)
All good. So, I basically started writing XQuery that would do the following:
- Open up a JDBC connection to PostgreSQL
- Get me all text content from each thread page of the forum, and the db it belonged to
- Create a prepared statement for one such thread page, populate the prepared statement, and execute it
This ended up in OOM after around 250k records. So just to be clear, 250k lines were rows in PostgreSQL, which is nice but eventually it ended up in OOM. (Perhaps it has to do with how the GC works in Java .. I don't know)
Variant 2 (see attached script pg-import2.sh)
I did something similar to the above:
- Open up a JDBC connection to PostgreSQL
- Get all posts and for each post get the author, the date, the message content, the post id, the BaseX db name (cause we're going over all shards, and each shard is a BaseX db)
- Create a prepared statement for each post with the data mentioned above, and execute it
This also ended up in OOM after around 340k records (my approximation would be that there were around 3M posts in the data).
To summarize, I'm tempted to believe that there might be a leak in the BaseX implementation of XQuery.
I will provide in the following, the relevant versions of the software used:
- BaseX 9.2.4
- java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)- the JVM memory param value was -Xmx3100m
I would be interested to know your thoughts
Thanks,