Hello,
This is essentially part2 of trying to index large amounts of web data. To summarize what happened before: The initial discussion started here [1], Christian suggested some options, I dove into each of them, I realized that doing this on a low-memory system is harder than I initially thought. At Christian's suggestion, I tried to split the big db into smaller dbs and came up with a rudimentary sharding mechanism [3]. All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM (do take into consideration that I only have 3.1GB of memory to allocate for BaseX).
Where to? I decided to look more into what Christian said in [2] about option 2, and to pick the exact values that I want, and to transfer them to PostgreSQL (after transferring, a GiST index would have to be built there, to allow full-text searches; PostgreSQL is picked because it uses an in-memory buffer for all large operations, and several files on disk, and if it needs to combine results that exceed the available memory, it goes to disk, but at all times it never exceeds the given amount of memory).
Variant 1 (see attached script pg-import.sh) All good. So, I basically started writing XQuery that would do the following: - Open up a JDBC connection to PostgreSQL - Get me all text content from each thread page of the forum, and the db it belonged to - Create a prepared statement for one such thread page, populate the prepared statement, and execute it This ended up in OOM after around 250k records. So just to be clear, 250k lines were rows in PostgreSQL, which is nice but eventually it ended up in OOM. (Perhaps it has to do with how the GC works in Java .. I don't know)
Variant 2 (see attached script pg-import2.sh) I did something similar to the above: - Open up a JDBC connection to PostgreSQL - Get all posts and for each post get the author, the date, the message content, the post id, the BaseX db name (cause we're going over all shards, and each shard is a BaseX db) - Create a prepared statement for each post with the data mentioned above, and execute it This also ended up in OOM after around 340k records (my approximation would be that there were around 3M posts in the data).
To summarize, I'm tempted to believe that there might be a leak in the BaseX implementation of XQuery. I will provide in the following, the relevant versions of the software used: - BaseX 9.2.4 - java version "1.8.0_151" Java(TM) SE Runtime Environment (build 1.8.0_151-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode) - the JVM memory param value was -Xmx3100m
I would be interested to know your thoughts
Thanks, Stefan
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.h... [2] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.htm... [3] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.htm...
Some complementary notes (others may be able to tell you more about their experiences with large data sets):
a GiST index would have to be built there, to allow full-text searches;
PostgreSQL is picked
You could as well have a look at Elasticsearch or its predecessors.
there might be a leak in the BaseX implementation of XQuery.
I assume you are referring to the SQL Module? Feel free to attach the OOM stack trace, it might give us more insight.
I would recommend you to write SQL commands or an SQL dump to disk (see the BaseX File Module for now information) and run/import this file in a second step; this is probably faster than sending hundreds of thousands of single SQL commands via JDBC, no matter if you are using XQuery or Java.
I'm currently at work and my setup is at home. In about 7 hours I'll get home and I will send the stack trace.
Meanwhile, is there any way to write a FLWOR, a loop, in a batched style?
Like for example in my case, this approach I described to migrate data from BaseX to PostgreSQL makes use of BaseX as an XQuery processor and transfers the full-text indexing to PostgreSQL, this is what I'm trying to do.
However, in order to avoid OOM, I am thinking of batching the transfer into chunks, and potentially restart the BaseX server in between the migration of each chunk. That's why I am asking how I could do that in BaseX. My hope is that the OOM could be avoided in this way, because not all the data would pass through main memory and there would be less chances of the JVM GC having to deal with this data. Restarting the BaseX server between each chunk transfer would help making sure that whatever memory was used is released.
So I wonder if something like (<insert-big-FLWOR-here)[position() = <start> to <end>] would work here. Of course, some count would have to be done beforehand to know how many batches there will be. Or maybe even without knowing how many batches there will be, a while-type loop could be written in Bash with the stop conditon being to check if the current chunk is empty.
Would an approach like this work to mitigate the OOM? Are there alternatives or work-arounds to this kind of OOM?
Thanks
On Mon, Oct 7, 2019, 1:13 AM Christian Grün christian.gruen@gmail.com wrote:
Some complementary notes (others may be able to tell you more about their experiences with large data sets):
a GiST index would have to be built there, to allow full-text searches;
PostgreSQL is picked
You could as well have a look at Elasticsearch or its predecessors.
there might be a leak in the BaseX implementation of XQuery.
I assume you are referring to the SQL Module? Feel free to attach the OOM stack trace, it might give us more insight.
I would recommend you to write SQL commands or an SQL dump to disk (see the BaseX File Module for now information) and run/import this file in a second step; this is probably faster than sending hundreds of thousands of single SQL commands via JDBC, no matter if you are using XQuery or Java.
On Mon, Oct 7, 2019 at 1:13 AM Christian Grün christian.gruen@gmail.com wrote:
I would recommend you to write SQL commands or an SQL dump to disk (see the BaseX File Module for now information) and run/import this file in a second step; this is probably faster than sending hundreds of thousands of single SQL commands via JDBC, no matter if you are using XQuery or Java.
Ok, so I finally managed to reach a compromise regarding BaseX capabilities and the hardware that I have at my disposal (for the time being). This message will probably answer thread [1] as well (which is separate from this but seems to ask the same question basically, which is, how to use BaseX as an command-line XQuery processor). The script attached will take a large collection of HTML documents, it will pack them into small "balanced" sets, and then it will run XQuery on them using BaseX. The result will be a lot of SQL files ready to be imported in PostgreSQL (with some small tweaks, the data could be adapted to be imported in Elasticsearch).
I'm also including some benchmark data:
On system1 the following times were recorded: If run with -j4 it does 200 forum thread pages in 10 seconds. And apparently there's about 5 posts on average per thread page. So in 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum thread pages) and have them prepared to be imported in PostgreSQL. With -j4 the observed peak memory usage was 500MB.
I've tested the script attached on the following two systems: system1 config: - BaseX 9.2.4 - script (from util-linux 2.31.1) - GNU Parallel 20161222 - Ubuntu 18.04 LTS
system1 hardware: - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores) - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM
system2 config: - BaseX 9.2.4 - GNU Parallel 20181222 - script (from util-linux 2.34)
system2 hardware: - cpu: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz (4 cores) - memory: 4GB RAM DDR @ 1600MHz - disk: HDD ST3000VN007-2E4166 @ 5900 rpm
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.htm...
I was surprised to see the 16 GB RAM machine pop up in your setup. Did you check how many gigabytes of XML data can fulltext-indexed with BaseX (and a large -Xmx value, maybe 15g) on that system?
first name last name randomcoder1@gmail.com schrieb am Di., 8. Okt. 2019, 22:40:
On Mon, Oct 7, 2019 at 1:13 AM Christian Grün christian.gruen@gmail.com wrote:
I would recommend you to write SQL commands or an SQL dump to disk (see the BaseX File Module for now information) and run/import this file in a second step; this is probably faster than sending hundreds of thousands of single SQL commands via JDBC, no matter if you are using XQuery or Java.
Ok, so I finally managed to reach a compromise regarding BaseX capabilities and the hardware that I have at my disposal (for the time being). This message will probably answer thread [1] as well (which is separate from this but seems to ask the same question basically, which is, how to use BaseX as an command-line XQuery processor). The script attached will take a large collection of HTML documents, it will pack them into small "balanced" sets, and then it will run XQuery on them using BaseX. The result will be a lot of SQL files ready to be imported in PostgreSQL (with some small tweaks, the data could be adapted to be imported in Elasticsearch).
I'm also including some benchmark data:
On system1 the following times were recorded: If run with -j4 it does 200 forum thread pages in 10 seconds. And apparently there's about 5 posts on average per thread page. So in 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum thread pages) and have them prepared to be imported in PostgreSQL. With -j4 the observed peak memory usage was 500MB.
I've tested the script attached on the following two systems: system1 config:
- BaseX 9.2.4
- script (from util-linux 2.31.1)
- GNU Parallel 20161222
- Ubuntu 18.04 LTS
system1 hardware:
- cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
- memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
- disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM
system2 config:
- BaseX 9.2.4
- GNU Parallel 20181222
- script (from util-linux 2.34)
system2 hardware:
- cpu: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz (4 cores)
- memory: 4GB RAM DDR @ 1600MHz
- disk: HDD ST3000VN007-2E4166 @ 5900 rpm
[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.htm...
basex-talk@mailman.uni-konstanz.de