On Mon, Oct 7, 2019 at 1:13 AM Christian Grün <christian.gruen@gmail.com> wrote:

I would recommend you to write SQL commands or an SQL dump to disk (see the BaseX File Module for now information) and run/import this file in a second step; this is probably faster than sending hundreds of thousands of single SQL commands via JDBC, no matter if you are using XQuery or Java.


Ok, so I finally managed to reach a compromise regarding BaseX capabilities and the hardware that I have at my disposal (for the time being).
This message will probably answer thread [1] as well (which is separate from this but seems to ask the same question basically, which is, how to use BaseX as an command-line XQuery processor).
The script attached will take a large collection of HTML documents, it will pack them into small "balanced" sets, and then it will run XQuery on them using BaseX.
The result will be a lot of SQL files ready to be imported in PostgreSQL (with some small tweaks, the data could be adapted to be imported in Elasticsearch).

I'm also including some benchmark data:

On system1 the following times were recorded: If run with -j4 it does 200 forum thread pages in 10 seconds.
And apparently there's about 5 posts on average per thread page. So in 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum thread pages) and have them prepared to be imported in PostgreSQL. With -j4 the observed peak memory usage was 500MB.

I've tested the script attached on the following two systems:
system1 config:
- BaseX 9.2.4
- script (from util-linux 2.31.1)
- GNU Parallel 20161222
- Ubuntu 18.04 LTS

system1 hardware:
- cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
- memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
- disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM

system2 config:
- BaseX 9.2.4
- GNU Parallel 20181222
- script (from util-linux 2.34)

system2 hardware:
- cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
- memory: 4GB RAM DDR @ 1600MHz
- disk: HDD ST3000VN007-2E4166 @ 5900 rpm

[1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html