You want to run the same query over a large set of XML chunks and persist the result as a Dataset or DataFrame?

Just store the XML chunks in a sequence file or parquet and use BaseX as a query processor. Map over your input partitions and create an in-memory database from each chunk and apply the XQuery.

Note that you can’t cache an XQuery yet in Basex (until someone generous funds the development), so you’ll be compiling the query for each chunk.

We’ve done similar work: the overhead of creating the database on the fly is peanuts compared to the effort of getting the data across S3 and onto your exectutors.

From: basex-talk-bounces@mailman.uni-konstanz.de [mailto:basex-talk-bounces@mailman.uni-konstanz.de] On Behalf Of Andrey Yatsuk
Sent: 08 May 2018 15:58
To: basex-talk@mailman.uni-konstanz.de
Subject: [basex-talk] Distributed XML processing on Apache Spark

*** External email: use caution ***

Hello.

Briefing:

I want to implement distributed work with BaseX in Hadoop using Apache Spark. Data processing will be divided into the following stages:

1) Splitting XML into chunks

2) Parallel parsing and filling the database

3) Executing queries to make the table (Apache Spark Dataset<Row>)

Stage 1 is a simple algorithmic problem. It will compose a HashMap of (ChunkNumber -> List<Xml_Path>). Each chunk contains no more than 128 MB of data.

Step 2. On each node of the cluster will initialized a standalone instance of BaseX. Every instance of BaseX will recieve files / lines from HDFS to the input. A xml database of each chunk as result will be serialized to HDFS.

Stage 3. When the request of a query is received, each xml database will be sequentially deserialized to apply the query. A table will be composed from the result.

Questions:

1) Send data from HDFS to embedded BaseX:

1.1) Does BaseX support reading data by schemed URI, e.g. `hdfs://home/user/file.xml`?

1.2) Can I send XML from RAM to BaseX?

1.3) Can I send XML lines (line by line) to BaseX?

2) Can I get a database in ram to serialize it in HDFS?

3.1) Do I need to store XML in a persistent path to query it in the future?

3.2) When executing a query on XML in HDFS, can I read it line by line if BaseX does not know how to work with it directly?

Best regards,

Andrei Iatsuk.

Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.