[basex-talk] Distributed XML processing on Apache Spark

8 May 2018


      Hello.
Briefing:
I want to implement distributed work with BaseX in Hadoop using Apache Spark. Data processing will be divided into the following stages:
1) Splitting XML into chunks
2) Parallel parsing and filling the database
3) Executing queries to make the table (Apache Spark Dataset<Row>)
Stage 1 is a simple algorithmic problem. It will compose a HashMap of (ChunkNumber -> List<Xml_Path>). Each chunk contains no more than 128 MB of data.
Step 2. On each node of the cluster will initialized a standalone instance of BaseX. Every instance of BaseX will recieve files / lines from HDFS to the input. A xml database of each chunk as result will be serialized to HDFS.
Stage 3. When the request of a query is received, each xml database will be sequentially deserialized to apply the query. A table will be composed from the result.
Questions:
1) Send data from HDFS to embedded BaseX:
1.1) Does BaseX support reading data by schemed URI, e.g. `hdfs://home/user/file.xml`?
1.2) Can I send XML from RAM to BaseX?
1.3) Can I send XML lines (line by line) to BaseX?
2) Can I get a database in ram to serialize it in HDFS?
3.1) Do I need to store XML in a persistent path to query it in the future?
3.2) When executing a query on XML in HDFS, can I read it line by line if BaseX does not know how to work with it directly?
Best regards,
Andrei Iatsuk.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[basex-talk] Distributed XML processing on Apache Spark