Hello.

 

Briefing:

I want to implement distributed work with BaseX in Hadoop using Apache Spark. Data processing will be divided into the following stages:

1) Splitting XML into chunks

2) Parallel parsing and filling the database

3) Executing queries to make the table (Apache Spark Dataset<Row>)

 

Stage 1 is a simple algorithmic problem. It will compose a HashMap of (ChunkNumber -> List<Xml_Path>). Each chunk contains no more than 128 MB of data.

 

Step 2. On each node of the cluster will initialized a standalone instance of BaseX. Every instance of BaseX will recieve files / lines from HDFS to the input. A xml database of each chunk as result will be serialized to HDFS.

 

Stage 3. When the request of a query is received, each xml database will be sequentially deserialized to apply the query. A table will be composed from the result.

 

Questions:

1) Send data from HDFS to embedded BaseX:

1.1) Does BaseX support reading data by schemed URI, e.g. `hdfs://home/user/file.xml`?

1.2) Can I send XML from RAM to BaseX?

1.3) Can I send XML lines (line by line) to BaseX?

2) Can I get a database in ram to serialize it in HDFS?

3.1) Do I need to store XML in a persistent path to query it in the future?

3.2) When executing a query on XML in HDFS, can I read it line by line if BaseX does not know how to work with it directly?

 

Best regards,

Andrei Iatsuk.