Hello.
Briefing:
I want to implement distributed work with BaseX in Hadoop using Apache Spark. Data processing will be divided into the following stages:
1) Splitting XML into chunks
2) Parallel parsing and filling the database
3) Executing queries to make the table (Apache Spark Dataset<Row>)
Stage 1 is a simple algorithmic problem. It will compose a HashMap of (ChunkNumber -> List<Xml_Path>). Each chunk contains no more than 128 MB of data.
Step 2. On each node of the cluster will initialized a standalone instance of BaseX. Every instance of BaseX will recieve files / lines from HDFS to the input. A xml database of each chunk as result will be serialized to HDFS.
Stage 3. When the request of a query is received, each xml database will be sequentially deserialized to apply the query. A table will be composed from the result.
Questions:
1) Send data from HDFS to embedded BaseX:
1.1) Does BaseX support reading data by schemed URI, e.g. `hdfs://home/user/file.xml`?
1.2) Can I send XML from RAM to BaseX?
1.3) Can I send XML lines (line by line) to BaseX?
2) Can I get a database in ram to serialize it in HDFS?
3.1) Do I need to store XML in a persistent path to query it in the future?
3.2) When executing a query on XML in HDFS, can I read it line by line if BaseX does not know how to work with it directly?
Best regards,
Andrei Iatsuk.