Hello.
Briefing: I want to implement distributed work with BaseX in Hadoop using Apache Spark. Data processing will be divided into the following stages: 1) Splitting XML into chunks 2) Parallel parsing and filling the database 3) Executing queries to make the table (Apache Spark Dataset<Row>)
Stage 1 is a simple algorithmic problem. It will compose a HashMap of (ChunkNumber -> List<Xml_Path>). Each chunk contains no more than 128 MB of data.
Step 2. On each node of the cluster will initialized a standalone instance of BaseX. Every instance of BaseX will recieve files / lines from HDFS to the input. A xml database of each chunk as result will be serialized to HDFS.
Stage 3. When the request of a query is received, each xml database will be sequentially deserialized to apply the query. A table will be composed from the result.
Questions: 1) Send data from HDFS to embedded BaseX: 1.1) Does BaseX support reading data by schemed URI, e.g. `hdfs://home/user/file.xml`? 1.2) Can I send XML from RAM to BaseX? 1.3) Can I send XML lines (line by line) to BaseX? 2) Can I get a database in ram to serialize it in HDFS? 3.1) Do I need to store XML in a persistent path to query it in the future? 3.2) When executing a query on XML in HDFS, can I read it line by line if BaseX does not know how to work with it directly?
Best regards, Andrei Iatsuk.