Hi Ron,
I agree that would be helpful. I’ve added a GitHub issue [1].
As you’ve already indicated, you can post-process your databases instances. I think the easiest query for that is:
delete nodes db:get('db')//*[empty(node())]
…followed by an optional db:optimize('db').
Best, Christian
[1] https://github.com/BaseXdb/basex/issues/2203
On Thu, Apr 20, 2023 at 1:06 PM Ron Van den Branden ron.vdbranden@gmail.com wrote:
Hi all,
I'm investigating a way of analysing a massive set of > 900.000 CSV files, for which the CSV parsing in BaseX seems very useful, producing a db nicely filled with documents such as:
<csv> <record> <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID> <source.id>bbcy:vev:6860</source.id> <card>AA</card> <order>0</order> <source_field/> <source_code/> <Annotation>some remarks</Annotation> <Annotation_Language>en</Annotation_Language> <Annotation_Type/> <resource_model/> <!-- ... --> </record> <record> <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID> <source.id>bbcy:vev:6860</source.id> <card>BE</card> <order>0</order> <source_field/> <source_code>concept</source_code> <Annotation/> <Annotation_Language/> <Annotation_Type/> <resource_model/> <!-- ... --> </record>
<!-- ... -->
</csv>
Yet, when querying those documents, I'm noticing how just selecting non-empty elements is very slow. For example:
//source_code[normalize-space()]
...can take over 40 seconds.
Since I don't have control over the source data, it would be really great if empty cells could be skipped when parsing CSV files. Of course this could be a trivial post-processing step via XSLT / XQuery, but that's unfeasible for that mass of data.
Does BaseX provide a way of telling the CSV parser to skip empty cells?
Best,
Ron