Hi all,

I'm investigating a way of analysing a massive set of > 900.000 CSV files, for which the CSV parsing in BaseX seems very useful, producing a db nicely filled with documents such as:

<csv>
  <record>
    <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
    <source.id>bbcy:vev:6860</source.id>
    <card>AA</card>
    <order>0</order>
    <source_field/>
    <source_code/>
    <Annotation>some remarks</Annotation>
    <Annotation_Language>en</Annotation_Language>
    <Annotation_Type/>
    <resource_model/>
    <!-- ... -->
  </record>
  <record>
    <ResourceID>00003a92-d10e-585e-84a7-29ad17c5799f</ResourceID>
    <source.id>bbcy:vev:6860</source.id>
    <card>BE</card>
    <order>0</order>
    <source_field/>
    <source_code>concept</source_code>
    <Annotation/>
    <Annotation_Language/>
    <Annotation_Type/>
    <resource_model/>
    <!-- ... -->
  </record>

  <!-- ... -->
</csv>

Yet, when querying those documents, I'm noticing how just selecting non-empty elements is very slow. For example:

  //source_code[normalize-space()]

...can take over 40 seconds.

Since I don't have control over the source data, it would be really great if empty cells could be skipped when parsing CSV files. Of course this could be a trivial post-processing step via XSLT / XQuery, but that's unfeasible for that mass of data.

Does BaseX provide a way of telling the CSV parser to skip empty cells?

Best,

Ron