I am trying to use an xpath passed to basex on the macOS command line to extract data from a downloaded third-party HTML file. The file isn't completely valid xml syntax (but only in elements that aren't referenced by my xpath), so basex outputs errors instead of the data that I want.
How can I configure basex to ignore xml syntax issues that aren't fatal for my xpath?
If it cannot be so configured:
- could that option be added as a new feature?
- in the meantime, what other macOS command-line xpath parsers can ignore non-fatal XML syntax issues? Which of those support the newest versions of xpath, are performant, etc.?
Thanks.
Sent with Proton Mail secure email.
Hi Ross,
you can try to make use of TagSoup [1]. It is a library that parses HTML and tries to output it as valid XML.
To use it with BaseX the TagSoup library has to be included in the classpath on startup. The approach is documented here: https://docs.basex.org/main/Parsers#html_parser
If you encounter problems using it feel free to report them back here.
Cheers, Alex
[1] http://vrici.lojban.org/~cowan/tagsoup/
Am 08.08.2024 um 01:55 schrieb Ross Goldberg ross.goldberg@proton.me:
I am trying to use an xpath passed to basex on the macOS command line to extract data from a downloaded third-party HTML file.
The file isn't completely valid xml syntax (but only in elements that aren't referenced by my xpath), so basex outputs errors instead of the data that I want.
How can I configure basex to ignore xml syntax issues that aren't fatal for my xpath?
If it cannot be so configured:
could that option be added as a new feature?
in the meantime, what other macOS command-line xpath parsers can ignore non-fatal XML syntax issues? Which of those support the newest versions of xpath, are performant, etc.?
Thanks.
Sent with Proton Mail https://proton.me/ secure email.
basex-talk@mailman.uni-konstanz.de