Different results when importing HTML documents - BaseX-Talk - mailman.uni-konstanz.de

27 Jan 2023


      Hello all,
I am trying to store HTML documents in BaseX. I setup a local instance of
BaseX on my computer using Docker, and I imported this file in it:
https://pastebin.com/HJdJgLv9
On my local BaseX instance, the document is imported and
"/html/body/article" does return the <article> node as expected.
On my remote/production BaseX instance (using the same Dockerfile and
image), the document is imported but the <article> tag is "stripped" (even
though its contents / child nodes remain in the imported document).
"/html/body/article" is empty.
If I copy over the .basex files from my local database to my remote
database, then the documents are complete like on my local instance. I also
tried to import the documents again on my local instance, and the <article>
tag gets stripped too (and the child nodes remain).
What am I doing wrong when importing my documents? What did I do to import
them properly in my current local instance? I tried a lot of options but I
just can't figure out why this happens (I fiddled a lot with it).
I used the following options when importing my documents, as per the
documentation:
SET PARSER html
SET HTMLPARSER
method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
SET CREATEFILTER *.html
I also use SET FTINDEX true but I don't think it would have an impact
anyway.
Thank you very much!
- Tim