Hello Christian,

Thanks for your reply. I added the documents again with the default options and I do get satisfying results. Not sure why I kept on using the settings recommended in the documentation...
Would it be possible to add the tagsoup documentation link about the parser options to the BaseX doc? That could be helpful.

Thanks,
- Tim

On Mon, Jan 30, 2023 at 10:42 PM Christian Grün <christian.gruen@gmail.com> wrote:
Hi Tim,

I assume the article element will be preserved if you omit the
nobogons HTMLPARSER option [1]. Usually, there’s no need to set
specific options if the default behavior gives satisfying results.

Best,
Christian

[1] http://vrici.lojban.org/~cowan/tagsoup/



On Fri, Jan 27, 2023 at 8:05 PM Timothée <timoguic@gmail.com> wrote:
>
> Hello all,
>
> I am trying to store HTML documents in BaseX. I setup a local instance of BaseX on my computer using Docker, and I imported this file in it: https://pastebin.com/HJdJgLv9
>
> On my local BaseX instance, the document is imported and "/html/body/article" does return the <article> node as expected.
>
> On my remote/production BaseX instance (using the same Dockerfile and image), the document is imported but the <article> tag is "stripped" (even though its contents / child nodes remain in the imported document). "/html/body/article" is empty.
>
> If I copy over the .basex files from my local database to my remote database, then the documents are complete like on my local instance. I also tried to import the documents again on my local instance, and the <article> tag gets stripped too (and the child nodes remain).
>
> What am I doing wrong when importing my documents? What did I do to import them properly in my current local instance? I tried a lot of options but I just can't figure out why this happens (I fiddled a lot with it).
>
> I used the following options when importing my documents, as per the documentation:
> SET PARSER html
> SET HTMLPARSER method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
> SET CREATEFILTER *.html
>
> I also use SET FTINDEX true but I don't think it would have an impact anyway.
>
> Thank you very much!
> - Tim