Hi Tim,
I assume the article element will be preserved if you omit the
nobogons HTMLPARSER option [1]. Usually, there’s no need to set
specific options if the default behavior gives satisfying results.
Best,
Christian
[1] http://vrici.lojban.org/~cowan/tagsoup/
On Fri, Jan 27, 2023 at 8:05 PM Timothée <timoguic@gmail.com> wrote:
>
> Hello all,
>
> I am trying to store HTML documents in BaseX. I setup a local instance of BaseX on my computer using Docker, and I imported this file in it: https://pastebin.com/HJdJgLv9
>
> On my local BaseX instance, the document is imported and "/html/body/article" does return the <article> node as expected.
>
> On my remote/production BaseX instance (using the same Dockerfile and image), the document is imported but the <article> tag is "stripped" (even though its contents / child nodes remain in the imported document). "/html/body/article" is empty.
>
> If I copy over the .basex files from my local database to my remote database, then the documents are complete like on my local instance. I also tried to import the documents again on my local instance, and the <article> tag gets stripped too (and the child nodes remain).
>
> What am I doing wrong when importing my documents? What did I do to import them properly in my current local instance? I tried a lot of options but I just can't figure out why this happens (I fiddled a lot with it).
>
> I used the following options when importing my documents, as per the documentation:
> SET PARSER html
> SET HTMLPARSER method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
> SET CREATEFILTER *.html
>
> I also use SET FTINDEX true but I don't think it would have an impact anyway.
>
> Thank you very much!
> - Tim