Re: [basex-talk] Different results when importing HTML documents

31 Jan 2023


      Hi Tim,
I assume the article element will be preserved if you omit the
nobogons HTMLPARSER option [1]. Usually, there’s no need to set
specific options if the default behavior gives satisfying results.
Best,
Christian
[1] http://vrici.lojban.org/~cowan/tagsoup/
On Fri, Jan 27, 2023 at 8:05 PM Timothée timoguic@gmail.com wrote:
...
Hello all,
I am trying to store HTML documents in BaseX. I setup a local instance of BaseX on my computer using Docker, and I imported this file in it: https://pastebin.com/HJdJgLv9
On my local BaseX instance, the document is imported and "/html/body/article" does return the <article> node as expected.
On my remote/production BaseX instance (using the same Dockerfile and image), the document is imported but the <article> tag is "stripped" (even though its contents / child nodes remain in the imported document). "/html/body/article" is empty.
If I copy over the .basex files from my local database to my remote database, then the documents are complete like on my local instance. I also tried to import the documents again on my local instance, and the <article> tag gets stripped too (and the child nodes remain).
What am I doing wrong when importing my documents? What did I do to import them properly in my current local instance? I tried a lot of options but I just can't figure out why this happens (I fiddled a lot with it).
I used the following options when importing my documents, as per the documentation:
SET PARSER html
SET HTMLPARSER method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
SET CREATEFILTER *.html
I also use SET FTINDEX true but I don't think it would have an impact anyway.
Thank you very much!

Tim

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [basex-talk] Different results when importing HTML documents