Hello all,
I am trying to store HTML documents in BaseX. I setup a local instance of BaseX on my computer using Docker, and I imported this file in it: https://pastebin.com/HJdJgLv9
On my local BaseX instance, the document is imported and "/html/body/article" does return the <article> node as expected.
On my remote/production BaseX instance (using the same Dockerfile and image), the document is imported but the <article> tag is "stripped" (even though its contents / child nodes remain in the imported document). "/html/body/article" is empty.
If I copy over the .basex files from my local database to my remote database, then the documents are complete like on my local instance. I also tried to import the documents again on my local instance, and the <article> tag gets stripped too (and the child nodes remain).
What am I doing wrong when importing my documents? What did I do to import them properly in my current local instance? I tried a lot of options but I just can't figure out why this happens (I fiddled a lot with it).
I used the following options when importing my documents, as per the documentation: SET PARSER html SET HTMLPARSER method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true SET CREATEFILTER *.html
I also use SET FTINDEX true but I don't think it would have an impact anyway.
Thank you very much! - Tim
Hi Tim,
I assume the article element will be preserved if you omit the nobogons HTMLPARSER option [1]. Usually, there’s no need to set specific options if the default behavior gives satisfying results.
Best, Christian
[1] http://vrici.lojban.org/~cowan/tagsoup/
On Fri, Jan 27, 2023 at 8:05 PM Timothée timoguic@gmail.com wrote:
Hello all,
I am trying to store HTML documents in BaseX. I setup a local instance of BaseX on my computer using Docker, and I imported this file in it: https://pastebin.com/HJdJgLv9
On my local BaseX instance, the document is imported and "/html/body/article" does return the <article> node as expected.
On my remote/production BaseX instance (using the same Dockerfile and image), the document is imported but the <article> tag is "stripped" (even though its contents / child nodes remain in the imported document). "/html/body/article" is empty.
If I copy over the .basex files from my local database to my remote database, then the documents are complete like on my local instance. I also tried to import the documents again on my local instance, and the <article> tag gets stripped too (and the child nodes remain).
What am I doing wrong when importing my documents? What did I do to import them properly in my current local instance? I tried a lot of options but I just can't figure out why this happens (I fiddled a lot with it).
I used the following options when importing my documents, as per the documentation: SET PARSER html SET HTMLPARSER method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true SET CREATEFILTER *.html
I also use SET FTINDEX true but I don't think it would have an impact anyway.
Thank you very much!
- Tim
Hello Christian,
Thanks for your reply. I added the documents again with the default options and I do get satisfying results. Not sure why I kept on using the settings recommended in the documentation... Would it be possible to add the tagsoup documentation link about the parser options to the BaseX doc? That could be helpful.
Thanks, - Tim
On Mon, Jan 30, 2023 at 10:42 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Tim,
I assume the article element will be preserved if you omit the nobogons HTMLPARSER option [1]. Usually, there’s no need to set specific options if the default behavior gives satisfying results.
Best, Christian
[1] http://vrici.lojban.org/~cowan/tagsoup/
On Fri, Jan 27, 2023 at 8:05 PM Timothée timoguic@gmail.com wrote:
Hello all,
I am trying to store HTML documents in BaseX. I setup a local instance
of BaseX on my computer using Docker, and I imported this file in it: https://pastebin.com/HJdJgLv9
On my local BaseX instance, the document is imported and
"/html/body/article" does return the <article> node as expected.
On my remote/production BaseX instance (using the same Dockerfile and
image), the document is imported but the <article> tag is "stripped" (even though its contents / child nodes remain in the imported document). "/html/body/article" is empty.
If I copy over the .basex files from my local database to my remote
database, then the documents are complete like on my local instance. I also tried to import the documents again on my local instance, and the <article> tag gets stripped too (and the child nodes remain).
What am I doing wrong when importing my documents? What did I do to
import them properly in my current local instance? I tried a lot of options but I just can't figure out why this happens (I fiddled a lot with it).
I used the following options when importing my documents, as per the
documentation:
SET PARSER html SET HTMLPARSER
method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true
SET CREATEFILTER *.html
I also use SET FTINDEX true but I don't think it would have an impact
anyway.
Thank you very much!
- Tim
Hi Tim,
A link to the TagSoup documentation already exists some lines further above in the article.
I have slightly changed the text and the example, I hope this makes it less confusing.
Cheers, Christian
On Thu, Feb 2, 2023 at 6:32 PM Timothée timoguic@gmail.com wrote:
Hello Christian,
Thanks for your reply. I added the documents again with the default options and I do get satisfying results. Not sure why I kept on using the settings recommended in the documentation... Would it be possible to add the tagsoup documentation link about the parser options to the BaseX doc? That could be helpful.
Thanks,
- Tim
On Mon, Jan 30, 2023 at 10:42 PM Christian Grün christian.gruen@gmail.com wrote:
Hi Tim,
I assume the article element will be preserved if you omit the nobogons HTMLPARSER option [1]. Usually, there’s no need to set specific options if the default behavior gives satisfying results.
Best, Christian
[1] http://vrici.lojban.org/~cowan/tagsoup/
On Fri, Jan 27, 2023 at 8:05 PM Timothée timoguic@gmail.com wrote:
Hello all,
I am trying to store HTML documents in BaseX. I setup a local instance of BaseX on my computer using Docker, and I imported this file in it: https://pastebin.com/HJdJgLv9
On my local BaseX instance, the document is imported and "/html/body/article" does return the <article> node as expected.
On my remote/production BaseX instance (using the same Dockerfile and image), the document is imported but the <article> tag is "stripped" (even though its contents / child nodes remain in the imported document). "/html/body/article" is empty.
If I copy over the .basex files from my local database to my remote database, then the documents are complete like on my local instance. I also tried to import the documents again on my local instance, and the <article> tag gets stripped too (and the child nodes remain).
What am I doing wrong when importing my documents? What did I do to import them properly in my current local instance? I tried a lot of options but I just can't figure out why this happens (I fiddled a lot with it).
I used the following options when importing my documents, as per the documentation: SET PARSER html SET HTMLPARSER method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true SET CREATEFILTER *.html
I also use SET FTINDEX true but I don't think it would have an impact anyway.
Thank you very much!
- Tim
basex-talk@mailman.uni-konstanz.de