Hello --
So I have a pile of near-XML HTML with semantically significant comments to deal with. (I must have been sinning much more than I realized!)
Using BaseX866-20170818.124137, BaseX will parse the content but all the comments go away. This is with passing the "lexical" option on the parser tab where it asks for TagSoup options, which I understand from https://github.com/orbeon/tagsoup/blob/master/trunk/README to pass through comments (and DOCTYPE declarations).
How do I parse HTML and keep the comments?
Thanks! Graydon
Hi Graydon,
A little example query and input file would be great (the smaller, the better).
Thanks in advance, Christian
On Fri, Aug 18, 2017 at 2:40 PM, Graydon Saunders graydonish@gmail.com wrote:
Hello --
So I have a pile of near-XML HTML with semantically significant comments to deal with. (I must have been sinning much more than I realized!)
Using BaseX866-20170818.124137, BaseX will parse the content but all the comments go away. This is with passing the "lexical" option on the parser tab where it asks for TagSoup options, which I understand from https://github.com/orbeon/tagsoup/blob/master/trunk/README to pass through comments (and DOCTYPE declarations).
How do I parse HTML and keep the comments?
Thanks! Graydon
HI Christian --
There's no query! This is about loading the files into a DB with the GUI.
I've attached two files.
If I load them as Database->New with "input format" HTML, the comments go away.
If I load them the same way but with "lexical" as a TagSoup parser option, the comments go away. I expect "lexical" is the TagSoup option that keeps comments from going away. (And for the DOCTYPE in the example that has it to be retained.)
If I use java -jar /usr/share/java/tagsoup.jar --lexical --files *html
from the command line, the comments do NOT go away, so I don't think it's a TagSoup problem, at least not with 1.2.1
thanks! Graydon
On Fri, Aug 18, 2017 at 9:07 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
A little example query and input file would be great (the smaller, the better).
Thanks in advance, Christian
On Fri, Aug 18, 2017 at 2:40 PM, Graydon Saunders graydonish@gmail.com wrote:
Hello --
So I have a pile of near-XML HTML with semantically significant comments
to
deal with. (I must have been sinning much more than I realized!)
Using BaseX866-20170818.124137, BaseX will parse the content but all the comments go away. This is with passing the "lexical" option on the
parser
tab where it asks for TagSoup options, which I understand from https://github.com/orbeon/tagsoup/blob/master/trunk/README to pass
through
comments (and DOCTYPE declarations).
How do I parse HTML and keep the comments?
Thanks! Graydon
Hi Graydon,
Comments will be preserved if you specify "lexical=true" as HTMLPARSER option (or in the Parsing tab of the GUI Create dialog). I have added a little example to the Wiki options page [1].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Options#HTMLPARSER
On Fri, Aug 18, 2017 at 5:02 PM, Graydon Saunders graydonish@gmail.com wrote:
HI Christian --
There's no query! This is about loading the files into a DB with the GUI.
I've attached two files.
If I load them as Database->New with "input format" HTML, the comments go away.
If I load them the same way but with "lexical" as a TagSoup parser option, the comments go away. I expect "lexical" is the TagSoup option that keeps comments from going away. (And for the DOCTYPE in the example that has it to be retained.)
If I use java -jar /usr/share/java/tagsoup.jar --lexical --files *html
from the command line, the comments do NOT go away, so I don't think it's a TagSoup problem, at least not with 1.2.1
thanks! Graydon
On Fri, Aug 18, 2017 at 9:07 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
A little example query and input file would be great (the smaller, the better).
Thanks in advance, Christian
On Fri, Aug 18, 2017 at 2:40 PM, Graydon Saunders graydonish@gmail.com wrote:
Hello --
So I have a pile of near-XML HTML with semantically significant comments to deal with. (I must have been sinning much more than I realized!)
Using BaseX866-20170818.124137, BaseX will parse the content but all the comments go away. This is with passing the "lexical" option on the parser tab where it asks for TagSoup options, which I understand from https://github.com/orbeon/tagsoup/blob/master/trunk/README to pass through comments (and DOCTYPE declarations).
How do I parse HTML and keep the comments?
Thanks! Graydon
Hi Christian,
That works!
Thank you! (and next time I will try to find all the documentation, rather than supposing it works like the command line tool.)
-- Graydon
On Mon, Aug 21, 2017 at 6:35 AM, Christian Grün christian.gruen@gmail.com wrote:
Hi Graydon,
Comments will be preserved if you specify "lexical=true" as HTMLPARSER option (or in the Parsing tab of the GUI Create dialog). I have added a little example to the Wiki options page [1].
Hope this helps, Christian
[1] http://docs.basex.org/wiki/Options#HTMLPARSER
On Fri, Aug 18, 2017 at 5:02 PM, Graydon Saunders graydonish@gmail.com wrote:
HI Christian --
There's no query! This is about loading the files into a DB with the
GUI.
I've attached two files.
If I load them as Database->New with "input format" HTML, the comments go away.
If I load them the same way but with "lexical" as a TagSoup parser
option,
the comments go away. I expect "lexical" is the TagSoup option that
keeps
comments from going away. (And for the DOCTYPE in the example that has
it
to be retained.)
If I use java -jar /usr/share/java/tagsoup.jar --lexical --files *html
from the command line, the comments do NOT go away, so I don't think it's a TagSoup problem, at least not with 1.2.1
thanks! Graydon
On Fri, Aug 18, 2017 at 9:07 AM, Christian Grün <
christian.gruen@gmail.com>
wrote:
Hi Graydon,
A little example query and input file would be great (the smaller, the better).
Thanks in advance, Christian
On Fri, Aug 18, 2017 at 2:40 PM, Graydon Saunders <graydonish@gmail.com
wrote:
Hello --
So I have a pile of near-XML HTML with semantically significant
comments
to deal with. (I must have been sinning much more than I realized!)
Using BaseX866-20170818.124137, BaseX will parse the content but all
the
comments go away. This is with passing the "lexical" option on the parser tab where it asks for TagSoup options, which I understand from https://github.com/orbeon/tagsoup/blob/master/trunk/README to pass through comments (and DOCTYPE declarations).
How do I parse HTML and keep the comments?
Thanks! Graydon
basex-talk@mailman.uni-konstanz.de