Hi Graydon,
Comments will be preserved if you specify "lexical=true" as HTMLPARSER
option (or in the Parsing tab of the GUI Create dialog). I have added
a little example to the Wiki options page [1].
Hope this helps,
Christian
[1] http://docs.basex.org/wiki/Options#HTMLPARSER
On Fri, Aug 18, 2017 at 5:02 PM, Graydon Saunders <graydonish@gmail.com> wrote:
> HI Christian --
>
> There's no query! This is about loading the files into a DB with the GUI.
>
> I've attached two files.
>
> If I load them as Database->New with "input format" HTML, the comments go
> away.
>
> If I load them the same way but with "lexical" as a TagSoup parser option,
> the comments go away. I expect "lexical" is the TagSoup option that keeps
> comments from going away. (And for the DOCTYPE in the example that has it
> to be retained.)
>
> If I use
> java -jar /usr/share/java/tagsoup.jar --lexical --files *html
>
> from the
> command
> line, the comments do NOT go away,
> so I don't think it's a TagSoup problem, at least not with 1.2.1
>
> thanks!
> Graydon
>
> On Fri, Aug 18, 2017 at 9:07 AM, Christian Grün <christian.gruen@gmail.com>
> wrote:
>>
>> Hi Graydon,
>>
>> A little example query and input file would be great (the smaller, the
>> better).
>>
>> Thanks in advance,
>> Christian
>>
>>
>>
>> On Fri, Aug 18, 2017 at 2:40 PM, Graydon Saunders <graydonish@gmail.com>
>> wrote:
>> > Hello --
>> >
>> > So I have a pile of near-XML HTML with semantically significant comments
>> > to
>> > deal with. (I must have been sinning much more than I realized!)
>> >
>> > Using BaseX866-20170818.124137, BaseX will parse the content but all the
>> > comments go away. This is with passing the "lexical" option on the
>> > parser
>> > tab where it asks for TagSoup options, which I understand from
>> > https://github.com/orbeon/tagsoup/blob/master/trunk/ to passREADME
>> > through
>> > comments (and DOCTYPE declarations).
>> >
>> > How do I parse HTML and keep the comments?
>> >
>> > Thanks!
>> > Graydon
>
>