TagSoup and html5 support

List overview All Threads
Download

newer

older

Schema Validation issues

How to save Table views as...

Alexander Shpack

21 Dec 2016 21 Dec '16

6:33 a.m.

Hi team!

As you know, TagSoup 1.2.1 doesn't support correct HTML5 tag nesting. For example, this string Test will be parsed as

But html5 code <aside>Test</aside> will be parsed as is.

How to implement in BaseX the html5 support? I've found this project, but not sure that it's possible to add it into basex code https://github.com/UniversityofWarwick/tagsoup-html5

Thanks!

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Christian Grün

21 Dec 21 Dec

8:33 a.m.

Hi Alex,

Currently, there is no alternative I have in mind. As I assume that the original author of TagSoup has stopped development quite a while ago, it could indeed be interesting to find alternatives or extended versions of the original TagSoup code. Have you already tried the project you’ve been quoting in your e-mail?

Cheers, Christian

...

As you know, TagSoup 1.2.1 doesn't support correct HTML5 tag nesting. For example, this string Test will be parsed as 

But html5 code <aside>Test</aside> will be parsed as is.

How to implement in BaseX the html5 support? I've found this project, but not sure that it's possible to add it into basex code https://github.com/UniversityofWarwick/tagsoup-html5

Thanks!

Alexander Shpack

8:44 a.m.

On Wed, Dec 21, 2016 at 3:33 PM, Christian Grün christian.gruen@gmail.com wrote:

...

Hi Alex,

Currently, there is no alternative I have in mind. As I assume that the original author of TagSoup has stopped development quite a while ago, it could indeed be interesting to find alternatives or extended versions of the original TagSoup code. Have you already tried the project you’ve been quoting in your e-mail?

Nope, I don't. It should be implemented in basex server, we are using restxq functionality.

Christian Grün

8:46 a.m.

...

Nope, I don't. It should be implemented in basex server, we are using restxq functionality.

If we find working and light-weight alternatives, we could replace the original distribution of TagSoup with the new solution. Suggestions are welcome.

George Sofianos

8:54 a.m.

...

If we find working and light-weight alternatives, we could replace the original distribution of TagSoup with the new solution. Suggestions are welcome.

Speaking about suggestions, how do you feel about adding Apache HttpClient to BaseX? It can help with requesting gzipped XML files (which makes huge difference in large XML files), and could possibly use the http cache mechanism.

Regards, George

Christian Grün

8:58 a.m.

...

Speaking about suggestions, how do you feel about adding Apache HttpClient to BaseX?

Would it also help us converting HTML5, or it is a general suggestion? ;)

...

It can help with requesting gzipped XML files (which makes huge difference in large XML files), and could possibly use the http cache mechanism.

Out of interest: Where would this come into play? When using http:send-request, or also at other places?

George Sofianos

9:06 a.m.

...

Would it also help us converting HTML5, or it is a general suggestion? ;)

Unfortunately no, it was a general suggestion :( In our projects though, we are using https://jsoup.org/ and it works well, also very easy to use. I still prefer XPath over the CSS selectors.

...

Out of interest: Where would this come into play? When using http:send-request, or also at other places?

I'm talking about calls that happend using XQuery doc(http://randomhost.rn/random.xml). I'm not sure if they request gzipped files. I think I've tested it once and it didn't. For example trying to get a 233MB XML file using gzip compression, will only need to fetch 27.8MB (this is a random file, the compression may vary for different XML files). We are working with files that can be over 1GB, so it can make a difference in bandwidth and execution (compilation) time.

Christian Grün

9:17 a.m.

...

In our projects though, we are using https://jsoup.org/ and it works well, also very easy to use.

Interesting. Is it possible to use it for converting HTML to XML?

...

I'm talking about calls that happend using XQuery doc(http://randomhost.rn/random.xml).

I see. So it probably sends requests headers like "Accept-Encoding: x-compress; x-zip" to the server and unzips the result, is this right?

Maybe we could easily realize something similar in BaseX without an additional library, at least for (g)zipped streams. (because I still try to keep the BaseX distribution as small as possible...). There is already an existing issue for that [1]. I don’t know much about HTTP caching so far, though.

Cheers, Christian

[1] https://github.com/BaseXdb/basex/issues/1381

I'm not sure if they request gzipped

...

files. I think I've tested it once and it didn't. For example trying to get a 233MB XML file using gzip compression, will only need to fetch 27.8MB (this is a random file, the compression may vary for different XML files). We are working with files that can be over 1GB, so it can make a difference in bandwidth and execution (compilation) time.

George Sofianos

9:36 a.m.

...

Interesting. Is it possible to use it for converting HTML to XML?

I'm not really sure about that. It looks like it parses HTML into a DOM document object so I'm not sure if this can work with BaseX.

...

I see. So it probably sends requests headers like "Accept-Encoding: x-compress; x-zip" to the server and unzips the result, is this right?

Yes, It sends the request with Accept-Encoding for gzip, retrieves the gzipped response, and then it unzips the content into a stream.

...

I don’t know much about HTTP caching so far, though.

HttpClient has support for some caching libraries, which means it can download the XML files into a custom disk storage, then just check if they have changed in every document request. In case the file hasn't changed on the server that supports HTTP caching, a 304 response is returned to the client, so it doesn't need to download the file a second time.

Alexander Shpack

12:05 p.m.

On Wed, Dec 21, 2016 at 3:46 PM, Christian Grün christian.gruen@gmail.com wrote:

...

...
Nope, I don't. It should be implemented in basex server, we are using

restxq

...
functionality.

If we find working and light-weight alternatives, we could replace the original distribution of TagSoup with the new solution. Suggestions are welcome.

I think TagSoup is good enough. But it requires some html5 tuning. I know any library that is lightweight and has the same feature list as TagSoup.

-- s0rr0w

3130

Age (days ago)

3130

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

9 comments

3 participants

tags (0)

participants (3)

Alexander Shpack
Christian Grün
George Sofianos