I just prepared basex_7.1.1-2_all.deb which is available from:
deb http://files.basex.org/debian unstable/ deb-src http://files.basex.org/debian unstable/
* Problem: Want to parse non well-formed HTML
$ cat bad.html <html> <ul> <li>A <li>B </ul> </html>
$ basex -c 'create db html bad.html' "/home/holu/bad.html" (Line 5): </ul> found, </li> expected. The input may be correctly parsed after switching off the internal XML parser.
* Solution: Have tagsoup installed and set it as parser
$ sudo aptitude install libtagsoup-java The following NEW packages will be installed: libtagsoup-java 0 packages upgraded, 1 newly installed, 0 to remove and 88 not upgraded. Need to get 99.0 kB of archives. After unpacking 138 kB will be used. Get: 1 ftp://ftp.debian.org/debian/ unstable/main libtagsoup-java all 1.2.1-1 [99.0 kB] Fetched 99.0 kB in 0s (305 kB/s) Selecting previously unselected package libtagsoup-java. (Reading database ... 89487 files and directories currently installed.) Unpacking libtagsoup-java (from .../libtagsoup-java_1.2.1-1_all.deb) ... Processing triggers for man-db ... Setting up libtagsoup-java (1.2.1-1) ...
$ basex -c 'set parser html; create db html bad.html' $ basex -q "doc('html')" <html xmlns="http://www.w3.org/1999/xhtml"> <body> <ul> <li>A</li> <li>B</li> </ul> </body> </html>
Available in Debian package version 7.1.1-2
Cheers, Alex
On 22.02.2012, at 12:26, Alexander Holupirek wrote:
I'll have a look at this. If tagsoup is present on a Debian system it should be detected automatically. If not, its a fault of the package and i'll fix it.
On 22.02.2012, at 12:24, Christian GrĂ¼n wrote:
Tagsoup needs to be embedded in your classpath -- which is the case if BaseX is downloaded from our homepage). If you have installed BaseX via the Debian package manager, you'll have to manually embed the tagsoup.jar in the BaseX start scripts.
Hope this helps, Christian
Well all I know is that http://docs.basex.org/wiki/Parsers should mention what to do to read HTML, and on my machine there is $ apt-cache search tagsoup-java libtagsoup-java - SAX-compliant parser for real-life HTML libtagsoup-java-doc - API Documentation for TagSoup
Mainly it is tags like <img ...> without /> that throw basex off track.