Our mission today is to use Basex to remove tags injected right between the bytes of multibyte UTF-8 characters.
http://www.couchsurfing.org/group_read.html?gid=430&post=13986932
"CG" == Christian Grün christian.gruen@gmail.com writes:
CG> Have you tried method=raw, as mentioned in our documentation CG> (http://docs.basex.org/wiki/Serialization)?
Sorry. Try it yourself: echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|basex -q ' declare option db:parser "html"; declare option output:method "raw"; doc("/dev/stdin")//*:wbr/..'
There is no way to cleanly restore the shattered UTF-8.
I would also like to try
declare option output:encoding "RAW"; or "BYTES" or "NONE"
but on http://docs.basex.org/wiki/Serialization it just says "all encodings supported by Java" So one is supposed to look at http://www.google.com/search?q=all+encodings+supported+by+Java etc. etc.
Why doesn't basex have a command that would output the current "all encodings supported by Java" that it is using.