"CG" == Christian Grün christian.gruen@gmail.com writes:
CG> Jidanni,
echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|basex -q ' declare option db:parser "html"; declare option output:method "raw"; doc("/dev/stdin")//*:wbr/..'
CG> If you want help, please try to help, too. Your example is not what I CG> would call very helpful; give us at least:
CG> a) a minimized example,
That's what it is, totally contained. Just run it on your Linux etc. shell command line.
CG> b) the returned output, and
OK, here it is QP encoded: =EF=BF=BD=EF=BF=BD=EF=BF=BD=E5=A5=BD=
CG> c) the expected result
I'm just trying to find a way to remove the <wbr/> injected here, $ echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|qprint -e <A>=E4<wbr/>=BD=A0=E5=A5=BD</A>
So I can get <A>=E4=BD=A0=E5=A5=BD</A>
I am guessing that is not possible with Basex, and one needs byte level tools like perl.
declare option output:encoding "RAW"; or "BYTES" or "NONE"
CG> I’m not sure if you will need any output declaration for your query at CG> all; but we first need more details.
http://docs.basex.org/wiki/Serialization it just says "all encodings supported by Java" So one is supposed to look at http://www.google.com/search?q=all+encodings+supported+by+Java
CG> I've added a link. Note, however, that the list is also dependent on CG> the Java VM you are using.
OK, also do make a note of that fact there...
Why doesn't basex have a command that would output the current "all encodings supported by Java" that it is using.
CG> Try this:
CG> basex "Q{java.nio.charset.Charset}availableCharsets()"
Gawd! $ basex "Q{java.nio.charset.Charset}availableCharsets()"|wc 0 167 3593 One big line and everything is repeated twice!
$ basex "Q{java.nio.charset.Charset}availableCharsets()"| perl -nwle 'print for /([^\s{]+)=/g'|wc 167 167 1713 looks much nicer and has half the bytes.
Do make a note of it on the wiki there. Thanks.