As Liam indicated (thanks!), XQuery may not be the best choice to process data on byte level: XQuery was built to work with Unicode characters as basic unit, which means that it will never be possible with pure XQuery to create illegal UTF8 sequences. This also means that the language provides no support to „repair” invalid input.
I wonder if you have enough control over your input to avoid UTF8 shattering? If there’s no choice, and if you still want to try XQuery/BaseX for byte processing, you can play around with the functions of the Conversion Module:
http://docs.basex.org/wiki/Conversion_Module ___________________________
On Tue, Jan 1, 2013 at 5:50 AM, jidanni@jidanni.org wrote:
"LREQ" == Liam R E Quin liam@w3.org writes:
LREQ> Treating the individual UTF-8 octets individually? Yes. LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension... Well no big deal, I was just curious.
I was just curious if there was a way in basex if I could do s!<wbr/>!!g like I can do in perl, to restore the damaged UTF-8 characters.
LREQ> Note that "damaged UTF-8 characters", if by that you mean not LREQ> well-formed UTF-8, aren't going to come through email reliably, so I LREQ> might not be seeing what you wrote - s!<wbr/>!!g can be done with
Don't worry. I wouldn't put any illegal chars into mail.
LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is LREQ> another matter. But, my goal in replying was to tease out enough LREQ> information from you that someone else could answer :-)
http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
LREQ> This says, "this thread has been deleted" at me. In fact they deleted the entire group it turns out.
Anyway here's what I posted there #!/usr/bin/perl # Shows line where we remove couchsurfing.org's UTF-8 shattering effects. # Must run this before the browser gets its hands on it and turns the # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER. # So that seems to count out greasemonkey, etc. solutions. # I used wwwoffle -o URL|./this_program after first browsing the page logged in # in a browser that used wwwoffle as a proxy # Copyright : http://www.fsf.org/copyleft/gpl.html # Author : Dan Jacobson -- http://jidanni.org/ # Created On : 12/31/2012 # Last Modified On: Mon Dec 31 13:12:57 2012 # Update Count : 27 use strict; use warnings FATAL => 'all'; my $N = qr/[^[:ascii:]]/; while (<>) { my $original_line = $_; ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584 s!<wbr/>!!g; ## needed on e.g., ## http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequ... s!($N) ($N)!$1$2!g; s!\t<span class="show_more_control">\s+<br />!! && chomp; m!^\s+...<a class="show_more_link" href="#"> (more) </a><br />! && next; s!\s*</span><span class="show_more_text" style="display: none;"> !!; print "$.: $_" if $_ ne $original_line; }