In article <bdf1473550.boase(a)boase.demon.co.uk>,
Bernard Boase <b.boase(a)bcs.org> wrote:
Just looked at the site www.world-science.net
Netsurf renders much of its text with inter-syllable sequences Â
which, in the original HTML, are all hex C2 AD.
This is utf-8 for "soft hyphen". Netsurf isn't handling this encoding
it seems - which is intended to give a hint to a browser as to how a word
could be split across a line boundary as in printing hyphenation. If there
is no need to break across a line boundary then the hyphen should be
silently ignored - as does Firefox.
Is this legitimate HTML perhaps for automatic hyphenation or
something? Should Netsurf edit it out? Firefox does.
Whilst HTML entity 슭 seems to be valid,
tell us that U+C2AD is not a valid unicode character.
I'm sorry to say that all of the different 'encodings' on that web
document are generated on the fly as the document is being served -
auto-magically - but blindly. If the code is not valid as a Unicode then
that is it - allbets are off! The utf-8 is the correct encoding for the
Unicode code point U+00AD - try looking at