XMLParser les revendications Pharo U + 00A0 est "UTF-8 non valide"

Compte tenu de l'entrée: ""XMLParser les revendications Pharo U + 00A0 est "UTF-8 non valide"

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?> 
<sms body=". what" />

Lorsque le caractère après le dans l'attribut body de l'étiquette de sms est U+00A0;

Je reçois l'erreur:

XMLEncodingException: Invalid UTF-8 character encoding (line 2) (column 13)

IIUC, la représentation UTF-8 de ce caractère est 0xC2 0xA0per Wikipedia. Effectivement, les octets 72 et 73 de l'entrée sont respectivement 194 et 160.

Cela ressemble à un bug dans XMLParser, ou est-ce que quelque chose me manque?

Source

2016-07-28 Sean DeNigris

peut ne pas reproduire: parse 'de XMLDOMParser: '?

Merci à Monty pour venir à la rescousse on the Pharo User's list:

You're double decoding. Use onFileNamed:/parseFileNamed: instead (and the DOM printToFileNamed: family of messages when writing) and let XMLParser take care this for you, or disable XMLParser decoding before parsing with #decodesCharacters:.

Longer explanation:

The class #on:/#parse: take either a string or a stream (read the definitions). You gave it a FileReference, but because the argument is tested with isString and sent #readStream otherwise, it didn't blowup then.

File refs sent #readStream return file streams that do automatic decoding. But XMLParser automatically attempts its own decoding too, if:

The input starts with a BOM or it can be inferred by null bytes before or after the first non-null byte.

There is an encoding declaration with a non-UTF-8 encoding.

There is a UTF-8 encoding declaration but the stream is not a normal ReadStream (your case).

So it gets decoded twice, and the decoded value of the char causes the error. I'll consider changing the heuristic to make less eager to decode.

Source

2016-08-08 12:45:20

XMLParser les revendications Pharo U + 00A0 est "UTF-8 non valide"

Répondre

Questions connexes