[Vobject] unicode problem
Jeffrey Harris
jeffrey at osafoundation.org
Wed May 28 14:33:44 CDT 2008
Hi Anil,
> BEGIN:VCARD
> VERSION:2.1
> N;CHARSET=UTF-8:M.Sc.;Beno\303\256t Lef\303\251vre,
> FN;QUOTED-PRINTABLE:Beno=EEt Lef=E9vre, M.Sc.
> EMAIL;PREF;INTERNET:someone at something.com
> END:VCARD
>
>
> Since the data is in utf-8, I wanted to decode to Unicode so I do:
> name = vcard.fn.value.encode('utf-8')
>
> but that is throwing
> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode
> byte 0xee in position 4: ordinal not in range(128)
>
>
> I am still new to Unicoding, so I might've missed something obvious. Thanks!
You're doing everything right, this is bug 9814,
https://bugzilla.osafoundation.org/show_bug.cgi?id=9814
Just to be clear for posterity reading the mailing list, when you cut
and pasted the vCard into email, you cut and pasted utf-8 octets as
their escaped representations (\303\256, for instance), but I know what
you meant :)
There's a patch submitted by a helpful user for this problem, I'm
working on (belatedly) committing it now.
One thing to note: vobject by default expects to receive more modern
streams then VCARD 2.1, (VCARD 3.0 or VCALENDAR 2.0), so it ignores the
CHARSET parameter, instead it hopes the entire stream is unicode, if
it's not, it tries to decode as (in this order): 'utf-8', 'utf-16-LE',
'utf-16-BE', 'iso-8859-1'.
However, after I commit the fix for bug 9814, you'll be able to pass an
allowQP flag to readOne and readComponents, which turns on slower state
machine parsing instead of regular expressions (necessary to handle
quoted-printable). When this flag is passed in as True, individual
content-line's charset parameters will be used for decoding values into
unicode, defaulting to iso-8859-1.
Sincerely,
Jeffrey
More information about the VObject
mailing list