[Vobject] unicode problem

Jeffrey Harris jeffrey at osafoundation.org
Wed May 28 14:33:44 CDT 2008


Hi Anil,

> BEGIN:VCARD
> VERSION:2.1
> N;CHARSET=UTF-8:M.Sc.;Beno\303\256t Lef\303\251vre,
> FN;QUOTED-PRINTABLE:Beno=EEt Lef=E9vre, M.Sc.
> EMAIL;PREF;INTERNET:someone at something.com
> END:VCARD
> 
> 
> Since the data is in utf-8, I wanted to decode to Unicode so I do:
> name = vcard.fn.value.encode('utf-8')
> 
> but that is throwing
> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode
> byte 0xee in position 4: ordinal not in range(128)
> 
> 
> I am still new to Unicoding, so I might've missed something obvious. Thanks!

You're doing everything right, this is bug 9814,
https://bugzilla.osafoundation.org/show_bug.cgi?id=9814

Just to be clear for posterity reading the mailing list, when you cut 
and pasted the vCard into email, you cut and pasted utf-8 octets as 
their escaped representations (\303\256, for instance), but I know what 
you meant :)

There's a patch submitted by a helpful user for this problem, I'm 
working on (belatedly) committing it now.

One thing to note: vobject by default expects to receive more modern 
streams then VCARD 2.1, (VCARD 3.0 or VCALENDAR 2.0), so it ignores the 
CHARSET parameter, instead it hopes the entire stream is unicode, if 
it's not, it tries to decode as (in this order): 'utf-8', 'utf-16-LE', 
'utf-16-BE', 'iso-8859-1'.

However, after I commit the fix for bug 9814, you'll be able to pass an 
allowQP flag to readOne and readComponents, which turns on slower state 
machine parsing instead of regular expressions (necessary to handle 
quoted-printable).  When this flag is passed in as True, individual 
content-line's charset parameters will be used for decoding values into 
unicode, defaulting to iso-8859-1.

Sincerely,
Jeffrey


More information about the VObject mailing list