[Vobject] importing UTF-8 text - UnicodeDecodeError

Tue Mar 16 10:27:34 CDT 2010

Hi Matthew, 

> I'm using vobject to parse data from a couple of CSV files and merge them to
> vcards, eventually to import into Mac Address Book.  One of the files is ASCII,
> the other UTF-8.  The UTF-8 file is survey data that I'm just munging and
> putting into the Note field.

To avoid (or at least front-load) encoding issues, when working with text, you want to always decode into unicode when you read the text, and encode only when writing back to text.

It sounds to me like you aren't doing the decode step. If you're opening your UTF-8 file using open, just use codecs.open. A good explanation of using codecs.open rather than the builtin open is at:

http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python

Since you're working with CSV, though, you're probably fighting with the fact that Python's csv module doesn't support unicode. Take a look at:

http://docs.python.org/library/csv.html#csv-examples

specifically unicode_csv_reader, you probably want to use something like that to wrap your inbound csv data.

> I think I understand the problem: vcard is an ASCII-only specification, and the
> serialize method can't encode the unicode text as ASCII.  

Actually, RFC2426 is all about unicode, not ASCII. VCard 2.1 used quoted-printable for encoding, which makes older vcards a pain in the ass to parse, but it sounds like you're writing your own vcards, so you shouldn't be messing with vcard 2.1.

RFC2426 *is* agnostic about how you encode your unicode. Vobject tries to decode UTF-8, UTF-16, and even iso-8859-1, but when it serializes it always encodes as UTF-8 (I ought to make the encoding optional and let the user specify the encoding, but there hasn't been much of a clamor for anything but UTF-8).

> I tried quoted-printable for the encoding, but Address Book didn't decode it
> (left it as quoted-printable gobbledygook).  Then I tried binary, and Address
> Book crashed on import.

Always assign unicode, not UTF-8 encoded data, to the value of ContentLines, and you shouldn't see an encode error (that specific error is happening when the  utf-8 encoder encounters a non-unicode value and naively tries to convert it to unicode by decoding as ascii).

Sincerely,
Jeffrey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.skyhouseconsulting.com/pipermail/vobject/attachments/20100316/c75c7cb8/attachment.htm