[Vobject] importing UTF-8 text - UnicodeDecodeError

Matthew Leingang leingang at courant.nyu.edu
Wed Mar 17 15:43:59 CDT 2010


Dear Jeffrey,

Thanks for helping out with my issue.  I've got it worked out as I described.  I can now read CSV files encoded as UTF-8, and by making sure all strings were unicode, writing works, too.

Now if I could revise my question.  My original data files are actually UTF-16.  When I was working on this before, I reencoded as UTF-8 as a first attempt to get rid of errors, not minding to lose a little data.  But since this is a job I want to run a couple of times a semester, and then not again until the next semester, I'd really like to get this to a point where I can process the datafiles as they are downloaded.  "file" tells me this about the files:

"Little-endian UTF-16 Unicode English text, with very long lines, with CRLF line terminators"

My first attempt was to take the example generators from the csv module page you pointed out, and change "utf-8" to "utf-16" everywhere I could.  This produced a "UnicodeError: UTF-16 stream does not start with BOM" error.  Based on "file"'s hint, I tried "utf-16-le" instead, and I get the UnicodeEncodeError again, this time while trying to read the CSV files.

Text encodings are still a mystery to me, so I hope you don't mind my naïveté.  Any further advice?

Best,
Matthew

On Mar 16, 2010, at 11:27 AM, Jeffrey Harris wrote:

> Hi Matthew, 
> 
>> I'm using vobject to parse data from a couple of CSV files and merge them to
>> vcards, eventually to import into Mac Address Book.  One of the files is ASCII,
>> the other UTF-8.  The UTF-8 file is survey data that I'm just munging and
>> putting into the Note field.
>> 
> 
> To avoid (or at least front-load) encoding issues, when working with text, you want to always decode into unicode when you read the text, and encode only when writing back to text.
> 
> It sounds to me like you aren't doing the decode step. If you're opening your UTF-8 file using open, just use codecs.open. A good explanation of using codecs.open rather than the builtin open is at:
> 
> http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python
> 
> Since you're working with CSV, though, you're probably fighting with the fact that Python's csv module doesn't support unicode. Take a look at:
> 
> http://docs.python.org/library/csv.html#csv-examples
> 
> specifically unicode_csv_reader, you probably want to use something like that to wrap your inbound csv data.
> 
>> I think I understand the problem: vcard is an ASCII-only specification, and the
>> serialize method can't encode the unicode text as ASCII.  
>> 
> 
> Actually, RFC2426 is all about unicode, not ASCII. VCard 2.1 used quoted-printable for encoding, which makes older vcards a pain in the ass to parse, but it sounds like you're writing your own vcards, so you shouldn't be messing with vcard 2.1.
> 
> RFC2426 *is* agnostic about how you encode your unicode. Vobject tries to decode UTF-8, UTF-16, and even iso-8859-1, but when it serializes it always encodes as UTF-8 (I ought to make the encoding optional and let the user specify the encoding, but there hasn't been much of a clamor for anything but UTF-8).
> 
>> I tried quoted-printable for the encoding, but Address Book didn't decode it
>> (left it as quoted-printable gobbledygook).  Then I tried binary, and Address
>> Book crashed on import.
> 
> Always assign unicode, not UTF-8 encoded data, to the value of ContentLines, and you shouldn't see an encode error (that specific error is happening when the  utf-8 encoder encounters a non-unicode value and naively tries to convert it to unicode by decoding as ascii).
> 
> Sincerely,
> Jeffrey
> _______________________________________________
> VObject mailing list
> VObject at lists.skyhouseconsulting.com
> http://lists.skyhouseconsulting.com/mailman/listinfo/vobject

--
Matthew Leingang
Clinical Associate Professor of Mathematics
leingang at courant.nyu.edu





More information about the VObject mailing list