[Vobject] importing UTF-8 text - UnicodeDecodeError
Matthew Leingang
leingang at courant.nyu.edu
Thu Mar 18 15:48:53 CDT 2010
Hi everybody,
For purposes of posterity, I'll close this loop. Parsing UTF-16-LE-encoded CSV data turned out to be as easy (or as hard) as parsing UTF-8-encoded CSV data.
Basically, I followed Jeffrey's pointers to the UnicodeReader generator shown at
http://docs.python.org/library/csv.html#csv-examples
Then I just needed a line like:
parser= UnicodeReader(codecs.open(utf_16_le_file),delimiter='\t',encoding='utf-16-le')
The encoding parameter to the UnicodeReader constructor gets passed to the UTF8Recoder object, instructing it to open the file as UTF-16-LE. Hooray!
One red herring came out of printing for diagnostic purposes. It seems that "print"ing unicode data to the terminal can cause the UnicodeEncodeError. appending ".encode('utf-8')" to strings that I wanted to dump fixed that, too.
I still don't grok all this character encoding stuff too well, but I feel a bit better and I achieved the goal of parsing the data the way it comes.
Best,
Matthew
On Mar 17, 2010, at 4:43 PM, Matthew Leingang wrote:
> Dear Jeffrey,
>
> Thanks for helping out with my issue. I've got it worked out as I described. I can now read CSV files encoded as UTF-8, and by making sure all strings were unicode, writing works, too.
>
> Now if I could revise my question. My original data files are actually UTF-16. When I was working on this before, I reencoded as UTF-8 as a first attempt to get rid of errors, not minding to lose a little data. But since this is a job I want to run a couple of times a semester, and then not again until the next semester, I'd really like to get this to a point where I can process the datafiles as they are downloaded. "file" tells me this about the files:
>
> "Little-endian UTF-16 Unicode English text, with very long lines, with CRLF line terminators"
>
> My first attempt was to take the example generators from the csv module page you pointed out, and change "utf-8" to "utf-16" everywhere I could. This produced a "UnicodeError: UTF-16 stream does not start with BOM" error. Based on "file"'s hint, I tried "utf-16-le" instead, and I get the UnicodeEncodeError again, this time while trying to read the CSV files.
>
> Text encodings are still a mystery to me, so I hope you don't mind my naïveté. Any further advice?
>
> Best,
> Matthew
>
> On Mar 16, 2010, at 11:27 AM, Jeffrey Harris wrote:
>
>> Hi Matthew,
>>
>>> I'm using vobject to parse data from a couple of CSV files and merge them to
>>> vcards, eventually to import into Mac Address Book. One of the files is ASCII,
>>> the other UTF-8. The UTF-8 file is survey data that I'm just munging and
>>> putting into the Note field.
>>>
>>
>> To avoid (or at least front-load) encoding issues, when working with text, you want to always decode into unicode when you read the text, and encode only when writing back to text.
>>
>> It sounds to me like you aren't doing the decode step. If you're opening your UTF-8 file using open, just use codecs.open. A good explanation of using codecs.open rather than the builtin open is at:
>>
>> http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python
>>
>> Since you're working with CSV, though, you're probably fighting with the fact that Python's csv module doesn't support unicode. Take a look at:
>>
>> http://docs.python.org/library/csv.html#csv-examples
>>
>> specifically unicode_csv_reader, you probably want to use something like that to wrap your inbound csv data.
>>
>>> I think I understand the problem: vcard is an ASCII-only specification, and the
>>> serialize method can't encode the unicode text as ASCII.
>>>
>>
>> Actually, RFC2426 is all about unicode, not ASCII. VCard 2.1 used quoted-printable for encoding, which makes older vcards a pain in the ass to parse, but it sounds like you're writing your own vcards, so you shouldn't be messing with vcard 2.1.
>>
>> RFC2426 *is* agnostic about how you encode your unicode. Vobject tries to decode UTF-8, UTF-16, and even iso-8859-1, but when it serializes it always encodes as UTF-8 (I ought to make the encoding optional and let the user specify the encoding, but there hasn't been much of a clamor for anything but UTF-8).
>>
>>> I tried quoted-printable for the encoding, but Address Book didn't decode it
>>> (left it as quoted-printable gobbledygook). Then I tried binary, and Address
>>> Book crashed on import.
>>
>> Always assign unicode, not UTF-8 encoded data, to the value of ContentLines, and you shouldn't see an encode error (that specific error is happening when the utf-8 encoder encounters a non-unicode value and naively tries to convert it to unicode by decoding as ascii).
>>
>> Sincerely,
>> Jeffrey
>> _______________________________________________
>> VObject mailing list
>> VObject at lists.skyhouseconsulting.com
>> http://lists.skyhouseconsulting.com/mailman/listinfo/vobject
>
> --
> Matthew Leingang
> Clinical Associate Professor of Mathematics
> leingang at courant.nyu.edu
>
>
>
> _______________________________________________
> VObject mailing list
> VObject at lists.skyhouseconsulting.com
> http://lists.skyhouseconsulting.com/mailman/listinfo/vobject
>
--
Matthew Leingang
Clinical Associate Professor of Mathematics
leingang at courant.nyu.edu
More information about the VObject
mailing list