[Vobject] importing UTF-8 text - UnicodeDecodeError

Thu Mar 18 15:48:53 CDT 2010

Hi everybody,

For purposes of posterity, I'll close this loop.  Parsing UTF-16-LE-encoded CSV data turned out to be as easy (or as hard) as parsing UTF-8-encoded CSV data.  

Basically, I followed Jeffrey's pointers to the UnicodeReader generator shown at

	http://docs.python.org/library/csv.html#csv-examples

Then I just needed a line like:

	parser= UnicodeReader(codecs.open(utf_16_le_file),delimiter='\t',encoding='utf-16-le')

The encoding parameter to the UnicodeReader constructor gets passed to the UTF8Recoder object, instructing it to open the file as UTF-16-LE.  Hooray!  

One red herring came out of printing for diagnostic purposes.  It seems that "print"ing unicode data to the terminal can cause the UnicodeEncodeError.  appending ".encode('utf-8')" to strings that I wanted to dump fixed that, too.

I still don't grok all this character encoding stuff too well, but I feel a bit better and I achieved the goal of parsing the data the way it comes.

Best,
Matthew

On Mar 17, 2010, at 4:43 PM, Matthew Leingang wrote:

> Dear Jeffrey,
> 
> Thanks for helping out with my issue.  I've got it worked out as I described.  I can now read CSV files encoded as UTF-8, and by making sure all strings were unicode, writing works, too.
> 
> Now if I could revise my question.  My original data files are actually UTF-16.  When I was working on this before, I reencoded as UTF-8 as a first attempt to get rid of errors, not minding to lose a little data.  But since this is a job I want to run a couple of times a semester, and then not again until the next semester, I'd really like to get this to a point where I can process the datafiles as they are downloaded.  "file" tells me this about the files:
> 
> "Little-endian UTF-16 Unicode English text, with very long lines, with CRLF line terminators"
> 
> My first attempt was to take the example generators from the csv module page you pointed out, and change "utf-8" to "utf-16" everywhere I could.  This produced a "UnicodeError: UTF-16 stream does not start with BOM" error.  Based on "file"'s hint, I tried "utf-16-le" instead, and I get the UnicodeEncodeError again, this time while trying to read the CSV files.
> 
> Text encodings are still a mystery to me, so I hope you don't mind my naïveté.  Any further advice?
> 
> Best,
> Matthew
> 
> On Mar 16, 2010, at 11:27 AM, Jeffrey Harris wrote:
> 
>> Hi Matthew, 
>> 
>>> I'm using vobject to parse data from a couple of CSV files and merge them to
>>> vcards, eventually to import into Mac Address Book.  One of the files is ASCII,
>>> the other UTF-8.  The UTF-8 file is survey data that I'm just munging and
>>> putting into the Note field.
>>> 
>> 
>> To avoid (or at least front-load) encoding issues, when working with text, you want to always decode into unicode when you read the text, and encode only when writing back to text.
>> 
>> It sounds to me like you aren't doing the decode step. If you're opening your UTF-8 file using open, just use codecs.open. A good explanation of using codecs.open rather than the builtin open is at:
>> 
>> http://stackoverflow.com/questions/491921/unicode-utf8-reading-and-writing-to-files-in-python
>> 
>> Since you're working with CSV, though, you're probably fighting with the fact that Python's csv module doesn't support unicode. Take a look at:
>> 
>> http://docs.python.org/library/csv.html#csv-examples
>> 
>> specifically unicode_csv_reader, you probably want to use something like that to wrap your inbound csv data.
>> 
>>> I think I understand the problem: vcard is an ASCII-only specification, and the
>>> serialize method can't encode the unicode text as ASCII.  
>>> 
>> 
>> Actually, RFC2426 is all about unicode, not ASCII. VCard 2.1 used quoted-printable for encoding, which makes older vcards a pain in the ass to parse, but it sounds like you're writing your own vcards, so you shouldn't be messing with vcard 2.1.
>> 
>> RFC2426 *is* agnostic about how you encode your unicode. Vobject tries to decode UTF-8, UTF-16, and even iso-8859-1, but when it serializes it always encodes as UTF-8 (I ought to make the encoding optional and let the user specify the encoding, but there hasn't been much of a clamor for anything but UTF-8).
>> 
>>> I tried quoted-printable for the encoding, but Address Book didn't decode it
>>> (left it as quoted-printable gobbledygook).  Then I tried binary, and Address
>>> Book crashed on import.
>> 
>> Always assign unicode, not UTF-8 encoded data, to the value of ContentLines, and you shouldn't see an encode error (that specific error is happening when the  utf-8 encoder encounters a non-unicode value and naively tries to convert it to unicode by decoding as ascii).
>> 
>> Sincerely,
>> Jeffrey
>> _______________________________________________
>> VObject mailing list
>> VObject at lists.skyhouseconsulting.com
>> http://lists.skyhouseconsulting.com/mailman/listinfo/vobject
> 
> --
> Matthew Leingang
> Clinical Associate Professor of Mathematics
> leingang at courant.nyu.edu
> 
> 
> 
> _______________________________________________
> VObject mailing list
> VObject at lists.skyhouseconsulting.com
> http://lists.skyhouseconsulting.com/mailman/listinfo/vobject
> 

--
Matthew Leingang
Clinical Associate Professor of Mathematics
leingang at courant.nyu.edu