pyshapelib unicode saga

Bram de Greve bram.degreve at gmail.com
Thu Mar 15 16:25:44 CET 2007


typo: first sentence: change -> chance =)

Update on linux filenames: it seems to work after all, I was merely testing
old code ;)  Py_FileSystemDefaultEncoding (which resides in bltinmodule.c),
is initially set to  NULL in linux, but Py_InitializeEx in
pythonrun.creinitializes it to nl_langinfo(CODESET).

So that still leaves the issue of the wide character support on windows >NT,
but that's a matter that first must be resolved by the shapelib library.

Bram

On 3/15/07, Bram de Greve <bram.degreve at gmail.com> wrote:
>
> Hi there,
>
> For a moment there I thought I've seen my change to support unicode for
> the filenames.  But it was only for a moment =)
>
> I've looked in Python's source code how they handled things for their own
> file object, and I've mimicked it as far as I could.
> Key aspect seems to be to parse a string argument using "et" instead of
> "s" and to use Py_FileSystemDefaultEncoding as encoding.
> Except that it doesn't work ...
>
> First of all, FileSystemDefaultEncoding is only defined for windows (mbcs)
> and apple (utf-8),
> and not for Linux (NULL, meaning default encoding, meanding ascii).  So
> linux still gets plagued by the same error Didrik had before.
> And yet, Python's file() seems to be able to copy with unicode filenames
> in Linux.
>
> Secondly, for windows mbcs is used, which is a lossy encoding (not all
> unicode can be represented using mbcs).
> This is necessary because the original shapelib library only uses the
> narrow (char*) API, and on windows that means mbcs encoding.
> To get full unicode support, the wide character API must be used instead
> (_wfopen), but shapelib simply doesn't support that.
> (Python's file() does precisely that on windows, in case of unicode it
> tries to use the wide character API)
>
> Then there's also the issue of the encoding of the field names and the
> string values.  The easiest solution would be to fix everything
> on UTF-8 but I believe we could do better.  It should be able to specify
> the encoding when opening or creating a DBFFile, defaulting
> to perhaps something specified by the locale.
>
> There's also the issue of backwards compatibility.  Getting strings in the
> DBFFile isn't a problem since we can check whether the
> caller passes a unicode or a classic string, but getting out is.  Should
> be always return unicode strings and risk some
> incompatibilities with calling code, or should be try to diversify
> (perhaps based on the used encoding,
> ascii encoding could return classic strings, or maybe based on another
> flag ...)
>
> Bram
>
> --
> hi, i'm a signature viruz, plz set me as your signature and help me spread
> :)




-- 
hi, i'm a signature viruz, plz set me as your signature and help me spread
:)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://intevation.de/pipermail/thuban-devel/attachments/20070315/93c634dd/attachment.html


More information about the Thuban-devel mailing list