[wellylug] utility to determine text file encoding?

Daniel Pittman daniel at rimspace.net
Mon Mar 2 18:24:06 NZDT 2009


Bruce Hoult <bruce at hoult.org> writes:
> On Mon, Mar 2, 2009 at 5:21 PM, Joe Mahoney <joe at cheerschopper.com> wrote:
>> Hi All
>>
>> Is there a nice little command line app that, given a text file, will
>> tell me the encoding/charset of the file.
>
> That would be a highly uncertain thing to try to figure out.  Ok, sure
> if there is nothing above 127 then call it USASCII,

KOI7, MIK, VISCII, Shift-JIS[1], ISO 2022-based encodings and ANSEL are
all 7-bit encodings that are incompatible with ASCII; the later three
are routinely seen in the wild.[2]

> and if you see UTF-8 2 or 3 octet sequences used correctly then call
> it that.

You would need to verify more than the first occurrence: it isn't all
that unlikely that random binary data could generate a sequence of "high
bit off" followed by valid "high bit on" characters than resemble UTF-8.

> But otherwise ... how on earth would you tell some strange national
> language encoding (let along the three or four different encodings for
> Russian alone) without a dictionary for every possible language?

*nod*

Regards,
        Daniel

Footnotes: 
[1]  This substitutes the Yen character into ASCII.

[2]  Although y'all can be thankful that you don't have to deal with
     ANSEL outside the MARC 21 / Z39.50 protocol.



More information about the wellylug mailing list