[wellylug] utility to determine text file encoding?
Daniel Pittman
daniel at rimspace.net
Mon Mar 2 18:24:06 NZDT 2009
Bruce Hoult <bruce at hoult.org> writes:
> On Mon, Mar 2, 2009 at 5:21 PM, Joe Mahoney <joe at cheerschopper.com> wrote:
>> Hi All
>>
>> Is there a nice little command line app that, given a text file, will
>> tell me the encoding/charset of the file.
>
> That would be a highly uncertain thing to try to figure out. Ok, sure
> if there is nothing above 127 then call it USASCII,
KOI7, MIK, VISCII, Shift-JIS[1], ISO 2022-based encodings and ANSEL are
all 7-bit encodings that are incompatible with ASCII; the later three
are routinely seen in the wild.[2]
> and if you see UTF-8 2 or 3 octet sequences used correctly then call
> it that.
You would need to verify more than the first occurrence: it isn't all
that unlikely that random binary data could generate a sequence of "high
bit off" followed by valid "high bit on" characters than resemble UTF-8.
> But otherwise ... how on earth would you tell some strange national
> language encoding (let along the three or four different encodings for
> Russian alone) without a dictionary for every possible language?
*nod*
Regards,
Daniel
Footnotes:
[1] This substitutes the Yen character into ASCII.
[2] Although y'all can be thankful that you don't have to deal with
ANSEL outside the MARC 21 / Z39.50 protocol.
More information about the wellylug
mailing list