text encoding detect

How to detect text encoding use tcl, anyone know about it?

There is a python implement called chardet [L1 ].

It works well.

gobvip 2010/07/21


The file command can do it too

bash-5.1$ file --mime-encoding foobar.txt
foobar.txt: iso-8859-1

The encguess command from Perl can also do this. There are pathological cases that encguess handles better than the file command, and vice versa.

pi31415 2024/01/26

MiR 2024/01/30

If its only to decide if a file is a test or binary file, one can use code from file's encoding.c like this: (proof of concept, extend to your needs if necessary)

proc checkFileText {fname} {
    # check if this is a text or a binary file
    # https://github.com/file/file/blob/f2a6e7cb7db9b5fd86100403df6b2f830c7f22ba/src/encoding.c#L151-L228
    #define F 0   /* character never appears in text */
    #define T 1   /* character appears in plain ASCII text */
    #define I 2   /* character appears in ISO-8859 text */
    #define X 3   /* character appears in non-ISO extended ASCII (Mac, IBM PC) */
    
    set text_chars  {
        F F F F F F F T T T T T T T F F  
        F F F F F F F F F F F T F F F F  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T F  
        X X X X X T X X X X X X X X X X  
        X X X X X X X X X X X X X X X X  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I   
    };
    
    set fn [open $fname rb]
    fconfigure $fn -translation binary
    set teststring [read $fn 1024]
    close $fn
    set n [string length $teststring]

    binary scan $teststring "c$n" binstr
    array get $binstr 

    for {set i 0} {$i<$n} {incr i} {
        set bstr [lindex $binstr $i]
        set bstr [expr {$bstr & 0xFF}]; # val == 0x8000
        set cstr [lindex $text_chars $bstr]
        if {$cstr=="F"} {
            return -1
        }
    }
    return 1
}