text encoding detect

Difference between version 4 and 6 - Previous - Next
***How to detect text encoding use tcl, anyone know about it?***

There is a python implement called '''chardet''' [https://github.com/chardet/chardet].

It works well.

[gobvip] 2010/07/21

-----

The file command can do it too

======none
bash-5.1$ file --mime-encoding foobar.txt
foobar.txt: iso-8859-1
======

The encguess command from Perl can also do this.  There are pathological cases that encguess handles better than the file command, and vice versa.

[pi31415] 2024/01/26
[MiR] 2024/01/30

If its only to decide if a file is a test or binary file, one can use code from file's encoding.c like this:
(proof of concept, extend to your needs if necessary)

======
proc checkFileText {fname} {
    # check if this is a text or a binary file
    # https://github.com/file/file/blob/f2a6e7cb7db9b5fd86100403df6b2f830c7f22ba/src/encoding.c#L151-L228
    #define F 0   /* character never appears in text */
    #define T 1   /* character appears in plain ASCII text */
    #define I 2   /* character appears in ISO-8859 text */
    #define X 3   /* character appears in non-ISO extended ASCII (Mac, IBM PC) */
    
    set text_chars  {
        F F F F F F F T T T T T T T F F  
        F F F F F F F F F F F T F F F F  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T T  
        T T T T T T T T T T T T T T T F  
        X X X X X T X X X X X X X X X X  
        X X X X X X X X X X X X X X X X  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I  
        I I I I I I I I I I I I I I I I   
    };
    
    set fn [open $fname rb]
    fconfigure $fn -translation binary
    set teststring [read $fn 1024]
    close $fn
    set n [string length $teststring]

    binary scan $teststring "c$n" binstr
    array get $binstr 

    for {set i 0} {$i<$n} {incr i} {
        set bstr [lindex $binstr $i]
        set bstr [expr {$bstr & 0xFF}]; # val == 0x8000
        set cstr [lindex $text_chars $bstr]
        if {$cstr=="F"} {
            return -1
        }
    }
    return 1
}    

======




<<categories>>Category Printing | Category HTML | Category Word and Text Processing