Extensions to scan and format

Arjen Markus (1 december 2003) I ran into a peculiar problem the other day, when I needed to read data from a string like:

   var1 mor.inp 0.0  0.0 10.0 "Some input variable"

that is a string containing simple numbers and strings, but also strings delimited by quotes (or apostrophes).

Such strings can not be read that easily by means of scan or split: there is no format code to force [scan] to skip over whitespace nor is there a way to make [split] recognise that the substring delimited by quotes is to become one single list element.

I tried such solutions as:

    scan $line "%s %s %f %f %f %40s" varname filename dvalue \
       minvalue maxvalue descr

but invariably the description (variable descr) would be a single word only. The way out was in first instance:

   scan $line "%s %s %s %s %s" varname filename dvalue minvalue maxvalue
   regexp {"(.*)\"$} $line => text

but this is very unsatisfactory: this assumes that the description is always surrounded by quotes and that it is the last part of the line (otherwise the preceeding variables would get the wrong value).

You can get a workaround with formats like:

    scan $line "%s %s %f %f %f \"[^"]\"" varname filename dvalue \
       minvalue maxvalue descr

but that is still very tricky: if the description is but one word, people might forget about the quotes.

Instead, I have tried to make a more general procedure than the ad hoc solution that works in the same way as [scan] but recognises the delimited strings. Mind you, it is much less general than [scan]:

it assumes that the format codes are separated by whitespace).
it can therefore not handle formats like: "%f,%f"

But it does do the job for me!

AK: Another possible solution: Treat this as a CSV format, using SPACE as the separator character. The tcllib csv routines will then return a list of the element, properly handling the quotes. The elements of the returned list can then be validated in a simple manner.

AM Interesting - I did not know it had such an option.

AK: Well, the csv package allows to use any character as separator (sepChar argument to all commands). And that means that quite a lot of formats can be seen as csv, just with different separator characters.

SMH: The following worked for me if I only use double quotes (TCL 8.4.2, windows)

 foreach {varname filename dvalue  minvalue maxvalue descr} $line break

AM I considered using that too, but I thought an approach via scan would be more robust ...

TV Not having given the problem more than a quick glance, I'd say that if you have no substitution problem (with $ [ ] { } \} or quote those characters, the example line is interpreted as a list by:

 % list var1 mor.inp 0.0  0.0 10.0 "Some input variable"
 var1 mor.inp 0.0 0.0 10.0 {Some input variable}

which is probably why the foreach works.

For the general quoting problem, you'd have to know how the source handles quoting or quotes themselves, at least. No point in parsing lines with inconsistent, ambiguous, or not defined syntax.

 # Procedure read a line with quoted and unquoted strings
 #
 proc qscan { data format args } {
    set data_list   [regexp -inline -all {[^ "]+|"[^"]*"|'[^']*'} $data]
    set format_list [split $format]
    puts $format
    puts $format_list

    foreach form $format_list arg $args datum $data_list {
       if { $form == "" || $arg == "" } {
          break
       } else {
          puts "$arg - $form - $datum"
          upvar $arg var
          if { [string index $datum 0] != "\"" &&
               [string index $datum 0] != "'"   } {
             scan $datum $form var
          } else {
             set var [string range $datum 1 end-1]
          }
       }
    }
 }

 #
 # Example of its use:
 #
 set line "var1 mor.inp 0.0  0.0 10.0 \"Some input variable\""

 qscan $line "%s %s %f %f %f %s" varname filename def min max descr

 puts "Variable: $varname"
 puts "Filename: $filename"
 puts "Default:  $def"
 puts "Minimum:  $min"
 puts "Maximum:  $max"
 puts "Description: $descr"

In a similar vein, I wrote a small package that deals with differences in the representation of decimal numbers: in some countries a comma is used instead of a point to separate the integer part from the fraction. And a period is used to group three digits:

   one million and 1 tenth: 1,000,000.1 in English becomes
                            1.000.000,1 in Dutch or German.

Here is the code. Its use should be clear from the test at the end.

 # dformat.tcl --
 #    Script to handle alternative decimal characters (comma say
 #    instead of dot) on output
 #

 # Format --
 #    Namespace for the two commands
 #
 namespace eval ::Format:: {
    variable decimal    "."
    variable thousands  ""

    namespace export setDecimalChars dformat
 }

 # setDecimalChars --
 #    Set the decimal characters (decimal and thousands separators)
 # Arguments:
 #    decimal_char    Character to use for decimals (default: .)
 #    thousands_char  Character to use for thousands (default: nothing)
 # Result:
 #    Nothing
 # Side effect:
 #    Private variables set
 # Note:
 #    To do: simple sanity checks
 #
 proc ::Format::setDecimalChars { {decimal_char .} {thousands_char "" } } {
    variable decimal
    variable thousands

    set decimal   $decimal_char
    set thousands $thousands_char
 }

 # DecFormat --
 #    Private routine to handle the decimal characters
 # Arguments:
 #    format    Single numerical format string
 #    value     Single numerical value
 # Result:
 #    Correctly formatting string
 #
 proc ::Format::DecFormat { format value } {
    variable decimal
    variable thousands

    set string  [format $format $value]
    set posdot  [string first "." $string]
    if { $posdot >= 0 } {
       set indices [regexp -inline -indices {([0-9]+)\.} $string]
    } else {
       set indices [regexp -inline -indices {([0-9]+)} $string]
    }
    foreach {first last} [lindex $indices 1] {break}
  #  puts "$first $last -- $string ($indices)"
    set prefix  [string range $string $first $last]
    set idx     [expr {[string length $prefix]-3}]
    while { $idx > 0 } {
       set posidx [string index $prefix $idx]
       set prefix [string replace $prefix $idx $idx "$thousands$posidx"]
       incr idx -3
    }

    #
    # Be careful with these replacements: otherwise interference
    # may occur
    #
    set string [string replace $string $posdot $posdot $decimal]
    set string [string replace $string $first $last $prefix]
    return $string
 }

 # dformat --
 #    Format the given variables according to the format string
 #    keeping in mind the current decimal settings
 # Arguments:
 #    format_string   String containing formats
 #    args            Values to be formatted
 # Result:
 #    Formatted string
 # Note:
 #    No support for %*s and the like
 #    To do: error checks
 #
 proc ::Format::dformat { format_string args } {

    set codes_re {%[^%cdfegsx]*[%cdefgsx]}
    set codes      [regexp -all -inline -indices $codes_re $format_string]
    regsub -all $codes_re $format_string "%s" new_format

    set idx 0
    set vars {}
    foreach code $codes {
       foreach {start stop} $code {break}
       set substr [string range $format_string $start $stop]
  #     puts "$code -- $substr -- [lindex $args $idx]"
       if { $substr == "%%" } {
          lappend vars "%"
          continue
       }
       set value [lindex $args $idx]
       if { [string first [string index $substr end] "defg"] >= 0 } {
          set result [DecFormat $substr $value]
       } else {
          set result [format $substr $value]
       }
       lappend vars $result
  #     puts $vars
       incr idx
    }

    return [eval format [list $new_format] $vars]
 }


 # main --
 #    Testing the routines
 #
 #set test 1

 if { [file tail $::argv0] == [info script] || $test == 1 } {

    namespace import -force ::Format::*

    for { set i 0 } { $i < 2 } { incr i } {
       puts [dformat %.3f 1000]
       puts [dformat "A %f %d B" 1000.0 10000]
       puts [dformat "%d %d %d %d %d" 1 10 100 1000 10000]
       puts [dformat "%g %g %g %g %g" 1.0 10.0 100.0 1000.0 10000.0]
       puts [dformat "%g %g %g %g %g" 1.1 10.1 100.1 1000.1 10000.1]
       setDecimalChars "," "."
    }
 }

Category Parsing	Category Local	Category String Processing