Version 3 of Finnish Hyphenation

mjk: I played with Finnish grammar and hyphenation rules and figured out a simple Tcl program for hyphenating Finnish text.

 ######################################################################
 ##                                                                  ##
 ## Finnish Hyphenation                                              ##
 ##                                                                  ##
 ## Matti J. Kärki <[email protected]> 19.5.2004                            ##
 ##                                                                  ##
 ## The following code is my personal experiment. I tried to find an ##
 ## easy way to hyphenate Finnish text. In theory, it is not an easy ##
 ## task (trust me). However, I did found a simple rule-set for      ##
 ## hyphenation. The following code is an implementation of my idea. ##
 ## I don't know, if this is already invented method or not. If it   ##
 ## is, then I have reinvented a wheel. If not, then, well... cool :)##
 ##                                                                  ##
 ######################################################################

 ##
 ## Tools:
 ##

 # Returns true if the character is a vowel. Otherwise returns false.
 proc vowel chr {
     if {$chr != {}} then {
         if {[regexp {[aeiouyäö]} [string tolower $chr]]} then {
             return true
         }
     }

     return false
 }

 # Returns true if the character is a consonant. Otherwise returns false.
 proc consonant chr {
     if {$chr != {}} then {
         if {[regexp {[bcdfghjklmnpqrstvwxz]} [string tolower $chr]]} then {
             return true
         }
     }

     return false
 }

 # Returns true only if the character is not vowel or consonant. Otherwise
 # returns false.
 proc nonalphabet chr {
     if {[vowel $chr] == false && [consonant $chr] == false} then {
         return true
     }

     return false
 }

 ##
 ## Hyphenation rules:
 ##
 ## c = expects consonant
 ## v = expects vowel
 ## x = expects non-alphabet character (or no character, empty {})
 ##
 ## hyphen is placed before _current_ character ($cc in the engine code)
 ##

 #  -lla, for example si-vuil-la
 proc rule-ccv {a b c} {
     if {[consonant $a] && [consonant $b] && [vowel $c]} then {
         return true
     }

     return false
 }

 # -uli-, for example: tu-lin
 proc rule-vcv {a b c} {
     if {[vowel $a] && [consonant $b] && [vowel $c]} then {
         return true
     }

     return false
 }

 # -aa-il-, for example: maa-il-ma
 proc rule-vvvc {a b c d} {
     if {[vowel $a] && [vowel $b] && [vowel $c] && [consonant $d]} then {
         return true
     }

     return false
 }

 # -aan, -ään, -ian, -uan ending, for example: ku-kaan, pi-an
 proc rule-vvcx {a b c d} {
     if {[vowel $a] && [vowel $b] && [consonant $c] && [nonalphabet $d] && \
             ($a ne $b)} then {
         return true
     }

     return false
 }

 # -ia, -aa ending, for example: vaa-li-a, tär-ke-ää
 proc rule-vvx {a b c} {
     if {[vowel $a] && [vowel $b] && [nonalphabet $c] && ($a ne $b)} then {
         return true
     }

     return false
 }

 ##
 ## Hyphenation engine:
 ##

 # Hyphenates given text and returns list of characters, including hyphenation
 # marks.
 proc hyphenate text {
     set chars  [split $text ""]
     set len    [llength $chars]
     set hyphen false
     set result {}

     for {set i 0} {$i < $len} {incr i} {
         set cc  [lindex $chars $i]            ;# Current character
         set cpp [lindex $chars [expr $i - 2]] ;# Character before previous
         set cp  [lindex $chars [expr $i - 1]] ;# Previous character
         set cn  [lindex $chars [expr $i + 1]] ;# Next character
         set cnn [lindex $chars [expr $i + 2]] ;# Character after next

         if {$hyphen} then {
             set cp {}
         }

         if {![consonant $cc] && ![vowel $cc]} then {
             set cp {}
         }

         if { \
                  [rule-ccv $cp $cc $cn] || \
                  [rule-vcv $cp $cc $cn] || \
                  [rule-vvvc $cpp $cp $cc $cn] || \
                  [rule-vvcx $cp $cc $cn $cnn] || \
                  [rule-vvx $cp $cc $cn] } then {
             lappend result "-"
             set hyphen true
         } else {
             set hyphen false
         }

         lappend result $cc
     }

     return $result
 }

 # Hyphenates a string from stdin. Hyphenated string is returned to stdout.
 puts [join [hyphenate [gets stdin]] ""]

An example:

Source text (from my home page[L1 ]):

Pitkään mietittyäni kotisivujani ja sitä, mitä haluan maailmalle näillä sivuilla tarjota, päädyin lopulta tähän varsin karuun esitystapaan. Tulin siihen tulokseen, että on parempi yrittää tarjota näillä sivuilla jotain minulle tärkeää ja aidosti muillekin hyödyllistä kuin julkaista taas uutta linkkilistaa, jollaisia Internet on nykyään väärällään. Sivut ovat suomeksi, koska äidinkieltä pitää vaalia. Poikkeuksena ovat ohjelmointisivut, koska ohjelmoinnin de facto -kielenä on englanti.

Hyphenated text:

Pit-kään mie-tit-tyä-ni ko-ti-si-vu-ja-ni ja si-tä, mi-tä ha-lu-an maa-il-mal-le näil-lä si-vuil-la tar-jo-ta, pää-dy-in lo-pul-ta tä-hän var-sin ka-ruun e-si-tys-ta-paan. Tu-lin sii-hen tu-lok-seen, et-tä on pa-rem-pi y-rit-tää tar-jo-ta näil-lä si-vuil-la jo-ta-in mi-nul-le tär-keää ja ai-dos-ti muil-le-kin hyö-dyl-lis-tä ku-in jul-kais-ta taas uut-ta link-ki-lis-taa, jol-lai-si-a In-ter-net on ny-kyä-än vää-räl-lään. Si-vut o-vat suo-mek-si, kos-ka äi-din-kiel-tä pi-tää vaa-li-a. Poik-keuk-se-na o-vat oh-jel-moin-ti-si-vut, kos-ka oh-jel-moin-nin de fac-to -kie-le-nä on eng-lan-ti.

The algorithm is not perfect. There are some (rare?) cases, when the program places hyphens in wrong places.

Category Application | Category Human Language