Version 7 of Finnish Hyphenation

Updated 2004-05-19 21:07:08 by mjk

mjk: I played with Finnish grammar and hyphenation rules and figured out a simple Tcl program for hyphenating Finnish text.

 ######################################################################
 ##                                                                  ##
 ## Finnish Hyphenation                                              ##
 ##                                                                  ##
 ## Matti J. Kärki <[email protected]> 19.5.2004                            ##
 ##                                                                  ##
 ## The following code is my personal experiment. I tried to find an ##
 ## easy way to hyphenate Finnish text. In theory, it is not an easy ##
 ## task (trust me). However, I did found a simple rule-set for      ##
 ## hyphenation. The following code is an implementation of my idea. ##
 ## I don't know, if this is already invented method or not. If it   ##
 ## is, then I have reinvented a wheel. If not, then, well... cool :)##
 ##                                                                  ##
 ######################################################################

 ##
 ## Tools:
 ##

 # Returns true if the character is a vowel. Otherwise returns false.
 proc vowel chr {
     if {$chr != {}} then {
         if {[regexp {[aeiouyäö]} [string tolower $chr]]} then {
             return true
         }
     }

     return false
 }

 # Returns true if the character is a consonant. Otherwise returns false.
 proc consonant chr {
     if {$chr != {}} then {
         if {[regexp {[bcdfghjklmnpqrstvwxz]} [string tolower $chr]]} then {
             return true
         }
     }

     return false
 }

 # Returns true only if the character is not vowel or consonant. Otherwise
 # returns false.
 proc nonalphabet chr {
     if {[vowel $chr] == false && [consonant $chr] == false} then {
         return true
     }

     return false
 }

 ##
 ## Hyphenation rules:
 ##
 ## c = expects consonant
 ## v = expects vowel
 ## x = expects non-alphabet character (or no character, empty {})
 ##
 ## hyphen is placed before _current_ character ($cc in the engine code)
 ##

 #  -lla, for example si-vuil-la
 proc rule-ccv {a b c} {
     if {[consonant $a] && [consonant $b] && [vowel $c]} then {
         return true
     }

     return false
 }

 # -uli-, for example: tu-lin
 proc rule-vcv {a b c} {
     if {[vowel $a] && [consonant $b] && [vowel $c]} then {
         return true
     }

     return false
 }

 # -aa-il-, for example: maa-il-ma
 proc rule-vvvc {a b c d} {
     if {[vowel $a] && [vowel $b] && [vowel $c] && [consonant $d]} then {
         return true
     }

     return false
 }

 # -aan, -ään, -ian, -uan ending, for example: ku-kaan, pi-an
 proc rule-vvcx {a b c d} {
     if {[vowel $a] && [vowel $b] && [consonant $c] && [nonalphabet $d] && \
             ($a ne $b)} then {
         return true
     }

     return false
 }

 # -ia, -aa ending, for example: vaa-li-a, tär-ke-ää
 proc rule-vvx {a b c} {
     if {[vowel $a] && [vowel $b] && [nonalphabet $c] && ($a ne $b)} then {
         return true
     }

     return false
 }

 ##
 ## Hyphenation engine:
 ##

 # Hyphenates given text and returns list of characters, including hyphenation
 # marks.
 proc hyphenate text {
     set chars  [split $text ""]
     set len    [llength $chars]
     set hyphen false
     set result {}

     for {set i 0} {$i < $len} {incr i} {
         set cc  [lindex $chars $i]            ;# Current character
         set cpp [lindex $chars [expr $i - 2]] ;# Character before previous
         set cp  [lindex $chars [expr $i - 1]] ;# Previous character
         set cn  [lindex $chars [expr $i + 1]] ;# Next character
         set cnn [lindex $chars [expr $i + 2]] ;# Character after next

         if {$hyphen} then {
             set cp {}
         }

         if {![consonant $cc] && ![vowel $cc]} then {
             set cp {}
         }

         if { \
                  [rule-ccv $cp $cc $cn] || \
                  [rule-vcv $cp $cc $cn] || \
                  [rule-vvvc $cpp $cp $cc $cn] || \
                  [rule-vvcx $cp $cc $cn $cnn] || \
                  [rule-vvx $cp $cc $cn] } then {
             lappend result "-"
             set hyphen true
         } else {
             set hyphen false
         }

         lappend result $cc
     }

     return $result
 }

 # Hyphenates a string from stdin. Hyphenated string is returned to stdout.
 puts [join [hyphenate [gets stdin]] ""]

An example:

Source text (from my home page[L1 ]):

Pitkään mietittyäni kotisivujani ja sitä, mitä haluan maailmalle näillä sivuilla tarjota, päädyin lopulta tähän varsin karuun esitystapaan. Tulin siihen tulokseen, että on parempi yrittää tarjota näillä sivuilla jotain minulle tärkeää ja aidosti muillekin hyödyllistä kuin julkaista taas uutta linkkilistaa, jollaisia Internet on nykyään väärällään. Sivut ovat suomeksi, koska äidinkieltä pitää vaalia. Poikkeuksena ovat ohjelmointisivut, koska ohjelmoinnin de facto -kielenä on englanti.

Hyphenated text:

Pit-kään mie-tit-tyä-ni ko-ti-si-vu-ja-ni ja si-tä, mi-tä ha-lu-an maa-il-mal-le näil-lä si-vuil-la tar-jo-ta, pää-dy-in lo-pul-ta tä-hän var-sin ka-ruun e-si-tys-ta-paan. Tu-lin sii-hen tu-lok-seen, et-tä on pa-rem-pi y-rit-tää tar-jo-ta näil-lä si-vuil-la jo-ta-in mi-nul-le tär-keää ja ai-dos-ti muil-le-kin hyö-dyl-lis-tä ku-in jul-kais-ta taas uut-ta link-ki-lis-taa, jol-lai-si-a In-ter-net on ny-kyä-än vää-räl-lään. Si-vut o-vat suo-mek-si, kos-ka äi-din-kiel-tä pi-tää vaa-li-a. Poik-keuk-se-na o-vat oh-jel-moin-ti-si-vut, kos-ka oh-jel-moin-nin de fac-to -kie-le-nä on eng-lan-ti.

The algorithm is not perfect. Above text extract seems to be correctly hyphenated, but I think, that there are some (rare?) cases, when the program will place hyphens in wrong places.

So, how Finnish is normally hyphenated and how my implementation differs from them? I have only seen few examples of hyphenation and if I recall, they all did their work by manipulating words backwards. My code - not scientifically exact in any way - walks through the text from the beginning to end. It doesn't try to find exceptions or spesific rules from the text (in other words, it doesn't directly depend on the grammatical rules), it just applies few predefined rules, which are crafted to represent the way human brain "hears" the correct hyphenation. Also, other implementations usually travel through words and see only forward in the string, so they check the current character and the next to come. My implementation examines the word also one character at a time, but it also checks both forward and backward, what kind of characters are aroung the current one.

Because the current implementation doesn't follow written rules, I'm going to improve the code so it will face the official gramatical rules of Finnish language.

Update: I checked the grammmar and it seems, that my code meets the standards pretty well. Only weak spot is a set of exceptions, which is hard to come up anyway, because there is no single rule for exeptions. They are something that "you just have to know". Oh well...


Category Application | Category Human Language