ListsProgramming Practice (CS109)ArrayBuffer and StringBuilderMaps

Maps

What is a map?

To compare different authors, or to identify a good match in a web search, we can use a histogram of a document. It contains all the words used, and for each word how often it was used.

In other words, given an input text, we want to compute a mapping

\[ \textit{words} \rightarrow \mathbb{N} \]
that maps a word \(w\) to the number of times it is used in the text.

We therefore need a data type that can store pairs of (word, count), that is, pairs of type (String, Int). It should support the following operations:

This data type is called a map or dictionary. A map implements a mapping from some key type to some value type.

Scala provides a parameterized data type Map[K,V], where K is the key type and V is the value type. We can think of it as a container for (K,V) pairs.

We can create such a map as follows:

scala> val m1 = Map(("A", 3), ("B", 7))
m1: scala.collection.immutable.Map[String,Int] = Map(A -> 3, B -> 7)

As you can see, when Scala prints the map, it shows nice mappings A -> 3 and B -> 7 instead of the pairs. We can use the same syntax when we create maps, because Scala interprets the -> operator as building a pair:

scala> 23 -> 19
res0: (Int, Int) = (23,19)
scala> "CS109" -> "Otfried"
res1: (String, String) = (CS109,Otfried)
scala> val m = Map("A" -> 7, "B" -> 13)
m: scala.collection.immutable.Map[String,Int] = Map(A -> 7, B -> 13)

We can access the mappings using the nice mathematical syntax \(m(x)\) for the result of mapping key \(x\) by the mapping \(m\):

scala> m("A")
res2: Int = 7
scala> m("B")
res3: Int = 13
scala> m("C")
java.util.NoSuchElementException: key not found: C
Note that an exception is thrown when the key is not in the map.

We can check if a mapping is defined for a given key using contains, and there is a nice method getOrElse that provides a default mapping for keys that are not mapped:

scala> m contains "C"
res5: Boolean = false
scala> m contains "A"
res6: Boolean = true
scala> m.getOrElse("A", 99)
res7: Int = 7
scala> m.getOrElse("C", 99)
res8: Int = 99

We can add or change mappings in a map using the + operator, and we can remove mappings using the - operator. Note that maps are immutable, so all of these operators return a new map:

scala> m + ("C" -> 13)
res9: scala.collection.immutable.Map[String,Int] = Map(A -> 7, B -> 13, C -> 13)
scala> m - "A"
res10: scala.collection.immutable.Map[String,Int] = Map(B -> 13)
scala> m - "C"
res11: scala.collection.immutable.Map[String,Int] = Map(A -> 7, B -> 13)
scala> m + ("A" -> 99)
res12: scala.collection.immutable.Map[String,Int] = Map(A -> 99, B -> 13)

Mutable maps

However, Scala also provides mutable maps. Mutable maps have operators += and -= for adding (updating) and removing mappings. We can also use the syntax m(key) on the left side of an assignment to add or change a mapping:

scala> import scala.collection.mutable.Map
import scala.collection.mutable.Map
scala> val m = Map("A" -> 7, "B" -> 9)
m: scala.collection.mutable.Map[String,Int] = Map(A -> 7, B -> 9)
scala> m += ("C" -> 13)
res0: m.type = Map(A -> 7, C -> 13, B -> 9)
scala> m
res1: scala.collection.mutable.Map[String,Int] = Map(A -> 7, C -> 13, B -> 9)
scala> m("C")
res2: Int = 13
scala> m -= "A"
res3: m.type = Map(C -> 13, B -> 9)
scala> m
res4: scala.collection.mutable.Map[String,Int] = Map(C -> 13, B -> 9)
scala> m("A")
java.util.NoSuchElementException: key not found: A
scala> m("A") = 19
scala> m("B") = 99
scala> println(m)
Map(A -> 19, C -> 13, B -> 99)

Note that since I use the import statement in the first line, I can simply write Map for mutable maps. Of course I could also create a mutable map with the long name (in this case making an empty map and adding mappings later):

scala> val s = scala.collection.mutable.Map[String,Int]()
s: scala.collection.mutable.Map[String,Int] = Map()
scala> s("Otfried") = 13
scala> s("Youjin") = 37
scala> s("Yeonhee") = 59
scala> println(s)
Map(Youjin -> 37, Yeonhee -> 59, Otfried -> 13)

Computing a word histogram

Here is my first attempt at a histogram program:

import scala.collection.mutable.Map

def histogram(fname: String): Map[String, Int] = {
  val F = scala.io.Source.fromFile(fname)
  val hist = Map[String, Int]()

  for (line <- F.getLines()) {
    if (line != "") {
      val words = line.split("[ ,:;.?!()-]+")
      for (word <- words) {
	val upword = word.toUpperCase
	if (hist contains upword)
	  hist(upword) += 1
	else
	  hist(upword) = 1
      }
    }
  }
  hist
}

if (args.length != 1) {
  println("Usage: scala histogram1.scala <file name>")
  sys.exit(1)
}
  
val fname = args(0)
val hist = histogram(fname)
println(hist)

The function histogram creates an empty map hist from strings to integers. It then looks at all words in the file, converts them to uppercase, and checks if the word is already in the map. If not, it adds the mapping \(w \rightarrow 1\) to the map. If it is already in the map, it changes the mapping by increasing its value by one (so if the old mapping was \(w \rightarrow n\), the new mapping is \(w \rightarrow n+1\)).

When I run it on the text file scala.txt, I get this output:

$ scala histogram1.scala scala.txt 
Map(START -> 1, THAT -> 3, PROGRAMMING -> 4, SO -> 1, STYLE -> 3, WELL
-> 1, USING -> 1, IT -> 2, FIRST -> 1, DURING -> 1, IN -> 5, SCALA ->
4, PRACTICE -> 1, JAVA -> 1, OBJECT -> 1, THOSE -> 1, CODE -> 2, A ->
6, FIND -> 1, EXPERIENCE -> 1, BUT -> 1, ORIENTED -> 1, LANGUAGES ->
1, YET -> 1, THIS -> 3, LANGUAGE -> 5, AS -> 1, AFTER -> 1, ALTHOUGH
-> 1, SUPPORTS -> 1, GOOD -> 1, I -> 4, BETTER -> 1, NOT -> 1, IS ->
1, EVEN -> 1, OF -> 1, MOST -> 1, OR -> 1, PLAN -> 1, USE -> 2,
DIFFERENT -> 2, COURSE -> 1, SEMESTER -> 1, DEVELOP -> 1, ABLE -> 1,
AND -> 4, EARLY -> 1, POPULAR -> 1, BE -> 2, WRITING -> 1, WRITE -> 1,
AND/OR -> 1, TRY -> 1, DIFFICULT -> 1, CLEAN -> 1, WILL -> 3, TO -> 7,
IF -> 1, SHOULD -> 1, LEARN -> 2, STUDENTS -> 3, ELEGANT -> 1, STUDIES
-> 1, SUCCESSFUL -> 1, HAVE -> 1, GETTING -> 1, MEANS -> 1, LATER ->
2, ENCOURAGES -> 1, THEREFORE -> 1, BELIEVE -> 2, THEY -> 3,
FUNCTIONAL -> 1, THE -> 1, MY -> 1, MODERN -> 1, WANT -> 1, C++ -> 1,
SWITCH -> 1, THEIR -> 2)

The output is not very pretty, so I should use a nicer way of printing out the map. This can be done by iterating over the contents of the map, as follows:

def printHistogram(h: Map[String, Int]) {
  for ((word, count) <- h)
    printf("%20s: %d\n", word, count)
}

The output is much nicer:

$ scala histogram2.scala scala.txt 
               START: 1
                THAT: 3
         PROGRAMMING: 4
                  SO: 1
               STYLE: 3
                WELL: 1
               USING: 1
                  IT: 2
               FIRST: 1
              DURING: 1
                  IN: 5
               SCALA: 4
            PRACTICE: 1
                JAVA: 1
              OBJECT: 1
               THOSE: 1
                CODE: 2
                   A: 6
                FIND: 1
          EXPERIENCE: 1
                 BUT: 1
            ORIENTED: 1
           LANGUAGES: 1
                 YET: 1
                THIS: 3
            LANGUAGE: 5
                  AS: 1
               AFTER: 1
            ALTHOUGH: 1
            SUPPORTS: 1
                GOOD: 1
                   I: 4
              BETTER: 1
                 NOT: 1
                  IS: 1
                EVEN: 1
                  OF: 1
                MOST: 1
                  OR: 1
                PLAN: 1
                 USE: 2
           DIFFERENT: 2
              COURSE: 1
            SEMESTER: 1
             DEVELOP: 1
                ABLE: 1
                 AND: 4
               EARLY: 1
             POPULAR: 1
                  BE: 2
             WRITING: 1
               WRITE: 1
              AND/OR: 1
                 TRY: 1
           DIFFICULT: 1
               CLEAN: 1
                WILL: 3
                  TO: 7
                  IF: 1
              SHOULD: 1
               LEARN: 2
            STUDENTS: 3
             ELEGANT: 1
             STUDIES: 1
          SUCCESSFUL: 1
                HAVE: 1
             GETTING: 1
               MEANS: 1
               LATER: 2
          ENCOURAGES: 1
           THEREFORE: 1
             BELIEVE: 2
                THEY: 3
          FUNCTIONAL: 1
                 THE: 1
                  MY: 1
              MODERN: 1
                WANT: 1
                 C++: 1
              SWITCH: 1
               THEIR: 2

This is still not perfect, because it's quite hard to find the word we are looking for. It would be better if the list was sorted. We can do this by first converting the map to an array, and sorting the array:

def printHistogram(h: Map[String, Int]) {
  val sorted = h.toArray.sorted
  for ((word, count) <- sorted)
    printf("%20s: %d\n", word, count)
}

Now the output becomes:

$ scala histogram3.scala scala.txt 
                   A: 6
                ABLE: 1
               AFTER: 1
            ALTHOUGH: 1
                 AND: 4
              AND/OR: 1
                  AS: 1
                  BE: 2
             BELIEVE: 2
              BETTER: 1
                 BUT: 1
                 C++: 1
               CLEAN: 1
                CODE: 2
              COURSE: 1
             DEVELOP: 1
           DIFFERENT: 2
           DIFFICULT: 1
              DURING: 1
               EARLY: 1
             ELEGANT: 1
          ENCOURAGES: 1
                EVEN: 1
          EXPERIENCE: 1
                FIND: 1
               FIRST: 1
          FUNCTIONAL: 1
             GETTING: 1
                GOOD: 1
                HAVE: 1
                   I: 4
                  IF: 1
                  IN: 5
                  IS: 1
                  IT: 2
                JAVA: 1
            LANGUAGE: 5
           LANGUAGES: 1
               LATER: 2
               LEARN: 2
               MEANS: 1
              MODERN: 1
                MOST: 1
                  MY: 1
                 NOT: 1
              OBJECT: 1
                  OF: 1
                  OR: 1
            ORIENTED: 1
                PLAN: 1
             POPULAR: 1
            PRACTICE: 1
         PROGRAMMING: 4
               SCALA: 4
            SEMESTER: 1
              SHOULD: 1
                  SO: 1
               START: 1
            STUDENTS: 3
             STUDIES: 1
               STYLE: 3
          SUCCESSFUL: 1
            SUPPORTS: 1
              SWITCH: 1
                THAT: 3
                 THE: 1
               THEIR: 2
           THEREFORE: 1
                THEY: 3
                THIS: 3
               THOSE: 1
                  TO: 7
                 TRY: 1
                 USE: 2
               USING: 1
                WANT: 1
                WELL: 1
                WILL: 3
               WRITE: 1
             WRITING: 1
                 YET: 1

How do maps work?

Maps are implemented using a hash table, which allows extremely fast insertion, removal, and search, but does not maintain any ordering on the keys. (Come to CS206 to learn about hash tables.)

A pronounciation dictionary

Let's build a real "dictionary", one that maps words to other words. In this case, we want to build a map that maps English words to their pronounciation (English pronounciation is so unpredictable, it would be nice to have a tool to help with this).

We will use the data file cmudict.txt. Here are a few example lines from this file:

## Date: 9-7-94
##
...
ADHERES AH0 D HH IH1 R Z
ADHERING AH0 D HH IH1 R IH0 NG
ADHESIVE AE0 D HH IY1 S IH0 V
ADHESIVE(2) AH0 D HH IY1 S IH0 V
...
The format is roughly this: A line starting with # is a comment. Other lines start with a word in upper case, then a space, then the pronounciation in a particular format (the phonemes are separated by spaces, we will not go into details).

Some words have more than one correct pronounciation, see "adhesive" above. As you can see, the additional pronounciations are indicated by the number in parentheses after the word. Here is another example for a word with three pronounciations (and partly different meanings):

MINUTE  M IH1 N AH0 T
MINUTE(2)  M AY0 N UW1 T
MINUTE(3)  M AY0 N Y UW1 T

Here is a function that reads the file and constructs a mapping. It simply ignores the additional pronounciations and only uses the first one for each word (note that it uses an immutable map):

def readPronounciations(): Map[String,String] = {
  val F = scala.io.Source.fromFile("cmudict.txt")
  var M = Map[String,String]()
  for (l <- F.getLines()) {
    if (l(0).isLetter) {
      val p = l.trim.split("\\s+", 2)
      val word = p(0).toLowerCase
      if (!(word contains '(')) {
	val pro = p(1)
	M = M + (word -> pro)
      }
    }
  }
  M
}

Here's a quick test of the function:

scala> val m = readPronounciations()
m: Map[String,String] = Map(frane -> F R EY1 N, ciresi -> S ER0 EH1 S
IY0, hilyer -> HH IH1 L IY0 ER0, scorsone -> S K AO1 R S AH0 N,
professed -> P R AH0 F EH1 S T, pathogens -> P AE1 TH AH0 JH AH0 N Z,
senger -> S EH1 N JH ER0, metacarpals -> M EH2 T AH0 K AA1 R P AH0 L
Z, hogston -> HH AA1 G S T AH0 N, stum -> S T AH1 M, chary -> CH AA1 R
IY0, lillich -> L IH1 L IH0 K, motts -> M AA1 T S, cerio -> CH EH1 R
IY0 OW0, laferriere -> L AE1 F ER0 IY0 EH0 R, phosphates -> F AA1 S F
EY0 T S, quotient -> K W OW1 SH AH0 N T, hirschhorn -> HH ER1 SH HH
ER0 N, clarissa -> K L ER0 IH1 S AH0, incident -> IH1 N S AH0 D AH0 N
T, stucker -> S T AH1 K ER0, aldys -> AA1 L D Y S, transrapid -> T R
AE1 N Z R AE1 P IH0 D, meteorologist -> M IY2 T IY0 ER0 AA1 L AH0 JH
IH0 S T, misfire -> M IH0 S F AY1 ER0,...
scala> m("minute")
res0: String = M IH1 N AH0 T
scala> m("knight")
res2: String = N AY1 T
scala> m("night")
res3: String = N AY1 T
scala> m("weird")
res4: String = W IH1 R D

Now let's put our map to some use. English has many words that are homophones: they sound the same, like “be” and “bee”, or ”sewing” and ”sowing”. We want to find some more examples. What would be the largest number of different words that all have the same pronounciation?

To determine this, we need to create a dictionary for the opposite direction: mapping pronounciations to words. Since there may be several words with the same pronounciation, this will be a Map[String,Set[String]], that is a mapping from strings to sets of strings.

We write a general function that computes the inverse function of a map:

def reverseMap(M: Map[String,String]): Map[String,Set[String]] = {
  var R = Map[String,Set[String]]()
  for ((word, pro) <- M) {
    val S = R.getOrElse(pro, Set())
    R = R + (pro -> (S + word))
  }
  R
}

We test with some examples:

scala> val r = reverseMap(m)
r: Map[String,Set[String]] = Map(T EH1 L AH0 D AY2 N Z ->
Set(teledyne's), M OW0 M EY1 EH0 Z -> Set(momayez), S P AH1 D ->
Set(spud), AH2 N D ER0 EH1 S T IH0 M EY2 T IH0 NG ->
Set(underestimating), P R IY0 K UH1 K T -> Set(precooked), AE1 G R IY0
-> Set(agri), S AW0 TH HH AE1 M P T AH0 N -> Set(southampton), AA0 D
AO1 N -> Set(adon), SH IY1 K S -> Set(sheik's, sheeks, sheiks), N OW1
T B UH2 K S -> Set(notebooks), AE1 TH L IY2 T -> Set(athlete), M EY1 B
AH0 L -> Set(mable, mabel), ER1 M IH0 N IY0 -> Set(erminie), HH ER0
AE1 NG D -> Set(harangued), L IH1 P F ER0 D -> Set(lipford), L EH1 N D
Z -> Set(lends), P AE1 D L AA2 K -> Set(padlock), D OW2 B R AH0 ZH IH1
N S K IY0 -> Set(dobrzynski), M IH2 S D AH0 M IY1 N ER0 Z ->
Set(misdemeanors), R OW1 B AA2 T -> Set(robot), R EY1 TH IY0 AA0 N
->...
scala> r(m("knight"))
res5: Set[String] = Set(night, nite, knight)
scala> r(m("weird"))
res6: Set[String] = Set(weird)
scala> r(m("minute"))
res7: Set[String] = Set(minott, minute, minot, mynatt)
scala> r(m("be"))
res8: Set[String] = Set(bea, b., b, bee, be)

And now we use the reverse map to display all the words that have at least a certain number of homophones:

def showHomophones(k: Int) {
  val m = readPronounciations()
  var r = reverseMap(m)
  for ((pro, words) <- r) {
    if (words.size >= k) {
      printf("%s (%d words):\n", pro, words.size)
      println("  " + words.mkString(" "))
    }
  }
}

Here is the output for \(k = 10\):

scala> showHomophones(10)
S IY1 Z (11 words):
  c.s sea's sease sees seese c.'s saez seas cees sies seize
S IY1 (10 words):
  sie sci sea si cie cea c. c sieh see
M EY1 Z (10 words):
  mae's mayes may's maze mays mais maes maize mayse mase
K EH1 R IY0 (10 words):
  cary kerry kary carrie kairey carey cheri kerrey kari carie
OW1 (10 words):
  eau oh owe o. aux eaux o' au ow o
F R IY1 Z (10 words):
  frees frieze frease freeze friese freas freis freese frese friis
R OW1 (10 words):
  reaux roe rohe row wroe roh rowe rho rheault ro
SH UW1 (11 words):
  shu schue schou shue schuh hsu shew schoo shoo shoe shiu
L AO1 R IY0 (11 words):
  laury lori lorey lorrie lawry laurie lawrie lorry lorie lowrie lory

Let's try another puzzle: There are words in English that sound the same if you remove the first letter: "knight" and "night" is an example. Let's try to find all such words:

def findWords() {
  val m = readPronounciations()
  for ((word, pro) <- m) {
    val ord = word.substring(1)
    if (pro == m.getOrElse(ord, ""))
      println(word)
  }
}

And here is the output:

scala> findWords()
knack
pfister
kneed
knapp
eau
knoble
knuckles
wreck
lloyd
pfohl
ar
eiseman
ngo
knoles
kness
knaus
knee
wrights
whorton
knoell
knauer
wracked
pfizer
writer
scent
knott
wrath
knoll
wresting
hwang
wrench
wrage
mme
llanes
knicks
wrapped
wrecker
wroten
wright's
llewellyn
wright
el
pfund
wrote
psalters
schau
knope
knipper
kneel
knuts
knauss
aisle
knight's
knew
wroe
knill
wruck
ourso
knell
knees
eula
hwan
wrather
gnu
hour
gnats
wriggle
wrobel
wrye
eng
eide
knipple
wringer
wring
em
wrisley
ailes
eiler
llana
mmonoxide
yu
knights
hours
knit
eerie
knightly
wrap
ex
pty
kneller
knots
pfarr
eaux
wrung
wringing
wracking
wriston
pfeffer
knut
knode
knot
wray
en
wrought
knab
pfahl
ai
kneale
kneer
knipp
wrist
es
hmong
knick
wren
llama
knicely
knock
extra
whole
wrappers
eudy
wrack
knobbe
llano
wrubel
eiden
eike
knoth
kneece
knave
knebel
knabb
scents
ee
knies
wrona
knight
wrenn
writes
aisles
herb
write
wrest
kniess
eury
psalter
knapper
eisenberg
How many of these words do you know?

There is a word in English where you can remove both the first letter or the second letter, and it still sounds the same. (In other words, if the whole word is XYABCDE, then YABCDE and XABCDE sound the same as XYABCDE.)

Can you figure out which word this is?

ListsProgramming Practice (CS109)ArrayBuffer and StringBuilderMaps