Maps |
To compare different authors, or to identify a good match in a web search, we can use a histogram of a document. It contains all the words used, and for each word how often it was used.
In other words, given an input text, we want to compute a mapping
We therefore need a data type that can store pairs of (word, count), that is, pairs of type (String, Int). It should support the following operations:
This data type is called a map or dictionary. A map implements a mapping from some key type to some value type.
Scala provides a parameterized data type Map[K,V], where K is the key type and V is the value type. We can think of it as a container for (K,V) pairs.
We can create such a map as follows:
scala> val m1 = Map(("A", 3), ("B", 7)) m1: scala.collection.immutable.Map[String,Int] = Map(A -> 3, B -> 7)
As you can see, when Scala prints the map, it shows nice mappings A -> 3 and B -> 7 instead of the pairs. We can use the same syntax when we create maps, because Scala interprets the -> operator as building a pair:
scala> 23 -> 19 res0: (Int, Int) = (23,19) scala> "CS109" -> "Otfried" res1: (String, String) = (CS109,Otfried) scala> val m = Map("A" -> 7, "B" -> 13) m: scala.collection.immutable.Map[String,Int] = Map(A -> 7, B -> 13)
We can access the mappings using the nice mathematical syntax \(m(x)\) for the result of mapping key \(x\) by the mapping \(m\):
scala> m("A") res2: Int = 7 scala> m("B") res3: Int = 13 scala> m("C") java.util.NoSuchElementException: key not found: CNote that an exception is thrown when the key is not in the map.
We can check if a mapping is defined for a given key using contains, and there is a nice method getOrElse that provides a default mapping for keys that are not mapped:
scala> m contains "C" res5: Boolean = false scala> m contains "A" res6: Boolean = true scala> m.getOrElse("A", 99) res7: Int = 7 scala> m.getOrElse("C", 99) res8: Int = 99
We can add or change mappings in a map using the + operator, and we can remove mappings using the - operator. Note that maps are immutable, so all of these operators return a new map:
scala> m + ("C" -> 13) res9: scala.collection.immutable.Map[String,Int] = Map(A -> 7, B -> 13, C -> 13) scala> m - "A" res10: scala.collection.immutable.Map[String,Int] = Map(B -> 13) scala> m - "C" res11: scala.collection.immutable.Map[String,Int] = Map(A -> 7, B -> 13) scala> m + ("A" -> 99) res12: scala.collection.immutable.Map[String,Int] = Map(A -> 99, B -> 13)
However, Scala also provides mutable maps. Mutable maps have operators += and -= for adding (updating) and removing mappings. We can also use the syntax m(key) on the left side of an assignment to add or change a mapping:
scala> import scala.collection.mutable.Map import scala.collection.mutable.Map scala> val m = Map("A" -> 7, "B" -> 9) m: scala.collection.mutable.Map[String,Int] = Map(A -> 7, B -> 9) scala> m += ("C" -> 13) res0: m.type = Map(A -> 7, C -> 13, B -> 9) scala> m res1: scala.collection.mutable.Map[String,Int] = Map(A -> 7, C -> 13, B -> 9) scala> m("C") res2: Int = 13 scala> m -= "A" res3: m.type = Map(C -> 13, B -> 9) scala> m res4: scala.collection.mutable.Map[String,Int] = Map(C -> 13, B -> 9) scala> m("A") java.util.NoSuchElementException: key not found: A scala> m("A") = 19 scala> m("B") = 99 scala> println(m) Map(A -> 19, C -> 13, B -> 99)
Note that since I use the import statement in the first line, I can simply write Map for mutable maps. Of course I could also create a mutable map with the long name (in this case making an empty map and adding mappings later):
scala> val s = scala.collection.mutable.Map[String,Int]() s: scala.collection.mutable.Map[String,Int] = Map() scala> s("Otfried") = 13 scala> s("Youjin") = 37 scala> s("Yeonhee") = 59 scala> println(s) Map(Youjin -> 37, Yeonhee -> 59, Otfried -> 13)
Here is my first attempt at a histogram program:
import scala.collection.mutable.Map def histogram(fname: String): Map[String, Int] = { val F = scala.io.Source.fromFile(fname) val hist = Map[String, Int]() for (line <- F.getLines()) { if (line != "") { val words = line.split("[ ,:;.?!()-]+") for (word <- words) { val upword = word.toUpperCase if (hist contains upword) hist(upword) += 1 else hist(upword) = 1 } } } hist } if (args.length != 1) { println("Usage: scala histogram1.scala <file name>") sys.exit(1) } val fname = args(0) val hist = histogram(fname) println(hist)
The function histogram creates an empty map hist from strings to integers. It then looks at all words in the file, converts them to uppercase, and checks if the word is already in the map. If not, it adds the mapping \(w \rightarrow 1\) to the map. If it is already in the map, it changes the mapping by increasing its value by one (so if the old mapping was \(w \rightarrow n\), the new mapping is \(w \rightarrow n+1\)).
When I run it on the text file scala.txt, I get this output:
$ scala histogram1.scala scala.txt Map(START -> 1, THAT -> 3, PROGRAMMING -> 4, SO -> 1, STYLE -> 3, WELL -> 1, USING -> 1, IT -> 2, FIRST -> 1, DURING -> 1, IN -> 5, SCALA -> 4, PRACTICE -> 1, JAVA -> 1, OBJECT -> 1, THOSE -> 1, CODE -> 2, A -> 6, FIND -> 1, EXPERIENCE -> 1, BUT -> 1, ORIENTED -> 1, LANGUAGES -> 1, YET -> 1, THIS -> 3, LANGUAGE -> 5, AS -> 1, AFTER -> 1, ALTHOUGH -> 1, SUPPORTS -> 1, GOOD -> 1, I -> 4, BETTER -> 1, NOT -> 1, IS -> 1, EVEN -> 1, OF -> 1, MOST -> 1, OR -> 1, PLAN -> 1, USE -> 2, DIFFERENT -> 2, COURSE -> 1, SEMESTER -> 1, DEVELOP -> 1, ABLE -> 1, AND -> 4, EARLY -> 1, POPULAR -> 1, BE -> 2, WRITING -> 1, WRITE -> 1, AND/OR -> 1, TRY -> 1, DIFFICULT -> 1, CLEAN -> 1, WILL -> 3, TO -> 7, IF -> 1, SHOULD -> 1, LEARN -> 2, STUDENTS -> 3, ELEGANT -> 1, STUDIES -> 1, SUCCESSFUL -> 1, HAVE -> 1, GETTING -> 1, MEANS -> 1, LATER -> 2, ENCOURAGES -> 1, THEREFORE -> 1, BELIEVE -> 2, THEY -> 3, FUNCTIONAL -> 1, THE -> 1, MY -> 1, MODERN -> 1, WANT -> 1, C++ -> 1, SWITCH -> 1, THEIR -> 2)
The output is not very pretty, so I should use a nicer way of printing out the map. This can be done by iterating over the contents of the map, as follows:
def printHistogram(h: Map[String, Int]) { for ((word, count) <- h) printf("%20s: %d\n", word, count) }
The output is much nicer:
$ scala histogram2.scala scala.txt START: 1 THAT: 3 PROGRAMMING: 4 SO: 1 STYLE: 3 WELL: 1 USING: 1 IT: 2 FIRST: 1 DURING: 1 IN: 5 SCALA: 4 PRACTICE: 1 JAVA: 1 OBJECT: 1 THOSE: 1 CODE: 2 A: 6 FIND: 1 EXPERIENCE: 1 BUT: 1 ORIENTED: 1 LANGUAGES: 1 YET: 1 THIS: 3 LANGUAGE: 5 AS: 1 AFTER: 1 ALTHOUGH: 1 SUPPORTS: 1 GOOD: 1 I: 4 BETTER: 1 NOT: 1 IS: 1 EVEN: 1 OF: 1 MOST: 1 OR: 1 PLAN: 1 USE: 2 DIFFERENT: 2 COURSE: 1 SEMESTER: 1 DEVELOP: 1 ABLE: 1 AND: 4 EARLY: 1 POPULAR: 1 BE: 2 WRITING: 1 WRITE: 1 AND/OR: 1 TRY: 1 DIFFICULT: 1 CLEAN: 1 WILL: 3 TO: 7 IF: 1 SHOULD: 1 LEARN: 2 STUDENTS: 3 ELEGANT: 1 STUDIES: 1 SUCCESSFUL: 1 HAVE: 1 GETTING: 1 MEANS: 1 LATER: 2 ENCOURAGES: 1 THEREFORE: 1 BELIEVE: 2 THEY: 3 FUNCTIONAL: 1 THE: 1 MY: 1 MODERN: 1 WANT: 1 C++: 1 SWITCH: 1 THEIR: 2
This is still not perfect, because it's quite hard to find the word we are looking for. It would be better if the list was sorted. We can do this by first converting the map to an array, and sorting the array:
def printHistogram(h: Map[String, Int]) { val sorted = h.toArray.sorted for ((word, count) <- sorted) printf("%20s: %d\n", word, count) }
Now the output becomes:
$ scala histogram3.scala scala.txt A: 6 ABLE: 1 AFTER: 1 ALTHOUGH: 1 AND: 4 AND/OR: 1 AS: 1 BE: 2 BELIEVE: 2 BETTER: 1 BUT: 1 C++: 1 CLEAN: 1 CODE: 2 COURSE: 1 DEVELOP: 1 DIFFERENT: 2 DIFFICULT: 1 DURING: 1 EARLY: 1 ELEGANT: 1 ENCOURAGES: 1 EVEN: 1 EXPERIENCE: 1 FIND: 1 FIRST: 1 FUNCTIONAL: 1 GETTING: 1 GOOD: 1 HAVE: 1 I: 4 IF: 1 IN: 5 IS: 1 IT: 2 JAVA: 1 LANGUAGE: 5 LANGUAGES: 1 LATER: 2 LEARN: 2 MEANS: 1 MODERN: 1 MOST: 1 MY: 1 NOT: 1 OBJECT: 1 OF: 1 OR: 1 ORIENTED: 1 PLAN: 1 POPULAR: 1 PRACTICE: 1 PROGRAMMING: 4 SCALA: 4 SEMESTER: 1 SHOULD: 1 SO: 1 START: 1 STUDENTS: 3 STUDIES: 1 STYLE: 3 SUCCESSFUL: 1 SUPPORTS: 1 SWITCH: 1 THAT: 3 THE: 1 THEIR: 2 THEREFORE: 1 THEY: 3 THIS: 3 THOSE: 1 TO: 7 TRY: 1 USE: 2 USING: 1 WANT: 1 WELL: 1 WILL: 3 WRITE: 1 WRITING: 1 YET: 1
Maps are implemented using a hash table, which allows extremely fast insertion, removal, and search, but does not maintain any ordering on the keys. (Come to CS206 to learn about hash tables.)
Let's build a real "dictionary", one that maps words to other words. In this case, we want to build a map that maps English words to their pronounciation (English pronounciation is so unpredictable, it would be nice to have a tool to help with this).
We will use the data file cmudict.txt. Here are a few example lines from this file:
## Date: 9-7-94 ## ... ADHERES AH0 D HH IH1 R Z ADHERING AH0 D HH IH1 R IH0 NG ADHESIVE AE0 D HH IY1 S IH0 V ADHESIVE(2) AH0 D HH IY1 S IH0 V ...The format is roughly this: A line starting with # is a comment. Other lines start with a word in upper case, then a space, then the pronounciation in a particular format (the phonemes are separated by spaces, we will not go into details).
Some words have more than one correct pronounciation, see "adhesive" above. As you can see, the additional pronounciations are indicated by the number in parentheses after the word. Here is another example for a word with three pronounciations (and partly different meanings):
MINUTE M IH1 N AH0 T MINUTE(2) M AY0 N UW1 T MINUTE(3) M AY0 N Y UW1 T
Here is a function that reads the file and constructs a mapping. It simply ignores the additional pronounciations and only uses the first one for each word (note that it uses an immutable map):
def readPronounciations(): Map[String,String] = { val F = scala.io.Source.fromFile("cmudict.txt") var M = Map[String,String]() for (l <- F.getLines()) { if (l(0).isLetter) { val p = l.trim.split("\\s+", 2) val word = p(0).toLowerCase if (!(word contains '(')) { val pro = p(1) M = M + (word -> pro) } } } M }
Here's a quick test of the function:
scala> val m = readPronounciations() m: Map[String,String] = Map(frane -> F R EY1 N, ciresi -> S ER0 EH1 S IY0, hilyer -> HH IH1 L IY0 ER0, scorsone -> S K AO1 R S AH0 N, professed -> P R AH0 F EH1 S T, pathogens -> P AE1 TH AH0 JH AH0 N Z, senger -> S EH1 N JH ER0, metacarpals -> M EH2 T AH0 K AA1 R P AH0 L Z, hogston -> HH AA1 G S T AH0 N, stum -> S T AH1 M, chary -> CH AA1 R IY0, lillich -> L IH1 L IH0 K, motts -> M AA1 T S, cerio -> CH EH1 R IY0 OW0, laferriere -> L AE1 F ER0 IY0 EH0 R, phosphates -> F AA1 S F EY0 T S, quotient -> K W OW1 SH AH0 N T, hirschhorn -> HH ER1 SH HH ER0 N, clarissa -> K L ER0 IH1 S AH0, incident -> IH1 N S AH0 D AH0 N T, stucker -> S T AH1 K ER0, aldys -> AA1 L D Y S, transrapid -> T R AE1 N Z R AE1 P IH0 D, meteorologist -> M IY2 T IY0 ER0 AA1 L AH0 JH IH0 S T, misfire -> M IH0 S F AY1 ER0,... scala> m("minute") res0: String = M IH1 N AH0 T scala> m("knight") res2: String = N AY1 T scala> m("night") res3: String = N AY1 T scala> m("weird") res4: String = W IH1 R D
Now let's put our map to some use. English has many words that are homophones: they sound the same, like “be” and “bee”, or ”sewing” and ”sowing”. We want to find some more examples. What would be the largest number of different words that all have the same pronounciation?
To determine this, we need to create a dictionary for the opposite direction: mapping pronounciations to words. Since there may be several words with the same pronounciation, this will be a Map[String,Set[String]], that is a mapping from strings to sets of strings.
We write a general function that computes the inverse function of a map:
def reverseMap(M: Map[String,String]): Map[String,Set[String]] = { var R = Map[String,Set[String]]() for ((word, pro) <- M) { val S = R.getOrElse(pro, Set()) R = R + (pro -> (S + word)) } R }
We test with some examples:
scala> val r = reverseMap(m) r: Map[String,Set[String]] = Map(T EH1 L AH0 D AY2 N Z -> Set(teledyne's), M OW0 M EY1 EH0 Z -> Set(momayez), S P AH1 D -> Set(spud), AH2 N D ER0 EH1 S T IH0 M EY2 T IH0 NG -> Set(underestimating), P R IY0 K UH1 K T -> Set(precooked), AE1 G R IY0 -> Set(agri), S AW0 TH HH AE1 M P T AH0 N -> Set(southampton), AA0 D AO1 N -> Set(adon), SH IY1 K S -> Set(sheik's, sheeks, sheiks), N OW1 T B UH2 K S -> Set(notebooks), AE1 TH L IY2 T -> Set(athlete), M EY1 B AH0 L -> Set(mable, mabel), ER1 M IH0 N IY0 -> Set(erminie), HH ER0 AE1 NG D -> Set(harangued), L IH1 P F ER0 D -> Set(lipford), L EH1 N D Z -> Set(lends), P AE1 D L AA2 K -> Set(padlock), D OW2 B R AH0 ZH IH1 N S K IY0 -> Set(dobrzynski), M IH2 S D AH0 M IY1 N ER0 Z -> Set(misdemeanors), R OW1 B AA2 T -> Set(robot), R EY1 TH IY0 AA0 N ->... scala> r(m("knight")) res5: Set[String] = Set(night, nite, knight) scala> r(m("weird")) res6: Set[String] = Set(weird) scala> r(m("minute")) res7: Set[String] = Set(minott, minute, minot, mynatt) scala> r(m("be")) res8: Set[String] = Set(bea, b., b, bee, be)
And now we use the reverse map to display all the words that have at least a certain number of homophones:
def showHomophones(k: Int) { val m = readPronounciations() var r = reverseMap(m) for ((pro, words) <- r) { if (words.size >= k) { printf("%s (%d words):\n", pro, words.size) println(" " + words.mkString(" ")) } } }
Here is the output for \(k = 10\):
scala> showHomophones(10) S IY1 Z (11 words): c.s sea's sease sees seese c.'s saez seas cees sies seize S IY1 (10 words): sie sci sea si cie cea c. c sieh see M EY1 Z (10 words): mae's mayes may's maze mays mais maes maize mayse mase K EH1 R IY0 (10 words): cary kerry kary carrie kairey carey cheri kerrey kari carie OW1 (10 words): eau oh owe o. aux eaux o' au ow o F R IY1 Z (10 words): frees frieze frease freeze friese freas freis freese frese friis R OW1 (10 words): reaux roe rohe row wroe roh rowe rho rheault ro SH UW1 (11 words): shu schue schou shue schuh hsu shew schoo shoo shoe shiu L AO1 R IY0 (11 words): laury lori lorey lorrie lawry laurie lawrie lorry lorie lowrie lory
Let's try another puzzle: There are words in English that sound the same if you remove the first letter: "knight" and "night" is an example. Let's try to find all such words:
def findWords() { val m = readPronounciations() for ((word, pro) <- m) { val ord = word.substring(1) if (pro == m.getOrElse(ord, "")) println(word) } }
And here is the output:
scala> findWords() knack pfister kneed knapp eau knoble knuckles wreck lloyd pfohl ar eiseman ngo knoles kness knaus knee wrights whorton knoell knauer wracked pfizer writer scent knott wrath knoll wresting hwang wrench wrage mme llanes knicks wrapped wrecker wroten wright's llewellyn wright el pfund wrote psalters schau knope knipper kneel knuts knauss aisle knight's knew wroe knill wruck ourso knell knees eula hwan wrather gnu hour gnats wriggle wrobel wrye eng eide knipple wringer wring em wrisley ailes eiler llana mmonoxide yu knights hours knit eerie knightly wrap ex pty kneller knots pfarr eaux wrung wringing wracking wriston pfeffer knut knode knot wray en wrought knab pfahl ai kneale kneer knipp wrist es hmong knick wren llama knicely knock extra whole wrappers eudy wrack knobbe llano wrubel eiden eike knoth kneece knave knebel knabb scents ee knies wrona knight wrenn writes aisles herb write wrest kniess eury psalter knapper eisenbergHow many of these words do you know?
There is a word in English where you can remove both the first letter or the second letter, and it still sounds the same. (In other words, if the whole word is XYABCDE, then YABCDE and XABCDE sound the same as XYABCDE.)
Can you figure out which word this is?
Maps |