Maps |
To compare different authors, or to identify a good match in a web search, we can use a histogram of a document. It contains all the words used, and for each word how often it was used.
In other words, given an input text, we want to compute a mapping
We therefore need a data type that can store pairs of (word, count), that is, pairs of (String, Int). It should support the following operations:
This data type is called a map or dictionary. A map implements a mapping from some key type to some value type.
The Java library provides a parameterized data type Map<K,V>, where K is the key type and V is the value type. We can think of it as a container for (K,V) pairs.
We can create such a map as follows:
>>> val m1 = mapOf(Pair("A", 3), Pair("B", 7)) >>> m1 {A=3, B=7}
Instead of having to write Pair for every mapping, we can use the small utility function to. It does nothing but combine two elements into a pair, and makes the syntax for creating a map nicer:
>>> 23 to 19 (23, 19) >>> "CS109" to "Otfried" (CS109, Otfried) >>> val m = mapOf("A" to 7, "B" to 13) >>> m {A=7, B=13}
We can access the mappings using the nice mathematical syntax \(m[x]\) for the result of mapping key \(x\) by the mapping \(m\):
>>> m["A"] 7 >>> m["B"] 13 >>> m["C"] nullNote that requesting a key that is not in the map will return the value null. This means that the result type of the map access is actually V? (see nullable types).
This means that you have to first check the result for null before you can do anything with it:
>>> m["B"] + 7 error: infix call corresponds to a dot-qualified call 'm["B"].plus(7)' which is not allowed on a nullable receiver 'm["B"]'. Use ?.-qualified call instead
Alternatively, you can use the getOrElse method: if the key is not in the map, it will use the provided code to compute a default value:
>>> m.getOrElse("B") { 99 } 13 >>> m.getOrElse("C") { 99 } 99 >>> m.getOrElse("B") { 99 } + 7 20 >>> m.getOrElse("C") { 99 } + 7 106The result type of getOrElse is the (not nullable) value type V, so there is no need to check for null.
We can check if a mapping is defined for a given key using the in operator:
>>> "C" in m false >>> "A" in m true
You can determine the number of entries in a map m as m.size, use m.isEmpty() to determine if the map has no mappings, and you can iterate over all the entries of a map using a for-loop:
>>> val m = mapOf("A" to 7, "B" to 13) >>> m.size 2 >>> for ((k, v) in m) ... println("$k -> $v") A -> 7 B -> 13
Often we need a mutable map, where we can add, update, and remove mappings. We use the syntax m[key] on the left side of an assignment to add or change a mapping. Mappings are removed using m.remove(key).
>>> val m = mutableMapOf("A" to 7, "B" to 13) >>> m {A=7, B=13} >>> m["C"] = 13 >>> m {A=7, B=13, C=13} >>> m.remove("A") 7 >>> m {B=13, C=13} >>> m["B"] = 42 >>> m {B=42, C=13}
Another useful method is getOrPut. If the given key is in the map, it returns the value from the map. Otherwise, it executes the given piece of code, stores the value in the map (for the given key), and returns the value:
>>> m.getOrPut("B") { 99 } 42 >>> m {B=42, C=13} >>> m.getOrPut("D") { 99 } 99 >>> m {B=42, C=13, D=99}
Here is my first attempt at a histogram program (histogram1.kts):
fun histogram(fname: String): Map<String, Int> { val file = java.io.File(fname) val hist = mutableMapOf<String, Int>() file.forEachLine { if (it != "") { val words = it.split(Regex("[ ,:;.?!<>()-]+")) for (word in words) { if (word == "") continue val upword = word.toUpperCase() hist[upword] = hist.getOrElse(upword) { 0 } + 1 } } } return hist } if (args.size != 1) { println("Usage: kts histogram1.kts <file name>") kotlin.system.exitProcess(1) } val fname = args[0] val hist = histogram(fname) println(hist)
The function histogram creates an empty map hist from strings to integers. It then looks at all words in the file, converts them to uppercase. Using getOrElse, we obtain the current mapping of the word or zero if the word is not yet there. We increase the number by one, and store the new number in the map.
When I run it on the text file text.txt, I get this output:
$ kts histogram1.kts text.txt {WHEN=2, I=3, STARTED=1, PROGRAMMING=1, THERE=1, WERE=2, NO=1, GRAPHICAL=2, DISPLAYS=2, ALL=1, COMPUTER=7, INPUT=1, AND=2, OUTPUT=2, WAS=5, DONE=1, USING=1, TEXT=1, A=9, SHARED=1, BY=4, MANY=1, USERS=2, EACH=1, USER=2, CONNECTED=1, TO=3, THE=14, FROM=2, TERMINAL=3, WHICH=1, IS=2, CALLED=1, THIS=1, WAY=1, BECAUSE=1, IT=1, ENDPOINT=1, OF=4, CONNECTION=1, EARLIEST=1, TERMINALS=1, LOOKED=1, LIKE=1, COMBINATION=1, PRINTER=1, WITH=4, KEYBOARD=2, IN=1, MIDDLE=1, SCHOOL=1, SOMETIMES=1, ALLOWED=1, PLAY=1, THAT=2, USED=1, SUCH=1, PRINTING=2, PRINTERS=1, LATER=1, REPLACED=1, CRT=1, COULD=1, TYPICALLY=1, DISPLAY=1, MATRIX=1, 25X80=1, CHARACTERS=2, ASCII=1, ALPHABET=1, LETTERS=1, DIGITS=1, SOME=1, SPECIAL=1, INTERACTED=1, TYPING=1, COMMAND=2, ON=2, RESPONDED=1}
The output is not very pretty, so I should use a nicer way of printing out the map. This can be done by iterating over the contents of the map, as follows (histogram2.kts):
fun printHistogram(h: Map<String, Int>) { for ((word, count) in h) println("%20s: %d".format(word, count)) }
The output is much nicer:
$ kts histogram2.kts text.txt WHEN: 2 I: 3 STARTED: 1 PROGRAMMING: 1 THERE: 1 WERE: 2 NO: 1 GRAPHICAL: 2 DISPLAYS: 2 ALL: 1 COMPUTER: 7 INPUT: 1 AND: 2 OUTPUT: 2 WAS: 5 DONE: 1 USING: 1 TEXT: 1 A: 9 SHARED: 1 BY: 4 MANY: 1 USERS: 2 EACH: 1 USER: 2 CONNECTED: 1 TO: 3 THE: 14 FROM: 2 TERMINAL: 3 WHICH: 1 IS: 2 CALLED: 1 THIS: 1 WAY: 1 BECAUSE: 1 IT: 1 ENDPOINT: 1 OF: 4 CONNECTION: 1 EARLIEST: 1 TERMINALS: 1 LOOKED: 1 LIKE: 1 COMBINATION: 1 PRINTER: 1 WITH: 4 KEYBOARD: 2 IN: 1 MIDDLE: 1 SCHOOL: 1 SOMETIMES: 1 ALLOWED: 1 PLAY: 1 THAT: 2 USED: 1 SUCH: 1 PRINTING: 2 PRINTERS: 1 LATER: 1 REPLACED: 1 CRT: 1 COULD: 1 TYPICALLY: 1 DISPLAY: 1 MATRIX: 1 25X80: 1 CHARACTERS: 2 ASCII: 1 ALPHABET: 1 LETTERS: 1 DIGITS: 1 SOME: 1 SPECIAL: 1 INTERACTED: 1 TYPING: 1 COMMAND: 2 ON: 2 RESPONDED: 1
This is still not perfect, because it's quite hard to find the word we are looking for. It would be better if the list was sorted. We can do this by converting the map to a sorted map (histogram3.kts):
fun printHistogram(h: Map<String, Int>) { val s = h.toSortedMap() for ((word, count) in s) println("%20s: %d".format(word, count)) }
Now the output becomes:
$ kts histogram3.kts text.txt 25X80: 1 A: 9 ALL: 1 ALLOWED: 1 ALPHABET: 1 AND: 2 ASCII: 1 BECAUSE: 1 BY: 4 CALLED: 1 CHARACTERS: 2 COMBINATION: 1 COMMAND: 2 COMPUTER: 7 CONNECTED: 1 CONNECTION: 1 COULD: 1 CRT: 1 DIGITS: 1 DISPLAY: 1 DISPLAYS: 2 DONE: 1 EACH: 1 EARLIEST: 1 ENDPOINT: 1 FROM: 2 GRAPHICAL: 2 I: 3 IN: 1 INPUT: 1 INTERACTED: 1 IS: 2 IT: 1 KEYBOARD: 2 LATER: 1 LETTERS: 1 LIKE: 1 LOOKED: 1 MANY: 1 MATRIX: 1 MIDDLE: 1 NO: 1 OF: 4 ON: 2 OUTPUT: 2 PLAY: 1 PRINTER: 1 PRINTERS: 1 PRINTING: 2 PROGRAMMING: 1 REPLACED: 1 RESPONDED: 1 SCHOOL: 1 SHARED: 1 SOME: 1 SOMETIMES: 1 SPECIAL: 1 STARTED: 1 SUCH: 1 TERMINAL: 3 TERMINALS: 1 TEXT: 1 THAT: 2 THE: 14 THERE: 1 THIS: 1 TO: 3 TYPICALLY: 1 TYPING: 1 USED: 1 USER: 2 USERS: 2 USING: 1 WAS: 5 WAY: 1 WERE: 2 WHEN: 2 WHICH: 1 WITH: 4
Maps are implemented using a hash table, which allows extremely fast insertion, removal, and search, but does not maintain any ordering on the keys. (Come to CS206 to learn about hash tables.)
Let's build a real "dictionary", one that maps words to other words. In this case, we want to build a map that maps English words to their pronounciation (English pronounciation is so unpredictable, it would be nice to have a tool to help with this).
We will use the data file cmudict.txt. Here are a few example lines from this file:
## Date: 9-7-94 ## ... ADHERES AH0 D HH IH1 R Z ADHERING AH0 D HH IH1 R IH0 NG ADHESIVE AE0 D HH IY1 S IH0 V ADHESIVE(2) AH0 D HH IY1 S IH0 V ...The format is roughly this: A line starting with # is a comment. Other lines start with a word in upper case, then a space, then the pronounciation in a particular format (the phonemes are separated by spaces, we will not go into details).
Some words have more than one correct pronounciation, see "adhesive" above. As you can see, the additional pronounciations are indicated by the number in parentheses after the word. Here is another example for a word with three pronounciations (and partly different meanings):
MINUTE M IH1 N AH0 T MINUTE(2) M AY0 N UW1 T MINUTE(3) M AY0 N Y UW1 T
Here is a function that reads the file and constructs a mapping. It simply ignores the additional pronounciations and only uses the first one for each word (note that it returns an immutable map even though it internally works with a mutable map): (cmudict1.kts):
fun readPronounciations(): Map<String,String> { val file = java.io.File("cmudict.txt") var m = mutableMapOf<String, String>() file.forEachLine { l -> if (l[0].isLetter()) { val p = l.trim().split(Regex("\\s+"), 2) val word = p[0].toLowerCase() if (!("(" in word)) m[word] = p[1] } } return m }
Here's a quick test of the function:
>>> val m = readPronounciations() >>> m["minute"] M IH1 N AH0 T >>> m["knight"] N AY1 T >>> m["night"] N AY1 T >>> m["weird"] W IH1 R D
Now let's put our map to some use. English has many words that are homophones: they sound the same, like “be” and “bee”, or ”sewing” and ”sowing”. We want to find some more examples. What would be the largest number of different words that all have the same pronounciation?
To determine this, we need to create a dictionary for the opposite direction: mapping pronounciations to words. Since there may be several words with the same pronounciation, this will be a Map<String, Set<String>>, that is a mapping from strings to sets of strings.
We write a general function that computes the inverse function of a map:
fun reverseMap(m: Map<String, String>): Map<String,Set<String>> { var r = mutableMapOf<String,MutableSet<String>>() for ((word, pro) in m) { val s = r.getOrElse(pro) { mutableSetOf<String>() } s.add(word) r[pro] = s } return r }
We test with some examples:
>>> val r = reverseMap(m) >>> r[m["knight"]] [knight, night, nite] >>> r[m["weird"]] [weird] >>> r[m["minute"]] [minot, minott, minute, mynatt] >>> r[m["be"]] [b, b., be, bea, bee]
And now we use the reverse map to display all the words that have at least a certain number of homophones:
fun showHomophones(k: Int) { val m = readPronounciations() var r = reverseMap(m) for ((pro, words) in r) { if (words.size >= k) { print("$pro (${words.size} words):") println(" " + words.joinToString(separator=" ")) } } }
Here is the output for \(k = 10\):
>>> showHomophones(10) OW1 (10 words): au aux eau eaux o o' o. oh ow owe S IY1 (10 words): c c. cea cie sci sea see si sie sieh S IY1 Z (11 words): c.'s c.s cees saez sea's seas sease sees seese seize sies K EH1 R IY0 (10 words): carey carie carrie cary cheri kairey kari kary kerrey kerry F R IY1 Z (10 words): freas frease frees freese freeze freis frese friese frieze friis SH UW1 (11 words): hsu schoo schou schue schuh shew shiu shoe shoo shu shue L AO1 R IY0 (11 words): laurie laury lawrie lawry lorey lori lorie lorrie lorry lory lowrie M EY1 Z (10 words): mae's maes mais maize mase may's mayes mays mayse maze R OW1 (10 words): reaux rheault rho ro roe roh rohe row rowe wroe
Let's try another puzzle: There are words in English that sound the same if you remove the first letter: "knight" and "night" is an example. Let's try to find all such words:
fun findWords() { val m = readPronounciations() for ((word, pro) in m) { val ord = word.substring(1) if (pro == m[ord]) println(word) } }
And here is the output:
>>> findWords() ai ailes aisle aisles ar eau eaux ee eerie eide eiden eike eiler eiseman eisenberg el em en eng es eudy eula eury ex extra gnats gnu herb hmong hour hours hwan hwang knab knabb knack knapp knapper knauer knaus knauss knave kneale knebel knee kneece kneed kneel kneer knees knell kneller kness knew knicely knick knicks knies kniess knight knight's knightly knights knill knipp knipper knipple knit knobbe knoble knock knode knoell knoles knoll knope knot knoth knots knott knuckles knut knuts llama llana llanes llano llewellyn lloyd mme mmonoxide ngo ourso pfahl pfarr pfeffer pfister pfizer pfohl pfund psalter psalters pty scent scents schau whole whorton wrack wracked wracking wrage wrap wrapped wrappers wrath wrather wray wreck wrecker wren wrench wrenn wrest wresting wriggle wright wright's wrights wring wringer wringing wrisley wrist wriston write writer writes wrobel wroe wrona wrote wroten wrought wrubel wruck wrung wrye yuHow many of these words do you know?
There is a word in English where you can remove both the first letter or the second letter, and it still sounds the same. (In other words, if the whole word is XYABCDE, then YABCDE and XABCDE sound the same as XYABCDE.)
Can you figure out which word this is?
Maps |