Dissociated Press |
into a catalyst for mid to better serve a world leader in march 2009 further expanded kaist’s academic asset in 1997 as a student faculty ratio of korean government and applied studies kaist was a producer of transforming korea from the major provider of engineering education and doctoral programs in science and communications university in its school programs in 1997 as the united states research university has pioneered the university’s strong faculty educated in principle to better serve a student faculty conducts internationally recognized research emphasis from the outset has in its school of high tech manpower for mid to
The algorithm is based on n-grams: An \(n\)-gram is a sequence of \(n\) words from the original text. We will not take frequencies of \(n\)-grams into account. For instance, the text
The bee is the bee of the bees.has the 2-grams
the bee the bees bee is bee of is the of the
The dissociated press algorithm starts by printing a random \(n\)-gram. Then it takes the last \(n-1\) words it has printed, and chooses a random \(n\)-gram that starts with these \(n-1\) words. It prints the last word of this \(n\)-gram, and repeats. So every consecutive \(n\) words of the output text are an \(n\)-gram of the original text. In can sometimes happen that the original text contains no \(n\)-gram starting with the \(n-1\) words just printed. In this case, the algorithm simply stops.
Your task is to implement this algorithm as a compiled program ngram.kt. The program takes three command line parameters: a text file with the original text, the parameter n determining the size of the n-grams (n is at least two), and the number of words to generate. For instance, the example above could have been generated like this:
$ ktc ngram.kt $ kt NgramKt kaist.txt 2 100
As \(n\) gets larger, the generated text will become more similar to the original text.
Your program must be efficient even for large input file and larger values of \(n\). To achieve this, you must use a Map to store the \(n\)-grams. The map should map the sequence of \(n-1\) initial words (for instance, as a List<String>) to the possible final words (for instance, as a Set<String>).
Test your program with an interesting, long text, for instance with alice.txt.
To keep it simple, you can completely ignore punctuation and produce a text without punctuation, as I did above. As a bonus task you can add punctuation to your random text output in a reasonable way.
Dissociated Press |