CS109 Programming Projects -- Dissociated Press

Dissociated Press

In this project we implement the Dissociated press algorithm, which can generate a random text based on a given text. For instance, based on kaist.txt, it could generate this text:

into a catalyst for mid to better 
serve a world leader in march 2009 further 
expanded kaist’s academic asset in 1997 as a 
student faculty ratio of korean government and applied 
studies kaist was a producer of transforming korea 
from the major provider of engineering education and 
doctoral programs in science and communications university in 
its school programs in 1997 as the united 
states research university has pioneered the university’s strong 
faculty educated in principle to better serve a 
student faculty conducts internationally recognized research emphasis from 
the outset has in its school of high 
tech manpower for mid to

The algorithm is based on n-grams: An \(n\)-gram is a sequence of \(n\) words from the original text. We will not take frequencies of \(n\)-grams into account. For instance, the text

The bee is the bee of the bees.

has the 2-grams

the bee
the bees
bee is
bee of
is the
of the

The dissociated press algorithm starts by printing a random \(n\)-gram. Then it takes the last \(n-1\) words it has printed, and chooses a random \(n\)-gram that starts with these \(n-1\) words. It prints the last word of this \(n\)-gram, and repeats. So every consecutive \(n\) words of the output text are an \(n\)-gram of the original text. In can sometimes happen that the original text contains no \(n\)-gram starting with the \(n-1\) words just printed. In this case, the algorithm simply stops.

Your task is to implement this algorithm as a compiled program ngram.kt. The program takes three command line parameters: a text file with the original text, the parameter n determining the size of the n-grams (n is at least two), and the number of words to generate. For instance, the example above could have been generated like this:

$ ktc ngram.kt
$ kt NgramKt kaist.txt 2 100

As \(n\) gets larger, the generated text will become more similar to the original text.

Your program must be efficient even for large input file and larger values of \(n\). To achieve this, you must use a Map to store the \(n\)-grams. The map should map the sequence of \(n-1\) initial words (for instance, as a List<String>) to the possible final words (for instance, as a Set<String>).

Test your program with an interesting, long text, for instance with alice.txt.

To keep it simple, you can completely ignore punctuation and produce a text without punctuation, as I did above. As a bonus task you can add punctuation to your random text output in a reasonable way.

Dissociated Press