I entered the ICFP 2001
Programming Contest. My team name was "dqd", and I was the
only team member. The program name was "
which is short for "compress". Originally I called it
z is the standard
letter meaning "compression", but got tired of typing that.
(Filename completion didn't work because
sml.h, one of my header files.)
I am the guy who asked FAQ #30, regarding the output speed. I specifically mentioned a disk file and a 9600 bps connection in my e-mail.
press in a combination of
flex and C. The
flex code parsed the
input and compressed whitespace. The C code did everything
else. The C code used a number of functions marked as "public
domain" in the
daemontools 0.76 distribution.
The submitted code, in a
.tar.gz file. Actually, the tar file I
submitted contained an executable program, but this tar file
does not. You'll have to build it yourself.
The "fixed" code, which includes a bug fix and some additional enhancements that didn't make the deadline.
To ensure that
press always produces
some valid output no longer than the input, I took
flexscanner makes an exact copy of the input. This is stored in a buffer called
safe_output. After the entire input has been read,
pressallocates a second buffer,
new_output, of the same size as
- As the optimizer builds a new output stream, it
stores it in
new_output. If at any point it is about to append bytes to
new_outputthat would make
safe_output, the optimizer abandons the current optimization level, resets
new_outputto empty, and starts over with the next optimization level. If the optimizer completes a pass and
new_outputis shorter than
safe_output, then it swaps
safe_output. So in the worst case,
safe_outputwill remain a copy of the original input for the entire run of the program.
presssets a timer to go off five seconds before the time limit is about to run out. When the timer goes off,
pressdumps the contents of
safe_outputto standard output and exits.
SIGSEGV, indicating a bug, then it dumps
presstries to allocate memory but fails, then it dumps
presslimits its total memory usage to 250 MB using the
setrlimitsystem call. If the system were to go deeply into swap space, then
presscould end up taking more than five seconds to write its output and run out of time due to excessive paging.
press will virtually always produce
some valid output before the time expires, and the output
will never be longer than the original input.
Aside from the basic whitespace optimizations (which I did
flex parser), I know of two basic strategies
for the task at hand. One, the "rebuild" strategy, is to parse
the document into a representation of its meaning - an array
of decorated characters - and then build a new output document
from scratch by analyzing the array and deciding what tags
to emit. The other, the "transform" strategy, is to perform
transformations on the existing tag structure of the document. I
use the rebuild strategy.
For the rebuild strategy, the document can be viewed as a
list of runs of characters. All of the characters in
each run have the same decorations; adjacent runs may have
completely different decorations. So my
builds two arrays: one contains the document characters (with
whitespace compressed) and the other contains the runs; each
run consists of a context and the number of characters in the
run. The context specifies how the characters in the run should
be decorated; it consists of one byte for the color, one byte
for the size, two bits for the
U level, and one
bit each for
Note that there is no tag for the document's initial size or color. This means (due to the rebuild strategy) that I must consider the initial size to be distinct from all the sizes I can describe with tags; likewise for the initial color. In my internal representation, I simply store the ASCII character representing the size or color of the run, or \000 for the initial size or color. That's why I allocate a whole byte for each of size and color.
My basic optimization strategy is simple. The first run always has no decoration flags turned on, so I simply emit the characters in the first run. (The first run may be empty depending on the input document.) Then, for each run, I consider its context (the "current" context) and the next run's context. I also know what start tags I have emitted for which I have not yet emitted end tags, and what context will result from emitting each of those end tags; this constitutes the "context stack". My goal is to get into the next context as cheaply as possible. So I consider all the ways I can get to that context by emitting start and end tags, emit the cheapest combination, and update the context stack to reflect those tags.
In deciding what combination of tags will be cheapest, I have a lookahead limit; initially the limit is one and I double it on each optimization pass until I run out of time or memory. To evaluate each tag combination, I consider its combined cost. The cost of a tag combination is the length of the start tags to emit, plus the length of the end tags I will eventually have to emit to match those start tags. Examples:
||7: 3 for "
||0: end tags cost nothing to emit; I considered their cost when I emitted their corresponding start tags.|
If my lookahead limit is 1, then the combined cost of those tags is the total cost. If my lookahead limit is 2, then I consider the tag combinations I will have to emit to get from the next context to the context after that, and add the cheapest of those to the cost of the tag combination under consideration. If my lookahead limit is 3, then I also account for the tag combination after that. In other words, the lookahead limit means exactly what you think it means.
Of course, I need an algorithm for enumerating the tag combinations that I wish to evaluate for emission. Here is the algorithm:
If the current context does not have the initial size or color, but the next context does, then all tag combinations must begin with sufficient close tags (as dictated by the context stack) to reach the initial size and/or color as necessary.
Then, if the current context has any decoration flags set
that are not set in the next context, then I must emit
PL start tag, or sufficient end tags
(as dictated by the context stack) to turn all of those flags
off. I consider both possibilities.
Then I can consider emitting any number of additional end tags, up to the number on the context stack. I call these "optional pops". I try each of those possibilities, unless I detect one of two situations:
- If I just emitted a
PLstart tag, I don't allow any optional pops.
- If n optional pops would leave me in a
context where additional pops (or a
PLtag) are required to reach the next context, then I skip n optional pops and go on to n+1. For example, if the context stack says "
<B><I><PL><U>", and I want to be in a context of
B, then I do not consider emitting "
</U></PL>" because more pops (or a
PLtag) would be necessary to remove the
Then I determine what start tags I must emit to reach the next context, given all of the above pushes and pops. If the lookahead limit is 1, then I just count the cost of emitting them in a hardcoded order. Otherwise, if the lookahead limit is larger than 1, then I consider each order in which I can emit the start tags, and determining the minimum cost after doing so of reaching the context after the next context, and so on until the lookahead limit is reached.
Once I have considered all of the tag combinations and determined the cheapest, I emit it and move to the next state. I don't save any of the lookahead work I did when starting on the next state.
I need memory dynamically to store the tag combinations I'm considering, and the resulting context stacks. For complex documents or large lookaheads, this will be a large amount of memory.
I store the tag combinations in dynamically-sized strings which I allocate and free as I go.
For the context stacks, I use a special-purpose allocator/garbage collector. The context stack is stored as a linked list, and context stacks for tag combinations under consideration share tails. The allocator allocates a large number of stack elements in advance and doles them about very rapidly (essentially by incrementing a pointer). I call the garbage collector after I finish computing the tag combination to emit for the next run. I give the garbage collector a pointer to the top of the new context stack. The GC relocates all the of elements in the stack to the beginning of the arena, marks the rest of the arena as available, and returns the new pointer to the relocated top of the context stack. This makes allocation O(1) and garbage collection O(N), where N is the depth of the context stack.
The version of
press that I submitted has a
specific algorithmic error and various other flaws.
The algorithmic error is that it abandons the lookahead
process if it needs to emit only a single tag (other than
PL) to get to the next state, and in
those cases that lookahead path is considered to have zero cost.
This means that
press will produce suboptimal
output in these cases. The fix is simple: remove "
ntags > 1" from line 100 of
The "fixed" code above includes this fix.
One flaw is that
press uses a very large amount
of memory because it garbage collects so rarely. It can be made
to garbage collect much more often.
press also formats tag combinations into
strings unnecessarily. When
press is looking ahead
past the tag combination needed to get to the next state, it
does not actually need to generate the string representation of
the lookahead tag combinations, but it does anyway. This
wastes additional memory.
In its submitted state, it quickly grows to its 250 MB limit in only a few levels of lookahead and then exits. With smarter garbage collection and suppression of tag generation during lookahead, it stays under 200K even for very complex inputs; it becomes entirely CPU-bound. The "fixed" code above includes these changes.
Another flaw is in its permutation generator. If
press needs to push
<I>, and two
it considers 4! = 24 orderings for those tags, but there are
only 4!/2! = 12 distinguishable orderings because the two
U tags are identical. If
to generate three
U tags, the waste is even greater
(a factor of six instead of two).
I have thought of various ways to reduce the number of lookahead tag combinations to consider that I did not have time to implement. The "fixed" code above contains some of these.
dumpdecorated program uses the same
flex parser as
press but simply
dumps the decorated characters to standard output. The
try shell script runs
press on a file
and then uses
testpermute program calls the permutation
generator and writes the results to standard output. The
runtestpermute shell script runs
testpermute and validates the output.
maketest program takes two arguments,
depth. It generates a
random but valid SML/NG output stream of approximately
size bytes with a maximum tag nesting depth of
The first line of
conf-cc contains the name of
the compiler and the compiler flags to use to generate a
.o file from a
.c file. The remaining
lines are not used.
The first line of
conf-link contains the name of
the linker and the linker flags to use to generate an
executable file from
.o files. The remaining
lines are not used.
The first line of
conf-flex contains the name
flex program and the
options to use to generate a
.c file from a
.l file. The remaining lines are not used.
make-makefile script generates
Makefile from the
(each of which contains a list of
Makefile.tail. You must re-run
make-makefile if you add or remove a
*=x file or change the
used by any
.l file. You do
not need to re-run
make-makefile after modifying
conf-flex, an existing
*=x file, or