I entered the ICFP 2001
Programming Contest. My team name was "dqd", and I was the
only team member. The program name was "press
",
which is short for "compress". Originally I called it
smlngz
, because z
is the standard
letter meaning "compression", but got tired of typing that.
(Filename completion didn't work because smlngz
collided with sml.h
, one of my header files.)
I am the guy who asked FAQ #30, regarding the output speed. I specifically mentioned a disk file and a 9600 bps connection in my e-mail.
I wrote press
in a combination of
flex
and C. The flex
code parsed the
input and compressed whitespace. The C code did everything
else. The C code used a number of functions marked as "public
domain" in the daemontools
0.76 distribution.
The submitted code, in a
.tar.gz
file. Actually, the tar file I
submitted contained an executable program, but this tar file
does not. You'll have to build it yourself.
The "fixed" code, which includes a bug fix and some additional enhancements that didn't make the deadline.
Safety
To ensure that press
always produces
some valid output no longer than the input, I took
several steps:
- The
flex
scanner makes an exact copy of the input. This is stored in a buffer calledsafe_output
. After the entire input has been read,press
allocates a second buffer,new_output
, of the same size assafe_output
. - As the optimizer builds a new output stream, it
stores it in
new_output
. If at any point it is about to append bytes tonew_output
that would makenew_output
longer thansafe_output
, the optimizer abandons the current optimization level, resetsnew_output
to empty, and starts over with the next optimization level. If the optimizer completes a pass andnew_output
is shorter thansafe_output
, then it swapsnew_output
andsafe_output
. So in the worst case,safe_output
will remain a copy of the original input for the entire run of the program. -
press
sets a timer to go off five seconds before the time limit is about to run out. When the timer goes off,press
dumps the contents ofsafe_output
to standard output and exits. - If
press
receives aSIGSEGV
, indicating a bug, then it dumpssafe_output
and exits. - If
press
tries to allocate memory but fails, then it dumpssafe_output
and exits. -
press
limits its total memory usage to 250 MB using thesetrlimit
system call. If the system were to go deeply into swap space, thenpress
could end up taking more than five seconds to write its output and run out of time due to excessive paging.
Thus, press
will virtually always produce
some valid output before the time expires, and the output
will never be longer than the original input.
Strategy
Aside from the basic whitespace optimizations (which I did
in the flex
parser), I know of two basic strategies
for the task at hand. One, the "rebuild" strategy, is to parse
the document into a representation of its meaning - an array
of decorated characters - and then build a new output document
from scratch by analyzing the array and deciding what tags
to emit. The other, the "transform" strategy, is to perform
transformations on the existing tag structure of the document. I
use the rebuild strategy.
Parsing
For the rebuild strategy, the document can be viewed as a
list of runs of characters. All of the characters in
each run have the same decorations; adjacent runs may have
completely different decorations. So my flex
parser
builds two arrays: one contains the document characters (with
whitespace compressed) and the other contains the runs; each
run consists of a context and the number of characters in the
run. The context specifies how the characters in the run should
be decorated; it consists of one byte for the color, one byte
for the size, two bits for the U
level, and one
bit each for B
, I
, S
,
EM
, and TT
.
Note that there is no tag for the document's initial size or color. This means (due to the rebuild strategy) that I must consider the initial size to be distinct from all the sizes I can describe with tags; likewise for the initial color. In my internal representation, I simply store the ASCII character representing the size or color of the run, or \000 for the initial size or color. That's why I allocate a whole byte for each of size and color.
Optimizing
My basic optimization strategy is simple. The first run always has no decoration flags turned on, so I simply emit the characters in the first run. (The first run may be empty depending on the input document.) Then, for each run, I consider its context (the "current" context) and the next run's context. I also know what start tags I have emitted for which I have not yet emitted end tags, and what context will result from emitting each of those end tags; this constitutes the "context stack". My goal is to get into the next context as cheaply as possible. So I consider all the ways I can get to that context by emitting start and end tags, emit the cheapest combination, and update the context stack to reflect those tags.
In deciding what combination of tags will be cheapest, I have a lookahead limit; initially the limit is one and I double it on each optimization pass until I run out of time or memory. To evaluate each tag combination, I consider its combined cost. The cost of a tag combination is the length of the start tags to emit, plus the length of the end tags I will eventually have to emit to match those start tags. Examples:
Tags | Cost |
---|---|
<I> |
7: 3 for "<I> " and 4 for the
"</I>" that will eventually match it. |
<PL><I> |
16 |
</2></r> |
0: end tags cost nothing to emit; I considered their cost when I emitted their corresponding start tags. |
</2><r> |
7 |
If my lookahead limit is 1, then the combined cost of those tags is the total cost. If my lookahead limit is 2, then I consider the tag combinations I will have to emit to get from the next context to the context after that, and add the cheapest of those to the cost of the tag combination under consideration. If my lookahead limit is 3, then I also account for the tag combination after that. In other words, the lookahead limit means exactly what you think it means.
Tag Combinations
Of course, I need an algorithm for enumerating the tag combinations that I wish to evaluate for emission. Here is the algorithm:
If the current context does not have the initial size or color, but the next context does, then all tag combinations must begin with sufficient close tags (as dictated by the context stack) to reach the initial size and/or color as necessary.
Then, if the current context has any decoration flags set
that are not set in the next context, then I must emit
either a PL
start tag, or sufficient end tags
(as dictated by the context stack) to turn all of those flags
off. I consider both possibilities.
Then I can consider emitting any number of additional end tags, up to the number on the context stack. I call these "optional pops". I try each of those possibilities, unless I detect one of two situations:
- If I just emitted a
PL
start tag, I don't allow any optional pops. - If n optional pops would leave me in a
context where additional pops (or a
PL
tag) are required to reach the next context, then I skip n optional pops and go on to n+1. For example, if the context stack says "<B><I><PL><U>
", and I want to be in a context ofB
, then I do not consider emitting "</U></PL>
" because more pops (or aPL
tag) would be necessary to remove theI
state.
Then I determine what start tags I must emit to reach the next context, given all of the above pushes and pops. If the lookahead limit is 1, then I just count the cost of emitting them in a hardcoded order. Otherwise, if the lookahead limit is larger than 1, then I consider each order in which I can emit the start tags, and determining the minimum cost after doing so of reaching the context after the next context, and so on until the lookahead limit is reached.
Once I have considered all of the tag combinations and determined the cheapest, I emit it and move to the next state. I don't save any of the lookahead work I did when starting on the next state.
Memory Management
I need memory dynamically to store the tag combinations I'm considering, and the resulting context stacks. For complex documents or large lookaheads, this will be a large amount of memory.
I store the tag combinations in dynamically-sized strings which I allocate and free as I go.
For the context stacks, I use a special-purpose allocator/garbage collector. The context stack is stored as a linked list, and context stacks for tag combinations under consideration share tails. The allocator allocates a large number of stack elements in advance and doles them about very rapidly (essentially by incrementing a pointer). I call the garbage collector after I finish computing the tag combination to emit for the next run. I give the garbage collector a pointer to the top of the new context stack. The GC relocates all the of elements in the stack to the beginning of the arena, marks the rest of the arena as available, and returns the new pointer to the relocated top of the context stack. This makes allocation O(1) and garbage collection O(N), where N is the depth of the context stack.
Bugs
The version of press
that I submitted has a
specific algorithmic error and various other flaws.
The algorithmic error is that it abandons the lookahead
process if it needs to emit only a single tag (other than
possibly a PL
) to get to the next state, and in
those cases that lookahead path is considered to have zero cost.
This means that press
will produce suboptimal
output in these cases. The fix is simple: remove "&&
ntags > 1
" from line 100 of optimize.c
.
The "fixed" code above includes this fix.
One flaw is that press
uses a very large amount
of memory because it garbage collects so rarely. It can be made
to garbage collect much more often.
press
also formats tag combinations into
strings unnecessarily. When press
is looking ahead
past the tag combination needed to get to the next state, it
does not actually need to generate the string representation of
the lookahead tag combinations, but it does anyway. This
wastes additional memory.
In its submitted state, it quickly grows to its 250 MB limit in only a few levels of lookahead and then exits. With smarter garbage collection and suppression of tag generation during lookahead, it stays under 200K even for very complex inputs; it becomes entirely CPU-bound. The "fixed" code above includes these changes.
Another flaw is in its permutation generator. If
press
needs to push <B>
,
<I>
, and two <U>
tags,
it considers 4! = 24 orderings for those tags, but there are
only 4!/2! = 12 distinguishable orderings because the two
U
tags are identical. If press
needs
to generate three U
tags, the waste is even greater
(a factor of six instead of two).
I have thought of various ways to reduce the number of lookahead tag combinations to consider that I did not have time to implement. The "fixed" code above contains some of these.
Other Stuff
The dumpdecorated
program uses the same
flex
parser as press
but simply
dumps the decorated characters to standard output. The
try
shell script runs press
on a file
and then uses dumpdecorated
and cmp
to
validate the press
output.
The testpermute
program calls the permutation
generator and writes the results to standard output. The
runtestpermute
shell script runs
testpermute
and validates the output.
The maketest
program takes two arguments,
size
and depth
. It generates a
random but valid SML/NG output stream of approximately
size
bytes with a maximum tag nesting depth of
depth
.
The first line of conf-cc
contains the name of
the compiler and the compiler flags to use to generate a
.o
file from a .c
file. The remaining
lines are not used.
The first line of conf-link
contains the name of
the linker and the linker flags to use to generate an
executable file from .o
files. The remaining
lines are not used.
The first line of conf-flex
contains the name
of the flex
program and the flex
options to use to generate a .c
file from a
.l
file. The remaining lines are not used.
The make-makefile
script generates
Makefile
from the *=x
files
(each of which contains a list of .o
filenames) and Makefile.tail
. You must re-run
make-makefile
if you add or remove a
*=x
file or change the #include
files
used by any .c
or .l
file. You do
not need to re-run make-makefile
after modifying
conf-cc
, conf-link
,
conf-flex
, an existing *=x
file, or
Makefile.tail
.