62.306 Systems Programming

UUencode

Uuencoding transforms binary data into a text-based form suitable for delivery through a text-only mailer. The uu in uuencode comes from the phrase UNIX-to-UNIX. Prior to the general availability of the Internet, UNIX computers would store and forward mail through telephone connections. Many of these connections would support only 7-bit ASCII. Uuencoding was a way for users to send binary files through this early mail system. Even though mailers can now transport binary attachments, they still do so by using a form of binary-to-text encoding called MIME.

Using uuencode

Uuencode is a filter; it reads from standard input and writes to standard output. For example, to uuencode the file test.bin, use the command

Why does test.bin appear both as a command-line argument and as redirected input? Because the argument is used to create a header line within the uuencoded file naming the file to be recreated by the uudecode command. Also, the mode (protection bits) of test.bin are incorporated into the header so that they can be reproduced at the destination. Clearly, the files specified by the command-line argument and the input direction need not be the same. Regardless of whether the file specified by the command-line argument exists, the mode in the header line is taken from the mode of standard input.

Using uudecode

Continuing with the example, after test.bin.uue has been transmitted (through a mailer, via FTP, or even using a simple mv command) the command

reproduces test.bin in the current directory with the mode specified by the header.

A mailer will oftern prepend and append additional lines to mail messages. Most uudecode programs scan the source for the uuencode header line. After uudecoding the file, some uudecode programs scan for an additional header line, allowing the user to encode several programs into one mail message.

Format of a uuencoded file

A uuencoded file contains the following sections

Each of these sections consists of exactly one line, except for the full-date section which is zero or more lines, and the residual-data section, which is zero or one line. The keywords begin and end are case insensitive. <mode> is an octal number, with each octal digit representing the 3-bit protection code for user, group, and other respectively, as used by chmod. Lines are delimited by the line-end convention for the file-system (that is, <cr><nl> for DOS, <nl> for UNIX, binary line-length prefix for VMS, etc.)

Format of uuencoded data

Each line of uuencoded data consists of a origin-32 line length followed by 4-byte tuples containing a modified origin-32 representation of four numbers in the range 0 to 63. Each 4-tuple represents 3 bytes of binary data. The line length is the number of bytes of binary data contained in the line, not the number of 4-tuples. This anomaly shows up only on the residual-data line; the full-data lines contain 60 characters, representing 45 bytes of data. Thus the length of a full-data line is always 60 * 3 / 4 = 45 (+32).

(Actually, since each line contains its own length, the format shown above is a simplification showing the usual format of uuencoded files.)

Origin-32 is used because 32 is the first "printable" character in the ASCII collating sequence. The line-length in a full-data line shows as the character 'M' because ord('M') = 45 + ord(' '). In origin-32, the binary value zero would be encoded as 32 (the space character). Since some mailers truncate trailing spaces and others replace internal spaces by tabs and spaces, data would be lost. Modified origin-32 uses 32+65 (the "`" character) to represent zero.

3-to-4 encoding

It takes 8 bits to represent 256 values, 6 bits to represent 64 values. There are 24 bits in 3 bytes, requiring 4 6-bit values. For example, the 3-byte value 0x3C4854 is represented in binary as 0011 1100 0100 1000 0101 0100.  Taken as 4 groups of 6-bit numbers, this is 001111 000100 100001 010100, or in decimal, 15, 4, 33, 20. In origin-32, these four decimal numbers become 47, 36, 65, 52.

If there are fewer than 3 bytes to encode, as will happen 2 times out of three in a file of arbitrary length, 1 or two bytes of zeros are supplied and 3-to-4 encoding takes place as usual. However, the transmitted length for the data line counts just the residual data bytes; the extra zero bytes supplied at encoding time are deleted at decoding time.

An example

The example uses text rather than binary because it's easier to show here. The 5-line file x.html contains

<HTML>
<HEAD>
   <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
   <META NAME="GENERATOR" CONTENT="Mozilla/4.02 [en] (WinNT; I) [Netscape]">
   <TITLE>Process Synchronization</TITLE>
</HEAD>

When uuencoded, this becomes

begin 644 x.html
M/$A434P^"CQ(14%$/@H@(" \345402!(5%10+45154E6/2)#;VYT96YT+51Y
M<&4B($-/3E1%3E0](G1E>'0O:'1M;#L@8VAA<G-E=#UI<V\M.#@U.2TQ(CX*
M(" @/$U%5$$@3D%-13TB1T5.15)!5$]2(B!#3TY414Y4/2)-;WII;&QA+S0N
M,#(@6V5N72 H5VEN3E0[($DI(%M.971S8V%P95TB/@H@(" \5$E43$4^4')O
E8V5S<R!3>6YC:')O;FEZ871I;VX\+U1)5$Q%/@H\+TA%040^"DE4
 
end