Skip to content

Homework 6: Compress

Due Tuesday, August 1, 2023 at 11:59pm

In this assignment, you will write a C program that compresses (and extracts) student records stored in plain-text format to binary format.

  • Compile and test frequently
  • Read the entire assignment first before you start
  • Start early and do not do all of the assignment in one sitting; coding is fun but fighting for hours with broken code is not
  • Do not hesitate to seek help if you are stuck

Synopsis

In this homework, you will work in the hw6 directory.

students.{h,c}: This function implements reading and writing student records from the plain-text format for you. You should read students.h thoroughly before starting writing your code.

compress.c: The entry point to your program. The main function is implemented for you to handle command-line parsing.

Written: You will answer some simple questions at the end.

Learning Objectives:

  • Bits
  • Writing and debugging C programs from scratch.

Getting started

See homework 3.

Specification

You will write a program called compress, which converts between the plain-text format and the binary format of student records.

./compress -c file should read from a file called file containing plain-text student records and outputs the binary records to the standard output.

./compress -x file should read from a file called file containing binary records and outputs the plain-text records to the standout output.

In either cases, if file is -, the program should read from the standard input instead.

If file cannot be opened or the program is called with incorrect number of arguments, the program should terminate after printing an error message.

Student record

A student record contains the following six fields (all ranges below are inclusive on both ends):

  • UCID: an unsigned number from 10000000 to 19999999
  • Late days: an unsigned number from 0 to 3
  • Point adjustments: a signed number from -5 to 5
  • Grade: one of A, B, C, D, and F
  • Major: one of the 64 predefined majors
  • Hometown: a string of at most 17 characters

Plain-text Format

When the student record is stored in plain-text format, the file consists of a sequence of lines, each of which contains one record. Within a line, the fields are written in the above order, separated by a \t character.

As an example, the following file contains one student record with

  • UCID: 14247517
  • Late days: 0
  • Point adjustment: -5
  • Grade: D
  • Major: Clinical and Translational Science
  • Hometown: Moreno Valley
14247517    0   -5  D   Clinical and Translational Science  Moreno Valley

Binary Format

Each student record is represented by 22 bytes (176 bits), containing a 5-byte (40-bit) bit-packed word ("the word") and 17-byte character array ("the array").

The bit-packed word has the following bit fields:

 39                       15 13    9   6      0
+-+------------------------+--+----+---+------+
|0| ucid                   |  |    |   |major |
+-+------------------------+--+----+---+------+
                            ^  ^    ^--- grade
                            |  +-point
                            +-late

Field Offset Width Encoding
Major 0 6 An 6-bit unsigned number corresponding to enum assignment
Grade 6 3 A 3-bit number where 'A'-'D' is 0-3 and 'F' is 4
Point adjustment 9 4 A 4-bit signed number
Late days 13 2 A 2-bit unsigned number
UCID 15 24 A 24-bit unsigned storing the last 7 digits in UCID (the first one is always one; range: 0-9999999)
Placeholder 39 1 This bit is always 0

The binary file begins with a header spelled CS143 student binary format followed by a newline \n character. Then, each 22-byte record is written contiguously (without spaces or newlines). In each record, the bit-packed word is written in big-endian order, and then the 17-byte character array is written in its index order.

Byte # Content
0 Bit 32 to bit 39 of the word
1 Bit 24 to bit 31 of the word
2 Bit 16 to bit 23 of the word
3 Bit 8 to bit 15 of the word
4 Bit 0 to bit 7 of the word
5 array[0]
6 array[1]
7 array[2]
... ...
21 array[16]

NOTE The hometown name is stored in a fixed-size 17-byte character array. If the name has fewer than 17 characters, 0 ('\0') is written to the unused space of the array; if the name has 17 characters, all characters are written without additional 0.

Testing

There are some reference files in tests. Files with extension .txt are plain-text files, while files with extension .bin are corresponding binary files. ./compress -c tiny.txt should yield tiny.bin and ./compress -x tiny.bin should yield tiny.txt.

Furthermore, ./compress -c tiny.txt | ./compress -x - should be identical to outputting tiny.txt itself. Therefore, you can perform round-trip test by ./compress -c tiny.txt | ./compress -x - | diff - tiny.txt. You should see no output from diff.

Written

You need to answer some questions in hw6/WRITTEN.txt.

Tips and Pitfalls

  • In C, integer literals (e.g. 42) are presumed to be signed, 4-byte integer. Thus, 42 >> 1 will perform signed right shift. Writing 42u tells the compiler 42 should be interpret as an unsigned integer.
  • Beware of signed integer promotion. When performing operations on integers of different sizes, C "promotes" shorter integers to match the size of the longer one. When C promotes a signed number, it extends the first bit. Sometimes, this is not what you want.
  • When the result of a compound expression (e.g. a >> 3 << 4 >> (5 & 2)) is not what you have expected, break it apart and print it out step by step. Sometimes, a number can be turned to a signed number, or worse bits of a number can be discarded, behind your back.
  • To print a 64-bit number in hex: printf("%016llx", number);. 016 means that we want a 16-digit number with 0 padded at the front. ll means it is a long long (64-bit) number. x means we want to see it in hex.
  • To write a byte to or to read a byte from a file is the same as writing and reading a byte. Use fputc(byte, file) and byte = fgetc(file).
  • feof(file) will not return true until you have read the EOF character. When EOF is encountered, fgetc will return an integer EOF. Remember to check for EOF in your extract function.
  • To see a file in hex, use xxd file.

Submission checklist

In hw6/:

  • make produce no errors and produce an executable called compress.

All changes are committed and pushed to your github repository. Submit your program to Gradescope by selecting your coursework directory and the correct branch.

Grading

Percentage
Correctness 70%
Style 20%
Written 10%

Warning: If your program cannot be compiled using the commands above without error or warning, you will receive 0 points in correctness since there is no executables for us to run.