Homework 6: Compress
Due Tuesday, August 1, 2023 at 11:59pm
In this assignment, you will write a C program that compresses (and extracts) student records stored in plain-text format to binary format.
- Compile and test frequently
- Read the entire assignment first before you start
- Start early and do not do all of the assignment in one sitting; coding is fun but fighting for hours with broken code is not
- Do not hesitate to seek help if you are stuck
Synopsis
In this homework, you will work in the hw6
directory.
students.{h,c}
: This function implements reading and writing student
records from the plain-text format for you. You should read students.h
thoroughly before starting writing your code.
compress.c
: The entry point to your program. The main
function is
implemented for you to handle command-line parsing.
Written: You will answer some simple questions at the end.
Learning Objectives:
- Bits
- Writing and debugging C programs from scratch.
Getting started
See homework 3.
Specification
You will write a program called compress
, which converts between the
plain-text format and the binary format of student records.
./compress -c file
should read from a file called file
containing plain-text
student records and outputs the binary records to the standard output.
./compress -x file
should read from a file called file
containing binary
records and outputs the plain-text records to the standout output.
In either cases, if file
is -
, the program should read from the standard
input instead.
If file
cannot be opened or the program is called with incorrect number of
arguments, the program should terminate after printing an error message.
Student record
A student record contains the following six fields (all ranges below are inclusive on both ends):
- UCID: an unsigned number from
10000000
to19999999
- Late days: an unsigned number from 0 to 3
- Point adjustments: a signed number from -5 to 5
- Grade: one of
A
,B
,C
,D
, andF
- Major: one of the 64 predefined majors
- Hometown: a string of at most 17 characters
Plain-text Format
When the student record is stored in plain-text format, the file consists of a
sequence of lines, each of which contains one record.
Within a line, the fields are written in the above order, separated by a \t
character.
As an example, the following file contains one student record with
- UCID: 14247517
- Late days: 0
- Point adjustment: -5
- Grade: D
- Major: Clinical and Translational Science
- Hometown: Moreno Valley
Binary Format
Each student record is represented by 22 bytes (176 bits), containing a 5-byte (40-bit) bit-packed word ("the word") and 17-byte character array ("the array").
The bit-packed word has the following bit fields:
39 15 13 9 6 0
+-+------------------------+--+----+---+------+
|0| ucid | | | |major |
+-+------------------------+--+----+---+------+
^ ^ ^--- grade
| +-point
+-late
Field | Offset | Width | Encoding |
---|---|---|---|
Major | 0 | 6 | An 6-bit unsigned number corresponding to enum assignment |
Grade | 6 | 3 | A 3-bit number where 'A' -'D' is 0-3 and 'F' is 4 |
Point adjustment | 9 | 4 | A 4-bit signed number |
Late days | 13 | 2 | A 2-bit unsigned number |
UCID | 15 | 24 | A 24-bit unsigned storing the last 7 digits in UCID (the first one is always one; range: 0-9999999) |
Placeholder | 39 | 1 | This bit is always 0 |
The binary file begins with a header spelled CS143 student binary format
followed by a newline \n
character. Then, each 22-byte record is written
contiguously (without spaces or newlines). In each record, the bit-packed word
is written in big-endian order, and then the 17-byte character array is
written in its index order.
Byte # | Content |
---|---|
0 | Bit 32 to bit 39 of the word |
1 | Bit 24 to bit 31 of the word |
2 | Bit 16 to bit 23 of the word |
3 | Bit 8 to bit 15 of the word |
4 | Bit 0 to bit 7 of the word |
5 | array[0] |
6 | array[1] |
7 | array[2] |
... | ... |
21 | array[16] |
NOTE The hometown name is stored in a fixed-size 17-byte character array. If
the name has fewer than 17 characters, 0 ('\0'
) is written to the unused space
of the array; if the name has 17 characters, all characters are written without
additional 0.
Testing
There are some reference files in tests
. Files with extension .txt
are
plain-text files, while files with extension .bin
are corresponding binary
files.
./compress -c tiny.txt
should yield tiny.bin
and ./compress -x tiny.bin
should yield tiny.txt
.
Furthermore, ./compress -c tiny.txt | ./compress -x -
should be identical to
outputting tiny.txt
itself. Therefore, you can perform round-trip test by
./compress -c tiny.txt | ./compress -x - | diff - tiny.txt
. You should see no
output from diff
.
Written
You need to answer some questions in hw6/WRITTEN.txt
.
Tips and Pitfalls
- In C, integer literals (e.g.
42
) are presumed to be signed, 4-byte integer. Thus,42 >> 1
will perform signed right shift. Writing42u
tells the compiler42
should be interpret as an unsigned integer. - Beware of signed integer promotion. When performing operations on integers of different sizes, C "promotes" shorter integers to match the size of the longer one. When C promotes a signed number, it extends the first bit. Sometimes, this is not what you want.
- When the result of a compound expression (e.g.
a >> 3 << 4 >> (5 & 2)
) is not what you have expected, break it apart and print it out step by step. Sometimes, a number can be turned to a signed number, or worse bits of a number can be discarded, behind your back. - To print a 64-bit number in hex:
printf("%016llx", number);
.016
means that we want a 16-digit number with 0 padded at the front.ll
means it is along long
(64-bit) number.x
means we want to see it in hex. - To write a byte to or to read a byte from a file is the same as writing and
reading a byte. Use
fputc(byte, file)
andbyte = fgetc(file)
. feof(file)
will not return true until you have read the EOF character. WhenEOF
is encountered,fgetc
will return an integerEOF
. Remember to check forEOF
in your extract function.- To see a file in hex, use
xxd file
.
Submission checklist
In hw6/
:
make
produce no errors and produce an executable calledcompress
.
All changes are committed and pushed to your github repository. Submit your program to Gradescope by selecting your coursework directory and the correct branch.
Grading
Percentage | |
---|---|
Correctness | 70% |
Style | 20% |
Written | 10% |
Warning: If your program cannot be compiled using the commands above without error or warning, you will receive 0 points in correctness since there is no executables for us to run.