Benford’s Law¶

Due: Friday, October 6th at 4pm

The goal of this assignment is to give you practice with the basics of Python and to get you to think about how to translate a few simple algorithms into code. You will be allowed to work in pairs on some of the later assignments, but you must work alone on this assignment.

In 1881, Simon Newcomb observed that some pages in his library’s book of log tables were more worn than others. Specifically, the pages for numbers starting with the numeral one were much more worn than the pages corresponding to numbers that started with larger digits. From this observation, he conjectured that numbers that start with one occurred more frequently in the data he and his colleagues used regularly than numbers that start with two, three, etc. Newcomb calculated the probability that the leading significant digit, $d$, in a number from a data set to be:

\[P(d) = log_{10}(1 + 1/d)\]

If we plot this function for one significant digit (that is, $d \in \lbrace 1,2,...,9 \rbrace$), we get the following figure:

Frank Benford, who made the same observation in 1938, studied a large number of different data sets and came up with the same equation. This rule of thumb has become known as Benford’s Law and it has been used to detect fraud in expense reports, vendor payments, elections, macro economic data, etc. Obviously, this rule of thumb does not hold for all data sets, but it does work surprisingly often. Here is a link to an article from the Information Systems Audit and Control Association that describes when it is valid to use Benford’s Law.

In this assignment, you will write code to generate the distribution of leading significant digits predicted by Benford’s Law and to compute the actual Benford distribution of a given data set.

Getting started¶

In the first lab, you learned the basics of how to use git and our git server. We will use git for all the programming assignments and labs in this course.

We have seeded your repository with a directory for this assignment. To pick it up, change to your cmsc12100-aut-17-username directory (where the string username should be replaced with your username) and then run the command: git pull upstream master. You should also run git pull to make sure your local copy of your repository is in sync with the server.

At the first lab, you already ran this command, which pulled the pa1 sub-directory into your 121 directory. However, it is good practice to always run git pull upstream master before you start working, since we may occasionally update files (e.g., if we notice bugs in our code, etc.). For example, some of the files for this assignment may have changed since you downloaded the initial distribution. After you’ve run git pull upstream master, you can proceed as described in the lab: work on your code and then run git add <filename> for each file you change, followed by git commit -m"some message" and git push to upload your changes to the server before you log out.

You should always sync the latest version of your work to the server using the commands described above before you log out, then run git pull before you resume work to retrieve the latest files. This discipline will guarantee that you always have the latest version, no matter which machine you are using. Also, it will be easier for us to help you recover from git and chisubmit problems if you consistently push updates to the server.

Getting help¶

If after carefully reading the details of any part of the assignment, you are still confused about how to get started or make progress:

post a question on Piazza to ask for help,
come to office hours, or
see the CS “Harper Tutors” who are available in North Reading Room of the Arley D. Cathey Learning Center on Sunday through Thursday evenings from 7pm to 11pm.

Before you post a question on Piazza, please check to see if someone else has already asked the same question. We especially encourage you to check the “Must read posts for PA #1” post, which we will update over time to be a compendium of important questions and answers. Also, please add, commit, and push the most recent version of your code to the server (as described above). Syncing your code will allow us to look at it, which may speed up the process of helping you.

Style¶

We will deduct points for poor style. Please see the Python Style Guide for CS121 for style guidelines.

For this assignment, you may assume that the input passed to your functions has the correct format. You may not change any of the input that is passed to your functions. In general, it is bad style to repurpose a data structure passed as input to a function, unless that is the explicit purpose of the function.

Your tasks¶

For this assignment, we will specify a set of specific functions that you must implement. You will start with basic functions and work your way up to more complex tasks. We will also supply extensive test code. Over the course of the term, we will provide less and less guidance on the appropriate structure for your code.

All the functions you need to complete can be found in the file benford.py.

Task 1: Extract amount¶

Your first task is to complete the function extract_amount. This function should take two strings, a currency symbol (e.g. "$" or "C$") and an amount in that currency as arguments and return a float.

Your code will need to:

determine the length of the currency symbol,
use that information to extract the portion of the string (e.g., "C$2.34") that represents the amount ("2.34"),
convert it to a floating point value (2.34), and finally
return the computed value.

To try out your function, fire up ipython3 from the Linux command-line, import your code, and call the function with sample arguments. Before we show you some examples, let’s set up autoreload and import your code:

$ ipython3

In [1]: %load_ext autoreload

In [2]: %autoreload 2

In [3]: import benford

(Note: we will use $ to signal the Linux command-line prompt and In [<number>] to represent the ipython3 prompt. Your prompts may look different. Do not type either prompt when issuing commands.)

The commands %load_ext autoreload and %autoreload 2 tell ipython3 to reload your code automatically whenever it changes. We encourage you to use this package whenever you are developing and testing code.

Here are some sample calls to the function:

In [4]: benford.extract_amount("$", "$2.34")
Out[4]: 2.34

In [5]: benford.extract_amount("C$", "C$2.34")
Out[5]: 2.34

In [6]: benford.extract_amount("\u00A3", "\u00A32.34")
Out[6]: 2.34

In [7]: benford.extract_amount("\u20AC", "\u20AC2.34")
Out[7]: 2.34

In [7]: benford.extract_amount("", "2.34")
Out[7]: 2.34

The third and fourth examples use unicode encodings to describe the symbols for British Pounds (￡) and Euros (€) respectively. When you work with strings, an encoding like \u00A3 is treated as a single character. For example, the length of the string "\u00A3" is one. The last example uses the empty string as the currency symbol, that is, it illustrates the case where the currency symbol is omitted.

It is common in Python to write helper functions that are only a few lines long. extract_amount is an example of such a function.

Testing for Task 1

The file test_benford.py contains test code for the tasks in this assignment. The names of the tests for a given function share a common pattern: a prefix that includes the function name followed by the test number. For example, the second test for extract_amount is named test_extract_amount_2.

The table below provides information about the tests for extract_amount. Each row contains a test number, the values that will be passed for the currency_symbol and amount_str arguments for that test, and the expected result.

Tests for `extract_amount`¶
Test number	Currency Symbol	Amount String	Expected result
1	`"$"`	`"$2.34"`	2.34
2	`"C$"`	`"C$2.34"`	2.34
3	`"\u00A3"`	`"\u00A32.34"`	2.34
4	`""`	`"2.34"`	2.34

We will be using the pytest Python testing framework for this and subsequent assignments. To run our tests, you will use the py.test command from the Linux command line (not from within ipython3). We recommend opening a new terminal window for running this command, which will allow you to go back and forth easily between testing code by hand in ipython3 and running the test suite using py.test.

Pytest, which is available on both the lab machines and your VM, has a lots of options. We’ll use three of them: -v, which means run in verbose mode, -x, which means that pytest should stop running tests after a single test failure, and -k, which allows you to describe a subset of the test functions to run. You can see the rest of the options by running the command py.test -h.

For example, running the following command from the Linux command-line:

$ py.test -v -x -k test_extract_amount test_benford.py

will run all the functions in test_benford.py that have names that start with test_extract_amount. (Recall that the $ represents the prompt and is not included in the command.)

Here is (slightly-modified) output from using this command to test our reference implementation of extract_amount:

$ py.test -v -x -k test_extract_amount test_benford.py
collected 30 items

test_benford.py::test_extract_amount_1 PASSED
test_benford.py::test_extract_amount_2 PASSED
test_benford.py::test_extract_amount_3 PASSED
test_benford.py::test_extract_amount_4 PASSED

==== 25 tests deselected by '-ktest_extract_amount' ====
==== 4 passed, 25 deselected, 1 pytest-warnings in 1.03 seconds ====

This output shows that out code passed all four tests in the test_extract_amount suite. It also shows that there were 25 tests that were deselected (that is, were not run) because they did not match the test selection criteria specified by the argument to -k.

If you fail a test, pytest will tell you the name of the test function that failed and the line in the test code at which the failure was detected. This information can be very helpful in determining what is wrong with your program. Read it carefully!

By default, pytest will run any function that starts with test_. You can limit the tests that get run using the -k option along with a string that identifies the desired tests. The string is not required to be a prefix. For example, if you specify -k amount, pytest will run all tests that start with test_ and include the word amount.

Also, by default, if you do not supply the name of a specific test file, pytest will look in the current directory tree for Python files that have names that start with test_.

In subsequent examples, we will leave out the name of the file with the test code (test_benford.py) and we will use short substrings to describe the desired tests.

Hint

Remember to save any changes you make to your code in your editor as you are debugging. Skipping this step is a common error. Fortunately, we’ve eliminated another common error— forgetting to reload code after it changes —by using the autoreload package.

Task 2: Extracting leading digits¶

Your second task is to complete the function extract_leading_digits. This function should take an amount as a float and a number of digits and return an integer that contains the specified number of leading digits from the amount.

You should use the following formula:

\[trunc(10^{(-\lfloor log_{10}\,{amount\_as\_float} \rfloor + {num\_digits} - 1)} * {amount\_as\_float})\]

to calculate leading digits as an integer. The formula uses several operations from the math library that you may not have seen:

Operation Description math library function Example call Example result

\(trunc(x)\)

throws away the fractional part of $x$

math.trunc

math.trunc(200.97)

math.trunc(-200.97)

200

-200

\(\lfloor x \rfloor\)

computes largest integer less than or equal to $x$

math.floor

math.floor(200.97)

math.floor(-200.97)

200

-201

\(log_{10} x\)

computes the base-10 logarithm of $x$

math.log10

math.log10(2.34)

math.log10(23.4)

0.3692158574101428

1.3692158574101427

See the Python math library documentation for more details. Please note that our test code will assume you use math.log10(x), rather than math.log(x, 10) because it is likely to yield a more accurate answer.

The remaining operations–addition (+), multiplication (*), and exponentiation (**)–needed for the formula are built-in.

Here are some examples of using this function:

In [8]: benford.extract_leading_digits(2.34, 1)
Out[8]: 2

In [9]: benford.extract_leading_digits(2.34, 2)
Out[9]: 23

In [10]: benford.extract_leading_digits(0.002034, 2)
Out[10]: 20

In [11]: benford.extract_leading_digits(2.01, 3)
Out[11]: 200

The results for the last two calls might seem surprising at first. The result of the third call (benford.extract_leading_digits(0.002034, 2)) is merely definitional: the correct result is 20, because zero is not a valid value for the first significant digit. Zero is a valid value for all subsequent significant digits.

The result of the fourth call (benford.extract_leading_digits(2.01, 3)), which does not match the definition, occurs due to floating point error. When the amount is 2.01 and the number of digits is 3, the result of the multiplication in the formula above is 200.99999999999997. Truncating the fractional part of this value yields 200. This issue may be your first encounter with floating point error, but it certainly won’t be your last! Should you have a need to do serious numerical calculations in the future, you will need to learn about the origin of floating point errors and how they propagate.

Testing for Task 2

The table below provides information about the tests for extract_leading_digits. Each row contains a test number, the values that will be passed for the amount and num_digits arguments for that test, and the expected result.

Tests for `extract_leading_digits`¶
Test number	Amount	Number of leading digits	Expected result
1	2.34	1	2
2	2.34	2	23
3	2.34	3	234
4	0.002034	2	20
5	2.01	3	200
6	1000	1	1

The first three tests check for basic functionality. The next two check for the corner (exceptional) cases discussed above. The last test checks for a common error. Test cases should cover both basic functionality and corner cases.

You can run these tests, by executing the following command from the Linux command-line:

$ py.test -v -x -k extract_leading

Task 3: Compute leading digit range¶

Your third task is to complete the function get_leading_digits_range, which takes the number of leading digits and returns a tuple of integers with the lower bound (inclusive) and upper bound (exclusive) on the range of the values for that number of leading digits. For example, when called with a value of 2, the range of possible values (inclusive) is 10,..,99, so the function should return the tuple (10, 100).

Testing Task 3

We have provided three tests for this task:

Tests for `get_leading_digits_range`¶
Test number	Number of leading digits	Expected result
1	1	(1, 10)
2	2	(10, 100)
3	3	(100, 1000)

You can run these tests, by executing the following command from the Linux command-line.

$ py.test -v -x -k range

Task 4: Compute expected Benford distribution¶

Your fourth task is to complete the function compute_expected_benford_dist, which takes the number of digits as an argument and returns a list of floats with the expected distribution, that is, probability of occurrence of a leading digit as defined by the formula in the introduction. The value that corresponds to the lower bound of the range of leading digits should be at index zero in the result. For example, if the number of digits is 1, the array should have nine values with the value at index 0 holding the expected distribution for the numeral 1. If the number of digits is 2, the array should have 90 values, with the value at index 0 holding the expected distribution for 10.

Here is an example use of this function:

In [24]: benford.compute_expected_benford_dist(1)
Out[24]:
[0.3010299956639812,
17609125905568124,
12493873660829993,
09691001300805642,
07918124604762482,
06694678963061322,
05799194697768673,
05115252244738129,
04575749056067514]

The implementation of this function can be done with a simple loop to construct the desired list. You must use your get_leading_digits_range function to determine the range of possible values for the specified number of digits.

Testing for Task 4

We have provided two tests for this task:

Tests for `compute_expected_benford_dist`¶
Test number	Number of leading digits	Expected output filename
1	1	`expected_benford_dist_1_output.txt`
2	2	`expected_benford_dist_2_output.txt`

The first calls compute_expected_benford_dist with an argument of 1 (as shown in the example above) and the other calls it with an argument of 2. Our test code matches your results against the expected values, which are stored in the specified files. These files and the files used in subsequent tests can be found in the data subdirectory of your pa1 directory. You can look at the contents of a text file using the Linux commands cat, more, or less.

Our test code can be run from the Linux command-line with the command:

$ py.test -v -x -k expected

Task 5: Computing the Benford distribution¶

Your fifth task is to complete the function compute_benford_dist, which takes a currency symbol as a string, a non-empty list of strings that represent positive amounts in that currency, and a number of digits as arguments. It should compute the Benford distribution of the given data for the specified number of digits.

You can do this work in two steps: determine the number of times each value in the range of leading digits occurs, using a list to hold the counts, and then generate a list with the distribution from the counts. You can also compute the distribution directly using one loop, rather than two. Either approach is fine.

You will need to determine the number of values in the range of leading digits. Use your get_leading_digits_range function to determine the range. Also, do not repeat the code for extracting the amount and leading digits from a string. Instead, you must call the functions you wrote for tasks 1 and 2 to perform this work.

You may be tempted to implement this task by creating a list of the leading digits and then calling the list count method once for each possible value in the range of leading digits. You may not use this approach, because it is inefficient. Let $N$ be the length of the list of leading digits and $M$ be the number of values in the leading digit range. Using this approach takes time proportional to $N*M$, because each call to the count method takes time proportional to $N$ and it would need to be called $M$ times, once for each value in the range.

Part of your task is to come up with an algorithm for doing this computation in time proportional to $N$. Here’s a hint: you will need to figure out how to transform a leading digit value into the corresponding index in the list of counts.

Here are some example uses of this function:

In [25]: benford.compute_benford_dist("$", ["$2.34"], 1)
Out[25]: [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [26]: AMOUNTS = ["C$1.01", "C$2.20", "C$3.333", "C$44.4", "C$0.000055",
   ...:                "C$6.006", "C$99999.9999", "C$99"]

In [27]: benford.compute_benford_dist("C$", AMOUNTS, 1)
Out[27]: [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.0, 0.0, 0.25]

In the first example, all of the entries except the value at index 1 are zero, because the list has only one value and it has a leading digit of 2. In the second example, the numerals 1, 2, 3, 4, 5, and 6 all occur once as leading digits in a list of length eight, the numerals 7 and 8 do not occur at all, and the numeral 9 appears twice as a leading digit in the list.

Testing Task 5

The table below contains information about the test cases for compute_benford_dist. For each test case, our code reads the list of amounts from a file, calls your function with appropriate arguments, reads the expected return value from the specified file, and then compares the actual return value with the expected return value.

Our test code includes the two examples above, a test that computes the Benford distribution for the same values as the AMOUNTS variable defined above with two leading digits, and four tests that use data extracted from payment records downloaded from the City of Chicago’s data portal. The file payments_50_input.csv contains data on 50 payments. The amount of interest in is column labeled “AMOUNT”. The file payments_50000_input.csv contains information on about 50,000 payments.

Tests for `compute_benford_dist`¶
Test number	Input filename	Number of leading digits	Expected output filename
1	`simple_input.csv`	1	`simple_computed_benford_dist_1_output.txt`
2	`amounts_input.csv`	1	`amounts_computed_benford_dist_1_output.txt`
3	`amounts_input.csv`	2	`amounts_computed_benford_dist_2_output.txt`
4	`payments_50_input.csv`	1	`payments_50_computed_benford_dist_1_output.txt`
5	`payments_50_input.csv`	2	`payments_50_computed_benford_dist_2_output.txt`
6	`payments_50000_input.csv`	1	`payments_50000_computed_benford_dist_1_output.txt`
7	`payments_50000_input.csv`	2	`payments_50000_computed_benford_dist_2_output.txt`

Our test code for this task can be run from the Linux command-line using the command:

$ py.test -v -x -k compute_benford_dist

Task 6: Compute Benford mean absolute difference¶

You can plot the actual distribution for a set of data and the expected distribution to get a sense of the difference between the two. In fact, we have included a function to do just that in benford.py. It is also useful to have a quantitative measure of the difference between the distributions as well. For this assignment, we’re going to use the mean absolute difference (MAD) of the expected distribution and the actual distribution for a given set of data and specified number of digits as our quantitative measure.

Given two lists, x and y, of length $N$, the MAD is:

\[MAD(x, y) = 1/N * \sum \limits_{i=1}^N | x_i - y_i |\]

where $x_i$ and $y_i$ are the ith elements of x and y respectively.

You can find a brief description of how to interpret this value here. For a more detailed analysis, see the book Digital Analysis Using Benford’s Law by Mark Nigrini.

Your last task is to complete the function compute_benford_MAD, which should take a currency symbol as a string, a non-empty list of strings that represent positive amounts in that currency, and a number of digits as arguments. As in the previous task, it does not make sense to call this function on an empty list.

Your implementation should call your functions from Tasks 4 and 5 to get the distributions and then compute and return the mean absolute difference between the two. Do not repeat the code from Tasks 4 and 5 in this function.

Testing Task 6

Our test code for this function uses the same inputs as the tests for Task 4. The expected results are read from files (simple_computed_benford_mad_1_output.txt, for example).

Run the command:

$ py.test -v -x -k MAD

from the Linux command-line to run our test code for this task.

Putting it all together¶

We have included code in benford.py that calls your functions to compute and then plot the expected and actual Benford distributions for a data set and to compute and report the MAD of the data set. It takes four arguments: the name of the input data file, the number of the column that contains the amounts (zero-based), the currency symbol, and the number of leading digits.

Here is a sample use of this program:

$ python3 benford.py data/payments_50_input.csv 2 "$" 2
MAD: 0.01211

and here is the plot that it generates:

Submission¶

To submit your assignment, make sure that you have:

put your name at the top of your file,
registered for the assignment using chisubmit (if you have not done so already),
added, committed, and pushed your code to the git server, and
run the chisubmit submission command.

chisubmit student assignment register pa1

git add benford.py
git commit -m"final version of PA #1 ready for submission"
git push

chisubmit student assignment submit pa1

We recommend cutting and pasting these commands rather than re-typing them!

Remember to push your code to the server early and often!