====== Pandas ====== Pandas is a Python library for working with data. Pandas is a large library with many useful features. In this lab, we will focus on concepts that will be useful for PA #6. Getting started --------------- Navigate (``cd``) to your ``cmsc12100-aut-17-username`` directory. Run ``git pull upstream master`` to collect the lab materials and ``git pull`` to sync with your personal repository. We will be using the same data as PA #6. If you have not do so already, please change to your ``pa6/data`` directory and run the script to get the data (``./get_files.sh``). The ``lab7`` directory contains a file named ``lab7.py``, where you will write your code for this lab. We have included code to import pandas as ``pd``. Fire up ``ipython3`` and run ``lab7.py`` after you finish each task to test it out. Please note that ``lab7.py`` will fail to run until you've implemented Task 1 below. Reading the data ---------------- As mentioned, we will be using one of the sample data sets from PA #6 (``morg_d07.csv``) for this assignment. Please look at the `Data `__ and `MORG Files `__ sections of PA #6 get a quick overview of this data. **Task 1**: Use ``pd.read_csv`` to read the sample data into a pandas dataframe and save the result in a variable named ``morg_df``. Use ``h_id`` as the index_col, which uniquely identifies each row and as the row index. Basics ------ If you are unsure of how to perform the following tasks, make sure to review the textbook's chapter on Pandas. **Task 2**: Extract the ``"age"`` column from ``morg_df``. **Task 3**: Extract the row that corresponds to ``h_id`` 1_2_2 from ``morg_df``. **Task 4**: Use slicing to extract the first four rows of ``morg_df``. **Task 5**: Use boolean indexing to extract all rows that correspond to a person who works 35 or more hours per week (``hours_worked_per_week``). .. **Task 6**: Use ``fillna`` to convert the missing values in ethnicity, which are currently represented using ``NaN``, which stands for *Not a Number*, to zero. Categorical values ------------------ Categoricals are a pandas data type that is used to represent variables which can take on only a limited, and usually fixed, number of possible values. We saw a way to convert strings to categorical values using ``.astype`` in lecture. This method is less useful for the Current Population Survey Data, because the categorical values are encoded as small integers. We could work with these small integers but it is much better to work with meaningful values. The Pandas library function ``pd.Categorical.from_codes`` allows you to convert codes into meaningful categorical values. For example, we could convert the ``ethnicity_code`` (after having replaced the ``NaN`` values with zeros using the ``fillna`` method) from an integer encoding into more meaningful values using the following: :: # convert NaN to 0 morg_df["ethnicity_code"] = morg_df["ethnicity_code"].fillna(0) # convert codes to categorical values ethnic_categories = ['Non-Hispanic', 'Mexican', 'PuertoRican', 'Cuban', 'Dominican', 'Salvadoran', 'CentralAmericanExcludingSalvadoran', 'SouthAmerican', 'OtherSpanish'] morg_df["ethnicity_code"] = pd.Categorical.from_codes(morg_df["ethnicity_code"], ethnic_categories) Having done so, we would probably want to also change the name of the column: :: morg_df.rename(columns={"ethnicity_code":"ethnicity"}, inplace=True) to reflect the fact that it no longer contains integer codes. **Task 6**: Convert the employment status code column (``employment_status_code``), which has values that range from 1 through 7 to categorical values and rename it "status". We have defined a variable, ``status_categories``, to simplify this task for the purposes of this lab. For the assignment, you will need to read this information from a file. Keep in mind that ``from_codes`` is expecting the values to be zero-based and so you will need to transform the values to range from 0 through 6 before you can convert them to categorical values. **Task 7**: Now that you have converted the status column into categorical values, use boolean indexing to extract the rows that correspond to the people who are not working. **Task 8**: Use boolean indexing to extract the rows that correspond to people who worked at least 35 hours per week and have a status of "Working". Converting floats to bins ------------------------- On occasion, it is useful to group floating point values, such as ages or salaries, into bins. Pandas provides a function `cut `__ that serves this purpose. Let's look at some code for adding an "age bin" column to the Morg dataframe. But first, let's look at the range of ages seen in our data: :: In [19]: morg_df["age"].min() Out[19]: 16 In [20]: morg_df["age"].max() Out[20]: 85 We'd like to group the data into 9 bins. For this lab, we'll use ``range(16, 89, 8)`` for this purpose. This approach is not ideal, since the last bin includes four years that do not correspond to the ages of any of the people in our dataset. For PA #6, we recommend using Numpy's `linspace `__ function, which allows you to get a list of values equally spaced between a lower bound and an upper bound. The bins need to be labelled. For simplicity, we'll label the bins with integers. Keep in mind that we need :math:`N+1` boundaries for :math:`N` bins (and :math:`N` labels). Once we have defined the bin boundaries and the labels, we can use ``cut`` to compute a new column where the ith entry correponds to the appropriate bin for the age of the person in the ith row of ``morg_df``: :: boundaries = range(16, 89, 8) morg_df["age_bin"] = pd.cut(morg_df["age"], bins=boundaries, labels=range(len(boundaries)-1), include_lowest=True, right=False) We use ``include_lowest=True`` because we have at least one person who is 16 and we want to include his age in a bin. We use ``right=False``, because we do not want to include 89 in our last bin. **Task 9**: Add a column to ``morg_df`` for hours worked per week that assigns a person's hours worked per week to one of ten buckets, evenly spaced from the fewest number of hours worked (0) to the most (99) inclusive. Counting -------- Counting the number of occurrences of different values is a very common operation. We saw two ways to do this task in class: using ``value_counts`` and ``groupby`` plus ``size``: :: In [59]: morg_df["age_bin"].value_counts() Out[59]: 3 41893 4 40857 0 39945 2 37416 1 37368 5 32701 6 23293 7 16891 8 13656 dtype: int64 In [60]: morg_df.groupby("age_bin").size() Out[60]: age_bin 0 39945 1 37368 2 37416 3 41893 4 40857 5 32701 6 23293 7 16891 8 13656 dtype: int64 **Task 10**: Use these methods to count the number of people in each of your hours-worked-per-week bins. Note that you can get the values sorted by bin number, for either approach, using the ``sort_index`` method: :: In [63]: morg_df["age_bin"].value_counts().sort_index() Out[63]: 0 39945 1 37368 2 37416 3 41893 4 40857 5 32701 6 23293 7 16891 8 13656 dtype: int64 In [62]: morg_df.groupby("age_bin").size().sort_index() Out[62]: age_bin 0 39945 1 37368 2 37416 3 41893 4 40857 5 32701 6 23293 7 16891 8 13656 dtype: int64 When Finished ------------- When finished with the lab please check in your work (assuming you are inside the lab directory): .. code:: git add lab7.py git commit -m "Finished with lab7" git push No, we're not grading this, we just want to look for common errors.