Teacher Salaries

Extracting Data from PDFs with Tabula

After reading a news article about teacher pay in the US, I was curious and wanted to look into the source data myself. Unfortunately, the source that was mentioned was a publication by the National Education Association (NEA) which had the data as tables embedded inside a PDF report. As those who know me can attest, I don’t like hand-copying data. It is slow and error-prone. Instead, I decided to use the tabula package to extract the information from the PDFs directly into a Pandas dataframe. In this post, I will show you how to extract the data and how to clean it up for analysis.

Table of Contents

The Data Source

Several years worth of data are available in PDF form on the NEA website. Reading through the technical notes, they highlight that they did not collect all of their own salary information. Some states' information is calculated from the American Community Survey (ACS) done by the Census Bureau - a great resource whose API I have covered in a different post. Each report includes accurate data for the previous school year, as well as estimates for the current school year. As of this post, the newest report is the 2020 report which includes data for the the 2018-2019 school year, as well as estimates of the 2019-2020 school year.

The 2020 report has the desired teacher salary information in two separate locations. One is in table B-6 on page 26 of the PDF, which shows a ranking of the different states' average salary in addition to the average salary:

A second location is in table E-7 on page 46, which gives salary data for the completed school year as well as different states' estimates for the 2019-2020 school year:

Note that table E-7 lacks the star-annotation marking NEA estimated values. This, and the lack of the ranking column, makes Table E-7 easier to parse. In the main example below, this will be the source of the five years of data. I will however also show how to parse table B-6 at the end of this post for completion.

Loading the Data

As of October 2020, the NEA site has five years worth of reports online. Unfortunately, these are not labeled consistently for all five years. Similarly the page numbers differ for each report. Prior to the 2018 report, inconsistent formats were used for the tables which require previous years to be parsed separately from the newer tables. For this reason, I’ll make a dictionary for the 2018-2020 reports only, which will simplify the example below.

report = {
    '2020' : {
        'url'  : "https://www.nea.org/sites/default/files/2020-10/2020%20Rankings%20and%20Estimates%20Report.pdf",
        'page' : 46,
    },
    '2019' : {
        'url'  : "https://www.nea.org/sites/default/files/2020-06/2019%20Rankings%20and%20Estimates%20Report.pdf",
        'page' : 49,
    },
    '2018' : {
        'url'  : "https://www.nea.org/sites/default/files/2020-07/180413-Rankings_And_Estimates_Report_2018.pdf",
        'page' : 51,
    },
}

We can now use dictionary comprehension to fill in a dictionary with all the source tables of interest. We will be using the tabula package to extract data from the PDFs. If you don’t have it installed, you can use pip install tabula-py to get a copy. The method that reads in a PDF is aptly called read_pdf. Its first argument is a file path to the PDF. Since we want to use a URL, we will use the keyword argument stream=True and then name the specific page in each PDF that contains the information we are after. By default, read_pdf returns a list of dataframes, so we just save the first element from the list, which is the report we are interested in.

Note: if you are using WSL, depending on your settings, you may get the error Exception in thread "main" java.awt.AWTError: Can't connect to X11 window server using 'XXX.XXX.XXX.XXX:0' as the value of the DISPLAY variable. error when running read_pdf. This is fixed by having an X11 server running.

import tabula
import pandas as pd

source_df = {year : tabula.read_pdf(report[year]['url'], stream=True, pages=report[year]['page'])[0] 
             for year in report.keys()}

And that’s it in principle. How cool is that! Of course, we still need to clean our data a little bit.

Cleaning the Data

Let’s take a look at the first and last few entries of the 2020 report:

pd.concat([source_df['2020'].head(), 
           source_df['2020'].tail()])

Unnamed: 0 2018-19 2019-20 From 2018-19 to 2019-20 From 2010-11 to 2019-20 (%)
0 State Salary($) Salary($) Change(%) Current Dollar Constant Dollar
1 Alabama 52,009 54,095 4.01 13.16 -2.58
2 Alaska 70,277 70,877 0.85 15.36 -0.69
3 Arizona 50,353 50,381 0.06 8.03 -7.00
4 Arkansas 49,438 49,822 0.78 8.31 -6.75
48 Washington 73,049 72,965 -0.11 37.86 18.69
49 West Virginia 47,681 50,238 5.36 13.51 -2.28
50 Wisconsin 58,277 59,176 1.54 9.17 -6.02
51 Wyoming 58,861 59,014 0.26 5.19 -9.44
52 United States 62,304 63,645 2.15 14.14 -1.73

We see that each column is treated as a string object (which you can confirm by running source_df['2020'].dtypes) and that the first row of data is actually at index 1 due to the fact that the PDF report used a two-row header. This means we can safely drop the first row of every dataframe. We can also drop the last row of every dataframe since that just contains summary data of the US as a whole, which we can easily regenerate as necessary. So row indices 0 and 52 can go for all of our data sets.

for df in source_df.values():
    df.drop([0, 52], inplace=True)

Next up I’d like to fix the column names. The fist column is clearly the name of the state (except in the case of Washington D.C.), while the next two columns give the years for which the salary information is given. Let’s rename the second and third columns according to the pattern Salary %YYYY-YY using Python’s f-string syntax.

for df in source_df.values():
    df.rename(columns={
        df.columns[0] : "State",
        df.columns[1] : f"Salary {str(df.columns[1])}",
        df.columns[2] : f"Salary {str(df.columns[2])}",
    }, 
              inplace=True)
    
source_df["2020"].head()  # show the result of our edits so far

State Salary 2018-19 Salary 2019-20 From 2018-19 to 2019-20 From 2010-11 to 2019-20 (%)
1 Alabama 52,009 54,095 4.01 13.16 -2.58
2 Alaska 70,277 70,877 0.85 15.36 -0.69
3 Arizona 50,353 50,381 0.06 8.03 -7.00
4 Arkansas 49,438 49,822 0.78 8.31 -6.75
5 California 83,059 84,659 1.93 24.74 7.39

Looks like we’re almost done! Let’s drop the unnecessary columns and check our remaining column names:

for year, df in source_df.items():
    df.drop(df.columns[3:], axis=1, inplace=True)
    print(f"{year}:\t{df.columns}")
2020:	Index(['State', 'Salary 2018-19', 'Salary 2019-20'], dtype='object')
2019:	Index(['State', 'Salary 2017-18', 'Salary 2018-19'], dtype='object')
2018:	Index(['State', 'Salary 2017', 'Salary 2018'], dtype='object')

We can see that the column naming scheme in 2018 was different than in the previous reports. To make them all compatible for our merge, we’re going to have to do some more editing. Based on the other reports, it appears as though the 2018 report used the calendar year of the end of the school year, while the others utilized a range. This can easily be solved using regex substitution. We’ll do that now.

import re

for year, df in source_df.items():
    if year != "2018":
        df.rename(columns={
            df.columns[1] : re.sub(r"\d{2}-", '', df.columns[1]),
            df.columns[2] : re.sub(r"\d{2}-", '', df.columns[2]),
        }, 
                  inplace=True)
    # print the output for verification
    print(f"{year}:\t{df.columns}")
2020:	Index(['State', 'Salary 2019', 'Salary 2020'], dtype='object')
2019:	Index(['State', 'Salary 2018', 'Salary 2019'], dtype='object')
2018:	Index(['State', 'Salary 2017', 'Salary 2018'], dtype='object')

Now that everything works, we can do our merge to create a single dataframe with the information for all of the school years we have downloaded.

merge_df = source_df["2018"].drop(["Salary 2018"], axis=1).merge(
                    source_df["2019"].drop(["Salary 2019"], axis=1)).merge(
                    source_df["2020"])

merge_df.head()

State Salary 2017 Salary 2018 Salary 2019 Salary 2020
0 Alabama 50,391 50,568 52,009 54,095
1 Alaska 6 8,138 69,682 70,277 70,877
2 Arizona 4 7,403 48,723 50,353 50,381
3 Arkansas 4 8,304 50,544 49,438 49,822
4 California 7 9,128 80,680 83,059 84,659

Numeric Conversion

We’re almost done! Notice that we still have not dealt with the fact that every column is still treated as a string. Before we can use the to_numeric function, we still need to take care of two issues:

  • The commas in the numbers. While they are nice for our human eyes, Pandas doesn’t like them.
  • In the 2017 salary column, there appears to be extraneous white space after the first digit for some entries.

Luckily, both of these problems can be remedied with a simple string replacement operation.

merge_df.iloc[:,1:] = merge_df.iloc[:,1:].replace(r"[,| ]", '', regex=True)

for col in merge_df.columns[1:]:
    merge_df[col] = pd.to_numeric(merge_df[col])

Now we’re done! We have created an overview of annual teacher salaries from the 2016-17 school year until 2019-20 extracted from a series of PDFs published by the NEA. We have cleaned up the data and converted everything to numerical values. We can now get summary statistics and do any analysis of interest with this data.

merge_df.describe() # summary stats of our numeric columns

Salary 2017 Salary 2018 Salary 2019 Salary 2020
count 51.000000 51.000000 51.000000 51.000000
mean 56536.196078 57313.039216 58983.254902 60170.647059
std 9569.444674 9795.914601 10286.843230 10410.259274
min 42925.000000 44926.000000 45105.000000 45192.000000
25% 49985.000000 50451.500000 51100.500000 52441.000000
50% 54308.000000 53815.000000 54935.000000 57091.000000
75% 61038.000000 61853.000000 64393.500000 66366.000000
max 81902.000000 84227.000000 85889.000000 87543.000000

Table B-6

As mentioned above, table B-6 in the 2020 Report presents slightly greater challenges. A lot of the cleaning is similar or identical, so I will not reproduce it in full. Instead, I have loaded a subsetted part of table B-6 and will show how this can be cleaned up as well. But first, let’s look at the first several entries:

Unnamed: 0 2017-18 (Revised) 2018-19
0 State Salary($) Rank Salary($)
1 Alabama 50,568 36 52,009
2 Alaska 69,682 7 70,277
3 Arizona 48,315 45 50,353
4 Arkansas 49,096 44 49,438
5 California 80,680 2 83,059 *
6 Colorado 52,695 32 54,935
7 Connecticut 74,517 * 5 76,465 *
8 Delaware 62,422 13 63,662

We can see that there is an additional hurdle compared to the previous tables: the second column now contains data from two columns, both the Salary information as well as a ranking of the salary as it compares to the different states. For a few states, there is additionally a ‘*’ to denote values that were estimated as opposed to received. We can again use a simple regex replace together with a capture group to parse out only those values that we are interested in, while dropping the extraneous information using the code below.

b6.iloc[:,1:] = b6.iloc[:,1:].replace(r"([\d,]+).*", r"\1", regex=True)

And now we’re back to where we were above before we did the string conversion. This is what it looks like after also dropping the first row and renaming the columns:

State Salary 2018 Salary 2019
1 Alabama 50,568 52,009
2 Alaska 69,682 70,277
3 Arizona 48,315 50,353
4 Arkansas 49,096 49,438
5 California 80,680 83,059
6 Colorado 52,695 54,935
7 Connecticut 74,517 76,465
8 Delaware 62,422 63,662

From here on out, we can proceed as in the previous example.

D. Michael Senter
D. Michael Senter
Research Statistician Developer

My research interests include data analytics and missing data.

Related