Some Basic SQL Joins

A non-technical friend recently asked me for help with a merge problem. They had two separate data pulls of electronic medical records based on specific study parameters. The set of people in the database who fit the study parameters changed in between the data pulls, for example by having people age into our out of a study, or by having new diagnoses added to their records that cause them to either be newly included or excluded. Let’s call the older data set A and the newer data set B. The goal was to get all those entries from B that don’t also show up in A. The data sets were pulled by a staff data scientist at that company who, despite their title, said they couldn’t figure out how to remove those entries from B that were already in A. Barring any special circumstances, this is a fairly standard problem so let’s look at a couple of tools we could use to solve it. ...

Sep 5, 2023 · 4 min · 789 words · D. Michael Senter

North Carolina Housing Data

A popular beginners machine learning problem is the prediction of housing prices. A frequently used data set for this purpose uses housing prices in California along some additional gathered through the 1990 Census. One such data set is available here at Kaggle. Unfortunately, that data set is rather old. And I live in North Carolina, not California! So I figured I might as well create a new housing data set, but this time with more up-to-date information and using North Carolina as the state to be analyzed. One thing that may be interesting about North Carlina as compared to California is the position of major populations centers. In California, major population centers are near the beach, while major population centers in North Carolina are in the interior of the state. Both large citites and proximity to the beach tend to correlate with higher housing prices. In California, unlike in North Carolina, both of these go together. ...

Nov 6, 2020 · 8 min · 1644 words · D. Michael Senter

Teacher Salaries

What do you do when your data table is in PDF format? Let’s use tabula-py to extract teacher salary information from PDFs directly into Pandas dataframes. We’ll also use some regex to clean up the results.

Oct 29, 2020 · 9 min · 1788 words · D. Michael Senter

Accessing Census Data via API

The Census Bureau makes an incredible amount of data available online. In this post, I will summarize how to get access to this data via Python by using the Census Bureau’s API. The Census Bureau makes a pretty useful guide available here - I recommend checking it out. ...

Aug 22, 2020 · 5 min · 999 words · D. Michael Senter