[{"content":"I wrote a blog post for blogs.sas.com that went live yesterday. This post walks you through how to run a Bayesian regression model with a multiply imputed data set such as those produced by PROC MI. What I like about this process is that you can get started working with a Bayesian model that includes missing data even if you\u0026rsquo;re not familiar with joint modeling techniques. It\u0026rsquo;s pretty neat and allows for knowledge transfer between your frequentist and Bayesian methods.\nGo over to SAS to read my article entitled \u0026ldquo;Getting started performing a Bayesian analysis with missing data in SAS .\u0026rdquo;\n","permalink":"https://dmsenter89.github.io/post/25/02-new-mi-post-at-sas/","summary":"My new post demonstrating how to do Bayesian analysis with MI is live.","title":"New MI Post at SAS"},{"content":"Background I have come across an interesting question in a kashrut course that I\u0026rsquo;m taking. What I find particularly interesting is the fact that since then, I have shared the question with many people and barring a solitary exception, the result has always been the same. The intuitively answer is given, which is incorrect, despite the person knowing all the relevant physics facts that would lead to the correct answer. But for some reason, it\u0026rsquo;s not put together. I myself first answered the question incorrectly, and when it was pointed out to me had to think about the relevant facts before I was able to really get it. Since I have already shared this with several people and found it interesting, I thought I\u0026rsquo;d write it up here.\nThe question we had talked about was this: hot food is taken off the stove and moved into utensils. I put some on a plate and some in a bowl. Which one cools off to an edible temperature faster, the food on the plate or in the bowl?\nBuilding Intuition To make this problem tractable, let\u0026rsquo;s add a few details and make some assumptions. Let\u0026rsquo;s start with the food. Imagine something that\u0026rsquo;s liquid-y enough to be able to assume the shape of the container, but thick enough it can hold its shape on the plate for long enough to solve our problem. Think something like thick mashed potatos, but with the \u0026ldquo;spherical cow\u0026rdquo;-style assumption of it being homogenous so we don\u0026rsquo;t have to deal with pockets of different densities in the food. Let\u0026rsquo;s label our scenarios:\nScenario I: I put the food inside a hemispherical bowl whose internal diameter is $r$, so my food takes on a half-sphere shape. Scenario II: I turn it upside down onto a plate so that I still have a half-sphere of hot food cooling down, just on a flat surface now. For temperature purposes, imagine the cooked food being very hot \u0026ndash; 70C (158 F) \u0026ndash; and edible temperature being about 50C (122 F). Air is at a \u0026ldquo;room temperature\u0026rdquo; of 20C (68 F). Image both utensils \u0026ndash; the plate and bowl \u0026ndash; are made from the same homogenous material with thickness $d$.\nThe main thing we worry about with cooling is surface area. We know the rate that heat leaves our system is the surface intergral of the heat flux over the objects area, i.e.\n$$ \\dot{Q} = \\oint_A q \\,dA.$$\nSo to make cooling quicker, we either need to manipulate the surface area or change the heat flux, which depends on the temperature gradient and the materials involved.\nFor our purposes, we have two interfaces to deal with: the circular interface that touches the air in scenario I or the plate in scenario II, and the round surface that either touches the inside of the bowl in scenario I and the air in scenario II. By construction of the example those are nice values: $A_\\text{circ} = \\pi r^2$ and $A_\\text{hemi}= 2 \\pi r^2$. With that in hand:\n$$\\begin{aligned} \\dot{Q}_{I} \u0026amp;= q_b\\, 2 \\pi r^2 + q_a \\,\\pi r^2 \\\\ \\dot{Q}_{II} \u0026amp;= q_{b}\\, \\pi r^2 + q_a \\, 2 \\pi r^2 \\end{aligned}$$\nwhere $q_b$ is the heat flux at the utensil interface and $q_a$ the heat flux at the air interface. Since I\u0026rsquo;m mainly interested in which is faster, I can write this as a fraction:\n$$\\frac{\\dot{Q}_{I}}{\\dot{Q}_{II}} = \\frac{2 q_b + q_a}{q_b + 2 q_a} $$\nVirtually every one I have talked to about this assumes that the plate cools faster, which for our purposes means the above fraction is less than 1, which is fulfilled when $q_a \u0026lt; q_b \u0026lt; 0$ (remember that heat is leaving the system, so the fluxes are negative).\nAnd there is our first snag: most everyone knows that air, like gases in general, are terrible conductors compared to solid materials, particularly ceramics and the like that are often used for utensils. As it turns out $|q_b| \\gg |q_a|$ so we could frankly neglect the air interface when thinking about the cooling time and simply declare the bowl in scenario I the winner without further work.\nGo Newton If we really wanted to, we could try to go further and solve something more precise. Newton\u0026rsquo;s law of cooling, familiar to all calculus students, let\u0026rsquo;s me calculate the rate of heat loss using only temperature as\n$$ \\dot{Q} = - h A \\left(T(t) - T_\\text{env}\\right).$$\nWith $\\dot{Q} = m c \\frac{dT}{dt}$ where $m$ is the mass and $c$ is the specific heat capacity, this gives me an ODE in temperature only as\n$$\\frac{dT}{dt} = - {\\frac{h A}{m c}} \\left(T(t) - T_\\text{env}\\right)$$\nwhich has the solution\n$$T(t) = T_\\text{env} + \\left(T_0 - T_\\text{env}\\right)\\,\\mathrm{exp}(-K t)$$\nwhere $K=hA/mc$. We\u0026rsquo;re only interested in the comparison of $K_I$ and $K_{II}$, so it mostly focuses on finding $h$. The heat transfer coefficient of air for relevant scenarios can be looked up. The question is how to treat the utensil. It\u0026rsquo;s interesting to know that we can deal with heat transfer through walls by by an inverse additive relation like this:\n$$ \\frac{1}{h_\\text{b}} = \\frac{1}{h_{a}} + \\frac{d}{k_b} + \\frac{1}{h_f}$$\nIn this case, $h_a$ is the coefficient for air, $k_b$ is a coefficient for the utensil material, and $h_f$ the relevant coefficient for the food. Recall that $d$ is the thickness of the utensil.\nWhile this is fine for a back-of-the-envelope calculation, technically it doesn\u0026rsquo;t apply. If the Biot number\n$$\\mathrm{Bi} = h L / k, $$\nis less than about 0.1 it is fine to ignore thermal processes inside of bodies and make the isothermal assumption inherent in the formulation of Newton\u0026rsquo;s law of cooling.\nAs per the literature the thermal conductivity of mashed potatoes near our target temperatures is approximately $k =0.58 \\mathrm{W} \\mathrm{m}^{-1} \\mathrm{K}^{-1}$. For our purposes, the characteristic length can be taken to be the hemisphere volume over area:\n$$L = V/A = \\left(\\frac 23 \\pi r^3 \\right) \\left( 3 \\pi r^2 \\right)^{-1} = \\frac 2 9 r. $$\nThe value $h$ should be averaged over the surfaces and I don\u0026rsquo;t feel like doing that. Let\u0026rsquo;s see how far we got with this:\n$$\\begin{aligned} \\mathrm{Bi} \u0026amp;\u0026lt; 0.1 \\\\ \\frac{100}{261} h\\,r \u0026amp;\u0026lt; \\frac {1}{10} \\\\ \\Rightarrow h \u0026amp;\u0026lt; 261 / (1000 r) \\end{aligned}$$\nAn Ikea bowl is about 5-6 in in length, so let\u0026rsquo;s use $r=0.07 \\mathrm{m}$. With that we get the Biot number we want when $h\u0026lt;3.7$. Since even for air we have $\\mathcal{O}\\left(h_\\text{air}\\right)= 10$ this looks like it won\u0026rsquo;t really hold. Oh well. Being more precise than this is getting to be overkill and I already spent too much time on this problem.\nLong story short: if you want your food to cool faster, maximize surface area but also maximize surface area in contact with the utensil, not with air.\n","permalink":"https://dmsenter89.github.io/post/25/01-cooling-utensils/","summary":"A fun physics problem people keep getting wrong, despite knowing all the relevant physics.","title":"Cooling Utensils"},{"content":"Substack\u0026rsquo;s algorithm for figuring out what I like to read is still serving up some odd stuff, but this time it found a good article. In The Mensa Fallacy Emil Kirkegaard takes on Karpinski et al (2018) which claimed that high intelligence is a risk factor for both psychological and physiological disease. The claim in the paper is to a novel finding, but Kirkegaard cites a number of studies that would indicate problems with Karpinski\u0026rsquo;s arguments.\nWhile the language in the post is a bit on the irreverent side, this seems to be a good example of the general difficulty with convenience studies. The latter are all over science, and many times used without much thought. You have a theory about a population $P$ but it\u0026rsquo;s hard to actually sample $p\\in P$? Luckily for you there is a population $Q$ that is easy to sample such that $q\\in Q \\Rightarrow q\\in P$. That\u0026rsquo;s where I\u0026rsquo;ve often seen it end. Maybe there\u0026rsquo;s an acknowledgment that $p\\in P \\nRightarrow p\\in Q$, but that\u0026rsquo;s obviously not surprising or else $P=Q$ and there wasn\u0026rsquo;t a problem to begin with.\nSo what\u0026rsquo;s the problem? We\u0026rsquo;ve done random sampling on $Q$ and we have a nice unbiased estimator $\\hat\\theta_Q$ for our population. Is my estimator also unbiased for $P$? A priori we have no grounds to think so. A popular illustration: assume I want to know the average height at my school or university (population $P$). I have no idea how tall random students are and it\u0026rsquo;s awkward to ask, but luckily the basketball team (population $Q$) publishes player stats that include the height of the players. So I go online, put all the stats in an Excel sheet, calculate the mean, and call it a day. That result would clearly not be particularly valuable for estimating the average height of all students at the school/uni. The problem is that a random sample of $Q$ does not necessarily translate into a random sample of $P$.\nSubtle versions of this problem show up in real life research all the time. If we know something about (relevant) ways in which $Q$ differs from $P$, then we can make some adjustments to correct for the bias introduced by our sampling technique. Aside from us sampling from a specific group, the issue can also crop up when we try to \u0026ldquo;hijack\u0026rdquo; an existing RCT for secondary analysis, as for example discussed in this post on the Kindergarten study that I\u0026rsquo;ve been meaning to comment on for a bit. A related issue that requires care is the use of surrogate endpoints in clinical trials, where for example an easy and cheap to acquire lab value functions as a stand-in for the actual issue of interest that may be difficult, expensive or unpleasant to collect. Without a good understanding of the relationship between a population defined by certain lab value cutoffs compared to the endpoint of interest, this may be not interpretable or provide much weaker evidence than we\u0026rsquo;d like.\n","permalink":"https://dmsenter89.github.io/post/24/12-convenience-sampling-and-mensa/","summary":"\u003cp\u003eSubstack\u0026rsquo;s algorithm for figuring out what I like to read is still serving up\nsome odd stuff, but this time it found a good article. In \u003ca href=\"https://www.emilkirkegaard.com/p/the-mensa-fallacy\"\u003eThe Mensa\nFallacy\u003c/a\u003e Emil Kirkegaard\ntakes on \u003ca href=\"https://doi.org/10.1016/j.intell.2017.09.001\"\u003eKarpinski et al (2018)\u003c/a\u003e\nwhich claimed that high intelligence is a risk factor for both psychological\nand physiological disease. The claim in the paper is to a novel finding, but\nKirkegaard cites a number of studies that would indicate problems with Karpinski\u0026rsquo;s\narguments.\u003c/p\u003e","title":"Convenience Sampling and Mensa"},{"content":"If you visit the Project Jupyter website you\u0026rsquo;ll encounter a bunch of \u0026ldquo;try it in your browser\u0026rdquo; buttons. If you\u0026rsquo;ve used Jupyter for a decade or so like me, you probably have also been ignoring these buttons. And if you have clicked on them, you might have been lead to a mybinder.org. Don\u0026rsquo;t get me wrong, mybinder is cool. It creates a docker image that remote-hosts a live environment so that you can share your interactive notebooks on the web. Cool stuff. But I just found something better.\nHave you ever heard of WebAssembly, or wasm as it\u0026rsquo;s abbreviated? The elevator pitch for WebAssenmbly can be summarized as binary format for a locally runnin, sandboxed JavaScript VM. The idea being that computation can essentially be off-loaded from a remote server to the machine running the browser that\u0026rsquo;s viewing the website. And you can now probably guess where this is going\u0026hellip;\nThanks to wasm, there\u0026rsquo;s now an online JupyterLab instance featuring a Python kernel and a SQLite kernel. The Python kernel is powered by Pyodide, a WASM port of CPython. While it\u0026rsquo;s not 1-1 feature complete, it\u0026rsquo;s quite impressive already. Solve an ODE with SciPy? Data analytics with Pandas? Visualization with Matplotlib? Fancy plots with Bokeh? This instance has got you covered. The governing JupyterLite project is open-source and even includes instructions on how to deploy JupyterLite on GitHub Pages.\nWhat\u0026rsquo;s wild to me is that yes, you can run this on a GitHub pages instance. Because you don\u0026rsquo;t actually need a backend, only a static site server, since all computations happen locally. You access your local files when you upload from a local VM, so even though you\u0026rsquo;re \u0026ldquo;uploading\u0026rdquo; you don\u0026rsquo;t actually need to share your data with the remote server. It\u0026rsquo;s quite impressive. Again, not everything works 100% yet, but you can get surprisingly far with this setup. You can even get a taste of it without leaving this blog because you can embed a REPL provided by JupyterLite\u0026rsquo;s demo instance as an iframe. Feel free to play with the REPL below and checkout the JupyterLite project. It\u0026rsquo;s really cool.\n","permalink":"https://dmsenter89.github.io/post/24/12-remote-hosted-local-jupyter/","summary":"\u003cp\u003eIf you visit the \u003ca href=\"https://jupyter.org/\"\u003eProject Jupyter website\u003c/a\u003e you\u0026rsquo;ll\nencounter a bunch of \u0026ldquo;try it in your browser\u0026rdquo; buttons. If you\u0026rsquo;ve used Jupyter\nfor a decade or so like me, you probably have also been ignoring these buttons.\nAnd if you have clicked on them, you might have been lead to a\n\u003ca href=\"https://mybinder.org/\"\u003emybinder.org\u003c/a\u003e. Don\u0026rsquo;t get me wrong, mybinder is cool. It\ncreates a docker image that remote-hosts a live environment so that you can\nshare your interactive notebooks on the web. Cool stuff. But I just found\nsomething better.\u003c/p\u003e","title":"Remote Hosted, Local Jupyter?!"},{"content":"Most have heard of Pascal\u0026rsquo;s wager, but have you heard of the thought experiment known as Pascal\u0026rsquo;s mugging? The mugging attempts to reframe the essence of the wager argument using only finite values, thereby getting around some standard objections to the wager argument.\nPascal\u0026rsquo;s Mugging It appears the term was coined in a blog post by Eliezer Yudkowsky and framed in terms of potential risks posed by AI tasked with solving a problem. Nick Bostrom retells the mugging as a conversation between Pascal and a hypothetical extra-dimensional mugger. The second part of \u0026ldquo;deal\u0026rdquo; in the paper is a bit extreme, but the initial part \u0026ndash; offering a fixed cost of money now for a low probability payoff tomorrow \u0026ndash; reminded me a bit of one of my most frequently discussed posts from 2022 on whether it makes sense to play lottery. To be fair, that was two posts rolled into one \u0026ndash; the first part is about how point estimates can be misleading, particularly for skewed distributions, and the second consisted of a mini-benchmark using a simple example calculation to make that point.\nThe beginning of Bostrom\u0026rsquo;s version goes something like this: a mugger approaches Pascal and asks for his wallet. Unfortunately, the mugger forgot his weapon so Pascal is disinclined to acquiesce to his request. To still get the wallet, the mugger offers Pascal a deal: give the mugger the wallet anyways, valued at $x$ USD, and the next day the mugger will return and pay Pascal $N x$ USD in return. As the story progresses, $N$ gets larger. We obviously don\u0026rsquo;t just believe the mugger, so there is some (small) probability $p$ that the mugger will return with the promised reward. The idea behind the experiment is that if $N$ grows sufficiently large then for any non-zero $p$ the expected value of paying the mugger becomes positive. It then veers off talking about other utility issues to make the payout better, but this early part of the conversation is essentially equivalent to a lottery game. Pay for the ticket now in the hopes that come game night the right numbers show up and you\u0026rsquo;re rich.\nAs stated, it would appear that it is reasonable for Pascal to pay the mugger and \u0026ndash; for a sufficiently large jackpot \u0026ndash; to play the lottery, even though intuitively it strikes us as the \u0026ldquo;wrong answer\u0026rdquo; given the low probability of winning. This got me thinking: can we steelman the case for playing the lottery by ignoring the magnitude of the win?\nLet\u0026rsquo;s Crunch Some Numbers If we ignore the question of the exact magnitude of the lottery win, we can divide the event space into three possibilities \u0026ndash; we are either worse off (cost of buying ticket), win back the ticket cost only, or win more money than the ticket cost so that we have a net gain from playing. Working off of the published odds again as a shortcut, we wind up with the following probabilities:\n$$ \\begin{cases} Pr(\\text{loss}) = 24/25 \\\\ Pr(\\text{even}) = 1/38 \\\\ Pr(\\text{gain}) = 13/950 \\end{cases} $$\nI\u0026rsquo;ll postulate that people don\u0026rsquo;t really care about just breaking even on the ticket, so from a decision point of view it probably makes sense to reduce this to a binomial problem: I either win or loose on each ticket, where loosing includes the case of breaking even. People usually buy a handful of tickets, call it $n$. So now I can write a random variable representing my number of winning tickets as\n$$ X \\sim \\mathrm{Bin}\\left(n, \\frac{13}{950} \\right). $$\nWe already showed that within any reasonable lifetime, there is a vanishingly small chance you\u0026rsquo;ll become rich from playing the lottery. To steelman, we\u0026rsquo;ll say that the utility of playing the lottery comes not from the money won but from the fun of playing.\nI\u0026rsquo;ll start by asserting that winning is more fun than loosing. So how often can we loose and still have fun? We definitely don\u0026rsquo;t want to always loose, so $x\u0026gt;0$ is required. Winning more often than we loose is unlikely by design of the lottery so that can be our upper bound. Our realistic expectation should then be something like $0 \u0026lt; x \\leq \\left \\lfloor{n/2}\\right \\rfloor$. If $x$ is in this window, I\u0026rsquo;ll call it a \u0026ldquo;Good Game\u0026rdquo; of lottery.\nRealistically, you\u0026rsquo;re not going to buy 100+ lottery tickets. Let\u0026rsquo;s say an average person likely wouldn\u0026rsquo;t buy more than 20 tickets, which still feels like plenty. So how likely is it you\u0026rsquo;ll have a good game, given as a function of the number of lottery tickets purchased? Let\u0026rsquo;s do a quick DATA step and find out.\ndata lottery; p=13/950; do N=2 to 20; gg=cdf(\u0026#39;Binomial\u0026#39;, floor(N/2), p, N) - cdf(\u0026#39;Binomial\u0026#39;, 0, p, N); output; end; ; run; proc sgplot data=lottery; scatter x=N y=gg; xaxis label=\u0026#39;Number of Tickets\u0026#39;; yaxis label=\u0026#39;Probability of a Good Game\u0026#39;; run; So in the best case scenario, where we buy 20 tickets, we only have a chance of approximately 0.24 of having a good time. In other words, even with the generous assumption you\u0026rsquo;d be happy to loose 19 games if you win 1 you\u0026rsquo;d still be disappointed most of the time. Given my experience, I\u0026rsquo;d say many people will probably purchase fewer than 20 tickets. Perhaps 2 to 6. In such cases, you\u0026rsquo;re still bound to be disappointed. I\u0026rsquo;d say even with this steelmanning, it doesn\u0026rsquo;t make sense. Instead of buying a lottery ticket, perhaps buy a coffee and a donut for similar cost but with a guaranteed happiness payoff.\n","permalink":"https://dmsenter89.github.io/post/24/12-lotteries-and-pascals-mugging/","summary":"\u003cp\u003eMost have heard of \u003ca href=\"https://en.wikipedia.org/wiki/Pascal's_wager\"\u003ePascal\u0026rsquo;s wager\u003c/a\u003e, but have you heard of the thought experiment known as \u003ca href=\"https://en.wikipedia.org/wiki/Pascal%27s_mugging\"\u003ePascal\u0026rsquo;s mugging\u003c/a\u003e? The mugging attempts to reframe the essence of the wager argument using only finite values, thereby getting around some standard objections to the wager argument.\u003c/p\u003e","title":"Lotteries and Pascal's Mugging"},{"content":"Ever heard someone say they were \u0026ldquo;letting the data speak for itself?\u0026rdquo; I often encounter this phrase on the internet by someone claiming not to be interpreting the data, but merely relaying facts. I don\u0026rsquo;t believe that\u0026rsquo;s actually true in the sense that it\u0026rsquo;s typically meant.\nHere\u0026rsquo;s a good minimal example I\u0026rsquo;ve used in a math modeling class before. I got this table from slides in an epidemiology course where it was presented in the context of racial disparities in health care. It shows rates of infant mortality for two time points in two different racial groups in the US. Here\u0026rsquo;s a reproduction of the table:\nYear White (Non-Hispanic) Black (Non-Hispanic) 1950 26.8 43.9 1998 6 13.8 This represents the number of infants who died per 1,000 live births, so lower is better. The first thing to note and credit is that the numbers have been plummeting in both columns, a testament to our improvements in infant care.\nNow, if all I\u0026rsquo;m trying to do is to say that there is a disparity because the numbers don\u0026rsquo;t match, that\u0026rsquo;s true. But it also trivial and uninteresting. I\u0026rsquo;m usually much more interested in patterns and trends. Given that we have two different time points, it is reasonable to ask a question like \u0026ldquo;are the disparities getting better or worse?\u0026rdquo; As it turns out, and we did this exercise in class, the question happens to be a bit vague, so it\u0026rsquo;s easy to come up with different answers both in direction and magnitude of the disparity. I\u0026rsquo;ll group a couple of examples each by outcome.\nSay I want to see the data as saying the racial disparity is getting bigger, i.e. the gaps are widening in some sense. To measure the disparity, the US Office of Minority Health uses the ratio of the non-Hispanic Black to the non-Hispanic White rates. That would give us $\\frac{43.9}{26.8} \\approx 1.63$ for 1950 versus of $\\frac{13.8}{6} = 2.3$ for 1998. This would tell a story of the gap widening. This would imply the gap has grown by about 40%.\nAlternatively, we could construct the relative percentage difference between the Black and White rates, which would give $\\frac{43.9 - 26.8}{26.8} \\approx 0.64$ for 1950 vs $\\frac{13.8 - 6}{6} = 1.3$ for 1998. This also tells a story of disparities growing, but it appears even more alarming than using the previous method \u0026ndash; the 1998 difference is about twice that of 1950!\nIf I want to take the opposite route and see the disparities as improving, I could compare the rates of improvement in the infant mortality rates. I have two data points for each racial category, so I can fit a line to each and compare the slopes. That would give us $\\frac{6 - 26.8}{1998 - 1950} \\approx - 0.43$ for Whites and $\\frac{13.8 - 43.9}{1998 - 1950} \\approx - 0.63$ for Blacks. If I choose this metric, improvements have been substantial. The rates have gone down for Blacks about 50% faster than for Whites.\nWe could also work with the number of deaths more directly. In 1950, there were $43.9 - 26.8 = 17.1$ excess infant deaths per 1,000 live births amongst Blacks compared to Whites, while in 1998 there were only $13.8 - 6 = 7.8$ excess deaths per 1,000 live births. This metric could also be framed as a success \u0026ndash; thanks to improvements in disparities, we have nearly 10 fewer Black infant deaths per 1,000 live births than we would have had the disparities of the 1950s persisted.\nAnybody wanting to opine on the matter could calculate any of these measures and claim they are fairly representing the data as they see them. All of these choices reflect what in statistics would be called an estimand \u0026ndash; a particular, mathematically defined answer to a question that we seek to infer from the data using an estimator. Each estimand is related to the scientific question being asked, but makes it more precise. One issue that arises is that because the scientific question of interest is by necessity a bit vague, at least when offered initially, there are multiple valid routes to go about answering it. This leads to what Gelman termed the \u0026ldquo;Garden of Forking Paths\u0026rdquo; and the different possible answers we can get to our question from a data set.\nAnother issue to consider is proper conditioning on factors playing a role in the outcome of interest. A relatable illustration of this is the famous \u0026ldquo;80 cents on the dollar\u0026rdquo; line about the US gender pay gap. This is an overall measure of group differences often raised in a political context. While this represents true observable group differences in wages, it is often used to imply wage discrimination. Because that\u0026rsquo;s a more interesting question from a policy perspective. But wage discrimination is usually thought of as two individuals with the same background characteristics, say qualifications and personality traits indicative of productivity, and differing in only one aspect of interest, say race, gender, or sexual orientation, having meaningfully different wages. So to be able to infer discrimination, we have to also collect and analyze the necessary covariates that can reflect differences in qualifications and ability. Wages have high variability and different industries have very different pay-bands which are often stratified to some degree by educational achievement. And we know there are large gender gaps in fields of study. Here\u0026rsquo;s an interesting link to an overview of this issue by Bankrate. Out of the six majors listed with a median salary of $100,000 or more, only one was even close to even in enrollment by gender. Without taking differences such as these (and others) into account, the noted group difference in and of itself is not particularly enlightening.\nBack to our example on infant mortality. Since this is a medical issue, we can look at whether there are established risk factors for it whose distribution might differ between groups. Here is a list of the top 5 risk factors identified by the CDC:\nbirth defects, preterm birth and low birth weight, sudden infant death syndrome, unintentional injuries, and maternal pregnancy complications. Since each of these factors are believed to affect the infant mortality rate, it would be appropriate to include them in an analysis. After all, it\u0026rsquo;s not a given that they do not differ by racial groups and variation in risk factors may explain some of the variation in the observed difference in infant mortality rates. For example: Blacks have about twice the rate of low birth weight infants compared to Whites. Among pregnancy complications, preeclampsia is more common among Blacks than Whites. For other risk factors, like gestational diabetes, the rates are similar. Including these risk factors in an analysis could increase or decrease the observed disparity, but would simultaneously provide a richer picture and ultimately a better answer to our scientific question. This is particularly true if we\u0026rsquo;re asking this question from a public health perspective to guide funding allocations in an effort to alleviate disparaties.\nSo all this to say that I don\u0026rsquo;t believe the data speak for themselves. Analysts use data to tell a story. And I don\u0026rsquo;t mean that maliciously. You can sincerely tell different stories with the same data. Since there are many ways of reasoning with and analyzing data, openness is key. And part of that openness is clarity on the estimands we choose, and what they imply about the specific question we\u0026rsquo;re asking. This becomes especially important when the stakes are high, such as in regulatory review of clinical trials or when policy decisions are at stake. We\u0026rsquo;re currently seeing a move to more transparency here, see for example here for the E9(R1) Addendum on estimands and sensitivity analysis in clinical trials. There\u0026rsquo;s also a relatively recent book by Mallinckrodt et al offering relevant examples from the clinical setting on picking and justifying appropriate estimands.\nThis also shows that it pays to take a second look at published data, something I\u0026rsquo;ve recently become more interested in. See this post by John Mandrola and the topic of reanalyzing clinical trial data and their perhaps surprising finding: in their admittedly small sample, 35% of the reanalysis of existing, published trial data lead to different interpretations than the originally published article.\n","permalink":"https://dmsenter89.github.io/post/24/11-the-data-dont-speak-for-themselves/","summary":"\u003cp\u003eEver heard someone say they were \u0026ldquo;letting the data speak for itself?\u0026rdquo; I\noften encounter this phrase on the internet by someone claiming\nnot to be interpreting the data, but merely relaying facts. I don\u0026rsquo;t\nbelieve that\u0026rsquo;s actually true in the sense that it\u0026rsquo;s typically meant.\u003c/p\u003e","title":"The Data Don't Speak for Themselves"},{"content":"Most medical papers featuring statistical analysis still utilize a hypothesis testing framework. Data is collected, an analysis is run, a p-value reported, and \u0026ndash; if it is found to be below the magic threshold of 0.05 \u0026ndash; the finding is declared \u0026ldquo;(statistically) significant.\u0026rdquo; The authors then suggest, with varying degrees of explicitness, that their results support their preferred hypothesis, or \u0026ndash; in some less modest cases \u0026ndash; may go as far as claiming to have found \u0026ldquo;proof\u0026rdquo; their preferred hypothesis. Criticism of this methodology is old, and will not be rehashed here. Suffice it to say that the p-value is typically constructed conditional on a strawman \u0026ldquo;null hypothesis\u0026rdquo; being true, and as such doesn\u0026rsquo;t provide direct evidence concerning any specific alternative hypothesis the researchers are actually interested in. What we\u0026rsquo;re left with is a general desire to say something along the lines of \u0026ldquo;Based on the data I have collected, I now have more/less reason to believe that my preferred hypothesis is correct.\u0026rdquo; P-values aside, there exists a measure to express this idea: it is often called the Bayes factor.1\nThe Bayes Factor and Its Bound Let\u0026rsquo;s start with the basic idea. We can frame the problem of how likely a particular model $M$ is given data $D$ that we have collected by using Bayes theorem:\n$$\\Pr(M|D) = \\frac{\\Pr(D|M)\\Pr(M)}{\\Pr(D)}$$\nSuppose you have two competing models as explanations for your data, $M_{1}$ and $M_{2}$. If you want to know if the data favors one model over another, you could simply take that ratio:\n$$\\frac{\\Pr(M_{2}|D)}{\\Pr(M_{1}|D)} = \\underset{\\text{ Bayes Factor}}{\\underbrace{\\frac{\\Pr(D|M_{2})}{\\Pr(D|M_{1})}}} \\times \\underset{\\text{Prior Odds}}{\\underbrace{\\frac{\\Pr(M_{2})}{\\Pr(M_{1})}}}$$\nWhich model is in the numerator or denominator can be chosen by convenience. As written above, a larger value of this ratio favors $M_2$ over $M_1$, while small values of the ratio favor $M_1$ over $M_2$.\nSubstitute in your traditional notation for a null-hypothesis $H_{0}$ and alternative hypothesis $H_{1}$ and we can compare two different Bayes factors, $\\text{BF}{10}$ which compares the alternative hypothesis to the null, or $\\text{BF}{01}$ which compares the null to the the alternative. For illustrative purposes, consider the case of\n$$\\mathrm{BF}_{10} = \\frac{1}{\\mathrm{BF}_{01}} = 10$$\nWe can interpret this to say that the data favor the alternative hypothesis 10 to 1 compared to the null. In other words, bigger values of $ \\text{ BF }_{10} $ correspond to greater evidence in favor of $H_{1}$ over $H_{0}$, whereas smaller values of $\\text{BF}{01}$ favor $H{1}$ over $H_{0}$.\nIn general, constructing a Bayes factor requires modeling. What we can do without modeling, however, is provide reasonable bounds for the Bayes factor - a minimum bound for $\\text{BF}_{01}$ or \u0026ndash; equivalently \u0026ndash; a maximum bound for $\\text{BF}_{10}$. Benjamin and Berger (2019) recommend a particularly simple upper bound for the Bayes factor that can be shown to hold in a wide variety of situations:\n$$\\text{ BF}_{10} \\leq \\text{ BFB } = \\frac{1}{- ep\\log(p)}$$\nThis approximation is valid for $p \u0026lt; \\frac{1}{e} \\approx 0.367$.\nAdding Bayes Factors to SAS Output There is an easy way to start adding such Bayes Factor bounds to your existing SAS workflow. Did you know that you can convert any ODS output table to a SAS data set? That way you can access any reported value for later analysis. For our purposes, this means we can access any test statistic or p-value reported by SAS and use them to calculate the appropriate Bayes factor bounds. I\u0026rsquo;ll show two simple examples.\nLet\u0026rsquo;s say I want to employ a bound for the output of PROC TTEST. I can look up the relevant table name in the SAS/STAT User\u0026rsquo;s Guide in the \u0026ldquo;ODS Table Names\u0026rdquo; section under the PROCs \u0026ldquo;Details\u0026rdquo; \u0026ndash; in this case, I want to use the table named TTests. I can then save this table to a SAS data set using ods output. To save typing, we\u0026rsquo;ll use the filename and include statements to utilize the available code for the Getting Started example in TTest:\nfilename tgs url \u0026#39;https://raw.githubusercontent.com/sassoftware/doc-supplement-statug/refs/heads/main/Examples/r-z/ttegs1.sas\u0026#39;; ods output TTests=res; /* this line exports the TTests table to \u0026#34;res\u0026#34; */ %include tgs; Looking at my new data set, I can see that the p-value is saved to a variable named Probt. I can now use a DATA step to calculate both the BFB and the reciprocal of it, the minimum Bayes factor:\ndata bayes; set res; BFB = 1/(-CONSTANT(\u0026#39;E\u0026#39;)*Probt*Log(Probt)); BFmin = 1/BFB; run; proc print; run; I\u0026rsquo;m omitting the data keyword from the PROC PRINT call to keep things concise. This way, it automatically uses the last data set in use. If you run these two code snippets, you\u0026rsquo;ll get the following table:\nThe Bayes factor bounds from our PROC TTest example.\nThis suggests that for our particular example, the most favorable interpretation would favor the alternative hypothesis over the null by a rate of about 5.4-to-1.\nAnother common source for p-values is regression output. Each parameter estimate is accompanied by a p-value. We can use the same procedure as above to look up the relevant ODS table name and use ODS OUTPUT to save that table for later use:\nods output ParameterEstimates=ParmEst; proc reg data=sashelp.baseball; id name team league; model logSalary = nhits nbb yrmajor crhits; run; This table contains a little more detail and a pretty large BFB, so I decided to specify which variables I want to print and added a FORMAT to the PRINT call.\ndata bayes; set ParmEst; BFB = 1/(-CONSTANT(\u0026#39;E\u0026#39;)*Probt*Log(Probt)); BFmin = 1/BFB; run; proc print noobs; var Variable Label Estimate Probt BF:; format BF: COMMA14.2; run; One thing that\u0026rsquo;s neat to notice here is that the p-value is printed by SAS using a special format. Since user\u0026rsquo;s are normally not interested in the exact value when it is less than $10^{-4}$ ODS just prints \u0026ldquo;\u0026lt;.0001.\u0026rdquo; This doesn\u0026rsquo;t mean SAS doesn\u0026rsquo;t calculate the exact p-value as we can see from the data set produced with ODS OUTPUT. It stores the actual numeric value, so the BFB computation can proceed without issues. This is what it looks like:\nThe Bayes factor bounds from our PROC REG example.\nYou can try these code snippets out yourself using SAS OnDemand for Academics or Viya for Learners.\nThe bound used in this post is one recommendation by Benjamin and Berger (2019) to improve scientific result reporting during this time in which we\u0026rsquo;re slowly trying to move away from p-values. To learn more about alternative bounds and the conditions in which they hold, I would recommend the very readable overview of the subject of by Held and Ott (2018).\nReferences Benjamin, D. J., and Berger, J. O. (2019), \u0026ldquo;Three Recommendations for Improving the Use of p-Values,\u0026rdquo; The American Statistician, ASA Website, 73, 186–191. https://doi.org/10.1080/00031305.2018.1543135.\nHeld, L., and Ott, M. (2018), \u0026ldquo;On p-Values and Bayes Factors,\u0026rdquo; Annual Review of Statistics and Its Application, Annual Reviews, 5, 393–419. https://doi.org/https://doi.org/10.1146/annurev-statistics-031017-100307.\nThis doesn\u0026rsquo;t solve all problems with hypothesis testing. In particular, see section 7.4 in BDA3 for limitations. You may also enjoy the critique offered at DataColada.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/24/11-from-p-values-to-bayes-factor/","summary":"Improve the interpretation of your frequentist analysis output\u0026rsquo;s strength of evidence by incorporating Bayes factor bounds using SAS.","title":"From p-Values to Bayes Factors"},{"content":"I have recently returned to work from my paternity leave. I really enjoyed my time with my youngest and am grateful to SAS for providing 8 weeks of paid paternity leave \u0026ndash; a benefit that remains uncommon in the United States. This precious time allowed me to bond with my youngest child and navigate the dynamic world of parenting four children whose ages span from infancy to the teenage years.\nThe playgrounds and family events I attended with my kids this summer provided me not only with a lot of joy, but also an opportunity to think about how parenting differs between my native Germany and my current home in the U.S. Overall, I\u0026rsquo;ve found that some of the cultural norms I\u0026rsquo;ll note below make parenting more burdensome and difficult than they have to be. I\u0026rsquo;ve noticed myself gradually adapting to some of these attitudes, despite trying to maintain awareness of my own ideas and ideals about parenting. But there is good reason to believe that most parenting choices do not permanently affect our children\u0026rsquo;s future in the way we\u0026rsquo;d like to think, unless we parent somewhat far outside the norm of our cultural environment.\nOne of the cultural norms I find unhelpful I mostly encounter at American playgrounds. I have observed a pattern of near-constant supervision, with parents actively engaging with their children almost non-stop. It seems like many parents can\u0026rsquo;t let 90 seconds pass without saying something to their child, whether that\u0026rsquo;s praise, a word of caution, or instructions on how to play. It\u0026rsquo;s rare to see a parent sitting back to chat with fellow adults while their children explore independently \u0026ndash; unless they\u0026rsquo;ve arrived as a prearranged group. This approach often seems odd to me and the other German expatriates I\u0026rsquo;ve spoken with.\nThis may be partially attributed to America\u0026rsquo;s obsession with child safety, particularly fears of child abduction. Parents follow their children closely, avoiding any loss of visual contact or distances of more than a few feet. As a personal example of this, I can think of an incident where my about 2-year old playfully closed a store\u0026rsquo;s glass door with me just on the other side. Although the surrounding area was clear and we had good visibility due to the glass door and glass walls, my wife\u0026rsquo;s anxiety kicked in. What if someone ran up to snatch our child away? This echoes a common American concern and other American moms could related to her feelings. As an immigrant, this level of concern strikes me as out of proportion, especially when considering actual child abduction statistics. According to the National Center for Missing \u0026amp; Exploited Children, there were only 366 reported cases of non-family abductions over the five years from 2016-2020, with over half involving someone known to the family. In the 0-5 age group, a mere 190 cases were reported. While each case is undoubtedly tragic, we have to weigh the actual risks to our children against the potential drawbacks of our attempts at minimizing said risks. Given that the risk of pediatric stranger abduction is lower than the risk of death by pediatric vehicular heatstroke, it seems to me it\u0026rsquo;s not worth it to impede our children\u0026rsquo;s growth toward independence and self-confidence in public by attempts to further minimize such a remote risk.\nAnother observation I\u0026rsquo;ve made is the tendency to treat children of various ages as if they were in the 7-12 age range. Toddlers face unrealistic expectations regarding emotional regulation, partiularly in public, while teens often appear to be coddled, delaying their progression into adulthood.\nOverall, parenting in the U.S. appears excessively child-centric. There\u0026rsquo;s a prevailing cultural expectation to constantly entertain children, as if boredom is a calamity to be avoided at all costs. But children are good at entertaining themselves when we let them. The ultimate goal of parenting should be to raise self-sufficient adults, not perpetual playmates.\nA book that I\u0026rsquo;ve found helpful on these cultural differences and that explores alternative approaches is Michaeleen Doucleff’s \u0026ldquo;Hunt, Gather, Parent.\u0026rdquo; I find her writing style a bit annoying, but the book\u0026rsquo;s insights are incredibly valuable. Jeremy Kun\u0026rsquo;s blog provides an excellent summary of the key insights. I\u0026rsquo;ve condensed them below. See Jeremy\u0026rsquo;s blog for more details.\nHow to think about your child and your role as a parent: Your child \u0026ldquo;has no brain\u0026rdquo; (=is predictably irrational). Your job is to teach your child to think. Your child mirrors your energy. Speech is a stimulant. Children want to be like their parents. Actionable advice: Say less. Quiz you child on good behavior, and focus on bad outcomes. Do your own thing, and let your kid participate (but never force them). Use monsters and stories to drive values. Bryan Caplan\u0026rsquo;s book \u0026ldquo;Selfish Reasons to Have More Kids\u0026rdquo; offers another perspective that can help alleviate the pressures of modern American parenting. Caplan in essence argues that many parents today engage in intensive parenting practices that the parents don\u0026rsquo;t enjoy (and sometimes, the kids don\u0026rsquo;t either) but that have little long-term impact on children, while overlooking simpler, more effective strategies that foster strong family bonds. He encourages parents to focus on what really matters and to keep our worries and fears about our children\u0026rsquo;s future in check.\nHere are 5 takeways from Caplan\u0026rsquo;s book:\nRethink Parental Investment: Caplan suggests that parents often overestimate the influence of intensive parenting on their children\u0026rsquo;s outcomes. He encourages a more relaxed approach that doesn\u0026rsquo;t sacrifice parental happiness for marginal gains in child development.\nGenetics Play a Major Role: The book emphasizes the significant impact of genetics over upbringing. Caplan argues that nature has a stronger hand than nurture in many aspects of a child\u0026rsquo;s future, such as intelligence and personality traits.\nEnjoy Parenting More: By worrying less about optimizing every aspect of their children\u0026rsquo;s lives, parents can enjoy the experience of parenting more, reducing stress and increasing the overall happiness of the family.\nLong-term Relationship Building: Caplan advises parents to focus on cultivating a positive, long-lasting relationship with their children, as this has a profound and enduring impact on both the parents\u0026rsquo; and children\u0026rsquo;s well-being.\nConsider Having More Kids: With the understanding that parenting can be less intensive and still very successful, Caplan encourages parents to consider the benefits of having more children, such as the joys of a larger family and the support siblings can provide to each other throughout their lives.\nBoth books effectively challenge the conventional approach to modern American parenting and offer valuable insights on creating a more relaxed, independent, and I would argue ultimately healthier approach to raising future adults.\n","permalink":"https://dmsenter89.github.io/post/24/08-parenting/","summary":"\u003cp\u003eI have recently returned to work from my paternity leave. I really enjoyed my\ntime with my youngest and am grateful to SAS for providing 8 weeks of paid\npaternity leave \u0026ndash; a benefit that remains uncommon in the United States. This\nprecious time allowed me to bond with my youngest child and navigate the dynamic\nworld of parenting four children whose ages span from infancy to the teenage\nyears.\u003c/p\u003e","title":"Some Thoughts on Parenting"},{"content":" From the null program: Object Oriented C Skeeto\u0026rsquo;s C coding style Portable Makefiles Beating NumPy\u0026rsquo;s Matrix Multiplication in C. A list of cli-tools provided by Python. ","permalink":"https://dmsenter89.github.io/post/24/07-programming-links/","summary":"\u003col\u003e\n\u003cli\u003eFrom the \u003ca href=\"https://nullprogram.com/\"\u003enull program\u003c/a\u003e:\n\u003col\u003e\n\u003cli\u003e\u003ca href=\"https://nullprogram.com/blog/2014/10/21/\"\u003eObject Oriented C\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003eSkeeto\u0026rsquo;s \u003ca href=\"https://nullprogram.com/blog/2023/10/08/\"\u003eC coding style\u003c/a\u003e\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://nullprogram.com/blog/2017/08/20/\"\u003ePortable Makefiles\u003c/a\u003e\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://salykova.github.io/matmul-cpu\"\u003eBeating NumPy\u0026rsquo;s Matrix Multiplication\u003c/a\u003e in C.\u003c/li\u003e\n\u003cli\u003eA list of \u003ca href=\"https://www.pythonmorsels.com/cli-tools/\"\u003ecli-tools\u003c/a\u003e provided by Python.\u003c/li\u003e\n\u003c/ol\u003e","title":"Programming Links - July"},{"content":"Did you know you can get your public SSH keys via GitHub? I recently installed Ubuntu Server to VM. During the installation process, it asked for my GitHub username and then populated the authorized_keys file with my public keys. That was super nifty! My new install never needed to accept password-based logins and I didn\u0026rsquo;t have to worry about onboarding different machines\u0026rsquo; keys. I was curious how it worked and it turns out you can do this directly yourself via cURL:\ncurl https://github.com/${GITUSERNAME}.keys \u0026gt;\u0026gt; ~/.ssh/authorized_keys The same thing works if you have uploaded GPG keys for signing:\ncurl https://github.com/${GITUSERNAME}.gpg The GitHub API also exposes this info, but then you\u0026rsquo;ll have to process the returning JSON if you want to use it for your keys file:\ncurl https://api.github.com/users/${GITUSERNAME}/keys Neat!\n","permalink":"https://dmsenter89.github.io/post/24/04-get-ssh-keys-from-github/","summary":"\u003cp\u003eDid you know you can get your public SSH keys via GitHub? I recently installed\nUbuntu Server to VM. During the installation process, it asked for my GitHub\nusername and then populated the authorized_keys file with my public keys. That\nwas super nifty! My new install never needed to accept password-based logins\nand I didn\u0026rsquo;t have to worry about onboarding different machines\u0026rsquo; keys.\nI was curious how it worked and it turns out you can do this directly yourself\nvia cURL:\u003c/p\u003e","title":"Get SSH Keys From Github"},{"content":"The Viya 2024.04 release includes a brand new MI feature: new missing data statistics. An important choice when building an imputation model is the selection of variables to be included. One method to help in the variable selection process is the usage of summary statistics such as influx and outflux, as proposed by van Buuren. In his words: \u0026ldquo;Influx and outflux are summaries of the missing data pattern intended to aid in the construction of imputation models. Keeping everything else constant, variables with high influx and outflux are preferred. Realize that outflux indicates the potential (and not actual) contribution to impute other variables\u0026rdquo;\nThe MI statement now supports the new FLUX option. When specified, MI produces a table including the influx, outflux, average inbound and outbound, and FICO statistics along with a column indicating the percent of cases for which the particular variable has been observed. When ODS graphics are turned on, MI additionally produces a scatter plot of the variables\u0026rsquo; influx and outflux. For details, see the new section on Missing Data Statistics in the MI chapter of the SAS/STAT User\u0026rsquo;s Guide.\nOne thing that\u0026rsquo;s cool about this new feature for all users, not just those interested in multiple imputation, is the fact that this new feature allows you to get a complete overview of the percent of observed/missing cases for all variables \u0026mdash; both character and numeric! Previously, you either needed to use separately procedures for character and numeric variables, or expend some work to get a macro written that creates a table of both types of variables for you.\nWith this new feature, you can simply use\n/* optional: creates output ds with PctObs and PctMiss vars */ ods output Flux=Flux; /* sample code using the sashelp.heart data set */ proc mi data=sashelp.heart flux nimpute=0 displaypattern=nomeans; class _character_; var _all_; fcs; run; Note that we can include all variables in our data set with var _all_. If our data set includes character variables, we need class _character_ to label all character variables as classification variables. If you are only interested in a subset of the variables, you can of course specify them here. We use the FCS statement to accomodate classification variables and we set nimpute=0 since we don\u0026rsquo;t actually want to create imputations, just view the missing data statistics. The ods output statement is completely optional. It creates a data set with variables PctObs and PctMiss for every variable in the analysis that you could then further process with PROC SQL or some other method.\nIn this example, the table will look as follows:\nFor a full walkthrough of this code, see the new example in the MI chapter of the SAS/STATS User\u0026rsquo;s Guide.\n","permalink":"https://dmsenter89.github.io/post/24/04-new-mi-feature-flux/","summary":"\u003cp\u003eThe \u003ca href=\"https://go.documentation.sas.com/doc/en/pgmsascdc/v_050/pgmsaswn/n1l6ng10yj6s1an1v0rt9nj79ktc.htm\"\u003eViya 2024.04 release\u003c/a\u003e\nincludes a brand new MI feature: new missing data statistics. An important\nchoice when building an imputation model is the selection of variables to be\nincluded. One method to help in the variable selection process is the usage of\nsummary statistics such as influx and outflux, as proposed by \u003ca href=\"https://stefvanbuuren.name/fimd/missing-data-pattern.html\"\u003evan\nBuuren\u003c/a\u003e. In his\nwords: \u0026ldquo;Influx and outflux are summaries of the missing data pattern intended to\naid in the construction of imputation models. Keeping everything else constant,\nvariables with high influx and outflux are preferred. Realize that outflux\nindicates the potential (and not actual) contribution to impute other variables\u0026rdquo;\u003c/p\u003e","title":"New MI Feature: Flux Statistics"},{"content":"We live in the age of evidence-based medicine and an increasing willingness on the part of patients to review medical guidance and actively participate in their care. This includes such personal and emotional areas like pregnancy and birth. Some popular tools to help would-be parents include Emily Oster\u0026rsquo;s famous \u0026ldquo;Expecting Better\u0026rdquo; and evidencebasedbirth.com.\nWe\u0026rsquo;ve had our fourth child this year and each time we go through prenatal appointments, birth, and post-partum care I encounter some differences between recommendations I\u0026rsquo;ve found in the literature and what providers say or do. It sticks out to me more than other branches of medicine I\u0026rsquo;ve encountered, but that might just be a difference in the amount of experience I\u0026rsquo;ve had versus willingness to go on Google Scholar for a lit review due to personal nature of the topic. Either way, in this post I wanted to share some notes on two particular issues that I\u0026rsquo;ve encountered and what I\u0026rsquo;ve seen in the literature on those topics. Obviously I\u0026rsquo;m not a medical provider and this is not medical advice, just a literature review by a curious statistician.\nLow Amniotic Fluid Volume (AFV) Measurement Techniques Management Continuous EFM Concluding Thoughts References Low Amniotic Fluid Volume (AFV) When your little baby is in utero it is surrounded by amniotic fluid. A sufficient amount of amniotic fluid is important for the baby, as it provides a cushion and allows for nutrient and oxygen flow through the umbilical cord. When amniotic fluid levels are low, baby\u0026rsquo;s ability to receive nutrients and oxygen can be hindered. This is called oligohydramnios in the literature.\nIt is important to know that we can divide cases of low amniotic fluid volume (AFV) into two broad categories:\nthe idiopathic, that is cases where no explanation for the low fluid level is known. it happens \u0026ldquo;on its own;\u0026rdquo; this is also called isolated oligohydramnios. This is the most common. low AFV as a side effect of another pregnancy complication. This can be further divided into maternal causes, i.e. a complication arising in the mother, the fetus, or having a placental cause. See the StatPearls article for relevant citations on these categories. This post focuses on idiopathic oligohydramnios, or isolated oligohydramnios (IO).\nMeasurement Techniques How is IO diagnosed? Accurately measuring AFV is difficult. The most accurate technique is invasive and involves injecting a dye into the amniotic cavity and sampling of the amniotic fluid to determine a dilution curve. This technique is accurate but impractical, so most AFV assessments during pregnancy are done via ultrasound measurements. Two techniques have been developed: the amniotic fluid index (AFI) and the maximum vertical pocket (MVP) technique, sometimes called single deepest pocket (SDP).\nFor MVP, a sonogropher identifies the deepest pocket of amniotic fluid not including the umbilical cord or body parts and measures the length from the 12 o\u0026rsquo;clock to the 6 o\u0026rsquo;clock position. The normal range is 2-8 cm, with \u0026lt;2 cm diagnosed as oligohydramnios and \u0026gt;8 cm as polyhydramnios. For the AFI method, the uterus is divided into 4 quadrants and the MVP is obtained in each quadrant. A sum of less than 5 cm is diagnosed as oligohydramnios (Keilman and Shanks 2022).\nThere is a long debate on which method is better, but a Cochrane review of five trials suggested providers use MVP as opposed to AFI due to an increased number labor inductions and Ceasarean deliveries among women screened using AFI, with no concomitant improvement in perinatal outcomes (Nabhan and Abdelmoula 2008). As of 2014, ACOG followed suit and recommends the MVP method (Landon et al. 2018). But a decade later many providers continue to rely on the AFI. That was my experience with our providers in fetal-maternal medicine at Wake Forest in Winston-Salem, for example.\nUnfortunately, both MVP and AFI are known to be poor predictors of actual AFV. One paper by Hughes et al. (2020) for example pits the two methods against each other on women with known reference AFV. Performance of the ultrasound measurements are stratified by the true AFV classified as low, normal or high. The AUC result for both methods are disappointingly low. The AUC for low and normal volumes is estimated as being between 0.53 and 0.59. For high AFV, minor improvement is seen with the AUC around 0.62 and 0.63. As a rule of thumb, an AUC of 0.5 is roughly equivalent to randomly selecting a test result, with a minimum acceptable AUC of 0.7 being suggested for a test to have utility (Hosmer et al. 2013).\nA further complication is that an abnormal AFV is relatively rare. I\u0026rsquo;ve found estimates anywhere from 0.5-8.0% of pregnancies being affected. So what is the diagnostic value of an ultrasound reading indicating low AFV? The sensitivity for low AFV is quite poor, with Hughes giving point estimates of 15% and 7.69% for AFI and MVP, respectively. For specificity, Hughes gives point estimates of 97.54% and 99.08% for AFI and MVP. Plug the paper\u0026rsquo;s values into Bayes\u0026rsquo; theorem and prepare to be underwhelmed.\nManagement In short, we have a relatively infrequent medical condition that is looked for using a crude test with a very high false positive rate. Even if we allow for retesting after a few days, results are not necessarily \u0026ldquo;reassuring\u0026rdquo; towards the end of pregnancy due to the natural decline of AFV this far along. A study by Brace and Wolf (1989) estimates changes in amniotic fluid volume throughout pregnancy. What is interesting is that according to their measurements, AFV peaks between 32 and 34 weeks and then starts to decline. This paper is cited and reproduced in a few standard textbooks on obstetrics, so you would imagine providers are familiar with it.\nEstimated amniotic fluid at various stages of pregnancies. Based on Brace and Wolf (1989) as reproduced in Landon et al. (2020). Note that average AFV declines near term at all percentiles.\nThis creates a risk if an AFV assessment is made after 34 weeks and the result is towards the low end. If a provider waits a few days, chances are that the reading will be similar or lower than the previous one for entirely natural causes. Note also the very high variation between readings as mentioned in Landon et al. (2020). Couple that with the recent trend of interpreting the ARRIVE study as demonstrating that there is no to minimal risk when inducing an early-term pregnancy (cf. Carmichael and Snowden (2019) for some good thoughts, and see the abstract of ARRIVE itself for selection criteria) and we have a recipe for otherwise unnecessary medical interventions. Even before ARRIVE, medical providers routinely felt that low AFV alone is a reason for offering inductions; see for example Schwartz et al. (2009) and Keilman and Shanks (2022) and many secondary citations in other papers on this topic. But induction is not as easy of a solution as it is made out to be. It can be more uncomfortable for the mother and rates of Ceasareans may be higher (though this is a bit disputed in the literature). Despite this, I\u0026rsquo;ve repeatedly encountered a borderline reading as being interpreted as indication for an expedited induction (within a day or less), even absent other issues or stress-tests.\nJust to reiterate, these comments are regarding isolated oligohydramnios, that is, low AFV that is not a side-effect of some condition. Interventions for IO may be thought to prevent stillbirth in part due to an otherwise missed comorbidity such as fetal growth restriction, but the literature simply does not bear out the thought of this being accurate. Even for low AFV associated with a comorbidity, a relatively recent meta-analysis suggests basing practice decisions on the relevant comorbidity, and not the AFV results (Rabie et al. 20217).\nContinuous EFM A quick summary of the issue is available from the abstract to Knupp et al. (2020):\nContinuous electronic fetal monitoring (EFM) was first introduced commercially over 50 years ago with the hope of improving perinatal outcomes during labor. However, despite the increased use of EFM, definitive improvements in perinatal outcomes have not been demonstrated. Variance in tracing interpretation and intervention has led to increased rates of cesarean and operative vaginal deliveries and perhaps increased maternal and neonatal morbidity. Since its inception, several strategies have been developed in hopes of optimizing EFM and improving these outcomes.\nProviders remain stubbornly committed to EFM despite the limitations and potential risks, while the promised improvements always remain somewhere beyond the horizon. Ironically, Knupp et al. continue their paper by recommending bluetooth enabled remote EFM solutions and the ever-popular \u0026ldquo;machine learning\u0026rdquo; to finally realize those longed-for improvements.\nIt seems hard to pick any particular study showing issues with EFM, because they are so common. This makes it somewhat difficult to understand why the use of continuous EFM remains so popular. For some references, see the aforementioned paper by Knupp. For a more colorful summary of the many issues with EFM, see Sartwelle and Johnston (2018). They seem to hit the nail on the head when showing that routine continuous EFM may be junk-science unwilling to die.\nACOG guidelines on EFM are hidden behind a paywall, but judging by the public FAQ\u0026rsquo;s statement on fetal heart rate monitoring, it appears as though they do not recommend routine continuous EFM. The American Academy for Nursing does not recommend routine EFM, and does not recommend using a routine EFM as part of admission since it is linked to higher rates of continued EFM utilization.\nIt seems to me that EFM is popular because it gives providers a sense of control. Continuous feedback gives the illusion of having immediate and continued access to a fetus\u0026rsquo; wellbeing. But that is precisely why this technology is problematic. Using crude tools during foetal monitoring has been shown to lead to unnecessary interventions, all of which have associated risks. During birth, patients are in a particularly vulnerable state \u0026mdash; especially after prolonged labor \u0026mdash; and any indication of risk to a fetus is likely to invoke an emotional response, as opposed to a rational or scientific one. This in turn can lead to poor outcomes that could have otherwise been avoided.\nConcluding Thoughts In my personal experience, obstestrics providers seem to be focus on narrow, short-term goals using outdated or unscientific methodology. Long-term outcomes seem less of a concern, as mother and baby are soon moved to another department after birth, so any subsequent negative outcomes aren\u0026rsquo;t \u0026ldquo;their problem\u0026rdquo; anymore. I don\u0026rsquo;t mean that this is done purposefully, but rather that the compartmentalization of hospital care and evaluation metrics for providers may inadvertently encourage this type of behavior. Additionally, I\u0026rsquo;ve been surprised several times at the datedness of a provider\u0026rsquo;s knowledge of the literature on a topic or what the result of a study was. In one case that readily comes to mind, a provider heavily pushed for a scheduled Ceasarean based on the results of a study she had collaborated on. Yet, when we asked for a reference and I then looked up the study, the results actually suggested the opposite \u0026mdash; expectant management \u0026mdash; for our particular case. So it\u0026rsquo;s definitely worth to review treatment suggestions with your provider as we continue moving away from paternalistic models of care to one where patient and provider collaborate on finding the best treatment.\nReferences Brace, R. A., and Wolf, E. J. (1989), “Normal amniotic fluid volume changes throughout pregnancy,” American Journal of Obstetrics and Gynecology, Elsevier BV, 161, 382–388. https://doi.org/10.1016/0002-9378(89)90527-9.\nCarmichael, S. L., and Snowden, J. M. (2019), “The ARRIVE Trial: Interpretation from an Epidemiologic Perspective,” Journal of Midwifery \u0026amp; Women’s Health, Wiley, 64, 657–663. https://doi.org/10.1111/jmwh.12996.\nGrobman, W. A., Rice, M. M., Reddy, U. M., Tita, A. T. N., Silver, R. M., Mallett, G., Hill, K., Thom, E. A., El-Sayed, Y. Y., Perez-Delboy, A., Rouse, D. J., Saade, G. R., Boggess, K. A., Chauhan, S. P., Iams, J. D., Chien, E. K., Casey, B. M., Gibbs, R. S., Srinivas, S. K., Swamy, G. K., Simhan, H. N., and Macones, G. A. (2018), “Labor Induction versus Expectant Management in Low-Risk Nulliparous Women,” New England Journal of Medicine, 379, 513–523. https://doi.org/10.1056/NEJMoa1800566.\nHughes, D. S., Magann, E. F., Whittington, J. R., Wendel, M. P., Sandlin, A. T., and Ounpraseuth, S. T. (2019), “Accuracy of the Ultrasound Estimate of the Amniotic Fluid Volume (Amniotic Fluid Index and Single Deepest Pocket) to Identify Actual Low, Normal, and High Amniotic Fluid Volumes as Determined by Quantile Regression,” Journal of Ultrasound in Medicine, Wiley, 39, 373–378. https://doi.org/10.1002/jum.15116.\nKeilman, C., and Shanks, A. L. (2022), “Oligohydramnios,” in StatPearls [Internet], Treasure Island, FL: StatPearls Publishing. Available at https://www.ncbi.nlm.nih.gov/books/NBK562326/.\nKnupp, R. J., Andrews, W. W., and Tita, A. T. N. (2020), “The future of electronic fetal monitoring,” Best Practice \u0026amp; Research Clinical Obstetrics \u0026amp; Gynaecology, Elsevier BV, 67, 44–52. https://doi.org/10.1016/j.bpobgyn.2020.02.004.\nLandon, M. B., Driscoll, D. A., Jauniaux, E. R. M., Galan, H. L., Grobman, W. A., and Berghella, V. (2018), Gabbe’s Obstetrics Essentials: Normal and Problem Pregnancies, Philadelphia, PA: Elsevier, p. 496.\nLandon, M. B., Galan, H. L., Jauniaux, E. R. M., Driscoll, D. A., Berghella, V., Grobman, W. A., Kilpatrick, S. J., and Cahill, A. G. (2020), Gabbe’s Obstetrics: Normal and Problem Pregnancies, Philadelphia, PA: Saunders, p. 1280.\nLevin, G., Rottenstreich, A., Tsur, A., Cahan, T., Shai, D., and Meyer, R. (2020), “Isolated oligohydramnios – should induction be offered after 36 weeks?,” The Journal of Maternal-Fetal \u0026amp; Neonatal Medicine, Informa UK Limited, 35, 4507–4512. https://doi.org/10.1080/14767058.2020.1852546.\nNabhan, A. F., and Abdelmoula, Y. A. (2008), “Amniotic fluid index versus single deepest vertical pocket as a screening test for preventing adverse pregnancy outcome,” Cochrane Database of Systematic Reviews, Wiley, 2010. https://doi.org/10.1002/14651858.cd006593.pub2.\nPatrelli, T. S., Gizzo, S., Cosmi, E., Carpano, M. G., Di Gangi, S., Pedrazzi, G., Piantelli, G., and Modena, A. B. (2012), “Maternal Hydration Therapy Improves the Quantity of Amniotic Fluid and the Pregnancy Outcome in Third‐Trimester Isolated Oligohydramnios: A Controlled Randomized Institutional Trial,” Journal of Ultrasound in Medicine, Wiley, 31, 239–244. https://doi.org/10.7863/jum.2012.31.2.239.\nRabie, N., Magann, E., Steelman, S., and Ounpraseuth, S. (2017), “Oligohydramnios in complicated and uncomplicated pregnancy: a systematic review and meta-analysis,” Ultrasound in Obstetrics \u0026amp; Gynecology, Wiley, 49, 442–449. https://doi.org/10.1002/uog.15929.\nSartwelle, T., and Johnston, J. (2018), “Continuous Electronic Fetal Monitoring during Labor: A Critique and a Reply to Contemporary Proponents,” The Surgery Journal, Georg Thieme Verlag KG, 04, e23–e28. https://doi.org/10.1055/s-0038-1632404.\nSchwartz, N., Sweeting, R., Young, B. K., Schwartz, N., Sweeting, R., and Young, B. K. (2009), “Practice patterns in the management of isolated oligohydramnios: a survey of perinatologists,” The Journal of Maternal-Fetal \u0026amp; Neonatal Medicine, Informa UK Limited, 22, 357–361. https://doi.org/10.1080/14767050802559103.\n","permalink":"https://dmsenter89.github.io/post/24/04-obs-on-obstetrics-and-ebm/","summary":"\u003cp\u003eWe live in the age of evidence-based medicine and an increasing willingness on the part of patients to review medical guidance and actively participate in their care. This includes such personal and emotional areas like pregnancy and birth. Some popular tools to help would-be parents  include Emily Oster\u0026rsquo;s famous \u0026ldquo;Expecting Better\u0026rdquo; and \u003ca href=\"https://evidencebasedbirth.com/\"\u003eevidencebasedbirth.com\u003c/a\u003e.\u003c/p\u003e","title":"Observations on Obstetrics and EBM"},{"content":"This weekend I found an interesting new preprint by Carlin and Moreno-Betancur on arxiv titled \u0026ldquo;On the Uses and Abuses of Regression Models\u0026rdquo; so I had to check it out.\nThe article focuses on medical literature, where regressions \u0026ndash; even in my experience \u0026ndash; often seem done almost automatically and then interpreted depending on the the desired question as opposed to with respect to model construction. \u0026ldquo;Garbage can\u0026rdquo; regressions to find \u0026ldquo;important risk factors\u0026rdquo; abound, as do repeat fittings of simple models in an attempt to describe a joint distribution. One of my favorite examples to show in class of the issues with the latter is a 2008 paper by Wang et al. that to date has been cited more than 1,000 times. The topic of the paper is an analysis of NHANES data with the aim of predicting the prevalence of obesity in the US. They desire to describe how different subgroups of Americans, that is the different genders and ethnicities, fare. Instead of fitting a joint model, they fit multiple linear models. This leads to fun results in their Table 2, where all Americans of all races and ethnicities will be obese by 2048, yet all men won\u0026rsquo;t be obese until 2051. Mexican-American men fare the best, as they escape being part of all Americans somehow and won\u0026rsquo;t reach 100% prevalence until 2126.\nCarlin and Moreno-Betancur describe these and other issues they encountered in the literature. One I find notable is what they call the \u0026ldquo;true model myth.\u0026rdquo; Essentially, the idea that that the \u0026ldquo;best\u0026rdquo; fitted model represents the data generating process, ergo the coefficients are easily interpreted in a causal manner so we can derive practice recommendations from these models without much discussion. That is of course not accurate.\nOverall, Carlin proposes a simple classification scheme for the different purposes of a regression analysis:\n\u0026ldquo;descriptive:\u0026rdquo; characterise the distribution of a feature or health outcome in a population, \u0026ldquo;predictive:\u0026rdquo; produce a model or algorithm for predicting future values given certain predictors, \u0026ldquo;causal:\u0026rdquo; investigate the extent to which a health outcome in some population would be different if a particular intervention were made. Since they understand the problem with the misuse and misinterpration of regression to (at least partially) be due to a certain vagueness with respect to the purpose for which the regression is fit, they propose a teaching framework centered around these types of research questions. This is in opposition of the more \u0026ldquo;traditional\u0026rdquo; focues on the \u0026ldquo;maths\u0026rdquo; of the problem. In other words, focus more on using standard tools to answer specific questions, than learning how to do simple problems \u0026ldquo;by hand\u0026rdquo; and then hoping that eventually people will figure out the hard parts of applying the theory to real life on their own.\nOverall, I think the paper is a worthwhile read. It reminds me of two other books I like that take a model-centric approach to teaching regression methos \u0026ndash; McElreath\u0026rsquo;s \u0026ldquo;Statistical Rethinking\u0026rdquo; and \u0026ldquo;Regression and Other Stories\u0026rdquo; by Gelman, Hill, and Vehtari. Gelman also posted about this preprint on his blog. Check out the discussion section there for other good insights and some additional reading material on this topic.\n","permalink":"https://dmsenter89.github.io/post/24/03-takeaways-uses-and-abuses-of-regression/","summary":"\u003cp\u003eThis weekend I found an interesting new preprint by Carlin and Moreno-Betancur\non arxiv titled \u003ca href=\"https://arxiv.org/abs/2309.06668\"\u003e\u0026ldquo;On the Uses and Abuses of Regression\nModels\u0026rdquo;\u003c/a\u003e so I had to check it out.\u003c/p\u003e\n\u003cp\u003eThe article focuses on medical literature, where regressions \u0026ndash; even in my\nexperience \u0026ndash; often seem done almost automatically and then interpreted\ndepending on the the \u003cem\u003edesired\u003c/em\u003e question as opposed to with respect to model\nconstruction. \u003ca href=\"https://doi.org/10.1080/07388940500339167\"\u003e\u0026ldquo;Garbage can\u0026rdquo;\nregressions\u003c/a\u003e to find \u0026ldquo;important risk\nfactors\u0026rdquo; abound, as do repeat fittings  of simple models in an attempt to\ndescribe a joint distribution. One of my favorite examples to show in class of\nthe issues with the latter is a 2008 paper by \u003ca href=\"https://doi.org/10.1038/oby.2008.351\"\u003eWang et al.\u003c/a\u003e that to date has been cited more than\n1,000 times. The topic of the paper is an analysis of NHANES data with the aim\nof predicting the prevalence of obesity in the US. They desire to describe how\ndifferent subgroups of Americans, that is the different genders and ethnicities,\nfare. Instead of fitting a joint model, they fit multiple linear models. This\nleads to fun results in their Table 2, where \u003cem\u003eall\u003c/em\u003e Americans of \u003cem\u003eall\u003c/em\u003e races and\nethnicities will be obese by 2048, yet \u003cem\u003eall men\u003c/em\u003e won\u0026rsquo;t be obese until 2051.\nMexican-American men fare the best, as they escape being part of \u003cem\u003eall Americans\u003c/em\u003e\nsomehow and won\u0026rsquo;t reach 100% prevalence until 2126.\u003c/p\u003e","title":"Takeaways from 'On the uses and abuses of regression models'"},{"content":"The statistics literature is filled with example code and sample data in R. Sometimes I find myself wanting to work through some provided sample data and compare the output from R with SAS code. In this post, I\u0026rsquo;ll show how to connect R and SAS so that you can load and execute R code straight from within SAS.\nSetup In order to use this feature, you will want to have both R and SAS/IML installed on the same computer. Make sure both SAS and R are in your path. In order to call R code from SAS, you will need to start SAS with the rlang option. You can call SAS from the command line with the -rlang option or you can add the option in your \u0026ldquo;sasv9.cfg\u0026rdquo; file.\nOnce SAS is started, you can verify that the setup worked by running\nproc options option=rlang; run; The log will list RLANG if the option was specified. If you forgot to add the option prior to startup, you\u0026rsquo;ll see NORLANG in the log instead.\nUsage R code can be called from within IML via a submit statement. The basic structure is this:\nproc iml; submit / R; /* R code her */ endsubmit; quit; With this we can run R code from within SAS. But the real power comes from our ability to move data between R and SAS. The following functions are available:\nExportDatasetToR(\u0026quot;libname.dsname\u0026quot;, RDataFrame); ExportMatrixToR(IMLMatrix, RMatrix); ImportDataSetFromR(r-expr, \u0026quot;libname.dsname\u0026quot;) ImportMatrixFromR(r-expr, IMLMatrix) Parameters can be passed to R as well, similar to how parameters can be passed from IML to SAS PROCs.\nFor more details, see the SAS/IML manual.\n","permalink":"https://dmsenter89.github.io/post/23-12-calling-r-from-sas/","summary":"\u003cp\u003eThe statistics literature is filled with example code and sample data in R. Sometimes I\nfind myself wanting to work through some provided sample data and compare the output from\nR with SAS code. In this post, I\u0026rsquo;ll show how to connect R and SAS so that you can load and\nexecute R code straight from within SAS.\u003c/p\u003e","title":"Calling R From SAS"},{"content":"For a while now I\u0026rsquo;ve recycled an old iMac running Fedora Workstation as a simple homeserver. It\u0026rsquo;s been working well in the past, but just now with the EOL of Fedora 37 did I get around to updating from Fedora 36 to Fedora 38.\nUnfortuantely for my use case, there is an unannounced change in Gnome\u0026rsquo;s powersettings when moving to Fedora 38 from 37. See the fedora forum for a description of the issue and a suggested fix. Unforunately, for me at least, the suggested fix doesn\u0026rsquo;t work and my system would still suspend, even with it disabled via the suggested commands. I had to use gsettings directly from my user account in addition to the the linked fix to get it to work and persist.\nOnly a couple of weeks after getting this fixed, that workstation suffered a hard drive failure that caused it to become unbootable. I have sinced scrapped that workstation and bought a Beelink mini-pc for about $200. I\u0026rsquo;m running Arch Linux on it.\nI had previously used Arch extensively as my daily driver and for various other machines, and never had problems with. My recent Fedora experience really drove me back to committing to Arch. Does Arch take longer to install and set up than Fedora? Sure. But it\u0026rsquo;s well documented and you actually understand the system you have and what packages are on it. It\u0026rsquo;s actually a transparent system, unlike the convenience-packaged distros with all their preinstalled stuff. You see so many people online worried about Arch not being \u0026ldquo;stable,\u0026rdquo; but in over ten years with different desktops and laptops running on Arch I haven\u0026rsquo;t had any major issues. Truth be told, I\u0026rsquo;ve had fewer issues with Arch than in the past five years with Ubuntu and Fedora systems.\nA particular example of something nice about Arch Linux: I have a ten year old workstation running on Arch that hadn\u0026rsquo;t booted in 4 years. The only difficulty in getting it up and running again was updating the archlinux-keyring for pacman. That didn\u0026rsquo;t take long and was well-documented. Since then, it\u0026rsquo;s humming along great with more up-to-date software than recent Ubuntu releases. Updating Fedora from 36 to 38 on the other hand was much more of a hassle. I had multiple reboots, hunting for what default packages changed, what changed without being documented, etc. My older Arch system got up and running in no time after scanning through the Arch Linux news for any required manual interventions that popped up in the past few years. Arch for the win.\n","permalink":"https://dmsenter89.github.io/post/23-12-automatic-suspend-in-fedora-38/","summary":"\u003cp\u003eFor a while now I\u0026rsquo;ve recycled an old iMac running Fedora Workstation as a simple homeserver. It\u0026rsquo;s been working well in the past, but just now with the EOL of Fedora 37 did I get around to updating from Fedora 36 to Fedora 38.\u003c/p\u003e","title":"Automatic Suspend in Fedora 38"},{"content":"Whenever I speak with students, I emphasize the need to share as much code and data as is feasible to enable reproduciblity. The fact that a large amount of research is not reproducible is a big issue that has gotten a lot of traction in the past two decades since Ioannidis published his influental paper. Given these issues, it is important to try to minimize other sources of variation in the process that could lead to reproducibility problems, such as choices in how to conduct statistical tests or how data is prepared. The many variations of the basic research problem is something Gelman has termed the garden of forking paths.\nThis week I came across a paper by Ostropolets et al (2023) that really exemplifies this. The short version is this: ask 54 researchers across 9 teams to reproduce the cohort used in awell-documented paper in their field from the same source data set and compare the outcomes to both the original paper and a “master implementation” that was recreated with one of the original authors of the reference paper. All researchers had access to the same tools and data set.\nThe result? Substantial variations in the final data set that was selected. Only four out of ten inclusion criteria fully aligned with the reference implementation. Note that this is just a cohort selection problem; they did not attempt to reproduce other steps from the paper.\nThis goes to show how important it is to share source code in order to achieve reproduciblity. If cohort selection had been done programmatically and the source code shared, we would have greater assurances that future teams trying to work with this data would be able to reproduce these findings and build on them. As the paper puts it:\nIn this regard, if we truly aspire to reproducible science, we should not hope that good documentation is sufficient and tolerate optional sharing of code, but rather make code sharing a hard requirement that can be complemented by free text descriptions.\nSee citation below for the paper\u0026rsquo;s DOI. Gelman shared a PDF of this paper on his blog.\nIn a similar vein, he shared a paper by Menkveld et al. (2022) in which several teams attempt to reproduce hypothesis tests based on the same data set, again leading to substantial variations.\nA solution to this issue is regular sharing of both data and code, as much as is feasible. In medical research in particular there are questions of confidentiality that we need to be concerned with, but this shouldn’t stop us from making de-identified data available either via a supplement or having them saved and ready should a third-party request them. This of course requires some coordination amongst the study authors. Too often I have seen researchers not be organized enough to be able to recover the original data set or steps in creating an analysis they themselves did when this was needed some years after the original study. My advice for this is to keep all information to a particular paper in one folder as you work on it, preferably together with the paper draft themselves so it will be obvious to future-you what paper those source files go with. When the study is done, just archive that folder and keep a cold copy on an external drive as well as on a storage server. This is often provided by the university free of charge. If multiple studies use the same data set, you\u0026rsquo;ll end up with multiple copies of that data using this method, but in my opinion that\u0026rsquo;s not a problem - storage is cheap these days, particularly cold storage. And if there are concerns about data privacy with your university\u0026rsquo;s storage, you can always encrpyt the archive prior to uploaded. GPG is your friend here.\nReferences Ioannidis, J. P. A. (2005), \u0026ldquo;Why Most Published Research Findings Are False,\u0026rdquo; PLOS Medicine, 2(8):e123, DOI: 10.1371/journal.pmed.0020124.\nMenkveld, A. J. et al (2022), \u0026ldquo;Non-standard errors\u0026rdquo;, available online at https://wrap.warwick.ac.uk/176566/1/WRAP-non-standard-errors-Kozhan-2023.pd.\nOstropolets, A. et al. (2023), \u0026ldquo;Reproducible variability: assessing investigator discordance across 9 research teams attempting to reproduce the same observational study,\u0026rdquo; Journal of the American Medical Informatics Association, Oxford University Press (OUP), 30, 859–868. DOI: 10.1093/jamia/ocad009.\n","permalink":"https://dmsenter89.github.io/post/23-09-reproducibility-by-sharing-code/","summary":"Whenever I speak with students, I emphasize the need to share as much code and data as is feasible to enable reproduciblity. The fact that a large amount of research is not reproducible is a big issue that has gotten a lot of traction in the past two decades since Ioannidis published his influental paper.","title":"Reproducibility by Sharing Code"},{"content":"A non-technical friend recently asked me for help with a merge problem. They had two separate data pulls of electronic medical records based on specific study parameters. The set of people in the database who fit the study parameters changed in between the data pulls, for example by having people age into our out of a study, or by having new diagnoses added to their records that cause them to either be newly included or excluded. Let\u0026rsquo;s call the older data set A and the newer data set B. The goal was to get all those entries from B that don\u0026rsquo;t also show up in A. The data sets were pulled by a staff data scientist at that company who, despite their title, said they couldn\u0026rsquo;t figure out how to remove those entries from B that were already in A. Barring any special circumstances, this is a fairly standard problem so let\u0026rsquo;s look at a couple of tools we could use to solve it.\nLet\u0026rsquo;s start with some made-up sample data:\nMinimal sample data for demonstration purposes.\nThe MRN here stands for \u0026ldquo;medical record number,\u0026rdquo; a common unique identifier present in clinical data sets. Each of our data sets has three rows, but only one row is shared between both - that associated with MRN 79602. We could theoretically merge on multiple columns or coalesce data if we think some missing fields might have been updated in the meantime, but for purposes of this example we\u0026rsquo;ll keep it simple and just merge on the MRN.\nSQL Merge Types There are four basic types of merge: left join, right join, outer join, and inner join. There\u0026rsquo;s also the cross join but that one shows up less frequently in my experience. A picture speaks a thousand words, so here\u0026rsquo;s a Venn diagram illustrating the idea behind these joins.\nStandard SQL Joins.\nIn our case, we actually want left/right \u0026ldquo;inner\u0026rdquo; or \u0026ldquo;exclusive\u0026rdquo; joins, like this:\nImplementations I figured I would go over three basic tools: SAS, SQL, and Pandas.\nOnly in A For starters, we want all entries in $A$ that are not also in $B$. In set notation that is the set denoted $A-B$ (sometimes $A\\backslash B$). Merges like this is what SQL excels at, so let\u0026rsquo;s see the SQL statment first:\ncreate table res1 as select A.* from A left join B on A.MRN=B.MRN where B.MRN is NULL; This should run in any typical SQL implementation, including PROC SQL in SAS and SQLite3. We expect the following table as output:\nMRN Weight Chol_Status 23356 140 74592 Desirable To do a left outer join instead, we would just omit the where clause. We could do the same with a data step merge statement, but unlike SQL this would assume our input data sets are sorted by the merge key:\ndata res1; merge A (IN = X) B (IN=Y); by MRN; If X and not Y; run; Pandas\u0026rsquo; merge statement allows for the creation of an indicator variable, similar to the in keyword used in the SAS data step merge. That indicator will tell us if the particular row is present in both the left and the right tables (value both), or just in one of them (values left_only and right_only). We can then query on that indicator variable to subset:\nres1 = (pd.merge(A, B, how=\u0026#39;outer\u0026#39;, indicator=True) .query(\u0026#39;_merge==\u0026#34;left_only\u0026#34;\u0026#39;) .drop(\u0026#39;_merge\u0026#39;, axis=1)) Only in B Same idea, but reversed: $B-A$. The implementation is identical except that we are using a right join instead:\ncreate table res2 as select B.* from A right join B on A.MRN=B.MRN where A.MRN is NULL; Expected output:\nMRN Weight Chol_Status 64836 129 High 2466 127 Borderline It is interesting to note that SQLite, at least as of 3.37.2, still doesn\u0026rsquo;t support right joins, so if you\u0026rsquo;re using that you\u0026rsquo;ll just want to use the left join method above but switch the A and B around. The data step implementation is also straight forward:\ndata res2; merge A (IN = X) B (IN=Y); by MRN; If Y and not X; run; as is the Pandas version:\nres2 = (pd.merge(A, B, how=\u0026#39;outer\u0026#39;, indicator=True) .query(\u0026#39;_merge==\u0026#34;right_only\u0026#34;\u0026#39;) .drop(\u0026#39;_merge\u0026#39;, axis=1)) What\u0026rsquo;s in common? Finally, you might be curious to see which rows both data sets have in common, that is $A \\cap B$. That\u0026rsquo;s a simple inner join:\ncreate table res3 as select A.* from A inner join B on A.MRN=B.MRN; Expected output:\nMRN Weight Chol_Status 79602 139 High In SAS:\ndata res3; merge A (IN = X) B (IN=Y); by MRN; If X and Y; run; and in Pandas:\nres3 = (pd.merge(A, B, how=\u0026#39;outer\u0026#39;, indicator=True) .query(\u0026#39;_merge==\u0026#34;both\u0026#34;\u0026#39;) .drop(\u0026#39;_merge\u0026#39;, axis=1)) And that\u0026rsquo;s it. All that\u0026rsquo;s left to do is to save the data in a format your customer or colleagues can work with and we\u0026rsquo;re done.\n","permalink":"https://dmsenter89.github.io/post/23-09-basic-sql-joins/","summary":"\u003cp\u003eA non-technical friend recently asked me for help with a merge problem. They had two separate data pulls of electronic medical records based on specific study parameters. The set of people in the database who fit the study parameters changed in between the data pulls, for example by having people age into our out of a study, or by having new diagnoses added to their records that cause them to either be newly included or excluded. Let\u0026rsquo;s call the older data set A and the newer data set B. The goal was to get all those entries from B that don\u0026rsquo;t also show up in A. The data sets were pulled by a staff data scientist at that company who, despite their title, said they couldn\u0026rsquo;t figure out how to remove those entries from B that were already in A. Barring any special circumstances, this is a fairly standard problem so let\u0026rsquo;s look at a couple of tools we could use to solve it.\u003c/p\u003e","title":"Some Basic SQL Joins"},{"content":"In Chapter 3 of van Buuren\u0026rsquo;s Flexible Imputation of Missing Data a variety of methods for imputing univariate missing data are presented. This post will summarize these techniques and show how to implement them in SAS.\nPreliminaries Imputing under a Normal Linear Model Regression Imputation Stochastic Regression Imputation Bayesian/Bootstrap Multiple Imputation What if my data are non-normal? Predictive Mean Matching Classification and Regression Trees The Propensity Score Method Categorical and Count Data The Logistic and Logit Models The Discriminant Function Method References Preliminaries Van Buuren demonstrates various techniques using data set 88 from Hand et al (1994). This data set is availabe from R\u0026rsquo;s MASS library as data(\u0026quot;whiteside\u0026quot;). The original data set can be downloaded from the publisher\u0026rsquo;s website. The name of the relevant data file is INSULATE.DAT. If you want to follow along using SAS, you can use this data step. It matches the way the data appears in R except that I have added a variable R indicating the observation that van Buuren deletes for demonstration purposes.\nFor purposes of this post, we assume one or more predictors $x$ are completely observed, while some variable of interest $y$ is only partially observed. Methods for dealing with this type of problem are available using the monotone keyword in PROC MI. A data set has a monotone missing pattern if it consists of variables $Y_1$, $Y_2$, $\\ldots$, $Y_p$ such that if $Y_j$ is missing for one individual, all subsequent variables $Y_k$ for $j \u0026lt; k \\leq p$ are also missing. Schematically, the data set will look like this:\n$$R = \\begin{bmatrix} 1 \u0026amp; 1 \u0026amp; 1 \\\\ 1 \u0026amp; 1 \u0026amp; 0 \\\\ 1 \u0026amp; 0 \u0026amp; 0 \\end{bmatrix}$$\nwhere 1 indicates an observed value and 0 a missing value. The monotone statement in SAS can impute missing values by completing the columns in turn using univariate methods. See the SAS documentation for specifics.\nImputing under a Normal Linear Model For completion, I will mention all of the main linear model methods van Buuren mentions in his text, even though the first two are not implemented in PROC MI.\nRegression Imputation Van Buuren also refers to this as the \u0026ldquo;prediction\u0026rdquo; method. In essence, the complete cases are used to create a linear model. This linear model is then used to fill in the missing values:\n$$ \\dot{y} = \\hat\\beta_0 + X_\\text{mis}\\,\\hat\\beta_1$$\nwhere $\\hat\\beta_i$ are least squares estimates.\nThis method has a variety of drawbacks. For one, it artificially strengthens the relationships between variables as they appear in the linear model by increasing correlations. Variability in the data is reduced. See section 1.3.4 in van Buuren for details.\nThe mice package implements this method as norm.predict. PROC MI does not implement this method; to use this technique in SAS, you could use the regression PROCs or IML.\nStochastic Regression Imputation This method proceeds as above, except that Gaussian noise is added to the imputed value: $$ \\dot{y} = \\hat\\beta_0 + X_\\text{mis}\\,\\hat\\beta_1 + \\dot\\epsilon$$ where $\\dot\\epsilon \\sim N(0, \\hat\\sigma^2)$. An advantage of this method over plain regression is that it can preserve correlation between variables.\nThe mice package implements this method as norm.nob. It is not available in PROC MI but can be implemented with IML.\nBayesian/Bootstrap Multiple Imputation Van Buuren also refrers to this as \u0026ldquo;predict + noise + parameters uncertainty.\u0026rdquo; This technique is based on a Bayesian linear regression using draws from the posterior as parameters: $$\\dot y = \\dot\\beta_0 + X_\\text{mis}\\, \\dot\\beta_1 + \\dot\\epsilon$$ where $\\dot\\epsilon\\sim N(0,\\dot\\sigma^2)$ and $\\dot\\beta_i$, $\\dot\\sigma$ are random draws from the posterior distribution.\nThis the default method in PROC MI for continuous data. Both SAS and mice use an algorithm based on Rubin (1987, pp. 166-167). See the SAS documentation and Algorithm 3.1 in van Buuren for details. The mice package implements this method as norm. Here is an example of how the Bayesian regression can be used in PROC MI:\nproc mi data=whiteside_miss out=regimp nimpute=5; var temp gas; monotone regression(gas); run; The regression keyword may be abbreviated as reg. A fully worked example is available in the SAS documentation, with the associated code available on GitHub.\nThe mice package also implements a variant of this method using bootstrapping instead of a Bayesian model. This method is available as norm.boot.\nWhat if my data are non-normal? In case the data are non-normal, one could proceed to a non-regression technique like predictive mean matching. Alternatively, one could adjust the regression methods to utizilise a generalized linear model instead. That technique is implemented in the ImputeRobust package for R. See section 3.3 in van Buuren for details.\nPredictive Mean Matching Similar to Bayesian regression above, a predicted value is calculated for each missing observation. Instead of adding noise to this prediction, however, a set of $k$ observations whose predicted values are close to the predicted missing value are sought. The missing value is then replaced by a random draw from this set of candidate donors. In mice, this method is available as pmm. In PROC MI, you can use the regpredmeanmatch keyword:\nproc mi data=whiteside_miss out=regimp nimpute=5; var temp gas; monotone regpredmeanmatch(gas); run; The keyword regpredmeanmatch may be abbreviated as regpmm.\nThe predictive mean matching method is robust to transformations of the target variable. It may be used with both continuous and discrete data and will always generate realistic data in the sense that all generated data has been observed. Since this does not require an explicit model to describe the distribution of missing values, it is more resilient to model misspecification.\nSee the SAS documentation and section 3.4 in van Buuren for details.\nClassification and Regression Trees An idea borrowed from the machine learning community and implemented in some R packages. In essence, the idea is similar to utiziling linear regression models except that a regression tree is utilized instead. See section 3.5 in van Buuren.\nThe Propensity Score Method With this method, propensity scores are generated for each observation estimating the probability of it being missing. The observations are then grouped by their propensity scores and an approximate Bayesian bootstrap imputation is carried out for each group. See SAS documentation for details.\nA fully worked example is available in the SAS documentation, with the associated code available on GitHub:\nproc mi data=Fish1 out=outex2; monotone propensity; var Length1 Length2 Length3; run; This method is not implemented in the mice package.\nCategorical and Count Data The Logistic and Logit Models Logit based regression models can be used both for nominal and ordinal data. The imputed value is generated from a Bayesian logistic regression model. The mice package implements this method as logreg. PROC MI uses the logistic keyword. An example of its usage is given in the SAS documentation, with the associated code available on GitHub. Here\u0026rsquo;s the example code:\nproc mi data=Fish2 out=outex4; class Species; monotone reg(Width/ details) logistic(Species = Length Width Length*Width/ details); var Length Width Species; run; This imputes the width variable using the Bayesian linear regression while imputing the categorical species variable using the logistig regression method.\nSee the SAS documentation for details.\nThe Discriminant Function Method This method is the default for categorical data in PROC MI. See the SAS documentation for details.\nA fully worked example is available in the SAS documentation, with the associated code available on GitHub. Here is the MI call:\nproc mi data=Fish2 out=outex5; class Species; monotone discrim(Species = Length Width/ details); var Length Width Species; run; The mice package does not implement this method.\nReferences Hand, D. J., F. Daly, A. D. Lunn, K. J. McConway, and Ostrowski, E. (1994), A Handbook of Small Data Sets, London: Chapman \u0026amp; Hall.\nRubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, New York: John Wiley \u0026amp; Sons.\nvan Buuren, S. (2018), Flexible Imputation of Missing Data, Chapman and Hall/CRC interdisciplinary statistics series, Boca Raton: CRC Press, Taylor and Francis Group. Available at https://stefvanbuuren.name/fimd/.\n","permalink":"https://dmsenter89.github.io/post/23-08-univariate-mi/","summary":"\u003cp\u003eIn Chapter 3 of van Buuren\u0026rsquo;s \u003cem\u003eFlexible Imputation of Missing Data\u003c/em\u003e a variety of methods for imputing univariate missing data are presented. This post will summarize these techniques and show how to implement them in SAS.\u003c/p\u003e","title":"Univariate Missing Data with PROC MI"},{"content":"I love simple CLI tools and am a big fan of the Unix philosophy. Recently I came across The Dam, a public Unix server that implements a clever tool they termed suc - the Simple Unix Chat. Essentially, it applies the Unix philophy to create a simple chat tool that can be used on any modern Unix server. The key code consists of just a few lines of Bash code. Check out the documentation for it here.\n","permalink":"https://dmsenter89.github.io/post/23-07-suc-a-unix-slack-clone/","summary":"\u003cp\u003eI love simple CLI tools and am a big fan of the \u003ca href=\"https://en.wikipedia.org/wiki/Unix_philosophy\"\u003eUnix philosophy\u003c/a\u003e.\nRecently I came across \u003ca href=\"https://the-dam.org/\"\u003eThe Dam\u003c/a\u003e, a public Unix server that implements a clever tool\nthey termed suc - the Simple Unix Chat. Essentially, it applies the Unix philophy to create a simple chat tool\nthat can be used on any modern Unix server. The key code consists of just a few lines of Bash code.\nCheck out the documentation for it \u003ca href=\"https://the-dam.org/docs/explanations/suc.html\"\u003ehere\u003c/a\u003e.\u003c/p\u003e","title":"SUC - a Slack Clone for Modern Unix"},{"content":"This post will introduce three GNU tools to help you explore your C code: ctags, cscope, and cflow. The first two can help you navigate your code as you work on it and can be used directly within Vim. Cflow on the other hand produces control charts that help you get to know the control flow in a project, which is particularly helpful if you are new to the codebase.\nctags In short, ctags is a program that can generate a file listing C symbols in a way that can be used by Vim (ctags) or by Emacs (etags). Various versions exist. See this wiki page for some links. A current maintained version of universal ctags can be found on GitHub. Universal ctags expands on the original ctags by including support for additional languages.\nThe first step to using a ctags file is to generate one for your source code. Just run ctags followed by the location of your source files. If you have multipled directories, you can list them sequentially like this:\n$ ctags h/* src/* This will generate a \u0026ldquo;tags\u0026rdquo; file in the current folder. If you open Vim from the same folder, the tags file is automatically loaded. What is particularly useful about the tag file is that saved keywords are addressed by patterns, not line numbers. This way, minor edits don\u0026rsquo;t require ctags to be re-run.\nBasic Usage To find the definition of a C symbol in your source code, put your cursors on the symbol and press \u0026lt;Ctrl-]\u0026gt; to jump to that symbol\u0026rsquo;s definition. To get back to where you were, press \u0026lt;Ctrl-t\u0026gt;. If the symbol has multiple definitions and you jumped to the wrong one, try using the :tselect command to bring up a list of all matches.\nCommand Effect \u0026lt;Ctrl-]\u0026gt; Jump to definition of the keyword under the cursor. :ta[g] {ident} Jump to the definition of {ident}. \u0026lt;Ctrl-t\u0026gt; Jump back up the tag stack. :tags Show content of tag stack. :po[p] Jump to older entry in tag stack. :ta[g] Jump to newer entry in tag stack. :ts[elect] [ident] List tags that match [ident]. :sts[elect] Same as above, but splits window for tag. For details, see :help tags in Vim.\ncscope The cscope program has more advanced features compared to ctags. In addition to finding symbol definitions, it can gather more advanced information than ctags. Specifically, it can tell you\nwhere a symbol is used in the code, where the symbol was defined, where a variable got its value from, what other functions call this function, what functions are called by a specific function, and more. Similar to ctags, a database file is created by the csope program. You can run it like this:\n$ cscope -b h/* src/* This will generate a cscope.out file that can be used with Vim. To make the cscope database available, you need to add it during your Vim session by using :cs[cope] add {file|dir}. By adding the following to your .vimrc you can automate this:\nif has(\u0026#34;cscope\u0026#34;) \u0026#34; add any database in current directory if filereadable(\u0026#34;cscope.out\u0026#34;) cs add cscope.out \u0026#34; else add database pointed to by environment elseif $CSCOPE_DB != \u0026#34;\u0026#34; cs add $CSCOPE_DB endif endif Basic Usage The basic command used is :cs find {querynum|querytype} {name}, with the following main query types:\nquerynum querytype Effect 0 s Find this C symbol 1 g Find this definition 2 d Find functions called by this function 3 c Find functions calling this function 4 t Find this text string 6 e Find this egrep pattern 7 f Find this file 8 i Find files #including this file 9 a Find places where this symbol is assigned a value For details, see :help cscope in Vim. For some suggested options and keymappings that make using cscope more convenient, see :help cscope-suggestions. You can also use the querynum to perform a single search using the cscope cli interface, e.g.: cscope -L{querynum} {name} [-d] where [-d] suppresses updating the cscope database.\ncflow GNU cflow is a tool that creates charts showing control flow within your program. It has a lot of options and settings, so you\u0026rsquo;ll definitely want to check out its documentation.\nThe most basic usage is cflow {file[s]} which creates an indented listing of function calls starting from main(). Two important command line options are --main which allows you to set a different starting function, and --target which allows you to set a target function below which you don\u0026rsquo;t want to investigate. If you want to include functions that aren\u0026rsquo;t directly reachable from main() or --main in your chart, use the --all flag.\nA particularly nifty feature is that cflow can generate valid dot files using cflow -f dot {file[s]}. These can be piped to graphviz to produce visual charts of your function calls, e.g.:\n$ cflow -f dot example.c | dot -Tpng -o flow-example.png ","permalink":"https://dmsenter89.github.io/post/23-07-explore-c-code-with-gnu-tools/","summary":"\u003cp\u003eThis post will introduce three GNU tools to help you explore your C code: ctags, cscope, and cflow.\nThe first two can help you navigate your code as you work on it and can be used directly within Vim.\nCflow on the other hand produces control charts that help you get to know the control flow in a project,\nwhich is particularly helpful if you are new to the codebase.\u003c/p\u003e","title":"Explore C Code With GNU Tools"},{"content":"This 1-hour workshop is intended to help funded CMSRP students prepare for their summer research project. Current best practices in data analytics will be discussed, with emphasis on the pre-analysis phase of the research cycle. Topics covered include the following:\nUsing scientific knowledge to guide variable and model selection. Best practices for data collection, storage, and documentation. Validating data and dealing with formatting issues. Reporting and analysis of missing data. ","permalink":"https://dmsenter89.github.io/talk/23-cmsrp-data-analytics-workshop/","summary":"This 1-hour workshop is intended to help funded CMSRP students prepare for their summer research project. Current best practices in data analytics will be discussed, with emphasis on the pre-analysis phase of the research cycle.","title":"CMSRP Data Analytics Workshop"},{"content":"Last week we saw how to generate posterior samples using PROC MCMC for simple linear and logistic regression models. This week, I want to show how to sample regression lines from the data set returned by MCMC by plotting several sample regression linse on top of a scatter plot of the source data.\nWriting the Macro Since the majority of the steps are identical irrespective of what data set we use, and because we might want to use this iteretively during model building, I decided to write this up as a macro. To some degree this is required since I will be using a macro do-loop, which is only valid when embedded inside of a macro.\nThis macro will assume we have fitted a simple linear model of the form\n\\begin{aligned} y_i \u0026amp;\\sim \\mathrm{Normal}(\\mu_i, \\sigma) \\\\ \\mu_i \u0026amp;= \\beta_0 + \\beta_1 x_i \\end{aligned}\nStep 1: Get an SRS Sample The first step is the simplest - selecting a subset of the posterior samples. This is easily achieved by calling PROC SURVEYSELECT.\nproc surveyselect data=\u0026amp;posterior. method=srs n=\u0026amp;n. out=SRS; run; Step 2: Make a Macro List of Parameters Next we need to generate a list of intercepts and slopes. I find it easiest to read those in PROC SQL using the into operation. Additionally, we\u0026rsquo;ll also collect the $x$- and $y$-ranges of our data. This will be used to make sure our plot is centered on the scatter-plot values of our source data set.\nproc sql noprint; select beta0, beta1 into :intercepts separated by \u0026#39; \u0026#39;, :slopes separated by \u0026#39; \u0026#39; from SRS; select min(x), max(x), min(y), max(y) into :minx, :maxx, :miny, :maxy from \u0026amp;ds.; quit; Step 3: Macro-Loop to Add Lines to a Scatter Plot Now all the parts have been assembled and you can call PROC SGPLOT. We use the scatter statement to create a scatter plot of the source data. Then we use a do-loop to iteratively paste different lineparm statements corresponding to our different samples into the SGPLOT statement. Lastly, use the xaxis and yaxis statements to focus the graph on the scatter plot data, and not forcing the x-intercepts of the different fitted lines into\nproc sgplot data=\u0026amp;ds. noautolegend; scatter x=x y=y; %do i = 1 %to \u0026amp;n.; lineparm x=0 y=%scan(\u0026amp;intercepts, \u0026amp;i, \u0026#39; \u0026#39;) slope=%scan(\u0026amp;slopes, \u0026amp;i, \u0026#39; \u0026#39;) / transparency=0.7; %end; xaxis min=\u0026amp;minx. max=\u0026amp;maxx.; yaxis min=\u0026amp;miny. max=\u0026amp;maxy.; run; Calling the Macro And that\u0026rsquo;s it. Assuming we declared the macro as follows\n%macro sample_regression(ds=, posterior=, n=); we can now call it.\nAs a particular example, let\u0026rsquo;s run PROC MCMC\u0026rsquo;s getting started example 1 straight from GitHub:\nfilename url gs1 \u0026#39;https://raw.githubusercontent.com/sassoftware/doc-supplement-statug/main/Examples/m-n/mcmcgs1.sas\u0026#39;; %include gs1; In this example, we predict weight based on height in the sashelp.class data set. The posterior samples are available as work.classout. We\u0026rsquo;ll want to rename the height and weight variables to $x$ and $y$ in order to work with the chosen macro names. This is easily accomplished by using the rename statement in the macro call itself. We\u0026rsquo;ll call it with $n=15$.\n%sample_regression(ds=sashelp.class(rename=(Height=x Weight=y)), posterior=work.classout, n=15); This will produce the following output:\nWith slight modifications you can also use this macro to help you refine your priors. By using the\nmodel general(0); statement in PROC MCMC in lieu of a regular model you will get estimates of the prior parameters. See the documentation for examples.\n","permalink":"https://dmsenter89.github.io/post/23-05-sample-regression-lines/","summary":"\u003cp\u003e\u003ca href=\"/post/23-05-simple-regression-with-proc-mcmc/\"\u003eLast week\u003c/a\u003e we saw how to generate posterior samples using PROC MCMC for simple linear and logistic regression models. This week, I want to show how to sample regression lines from the data set returned by MCMC by plotting several sample regression linse on top of a scatter plot of the source data.\u003c/p\u003e","title":"Sampling Regression Lines"},{"content":"In this post I\u0026rsquo;ll show how to fit simple linear and logistic regression models using the MCMC procedure in SAS. Note that the point of this post is to show how the mathematical model is translated into PROC MCMC syntax and not to discuss the method itself. I will include links to relevant sections in Johnson, Ott, and Dogucu (2022) if you\u0026rsquo;d like to read more about Bayesian modeling.\nThe MCMC Statement The basic syntax for MCMC is as follows:\nproc mcmc \u0026lt;options\u0026gt;; parms parameter \u0026lt;=\u0026gt; number \u0026lt;/options\u0026gt;; prior parameter ~ distribution; programming statements; model varaiable ~ distribution \u0026lt;/options\u0026gt;; This covers the basic components of most Bayesian models - the model itself (model), the parameters that need to be fit (parms), and their prior distribution (prior). Note that PROC MCMC requires you to always specify your priors, unlike some Bayesian modeling software that will default to some diffuse priors when they are omitted from the problem statement.\nThe most common options you\u0026rsquo;ll use in the MCMC statement will be\ndata= the name of the input data set. outpost= the name of the output data set for posterior samples of parameters. nbi= the number of burn-in iterations. nmc= the number of MCMC iterations, excluding the burn-in iterations. seed= specify a random seed for the the simulation. thin= specify the thinning rate; see here for more details. The names and function calls for the included distributions are described in the documentation on the MODEL statement. Their density definitions are documented here.\nA Simple Linear Model A basic linear model looks something like this:\n\\begin{aligned} y_i \u0026amp;\\sim \\mathrm{Normal}(\\mu_i, \\sigma) \\\\ \\mu_i \u0026amp;= \\beta_0 + \\beta_1 x_i \\end{aligned}\nwhich will need to be combined with priors for $\\beta_0$, $\\beta_1$, and $\\sigma$. Assume we have a data set work.mydata that contains two variables: our predictor x and our measured variable y. Assume we use the above model together with the following priors:\n\\begin{aligned} \\beta_0 \u0026amp;\\sim \\mathrm{Normal}(0, 10) \\\\ \\beta_1 \u0026amp;\\sim \\mathrm{Normal}(0, 10) \\\\ \\sigma \u0026amp;\\sim \\mathrm{Uniform}(0,50). \\end{aligned}\nTranslating this into PROC MCMC is straightforward. Even though we can specify the statements in any order, it is common to define the model \u0026ldquo;upside down\u0026rdquo; so that each line contains only variables that have already been defined. This is for convenience, so you don\u0026rsquo;t forget to specify something before hitting \u0026ldquo;run.\u0026rdquo;\nproc mcmc data=mydata outpost=posterior /* posterior sim */ nmc=2000; /* # of data points in posterior */ /* define the parameters. Optionally give an initial value */ parms beta0 0 beta1 1; parms sigma; /* no initial value - mcmc finds its own */ * define your priors; prior beta: ~ normal(mean=0, sd=10); prior sigma ~ uniform(0, 50); * define the mean and the model; mu = beta0 + beta1*x; model y ~ normal(mu, sd=sigma); run; For more info on simple regression, check out chapters 9-11 in Bayes Rules!\nA Simple Logistic Model A basic logistic model will look as follows:\n\\begin{aligned} y_i \u0026amp;\\sim \\mathrm{Binomial}(n_i, p_i) \\\\ \\mathrm{logit}(p_i) \u0026amp;= \\beta_0 + \\beta_1 x_i \\end{aligned}\ncombined with appropriate priors for $\\beta_0$, $\\beta_1$. Here $y_i$ is a positive integer response, $n_i$ is a count, and $x_i$ is still our predictor. In many medical studies we are interested in the special case where $y_i \\in \\{0,1\\}$ so that the model becomes\n\\begin{aligned} y_i \u0026amp;\\sim \\mathrm{Bern}(p_i) \\\\ \\mathrm{logit}(p_i) \u0026amp;= \\beta_0 + \\beta_1 x_i. \\end{aligned}\nLet\u0026rsquo;s assume a diffuse prior like this:\n\\begin{aligned} \\beta_0 \u0026amp;\\sim \\mathrm{Normal}(0, 100) \\\\ \\beta_1 \u0026amp;\\sim \\mathrm{Normal}(0, 100). \\end{aligned}\nThen we can translate to PROC MCMC as follows:\nproc mcmc data=mydata outpost=posterior /* posterior sim */ nmc=2000; /* # of data points in posterior */ parms (beta0 beta1) 0; prior beta: ~ normal(0, sd=100); /* now define the logistic part: */ p = logistic(beta0 + beta1*x); model y ~ bern(p); run; Often we are not so much directly interested in the $\\beta$ coefficients, but in the odds $e^{\\beta_0}$ and the multiplicative change in odds $e^{\\beta_1}$. While these values can be calculated and analyzed from the outpost data set, we can use the nodata block (delimited by beginnodata and endnodata statements) to directly calculate these values in our simulation. The amended procedure reads like this:\nproc mcmc data=mydata outpost=posterior nmc=2000 monitor=(odds modds); parms (beta0 beta1) 0; prior beta: ~ normal(0, sd=100); beginnodata; odds = exp(beta0); modds = exp(beta1); endnodata; p = logistic(beta0 + beta1*x); model y ~ bern(p); run; See chapter 13 in Bayes Rules! for more info on logistic regression models.\nAdding a Random-Effect Random-effects, also known as hierarchical modeling, looks at group structures in the data seta nd models group-specific effects. In a clinical setting, this might be the study site.\nAs a simple example, assume we have a data set ht containing the height (h) and sex (s) of a population sample. Assume we are interested in modeling the distribution of height in our data set. We know that on average males are taller than females (mean 167 cm vs 156 cm based on NHANES 2006). We could build a model similar to this:\n\\begin{aligned} h_i \u0026amp;\\sim \\mathrm{Normal}(\\mu_i, \\sigma) \\\\ \\mu_i \u0026amp;= \\alpha_{\\mathrm{sex[i]}} \\\\ \\alpha_j \u0026amp;\\sim \\mathrm{Normal}(160, 20) \\quad \\text{for } j=1,2 \\\\ \\sigma \u0026amp;\\sim \\mathrm{Uniform}(0,50). \\end{aligned}\nIn SAS, the random effect is specified with the random statement. We specify the categories with the subject keyword in the random statement. SAS will then automatically create the necessary number of parameters for the random effect. Our model translates to the following MCMC call:\nproc mcmc data=ht outpost=posterior nmc=2000; parms sigma 5 sigmaA 6; prior sigma: ~ uniform(0,50); random alpha ~ normal(160, sd=sigmaA) subject=s monitor=(alpha); mu = alpha; model h ~ normal(mu, sd=sigma); run; Assuming $s\\in\\{1,2\\}$ this will cause SAS to create two alpha variables for the two levels of s: alpha_1 and alpha_2. Had s been a character variable, say with values m and f, then SAS would have created alpha_m and alpha_f instead.\nFor more information on this, check out unit IV of Bayes Rules!\nReference Johnson, A. A., Ott, M. Q., and Dogucu, M. (2022), Bayes Rules!: An Introduction to Applied Bayesian Modeling, Boca Raton, FL: CRC Press, DOI: 10.1201/9780429288340. Available online at www.bayesrulesbook.com.\n","permalink":"https://dmsenter89.github.io/post/23-05-simple-regression-with-proc-mcmc/","summary":"\u003cp\u003eIn this post I\u0026rsquo;ll show how to fit simple linear and logistic regression models using the \u003ca href=\"https://support.sas.com/rnd/app/stat/procedures/mcmc.html\"\u003eMCMC\u003c/a\u003e procedure in SAS. Note that the point of this post is to show how the mathematical model is translated into PROC MCMC syntax and not to discuss the method itself. I will include links to relevant sections in Johnson, Ott, and Dogucu (2022) if you\u0026rsquo;d like to read more about Bayesian modeling.\u003c/p\u003e","title":"Simple Regression With PROC MCMC"},{"content":"The SAS Transport File Format (XPORT) is an open file format maintained by SAS for exchanging datasets. Its use is mandated by the FDA for data set submission for new drug or device applications and the CDC uses this format to distribute public data. For details regrading this format, see this Library of Congress page. This post will explore how to read several of these files into a SAS session with the URL filename statement using the National Health and Nutrition Examination Survey, or NHANES, as an example.\nLoading a Single XPT File By far the easiest way to read an XPT file is to use the XPT2LOC autocall macro if it is available on your SAS installation. As an example, this snippet would load the demographics table from the 2017-2018 NHANES data set into the work library:\nfilename demo \u0026#34;/data/Nhanes/2017-2018/DEMO_J.XPT\u0026#34;; %XPT2LOC(filespec=demo, libref=work); This macro correctly resolves the name of the data set, and it would be available as work.demo_j now.1 See the documentation for more details on this macro.\nIf we cannot or do not want to use this macro, we\u0026rsquo;ll have to assign a LIBREF to the XPT file. This might seem weird at first, because you typically will only find a single data set in an XPT file. But if you consider that the file standard allows for multiple data sets to reside in the same XPT file, it makes sense. Using the LIBREF, we can achieve the same result as above using this snippet:\nfilename xpt url \u0026#34;https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT\u0026#34;; libname xpt xport; proc copy in=xpt out=work; run; Loading Multiple XPT Files with a Macro This is all fine if you only need to load one or two files that way, but becomes tedious (and repetitive) if you have to load many data sets this way. Ignoring the restricted data sets for a minute, NHANES contains many data sets spread across five domains:\nDomain # of Data Sets Demographics Data 1 Dietary Data 14 Examination Data 14 Laboratory Data 51 Questionnaire Data 44 TOTAL 124 Even if you only need a subset of this, you\u0026rsquo;ll find yourself wanting to shortcut having to type out all the repetitive information. This is where a macro call comes in handy.\nA great trick for this is to use a codebook like data set that you can iterate over. Here is a minimal example using four data sets from NHANES:\n/* create a location to hold saved data */ libname nhanes \u0026#39;~/data/NHANES\u0026#39;; data nhanes.datasets; length df $10. dfname $100.; input df $ dfname $; infile datalines dsd; datalines; DEMO_J,Demographic Variables and Sample Weights BPX_J,Blood Pressure BMX_J,Body Measures OHXDEN_J,Oral Health - Dentition ; run; You can either create this data set yourself or use a webscraping tool to make it for you. Wrapping the autocall macro or the PROC COPY into a macro is straightforward:\n%macro load_data(name=); /* allow bad SSL; this is due to an issue with cdc.gov */ options set=SSLREQCERT=\u0026#34;allow\u0026#34;; /* set up for import */ filename xpt url \u0026#34;https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/\u0026amp;name..XPT\u0026#34;; libname xpt xport; proc copy in=xpt out=nhanes; run; /* make sure to clear libname \u0026amp; filename for next macro call */ filename xpt; libname xpt; %mend; The only question now is how to trigger this macro for each data set listed in nhanes.datasets. That\u0026rsquo;s where the CALL EXECUTE routine comes in. It allows us to invoke a macro for each line in the source data set while giving us access to the variables in the source data. Since this is executed as part of a data step, you can use more fine grained control by having if/else conditions, where clauses, etc. In our example, we\u0026rsquo;d use this data step:\ndata _NULL_; set nhanes.datasets; call execute(\u0026#39;%load_data(name=\u0026#39;||df||\u0026#39;);\u0026#39;); run; After running our script, the folder specified by libname nhanes will contain both our \u0026ldquo;codebook\u0026rdquo; of data sets, as well as all of the data sets listed in the file.\nNote that using this macro requires you to first download the file for processing. You can do this easily with a TEMP filename statement.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/23-04-loading-several-xpt-files-from-a-url/","summary":"\u003cp\u003eThe SAS Transport File Format (XPORT) is an open file format maintained by SAS for exchanging datasets. Its use is mandated by the FDA for data set submission for new drug or device applications and the CDC uses this format to distribute public data. For details regrading this format, see \u003ca href=\"https://www.loc.gov/preservation/digital/formats/fdd/fdd000464.shtml\"\u003ethis Library of Congress page\u003c/a\u003e. This post will explore how to read several of these files into a SAS session with the URL filename statement using the National Health and Nutrition Examination Survey, or \u003ca href=\"https://www.cdc.gov/nchs/nhanes/index.htm\"\u003eNHANES\u003c/a\u003e, as an example.\u003c/p\u003e","title":"Loading Several XPT Files From a URL"},{"content":" Today is tax day in the US. In celebration we\u0026rsquo;re going to take a look at some of the data available on the IRS Statistics page. Since the current administration focuses a lot of rhetoric on making the \u0026ldquo;rich pay their fair share,\u0026rdquo; I thought it might be interesting to see if any of the data made available by the IRS could be used to look at whether or not this is true at the moment. Luckily for us, the IRS makes an Excel sheet available that includes what income percentiles earn what share of taxable income and what portion of the tax burden is born by these same percentiles.\nIt is important to note that this Excel sheet only reports on what the IRS refers to as \u0026ldquo;Adjusted Gross Income,\u0026rdquo; which is - as the name suggests - gross income after adjustment by tax breaks and the like. While this accounts for capital gains, it does not account for all taxes paid by the population. Consumption taxes, for example sales tax revenue, is not included. State taxes are also excluded from this data set. Nevertheless I think this data set is interesting to look at, and we will note a research paper below that shows this trend continues even after adding in consumption taxes.\nDefining a Notion of \u0026lsquo;Fairness\u0026rsquo; The question of \u0026ldquo;fairness\u0026rdquo; can be viewed as a question of proportion. There are two obvious ways to do this: tax share being proportional to one\u0026rsquo;s share in the population, or being proportional to one\u0026rsquo;s share of the income. Let\u0026rsquo;s use an example to illustrate these two different options. You and two of your friends go out and have dinner together. The total tab amounts to 90 USD. What would be a fair way of splitting that tab?\nThe simplest option would be to divide the entire bill by the number of people in the group. That way, everybody pays an equal amount regardless of their respective incomes. In this case, everybody would pay 30 USD towards the bill.\nA second option would be to make each person\u0026rsquo;s contribution equal to their ability to provide as measured by their income. Let\u0026rsquo;s say you have one friend who is currently between jobs, i.e. with no income, and your other friend makes twice what you make. If we wanted to scale by income, your friend between jobs would pay nothing towards the bill, you would pay 30 USD, and your friend with greater income would cover the remaining 60 USD of the bill.\nWe can expect neither of these two options to be the case with the US tax system, which is decidedly progressive - the more money you earn, the higher a percentage of your income goes towards the federal goverment.1 In my experience, many people actually underestimate how robust the American welfare system is. Blanchet, Chancel and Gethin (2020) for example have shown that the US redistributes a greater share of national income to its low-income population than any European country.2 This paper also shows that the trend in the taxes we find in the IRS data holds after considering additional taxes such as consumption taxes.\nLoading the Data The data we are interested in come in two separate ranges of the Excel sheet. First, we\u0026rsquo;ll load the XLS file using a temporary object:\nfilename _httpin temp; proc http method=\u0026#34;get\u0026#34; url=\u0026#34;https://www.irs.gov/pub/irs-soi/20in41ts.xls\u0026#34; out=_httpin; run; We can then use PROC IMPORT with the range option to read the relevant ranges of the Excel sheet that we are interested in. For the AGI share information that is the range given by TAB1$A134:P153. For the tax share, the relevant range is TAB1$A155:P174. Both tables are in wide format and will need transposing. Since the code is nearly identical for both ranges, I wrapped it in a macro function:\n%macro get_table(name=, range=); proc import datafile=_httpin out=\u0026amp;name. dbms=XLS replace; range=\u0026#34;\u0026amp;range.\u0026#34;; run; proc transpose data=\u0026amp;name.(drop=B) out=\u0026amp;name.(drop=_LABEL_ rename=(Col1=\u0026amp;name.)) name=ExcelCol; by A; run; %mend; We can then easily call this for both the tax share and AGI ranges:\n%get_table(name=AGISH, range=TAB1$A134:P153); %get_table(name=TAXSH, range=TAB1$A155:P174); One tricky part of this is having to deal with transorming the Excel Column headers into the requisite percentiles. An easy way of doing this is to specify a SAS format that can be applied to the data. Here we go:\nproc format; invalue coln \u0026#39;C\u0026#39;=0.00001 \u0026#39;D\u0026#39;=0.0001 \u0026#39;E\u0026#39;=0.001 \u0026#39;F\u0026#39;=0.01 \u0026#39;G\u0026#39;=0.02 \u0026#39;H\u0026#39;=0.03 \u0026#39;I\u0026#39;=0.04 \u0026#39;J\u0026#39;=0.05 \u0026#39;K\u0026#39;=0.10 \u0026#39;L\u0026#39;=0.20 \u0026#39;M\u0026#39;=0.25 \u0026#39;N\u0026#39;=0.30 \u0026#39;O\u0026#39;=0.40 \u0026#39;P\u0026#39;=0.50; run; This can now be used with the input function to relabel our columns as percentiles. At this point we\u0026rsquo;re ready to merge. Here\u0026rsquo;s what the 2020 data looks like:\nYear Top Percentile Cumulative Tax Share Cumulative AGI Share 2020 0.001% 4.143% 2.379% 2020 0.010% 10.214% 5.530% 2020 0.100% 22.056% 11.322% 2020 1.000% 42.313% 22.187% 2020 2.000% 50.375% 27.723% 2020 3.000% 55.508% 31.801% 2020 4.000% 59.495% 35.171% 2020 5.000% 62.742% 38.107% 2020 10.000% 73.670% 49.453% 2020 20.000% 84.929% 64.869% 2020 25.000% 88.508% 70.713% 2020 30.000% 91.357% 75.709% 2020 40.000% 95.334% 83.734% 2020 50.000% 97.678% 89.819% We see from this table that US taxes are quite progressive. The Top 2% of tax payers pay about half of all the income taxes paid to the federal government, while earning only a quarter of all the taxable income generated. The bottom 50% of tax payers contribute nearly nothing to these federal taxes (less than 3%). We can also see a glimpse of income inequality - nearly half of all income earned is earned by the top 10% of income earners.\nSome Visualizations Let\u0026rsquo;s start by looking at a few simple visualizations of the data. In the following image, the dashed grey line represents equality between the top income percentiles (x-axis) and the cumulative tax share (y-axis). Anything above the line indicates larger-than-proportional contributions.\nTop tax paying percentiles versus cumulative tax share. The reference line indicated equal shares. A point above the line indicates greater contribution.\nThis graph basically looks as expected. We can see three distinct slopes in the graph: the steepest section is for the top 1% of tax payers. The next part is the 2-5% range, and then the 10-50% range is likewise fairly consistent.\nWe can use this graph to calculate something akin to a Gini coefficient. The Gini coefficient represents the ratio of the area under the Lorenz curve compared to the area under the equality line, which would represent perfect equality between the cumulative population share versus cumulative share of income. A Gini coefficient of zero would imply perfect equality, while a Gini coefficient of one would imply perfect inequality.\nWe don\u0026rsquo;t have the full population size, so our coefficient won\u0026rsquo;t be completely comparable. We will define our tax GINI coefficient, let\u0026rsquo;s call it \u0026ldquo;TINI\u0026rdquo; coefficient, as the ratio of the area between the cumulative Tax Share and the equality line to the total area above the equality line on the box $(0,0.5)\\times(0,1)$. The total area above the equality line is $3/8$. We can run through this with a data step using a trapezoidal approximation. See the accompanying SAS script for details.\nMy calculated TINI coefficient over time.\nThis TINI coefficient is centered at 0.74 over the data period. While quite stable from about 2008 to 2016, the overall trend has been upwards towards greater inequality. Compare this to the US\u0026rsquo; Gini coefficient which has fluctuated around 0.4 during that same time period. From this, it would appear that the inequality is greater when looking at tax contributions than wealth distribution.\nAnother interesting way to look at this data is to think about the ratio of cumulative share of taxes paid over the cumulative share of AGI earned by the different percentiles. For the top 5% of earners, this ratio tends to hover between 1.6 and 1.9 over the data period as can be seen from the plot below.\nDot plot of the tax obligation to AGI share over the data set. Points represent the mean, with bars indicating ±1 standard deviation.\nEven though I expected to see some inequality due to the progressive nature of the tax system, I was surprised by how \u0026ldquo;top-heavy\u0026rdquo; the tax burden is given the popular rhetoric.\nSee this IRS for a marginal rates for the 2023 tax year. In this year, the rates range from a low of 10% to a high of 37%.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nBlanchet, T., Chancel, L., and Gethin, A. (2020), \u0026ldquo;Why Is Europe More Equal than the United States?\u0026rdquo;, American Economic Journal: Applied Economics, 14 (4), 450\u0026ndash;518. DOI: 10.1257/app.20200703.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/23-04-are-the-rich-paying-their-fair-share/","summary":"Today is tax day in the US. In celebration we\u0026rsquo;re going to take a look at some of the data available on the IRS Statistics page.","title":"Are the Rich Paying Their Fair Share?"},{"content":"I have recently read Cal Newport\u0026rsquo;s book \u0026ldquo;Deep Work\u0026rdquo; (2016). Overall, it is a short but engaging read discussing his tips for how to spend more time doing intellectually focused and engaging work in a society whose attention and focus is ever more divided. Below are my takeways from the book.\nThe Big Picture The book focuses on what Newport calls the \u0026ldquo;Deep Work Hypothesis:\u0026rdquo;\nThe ability to perform deep work is becoming increasingly rare at exactly the same time it is becoming increasingly valuable in our economy. As a consequence, the few who cultivate this skill, and then make it the core of their working life, will thrive.\nHe defines \u0026ldquo;deep work\u0026rdquo; as\nProfessional activities performed in a state of distraction-free concetration that push your cognitive capabilities to the limit. These efforts create new value, improve your skill, and are hard to replicate.\nThis is in contrast to \u0026ldquo;shallow work\u0026rdquo;, which he describes as\nNoncognitively demanding, logistical-style tasks, often performed while distracted. These efforts tend to not create much new value in the world and are easy to replicate.\nOverall, he splits the book into two parts. The first describes what he considers to be \u0026ldquo;deep work\u0026rdquo; by illustration with various examples. This is coupled with an explanation for why he thinks deep work is valuable, rare, and meaningful, with a chapter dedicated to each of these three topics. In the second part of the book he lists his four major rules for accomplishing an increasing amount of deep work.\nPart 1 of the book struck me as the weakest, but that may be based on my background. The first chapter prepares a good definition of what Cal means by \u0026ldquo;deep work\u0026rdquo; and why it is increasingly valuable in our modern economy. His chapter showing deep work is rare will likely not come as a surprise to most. The key insight here is that while productivity is all the rage these days, most people are actually merely \u0026ldquo;busy,\u0026rdquo; but not productive. His chapter tying deep work to the good life should be familiar to all with even rudimentary exposure to philosophy.\nHere is a brief summary of his four rules:\nRule #1: Work Deeply. His first rule is a bit of a catch-all, describing several different methods of deep work at different levels, but the heart of it is the idea that deep work requires deliberate effort, planning, and ritual. Rule #2: Embrace Boredom. The theme of his chapter reminds me a lot of Sir Bertrand Russell and his wonderful book The Conquest of Happiness, in which boredom plays a prominent role. In essence, we are hurting ourselves by our constant efforts to escape the feeling of boredom. By embracing the idea that boredom is not inherently bad, we can relearn proper focus which aids in deep work. Rule #3: Quit Social Media. Here he focuses on one particular source of distraction in our world - social media. The essence of the chapter is that as a society, we are not sufficiently skeptical of the benefit social media consumption offers compared to its pitfalls. Rule #4: Drain the Shallows. Part of what keeps us from doing deep work is the distraction of shallow work. This chapter centers around a few different tips for reducing the shallow work load, thereby making time for more deep work. Specifically, he suggests using proper time management, saying no to taking on additional but unnecessary obligations, and restructuring one\u0026rsquo;s email habits. The Big Picture Part 1 - The Idea Deep work is Valuable Deep Work is Rare Deep Work is Meaningful Part 2 - The Rules Rule #1: Work Deeply Rule #2: Embrace Boredom Rule #3: Quit Social Media Rule #4: Drain the Shallows Reference Part 1 - The Idea Deep work is Valuable This chapter is largely a reflection of the economic import of deep work for knowledge workers. Our economy is shifting and even prior to COVID, regional limitations on finding workers have started to become less meaningful. That means that knowledge workers compete with an increasing amount of other knowledge workers. At the same time, technological advances make technological skill and adaptability ever more important.\nNewport identifies two core abilities for thriving in our new economy:\n\u0026ldquo;The ability to quickly master hard things.\u0026rdquo; \u0026ldquo;The ability to produce at an elite level, in terms of both quality and quantity.\u0026rdquo; Learning hard things quickly is identified with deep work.\nTo learn hard things quickly, you must focus intensely without distraction. To learn, in other words, is an act of deep work. If you\u0026rsquo;re comfortable going deep, you\u0026rsquo;ll be comfortable mastering the increasingly complex systems and skills needed to thrive in our economy. If you instead remain one of the many for whom depth is uncomfortable and distraction ubiquitous, you shouldn\u0026rsquo;t expect these systems and skills to come easily to you.\nImproving your skill requires deliberate practice, whose two key components he lists as\nfocusing your attention tightly on a specifc skill you are attempting to improve, and receiving feedback that allows for correction without loss of attention. Deep Work is Rare The Principle of Least Resistance: In a business setting, without clear feedback on the impact of various behaviors to the bottom line, we will tend toward behaviors that are easiest in the moment.\nMany knowledge workers struggle with being productive. This is despite exhibiting external signs of \u0026ldquo;productivity:\u0026rdquo;\nKnowledge workers, I\u0026rsquo;m arguing, are tending toward increasingly visible busyness because they lack a better way to demonstrate their value. Let\u0026rsquo;s give this tendency a name. Busyness as Proxy for Productivity: In the absence of clear indicators of what it means to be productive and valuable in their jobs, many knowledge workers turn back toward an industrial indicator of productivity: doing lots of stuff in a visible manner.\nThe problem he identifies is that it is difficult for a knowledge worker to demonstrate their value, and that modern society has developed a \u0026ldquo;productivity-fetish\u0026rdquo; for lack of better term. Everything should be measured and quantified and be \u0026ldquo;efficient.\u0026rdquo; The difficulty with this is that knowledge-work is inherently different from the type of labor many of these productivity measures are derived from.\nKnowledge work is not an assembly line, and extracting value from information is an activity that\u0026rsquo;s often at odds with busyness, not supported by it.\nAll this amounts to actual productivity, which is ascribed to deep work, being rare.\nDeep Work is Meaningful The previous chapters provided external motivation for deep work. In this chapter, Cal argues that deep work is \u0026ldquo;meaningful,\u0026rdquo; in the sense of being part of living \u0026ldquo;the good life.\u0026rdquo;\nThe goal of this chapter is to convince you that deep work can generate as much satisfaction in an information economy as it so clearly does in a craft economy. [\u0026hellip;] The thesis of this final chapter in Part 1, therefore, is that a deep life is not just economically lucrative, but also a life well lived.\nHe points out research concerning the effects of deliberate attention in general, and how we allocate deliberate attention. \u0026ldquo;Skillful management of attention is the sine qua non of the good life and the key to improving virtually every aspect of your experience.\u0026rdquo;\nDeep work aids a person in achieving happiness in two ways. One, it provides distraction that keeps us from noticing the \u0026ldquo;many smaller and less pleasant things that unavoidably and persistently populate our lives:\u0026rdquo;\nOur brains instead construct our worldview based on what we pay attention to. If you focus on a cancer diagnosis, you and your life become unhappy and dark, but if you focus instead on an evening martini, you and your life become more pleasant—even though the circumstances in both scenarios are the same. As Gallagher summarizes: “Who you are, what you think, feel, and do, what you love — is the sum of what you focus on.”\nSecond, being engaged in deep work is connected to flow:\nDeep work is an activity well suited to generate a flow state (the phrases used by Csikszentmihalyi to describe what generates flow include notions of stretching your mind to its limits, concentrating, and losing yourself in an activity—all of which also describe deep work).\nSince flow has been shown to generate happiness, deep work is argued to generate happiness as well.\nPart 2 - The Rules Rule #1: Work Deeply Cal lists four main methods, or \u0026ldquo;philosophies,\u0026rdquo; for practicing deep work:\nThe monastic philosophy. What is sounds like. Isolate yourself completely from shallow work for extended periods of time. Same famous examples are cited, but Cal notes this is not realistic for most workers. You can imagine your employer not taking kindly to not being able to reach you at all for a few weeks at a time because you are secluded into a deep work state.\nThe bimodal philosophy. Exemplified in the book by Carl Jung. Divide your time into clearly defined stretches of deep-work, at least a full day but preferably several days a week. During these stretches of time you essentially act monastically. This method is substantially more realistic for most workers, particularly for academics like Cal. As an example, you could set aside Thursdays/Fridays for deep work. During these days, you wouldn\u0026rsquo;t be checking email, be on Teams/Slack, you would have no meetings, etc. You just focus on whatever deep work project you are working on. During the rest of the week on the other hand, you\u0026rsquo;d act like a regular worker with meetings, email checking, etc.\nThe rhythmic philosophy. This method focuses most heavily on routines. Create a regular deep work routine during each work day. Create a consistent block of time set aside for deep work, each and every day. While you don\u0026rsquo;t have the same amount of time to dig as \u0026ldquo;deep\u0026rdquo; as in the previous methods, \u0026ldquo;by supporting deep work with rock-solid routines that make sure a little bit gets done on a regular basis, the rhythmic scheduler will often log a larger total number of deep hours per year.\u0026rdquo; This method is also being pushed into more office worker\u0026rsquo;s consciousness with Microsoft\u0026rsquo;s Viva Insights. Now integrated into Outlook, it suggests automatically scheduling \u0026ldquo;focus hours\u0026rdquo; for you, helping you get a head start with this method.\nThe journalistic philophy. Basically switch between deep and shallow work in undefined blocks, perhaps multiple times a day. Not recommended as most people are not able to do this, and even if they can they do so only with substantial training.\nOverall, the rhythmic and bimodal philosphies seem to be most achievable for most workers. He gives further support for the rhythmic philosophy by noting research that suggests a person can only engage in deep work for about one to four hours a day, depending on the level of training they have.\nCal thoroughly stresses the need for routines in helping someone get into a deep work state of mind.\nThe key to developing a deep work habit is to move beyond good intentions and add routines and rituals to your working life designed to minimize the amount of your limited willpower necessary to transition into and maintain a state of unbroken concentration.\nThere is a popular notion that artists work from inspiration—that there is some strike or bolt or bubbling up of creative mojo from who knows where… but I hope [my work] makes clear that waiting for inspiration to strike is a terrible, terrible plan. In fact, perhaps the single best piece of advice I can offer to anyone trying to do creative work is to ignore inspiration. In a New York Times column on the topic, David Brooks summarizes this reality more bluntly: “[Great creative minds] think like artists but work like accountants.”\nTo make the most out of your deep work sessions, build rituals of the same level of strictness and idiosyncrasy as the important thinkers mentioned previously. There’s a good reason for this mimicry. Great minds like Caro and Darwin didn’t deploy rituals to be weird; they did so because success in their work depended on their ability to go deep, again and again—there’s no way to win a Pulitzer Prize or conceive a grand theory without pushing your brain to its limit.\nThis means you have to plan your deep work sessions ahead of time. Decide\nWhere you’ll work and for how long. How you’ll work once you start to work. How you’ll support your work. He discusses the 4DX framework, originally idented for businesses, and adapts it to personal deep work. The 4DX framework is built up of four key disciplines:\nFocus on the Wildly Important Act on the Lead Measures Keep a Compelling Scoreboard Create a Cadence of Accountability Lastly, he argues for the importrance of \u0026ldquo;injecting regular and substantial freedom from professional concerns into your day, providing you with the idleness paradoxically required to get (deep) work done.\u0026rdquo; This is discussed in the context of attention restoration theory. The basic idea is that concentration, or \u0026ldquo;directed attention,\u0026rdquo; is a finite resource that can get exhausted. For this reason, he argues to keep all work strictly confined to the work day. This way, you can spend your time outside of work hours recharging. This in turn helps you be more productive while working. To facilitate this move away from work and towards attention restoring activities, he argues for a commitment to a strict shutdown ritual. This is Cal\u0026rsquo;s description:\nIn more detail, this ritual should ensure that every incomplete task, goal, or project has been reviewed and that for each you have confirmed that either (1) you have a plan you trust for its completion, or (2) it’s captured in a place where it will be revisited when the time is right. The process should be an algorithm: a series of steps you always conduct, one after another. When you’re done, have a set phrase you say that indicates completion (to end my own ritual, I say, “Shutdown complete”). This final step sounds cheesy, but it provides a simple cue to your mind that it’s safe to release work-related thoughts for the rest of the day. [\u0026hellip;] The concept of a shutdown ritual might at first seem extreme, but there’s a good reason for it: the Zeigarnik effect.\nRule #2: Embrace Boredom The ability to concentrate intensely is a skill that must be trained.\nDeep work requires focus, and sometimes this focus can be difficult and uncomfortable. In our society we are accustomed to constantly available, on-demand distraction that can keep us from having to have uncomfortable moments where we are left alone with our thoughts. These distractions keep us from being bored. But this comes with a downside: the more accustomed we are to the easy availability of distraction, the harder it becomes to focus when we need to. This problem is particularly pronounced in individuals who frequently multitask:\nPeople who multitask all the time can’t filter out irrelevancy. They can’t manage a working memory. They’re chronically distracted. They initiate much larger parts of their brain that are irrelevant to the task at hand\u0026hellip; they’re pretty much mental wrecks. [\u0026hellip;] Once your brain has become accustomed to on-demand distraction, Nass discovered, it’s hard to shake the addiction even when you want to concentrate. To put this more concretely: If every moment of potential boredom in your life—say, having to wait five minutes in line or sit alone in a restaurant until a friend arrives—is relieved with a quick glance at your smartphone, then your brain has likely been rewired to a point where, like the “mental wrecks” in Nass’s research, it’s not ready for deep work—even if you regularly schedule time to practice this concentration.\nTo counteract this, you need to train yourself with two goals in mind: \u0026ldquo;improving your ability to concentrate intensely and overcoming your desire for distraction.\u0026rdquo; As part of this training, Cal suggests a type of intermittent \u0026ldquo;internet Sabbath.\u0026rdquo; The intent is to schedule a break from focus where you\u0026rsquo;re allowed to give in to distraction.\nWith these rough categorizations established, the strategy works as follows: Schedule in advance when you’ll use the Internet, and then avoid it altogether outside these times. I suggest that you keep a notepad near your computer at work. On this pad, record the next time you’re allowed to use the Internet. Until you arrive at that time, absolutely no network connectivity is allowed—no matter how tempting.\nHe emphasizes the need to rigoursly stick to this schedule and really, completely avoid the internet during the scheduled focus time. For best results, he suggests restricting internet use at home as well as at work. This further helps train yourself to use the internet as an intentional tool, as opposed to as an escape mechanism from potential boredom.\nTo summarize, to succeed with deep work you must rewire your brain to be comfortable resisting distracting stimuli. This doesn’t mean that you have to eliminate distracting behaviors; it’s sufficient that you instead eliminate the ability of such behaviors to hijack your attention. The simple strategy proposed here of scheduling Internet blocks goes a long way toward helping you regain this attention autonomy.\nHe also suggests taking up what he calls productive meditation. Essentially, taking a time where you are physically but not mentally occupied (e.g., walking, jogging, driving) and focusing on a single, well-defined problem. This activity helps practice focusing and makes productive use of these times we might otherwise wast by distracting ourselves with things like podcasts.\nLastly, he suggets practicing memorization. He gives the example of learning how to memorize a deck of cards. The point here is not the party-trick of being able to quickly memorize a deck of cards, but rather to provide a workout for the mind.\nRule #3: Quit Social Media This chapter focuses on one particular source of distraction - social media. He lists sites like FaceBook, Twitter and Instagram (the book was written prior to TikTok). Most people are signed up for several of these services, despite them offering limited benefits in Cal\u0026rsquo;s view. He proposes the following root-cause of this:\nThe Any-Benefit Approach to Network Tool Selection: You’re justified in using a network tool if you can identify any possible benefit to its use, or anything you might possibly miss out on if you don’t use it. The problem with this approach, of course, is that it ignores all the negatives that come along with the tools in question. These services are engineered to be addictive—robbing time and attention from activities that more directly support your professional and personal goals (such as deep work).\nHe repeatedly emphasizes that there are legitimate benefits to social media; his point is not that they are morally \u0026ldquo;bad\u0026rdquo; or \u0026ldquo;useless,\u0026rdquo; but rather that they are attention robbing while providing limited benefit. In other words, we can spend our time more profitably by doing something else with it rather than scrolling through FaceBook or TikTok feeds. Our days are short and we only have limited time. Spending time on social media typically is either (1) a distraction, filling otherwise empty space (see his rule #2 above), or (2) a suboptimal use of our free time.\nCal proposes the following alternative:\nThe Craftsman Approach to Tool Selection: Identify the core factors that determine success and happiness in your professional and personal life. Adopt a tool only if its positive impacts on these factors substantially outweigh its negative impacts.\nHe suggests writing a list of high-level gaols of what\u0026rsquo;s most important in your personal and professional life, then listing the two or three most important activities that can help you satisfy that goal. If a networking tool doesn\u0026rsquo;t fit with your goals, it might be more of a distraction than a useful tool. We might worry about missing out if we eliminate some or all of our social media, but\nStuff accumulates in people’s lives, in part, because when faced with a specific act of elimination it’s easy to worry, “What if I need this one day?,”and then use this worry as an excuse to keep the item in question sitting around. Nicodemus’s packing party provided him with definitive evidence that most of his stuff was not something he needed, and it therefore supported his quest to simplify.\nCal suggests cold-quitting the use of all social media for 30 days, without announcing it. Note that he suggests quitting the use, not shutting the services. Re-evaluate your use of the services after these 30 days. Overall, Cal argues that we deserve to put more thought into our leisure activities, and shouldn\u0026rsquo;t just default to whatever is easily available at the moment. Even if the torrent of funny, short videos is funny.\nTo summarize, if you want to eliminate the addictive pull of entertainment sites on your time and attention, give your brain a quality alternative. Not only will this preserve your ability to resist distraction and concentrate, but you might even fulfill Arnold Bennett’s ambitious goal of experiencing, perhaps for the first time, what it means to live, and not just exist.\nRule #4: Drain the Shallows This chapter focuses on how to make more room for deep work by reducing the amount of low-value shallow work. He acknowledges that some shallow work is necessary, so he doesn\u0026rsquo;t encourage us to \u0026ldquo;quixotically pursue a schedule in which all of [our] time is invested in depth.\u0026rdquo; Nevertheless, he argues much shallow work can be eliminated without loss since it\u0026rsquo;s value is frequently overestimated.\nPart of his advice is to utilize block-scheduling as a time management routine.\nHere\u0026rsquo;s my suggestion: At the beginning of each workday, turn to a new page of lined paper in a notebook you dedicate to this purpose. Down the left-hand side of the page, mark every other line with an hour of the day, covering the full set of hours you typically work. Now comes the important part: Divide the hours of your workday into blocks and assign activities to the blocks. For example, you might block off nine a.m. to eleven a.m. for writing a client\u0026rsquo;s press release. To do so, actually draw a box that covers the lines corresponding to these hours, then write “press release”inside the box. Not every block need be dedicated to a work task. There might be time blocks for lunch or relaxation breaks. To keep things reasonably clean, the minimum length of a block should be thirty minutes (i.e., one line on your page). This means, for example, that instead of having a unique small box for each small task on your plate for the day—respond to boss\u0026rsquo;s e-mail, submit reimbursement form, ask Carl about report—you can batch similar things into more generic task blocks.\nHe advises not to stick too rigidly to these blocks, but instead use them to make sure you use your time intentionally instead of haphazardly. He also advises scheduling more time than you think you need at first, until you are used to the method and have a better sense of being able to gauge how long tasks will actually take when you give them your full attention.\nTo summarize, the motivation for this strategy is the recognition that a deep work habit requires you to treat your time with respect. A good first step toward this respectful handling is the advice outlined here: Decide in advance what you’re going to do with every minute of your workday. It’s natural, at first, to resist this idea, as it’s undoubtedly easier to continue to allow the twin forces of internal whim and external requests to drive your schedule. But you must overcome this distrust of structure if you want to approach your true potential as someone who creates things that matter.\nWhen deciding on a task schedule, quantify the depth of every activity. This can help in both prioritization and in figuring out what is actually shallow work masquerading as deep work. He advocates the following heuristic in judging where an activity falls on the shallow-depth continuum:\nevaluate activities by asking a simple (but surprisingly illuminating) question: How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?\nThe shorter the amount of time, the more shallow the task. Conversely, the longer it would take, the more important that task is for you as an individual since it utilizes the skills you have acquired and puts you to \u0026ldquo;best use.\u0026rdquo;\nExpanding on a topic brought up in rule #1, he advocates for what he terms fixed-schedule productivity. In essence: fix a time of day that ends your work day (he suggests 5:30 PM). From there, work backwards into making your day\u0026rsquo;s work fit that schedule. This way you are guaranteed to be done with work and available to engage in attention restoration and living your life outside of work, while at the same time nudging yourself to use the now more limited work time to the best of your ability.\nSince you will have more limited time in your workday, this motivates you to start saying no to potential commitments that don\u0026rsquo;t actually further your professional goals significantly. He brings up his own example of earning tenure and that of a colleague to show that even though we might think we need to say \u0026ldquo;yes\u0026rdquo; to all opportunities and work all hours of the day in order to succeed, we may not need to do either if we use our time productively.\nTo summarize these observations, Nagpal and I can both succeed in academia without Tom-style overload due to two reasons. First, we’re asymmetric in the culling forced by our fixed-schedule commitment. By ruthlessly reducing the shallow while preserving the deep, this strategy frees up our time without diminishing the amount of new value we generate. Indeed, I would go so far as to argue that the reduction in shallow frees up more energy for the deep alternative, allowing us to produce more than if we had defaulted to a more typical crowded schedule. Second, the limits to our time necessitate more careful thinking about our organizational habits, also leading to more value produced as compared to longer but less organized schedules. The key claim of this strategy is that these same benefits hold for most knowledge work fields.\nThis observation definitely matches up with my own personal observations of academic life. An incredible amount of time is wasted there at various levels.\nHis final piece of advice is mostly related to email - \u0026ldquo;become hard to reach.\u0026rdquo; Email sucks up a large amount of time in many workers\u0026rsquo; day. And if we really think about it, many emails are not effective uses of our time. He offers three tips for reducing the impact of this potential time waster:\nMake People Who Send You E-mail Do More Work. Set people up with the expectation that you won\u0026rsquo;t reply unless it is actually worth your time. It is common for people to expect replies to something as mundane and pointless as a forwarded email with the single line \u0026ldquo;what are your thougths?\u0026rdquo; prefixed to it. Start to make clear that you won\u0026rsquo;t respond to emails of this type. Raise the expectation that people need to communicate to you why you ought to reply to the email in the first place. This can be done by using a sender filter. On your website, where you list your email, you could communicate that you won\u0026rsquo;t respond unless the message fits your schedule and interests. By communicating what you will respond to ahead of time, you can reset people\u0026rsquo;s expectation.\nDo More Work When You Send or Reply to E-mails. This is related to the above. Just as you won\u0026rsquo;t reply to pointless emails, don\u0026rsquo;t write pointless emails. But also think ahead and make sure your emails include all of the necessary information. This avoids short, chat-style back-and-forths. As a particular example, this goes with one of my pet peeves: people sending meeting requests with absolutely no indication of what times they might be available. Don\u0026rsquo;t do that. If you want to meet with somebody, include why you want to meet and give several time windows. If your organization uses Outlook and basic calendar sharing, take a look at the other parties calenders and make sure to only suggest times that aren\u0026rsquo;t already blocked off on their calendars. That way, finding a meeting time can happen in two or three emails as opposed to five. In more detail, he writes:\npause a moment before replying [to an email] and take the time to answer the following key prompt: What is the project represented by this message, and what is the most efficient (in terms of messages generated) process for bringing this project to a successful conclusion? Once you’ve answered this question for yourself, replace a quick response with one that takes the time to describe the process you identified, points out the current step, and emphasizes the step that comes next. I call this the process-centric approach to e-mail, and it’s designed to minimize both the number of e-mails you receive and the amount of mental clutter they generate.\nJust don\u0026rsquo;t respond. You can use this technique as a heuristic for gauging when to not respond to an email: Professorial E-mail Sorting: Do not reply to an e-mail message if any of the following applies:\n• It\u0026rsquo;s ambiguous or otherwise makes it hard for you to generate a reasonable response.\n• It\u0026rsquo;s not a question or proposal that interests you.\n• Nothing really good would happen if you did respond and nothing really bad would happen if you didn\u0026rsquo;t.\nReference Newport, C. (2016), Deep Work: Rules for Focused Success in a Distracted World, New York, NY: Grand Central Publishing.\n","permalink":"https://dmsenter89.github.io/post/23-04-takeaways-deep-work/","summary":"\u003cp\u003eI have recently read Cal Newport\u0026rsquo;s book \u0026ldquo;Deep Work\u0026rdquo; (2016). Overall, it is a short but engaging read discussing his tips for how to spend more time doing intellectually focused and engaging work in a society whose attention and focus is ever more divided. Below are my takeways from the book.\u003c/p\u003e","title":"Takeaways from 'Deep Work'"},{"content":"Andrew Gelman\u0026rsquo;s statmodeling blog recently contained a link to an interesting document by Jonas Lindeløv. It tries to explain various statistical tests in terms of linear models. Here\u0026rsquo;s the chart from the post:\nThe summary chart from Lindeløv\u0026rsquo;s post.\nThe entire notebook can be viewed here, and Andrew\u0026rsquo;s comments are available here.\n","permalink":"https://dmsenter89.github.io/post/23-04-statistical-tests-as-linear-models/","summary":"Andrew Gelman\u0026rsquo;s statmodeling blog recently contained a link to an interesting document by Jonas Lindeløv. It tries to explain various statistical tests in terms of linear models.","title":"Statistical Tests as Linear Models"},{"content":"VS Code devcontainers are a great resource for creating reusable containers to share between developers on the same project. When properly setup, it automatically passes your SSH credentials to the container. When this is not set up, the git push/pull functionality in VS Code won\u0026rsquo;t work (you will still be able to make commits in the devcontainer and then push/pull from the CLI you launched Code with).\nThe way to do this is to use an SSH agent to forward your credentials. On most systems these aren\u0026rsquo;t started automatically, so for convenience you will probably want to add the start up to your .bash_profile or .bashrc.\nI have found the following useful; it includes a short check to make sure you aren\u0026rsquo;t running multiple ssh-agents in one session:\nSSH_ENV=\u0026#34;$HOME/.ssh/agent-environment\u0026#34; function start_agent { echo \u0026#34;Initialising new SSH agent...\u0026#34; /usr/bin/ssh-agent | sed \u0026#39;s/^echo/#echo/\u0026#39; \u0026gt; \u0026#34;${SSH_ENV}\u0026#34; echo succeeded chmod 600 \u0026#34;${SSH_ENV}\u0026#34; . \u0026#34;${SSH_ENV}\u0026#34; \u0026gt; /dev/null /usr/bin/ssh-add; } # Source SSH settings, if applicable if [ -f \u0026#34;${SSH_ENV}\u0026#34; ]; then . \u0026#34;${SSH_ENV}\u0026#34; \u0026gt; /dev/null ps -ef | grep ${SSH_AGENT_PID} | grep ssh-agent$ \u0026gt; /dev/null || { start_agent; } else start_agent; fi You can verify that your keys are working by running ssh-add -l from the VS Code terminal. This should print your host SSH key.\nSee also the official documentation.\n","permalink":"https://dmsenter89.github.io/post/23-04-sharing-ssh-keys-with-devcontainer/","summary":"\u003cp\u003eVS Code devcontainers are a great resource for creating reusable containers to share between developers on the same project. When properly setup, it automatically passes your SSH credentials to the container. When this is not set up, the git push/pull functionality in VS Code won\u0026rsquo;t work (you will still be able to make commits in the devcontainer and then push/pull from the CLI you launched Code with).\u003c/p\u003e","title":"Sharing SSH Keys With a Devcontainer"},{"content":"I recently read a great blog post by Jordan Tigani about Big Data. While Jordan\u0026rsquo;s post focuses on enterprise needs, I believe it contains relevant insights to individual researchers as well.\nIf you\u0026rsquo;ve had any exposure to the technology space in the past decade, you will have heard of big data. Advances in storage capabilities have unleashed a massive data collection effort across the board. Everyone was excited and assumed that soon we would be completely inundated by data. New methods and technology were needed so we won\u0026rsquo;t drown in all that data and lose out on important insights along the way.\nWhile the amount of data stored has grown significantly, it has not grown as massively as many had feared. Jordan mentions that most enterprises require approximately 100 GB of storage for their data warehousing needs. More interesting to me though is the mentioned difference between compute and storage needs. In his experience, the vast majority of compute queries used 100 MB or less of data - even for companies with terrabytes of data in storage.\nIntuitively this should make sense. As data accumulate, a large portion of it will be historic and won\u0026rsquo;t need continuing re-analyis. There are also many techniques that have been developed to update models as data come in, so that you don\u0026rsquo;t need to re-run the entire modeling process each time you get new data, further reducing compute needs. A good example are large language models like ChatGPT. Training the model requires massive amounts of data and compute capabilities, but once the model is trained, storing and working with the computed weights is nearly trivial by comparison.\nSo far for the blog post. Let\u0026rsquo;s turn to the academic space. The rising popularity of big data and big data concepts has started spilling over into academic research outside of cs/math/stats as well. I encounter incrising numbers of academics in other fields referring to their data as \u0026ldquo;big\u0026rdquo; or aspiring to have \u0026ldquo;big data.\u0026rdquo; The numbers in Jordan\u0026rsquo;s post are a good reminder of what we are actually talking about when discussing big data and related concepts. Jordan quotes the definition that I tend to give researchers that aren\u0026rsquo;t familiar with the big data field: big data is \u0026ldquo;whatever doesn\u0026rsquo;t fit on a single machine.\u0026rdquo;\nUsing that definition, average researchers hoping for \u0026ldquo;big data\u0026rdquo; run into the issue of massive improvements in storage and RAM on consumer grade laptops. Hundreds of gigabytes of SSD storage are the norm and laptop reviews these days frequently complain if a laptop is equipped with less than 8 GB of RAM. Outgrowing those needs will be challening for individual researchers or small teams of researchers in most fields, considering an Excel file with 360,000 observations and 90 variables is still only about 90 MB.1 If you\u0026rsquo;re able to use Excel, you are very, very far removed from having big data - even if your data set is significantly larger than what is typical in your field.\nBut that\u0026rsquo;s frankly not a bad thing. Most research doesn\u0026rsquo;t require big data. Acquiring quality big data is also very difficult. As is dealing with big data tools and infrastructure if you\u0026rsquo;re not used to it. And that\u0026rsquo;s not even touching on the analytical methodology relevant to big data. So most researchers who aren\u0026rsquo;t already in the big data field probably shouldn\u0026rsquo;t worry about it.\nReference Tigani, J. (2023), \u0026ldquo;Big Data is Dead\u0026rdquo;, MotherDuck, Available at motherduck.com/blog/big-data-is-dead.\nSee this quora answer. For illustration purposes only. Don\u0026rsquo;t use Excel for that many rows/columns. Just avoid Excel in general.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/23-03-takeaways-big-data-is-dead/","summary":"\u003cp\u003eI recently read a great \u003ca href=\"https://motherduck.com/blog/big-data-is-dead/\"\u003eblog post\u003c/a\u003e by Jordan Tigani about Big Data. While Jordan\u0026rsquo;s post focuses on enterprise needs, I believe it contains relevant insights to individual researchers as well.\u003c/p\u003e","title":"Takeaways - 'Big Data Is Dead'"},{"content":"I\u0026rsquo;m excited to announce that the new SAPy v4.6.0 release includes a pull request of mine that adds PROC MI to the SAS/STAT procedures directly exposed in SASPy. This procedure allows you to analyze missing data patterns and create imputations for missing data.\nSyntax PROC MI is accessed via the mi function that has been added to the SASstat class. Like other procedures, the SAS statements in MI are called as keyword arguments to the function whose name matches the SAS syntax:1\nPROC MI options; BY variables; CLASS variables; EM \u0026lt;options\u0026gt;; FCS \u0026lt;options\u0026gt;; FREQ variable; MCMC \u0026lt;options\u0026gt;; MNAR options; MONOTONE \u0026lt;options\u0026gt;; TRANSFORM transform (variables\u0026lt;/ options\u0026gt;) \u0026lt;…transform (variables\u0026lt;/ options\u0026gt;)\u0026gt;; VAR variables; Here is the corresponding function signature in Python:\ndef mi(self, data: (\u0026#39;SASdata\u0026#39;, str) = None, by: (str, list) = None, cls: (str, list) = None, em: str = None, fcs: str = None, freq: str = None, mcmc: str = None, mnar: str = None, monotone: str = None, transform: str = None, var: str = None, procopts: str = None, stmtpassthrough: str = None, **kwargs: dict) -\u0026gt; \u0026#39;SASresults\u0026#39;: Statements like EM or MCMC, which can stand alone in SAS, are called with an empty string argument in Python.\nBasic Example To use the new MI functionality, make sure you have updated to the newest SASPy release. In addition to starting a SAS Session as per usual, you will also want to enable access to the SAS/STAT procedures:\nsas = saspy.SASsession() # loads a session using your default profile stat = sas.sasstat() # gives access to SAS/STAT procedures Once these session objects are loaded, you can start using the mi function with stat.mi. The simplest possible call is to invoke MI with a built-in data set and all defaults as stat.mi(data='sashelp.heart'). For best results, store the output in a SASResults object. From there you can access the SAS log associated with the function call (LOG) as well as all ODS Output using the ODS table names in all caps. The default uses the EM method with 25 imputations.\nA more realistic use might look something like this:\nods = stat.mi(data=\u0026#39;sashelp.heart\u0026#39;, em=\u0026#34;outem=outem\u0026#34;, var=\u0026#34;Cholesterol Height Smoking Weight\u0026#34;, procopts=\u0026#34;simple nimpute=20 out=imp\u0026#34;) This is equivalent to running\nproc mi data=sashelp.heart simple nimpute=20 out=imp; em outem=outem; var Cholesterol Height Smoking Weight; run; in SAS. This call uses the EM procedure to impute values for the cholesterol, height, smoking, and weight variables. The simple option displays univariate statistics and correlations. The outem option saves a data set containing the computed MLE to work.outem. The imputed data sets are saved to work.imp, which contains the additional variable _IMPUTATION_ with the imputation number. This can be used as a by variable in other procedures, and the results can later be pooled using PROC MIANALYZE.\nThe resulting ods object for our example exposes the following ODS outputs to your Python instance, in addition to the log:\n[\u0026#39;CORR\u0026#39;, \u0026#39;EMESTIMATES\u0026#39;, \u0026#39;EMINITESTIMATES\u0026#39;, \u0026#39;EMPOSTESTIMATES\u0026#39;, \u0026#39;MISSPATTERN\u0026#39;, \u0026#39;MODELINFO\u0026#39;, \u0026#39;PARAMETERESTIMATES\u0026#39;, \u0026#39;UNIVARIATE\u0026#39;, \u0026#39;VARIANCEINFO\u0026#39;] See the SAS documentation for details. To use the imputed data with Python tools, create a SAS data object. We\u0026rsquo;ll also print the first few entries so we can see what it looks like:\nimputed = sas.sasdata(table=\u0026#34;imp\u0026#34;, libref=\u0026#34;work\u0026#34;) imputed.head() One exception is the SAS class statement, which is implemented as cls due to class being a reserved keyword in Python.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/23-02-proc-mi-added-to-saspy/","summary":"\u003cp\u003eI\u0026rsquo;m excited to announce that the new \u003ca href=\"https://github.com/sassoftware/saspy/releases/tag/v4.6.0\"\u003eSAPy v4.6.0\u003c/a\u003e release includes a pull request of mine that adds \u003ca href=\"https://go.documentation.sas.com/doc/en/statug/15.2/statug_mi_toc.htm\"\u003ePROC MI\u003c/a\u003e to the SAS/STAT procedures directly exposed in SASPy. This procedure allows you to analyze missing data patterns and create imputations for missing data.\u003c/p\u003e","title":"PROC MI Added to SASPy"},{"content":"ChatGPT has been all over my newsfeed lately, with a considerable amount of hype. In particular, many are wondering or even worrying whether the emergence of this technology will threaten jobs with moderate to high education requirments. See for example \u0026ldquo;How ChatGPT Will Destabilize White-Collar Work\u0026rdquo; (The Atlantic), where Annie Lowrey leads with \u0026ldquo;In the next five years, it is likely that AI will begin to reduce employment for college-educated workers.\u0026rdquo; I do not share these views. In fact, I am somewhat underwhelmed by the threat of ChatGPT for a number of reasons. Since this topic has come up a few times for me lately, I will write down my thoughts here so I can reference them more easily.\nChatGPT Cannot Think The first issue I take with many of the AI hype articles is that despite what the news coverage may imply, ChatGPT cannot think. To be honest, when I see articles talking about ChatGPT as \u0026ldquo;intelligent\u0026rdquo; or \u0026ldquo;thinking,\u0026rdquo; the first thing that comes to mind is this SMBC from 2011-08-17:\nIn my view, ChatGPT is a lot like this parrot - except that I do think it is fundamentally different, and ChatGPT is not \u0026ldquo;conscious\u0026rdquo; and does not \u0026ldquo;think\u0026rdquo; in a meaningful way. Despite the many advances made, artificial intelligence (AI) functions differently than a natural intelligence (NI), and in any ChatGPT is not designed to \u0026ldquo;think.\u0026rdquo;\nBefore giving a big picture view of how a large language model like ChatGPT works, I want to illustrate the limited flexibility of AI with an example from image recognition. An NI can readily distinguish between what is in the foreground and background of an image. Think of an image like this:\nA human will have no problem distinguishing between the insect in the image, the flower it is on, and the foliage in the background as distinct objects in different planes. This holds true even if the individual is not familiar with the particular plant or insect in the image. If given additional images of either the insect on a different background or the same background without the insect, we would not mistake the the plant for the insect or the other way around.\nNow consider an AI model trained to recognize insects. The algorithm doesn\u0026rsquo;t have a concept of \u0026ldquo;insect\u0026rdquo; or \u0026ldquo;plant,\u0026rdquo; per se. Rather, it notices patterns in images that are labeled \u0026ldquo;insect\u0026rdquo; or labeled with a particular insect. The pattern it learns does not depend on it having a concept of \u0026ldquo;insect.\u0026rdquo; What that means in practice, is that our model might learn that the background is equally or even more important than the foreground. If we train our data set with bees on flowers, but not flowers without bees, we may end up with a model that declares flower photos \u0026ldquo;bees.\u0026rdquo; This phenomeon is known in image recognition, and people are actively working on methods around this problem. But it nicely illustrates how AI is not \u0026ldquo;smart,\u0026rdquo; and humans need to do a lot of heavy lifting to get the AI algorithm to perform as intended, even if the application domain is relatively limited. For more information on this application, see this article from GradientScience.\nHow does it Work? With this background, let\u0026rsquo;s get an overview of how models like ChatGPT work. A good summary of the techniques involved is detailed in this post by AssemblyAI. In simple terms, a model is exposed to large amounts of data in order to learn about the structure of words and how the are aligned in sentences. In principle, this is not too different from the text prediction feature you have on your phone while texting. But this methodology only works to help produce coherent or seemingly coherent sentences by completion. Marked language modeling is a method use to help the model learn about syntax as well to improve the output.\nWhat is new with ChatGPT is that in addition labeled training material, it utilizes human feedback to improve its output. Deep down, AI models can be thought of as optimizing some (very complicated) function. This goal function need not necessarily be written down explicitly. OpenAI uses a method where a model gives two possible outputs for a prompt, and then a human judges which is \u0026ldquo;better,\u0026rdquo; somewhat similar to when an optometrist asks you if \u0026ldquo;1\u0026rdquo; or \u0026ldquo;2\u0026rdquo; is better. It then uses this feedback to improve its output iteratively. See this blog post from OpenAI where they use this methodology to animate a backflip.\nChatGPT uses three steps for human feedback based reinforcement learning. You can already imagine some of the issues that can arise from using this method. For one, if human feedback is used to train the model, then we can expect the model to reflect the thoughts and opinions of the labelers to some degree. Labelers may be mistaken and might not be experts in whatever topic they are reviewing. They may be fundamentally mistaken or biased about what we would consider high school-level knowledge.1 This is on top of the issues of the large amount of source text used in the initial training phase. These source texts may vary wildly in style and accuracy. Even humans reviewing an article may not be able to distinguish facts from opinion, let alone a language model using many source texts as input. Which leads us to what I see as a main problem for ChatGPT.\nFactual Inaccuracies Despite the confidence exuded by ChatGPTs output, it will readily produce a number of factual inaccuracies or give bad advice when explaining how to do tasks. See for example Avram Piltch\u0026rsquo;s \u0026ldquo;I Asked ChatGPT How to Build a PC. It Told Me to Break My CPU\u0026rdquo; (Tom\u0026rsquo;s Hardware), where ChatGPT gives instructions for a computer assembly that is potentially damaging to the hardware.\nOr this article (ToolGuyd) where Stuart asked ChatGPT to recommend a cordless powerdrill. ChatGPT made three recommendations. In explaining its recommendations, it gave several tech specs about the recommended products. The only problem is that it got several of these items wrong. It made mistakes about what type of drill a particular model was, whether the battery is included in the particular SKU it listed or not, and how many BPM the model delivers. It also recommended a discontinued model.\nAs a third example, consider this post where economics professor Bryan Caplan attempts to let ChatGPT take one his more recent midterms. It\u0026rsquo;s quite detailed and includes the questions, answers, and grading rubric Bryan used. He gave ChatGPT a D on this exam, substantially below the average grade human students in the class received.\nI would like to highlight that my argument isn\u0026rsquo;t that ChatGPT gets everything wrong - it doesn\u0026rsquo;t. It can even perform exceptionally well at certain tasks. See this white paper by Christian Terwiesch grading ChatGPT\u0026rsquo;s attempt at the final exam Wharton Business School MBA core course for just one example. A little googling will quickly lead to other examples, such as it passing law school exams or giving decent answers to tech sector interview questions.\nMy concern is that it sounds very confident in its answers, but it is not always trivial for the average person to verify whether or not ChatGPT\u0026rsquo;s output is trustworthy. As Rupert Goodwin put it, ChatGPT is \u0026ldquo;a Dunning-Kruger effect knowledge simulator par excellence.\u0026rdquo; And that\u0026rsquo;s a problem if people decide to just trust it to produce truth, when ChatGPT has no idea what \u0026ldquo;truth\u0026rdquo; is. It\u0026rsquo;s important to know that OpenAI is aware of this and it even says so on it\u0026rsquo;s FAQ page:\nCan I trust that the AI is telling me the truth? a. ChatGPT is not connected to the internet, and it can occasionally produce incorrect answers. It has limited knowledge of world and events after 2021 and may also occasionally produce harmful instructions or biased content.\nWe\u0026rsquo;d recommend checking whether responses from the model are accurate or not. If you find an answer is incorrect, please provide that feedback by using the \u0026ldquo;Thumbs Down\u0026rdquo; button.\nIn my opinion this is reasonable and to be expected. I think some people may get too excited and feel too confident in this technology when it just isn\u0026rsquo;t as reliable as many would wish at this stage. And for those reasons, I don\u0026rsquo;t think it\u0026rsquo;s coming for our jobs any time soon.\nNote: If you use ChatGPT, be careful to not give it any sensitive information. OpenAI isn\u0026rsquo;t making this very expensive model available to you for free out of the goodness of their hearts. They\u0026rsquo;re using your interaction with it to further train the model.\nUpdate 3/21: There is a good article in the New Yorker regarding my point that ChatGPT doesn\u0026rsquo;t \u0026ldquo;think.\u0026rdquo; This is contra Daniel Miessler\u0026rsquo;s argument that ChatGPT and similar models exhibit \u0026ldquo;understanding.\u0026rdquo;\nUpdate 4/4: And here a good post by Michael Huemer on this issue.\nFor a good review of the many ways in which typical adults are uninformed and mistaken about issues contra accepted expert opinion, see: B. Caplan, The Myth of the Rational Voter: Why Democracies Choose Bad Policies, Princeton University Press, Princeton, NJ, 2007. And B. Caplan, The Case against Education: Why the Education System Is a Waste of Time and Money, Princeton University Press, Princeton, NJ, 2019.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/23-01-why-im-not-worried-about-chatgpt/","summary":"\u003cp\u003eChatGPT has been all over my newsfeed lately, with a considerable amount of hype. In particular, many are wondering or  even worrying whether the emergence of this technology will threaten jobs with moderate to high education requirments. See for example \u003ca href=\"https://www.theatlantic.com/ideas/archive/2023/01/chatgpt-ai-economy-automation-jobs/672767/\"\u003e\u0026ldquo;How ChatGPT Will Destabilize White-Collar Work\u0026rdquo; (The Atlantic)\u003c/a\u003e, where Annie Lowrey leads with \u0026ldquo;In the next five years, it is likely that AI will begin to reduce employment for college-educated workers.\u0026rdquo; I do not share these views. In fact, I am somewhat underwhelmed by the threat of ChatGPT for a number of reasons. Since this topic has come up a few times for me lately, I will write down my thoughts here so I can reference them more easily.\u003c/p\u003e","title":"Why I'm Not Worried About ChatGPT"},{"content":"During December 2022, SAS ODA received substantial updates - see the upgrade page for details. It\u0026rsquo;s really nice to see that ODA is now using SAS 9.4M7. If you are a SASPy user, you may now bump into an error while logging in with your existing configuration. The specific error I encountered was \u0026ldquo;An exception was thrown during the encryption key exchange.\u0026rdquo; Nothing is wrong with your password, however. Due to changes with the AES encryption, SASPy will now need access to 3 encrpytion JARs in its classpath. See this note in the official SASPy docs. Download the required JAR files here (requires login) and add them to your SASPy package\u0026rsquo;s path here:\npath/to/python/site-packages/saspy/java/iomclient/ Make sure your JAR files are set to executable and you\u0026rsquo;ll be good to go again.\n","permalink":"https://dmsenter89.github.io/post/23-01-sas-oda-update-saspy-impact/","summary":"\u003cp\u003eDuring December 2022, SAS ODA received substantial updates - see the \u003ca href=\"https://support.sas.com/ondemand/upgrade2022.html\"\u003eupgrade page\u003c/a\u003e for details. It\u0026rsquo;s really nice to see that ODA is now using SAS 9.4M7. If you are a SASPy user, you may now bump into an error while logging in with your existing configuration. The specific error I encountered was \u0026ldquo;An exception was thrown during the encryption key exchange.\u0026rdquo; Nothing is wrong with your password, however. Due to changes with the AES encryption, SASPy will now need access to 3 encrpytion JARs in its classpath. See \u003ca href=\"https://sassoftware.github.io/saspy/configuration.html#attn-as-of-saspy-version-3-3-3-the-classpath-is-no-longer-required-in-your-configuration-file\"\u003ethis note\u003c/a\u003e in the official SASPy docs. Download the required JAR files \u003ca href=\"https://support.sas.com/downloads/package.htm?pid=2494\"\u003ehere\u003c/a\u003e (requires login) and add them to your SASPy package\u0026rsquo;s path here:\u003c/p\u003e","title":"Dec22 SAS ODA Update - Impact on SASPy Users"},{"content":"Dealing with computer resources in a modern lab can be tricky. Even if all participating researchers have laptops, a central location for storage or to host licensed software is desirable. While a physical computer can be setup for such a use, that is not always the most desirable solution. We want multiple people to have concurrent access to our resources while providing safe, sandboxed environments. Sometimes lab members want/need root access to learn certain tasks, but we don\u0026rsquo;t want them to accidentally take down our carefully configured systems. This leads us to the idea of containerization which can provide various failsafes. In this post, we will be setting up the LXD (ArchWiki) environment as a virtual lab computer. This solution gives an entire working system, including systemd, similar to but more lightweight than VirtualBox.\nSetting up a successful virtual computer is very similar to setting up a regular Linux machine, with some minor LXD overhead. Note that only the individual administering the containers will need to deal with that LXD overhead. From the point of view of the end user, it\u0026rsquo;ll look the same as if they were interacting with a \u0026ldquo;regular\u0026rdquo; computer. This post deals with the setup from the point of view of the admin. The lab members should be setup as users inside the container and can then SSH into the container or use VNC if a GUI is needed, similar to how they interact with a regular remote computer.\nInstall and Setup of LXD Container Setup Getting the Image Basic Container Management Setting up the Container Networking Giving the Container Access to the Internet Network Forwarding Getting a GUI Running Install and Setup of LXD LXD can be installed from a snap with sudo snap install lxd, but that requires you to have snap running. On Arch, LXD is available in the repos with pacman -S lxd. To get a RPM install in Fedora, you\u0026rsquo;ll need to use an additional COPR repository like this:\ndnf copr enable ganto/lxc4 dnf install lxd Once installed, you\u0026rsquo;ll need to either enable the lxd.socket or lxd.service (if you want instances to be able to autostart). You\u0026rsquo;ll want to modify the subuid and subgid files so you can run unpriviliged containers (recommended), e.g.:\n# for root user and systemd: usermod -v 1000000-1000999999 -w 1000000-1000999999 root With this done, run lxd init to go through a configuration guide for your new setup. If this is your first time using LXD, you will likely be fine just using the default settings, except maybe the size of the storage pool - but you can always attach other storage to your containers later, so if you will run tasks producing a lot of data you might want to consider just mounting a dedicated filesystem later. If you have multiple computers available in your lab, you might want to consider turning on clustering (documentation).\nFor more details, see the ArchWiki or the Official Getting Started Guide.\nContainer Setup Getting the Image The first step in setting up a container is picking a suitable image to start from. Similar to Docker, many distributions are available to chose from. There are also arm and amd64 images available, so you can pick what works with your platform. To list available images on the image server, use the syntax lxc image list images:\u0026lt;keyword\u0026gt;.\n# ArchLinux images: lxc image list images:archlinux amd64 # Fedora images, using key/value pairs: lxc image list images:fedora arch=amd64 type=container To create a new image without starting it, use lxc init \u0026lt;image\u0026gt; \u0026lt;container-name\u0026gt;. To both initialize and start a new container, use lxc init \u0026lt;image\u0026gt; \u0026lt;container-name\u0026gt;. For example:\n# create a base image called myarch without starting: lxc init images:archlinux myarch # you can also specify version and arch, e.g. Fedora 36 / 64bit: lxc init images:fedora/36/amd64 myfedora # create and launch an image: lxc init images:rockylinux/9 myrocky Note that you can have a large number of concurrent containers in use, which may but need not share the same base image. This can be useful for larger teams, where you can setup systems for particular tasks or projects. For example, you could have a main machine for your graduate students, a separate one for people moving in and out of the lab like REU students, and a third container for a class that you\u0026rsquo;re teaching to use.\nBasic Container Management # Starting, stopping etc. is intuitive lxc start \u0026lt;container\u0026gt; # starts container lxc stop \u0026lt;container\u0026gt; [--force] # stops the container lxc restart \u0026lt;container\u0026gt; [--force] # restart lxc pause \u0026lt;container\u0026gt; # send SIGSTOP to all container processes # what containers do I have? lxc list lxc info myarch # get detailed info about this container lxc copy \u0026lt;name1\u0026gt; \u0026lt;name2\u0026gt; # make a copy of an existing container lxc delete \u0026lt;container\u0026gt; [--force] # edit container configuration lxc config edit \u0026lt;container\u0026gt; # launches config in VISUAL editor lxc config set \u0026lt;container\u0026gt; \u0026lt;key\u0026gt; \u0026lt;value\u0026gt; # change a single config item lxc config device add \u0026lt;container\u0026gt; \u0026lt;dev\u0026gt; \u0026lt;type\u0026gt; \u0026lt;key\u0026gt;=\u0026lt;value\u0026gt; lxc config show [--expanded] \u0026lt;container\u0026gt; In addition to these commmands, you can also snapshot your containers. This creates a restorable copy of your container in case something bad happens - like someone typing rm -rf * into the wrong root shell. By default, snapshots are named in a numbered pattern snapX where X is an integer.\nlxc snapshot \u0026lt;container\u0026gt; \u0026lt;snap\u0026gt; # create new snapshot lxc restore \u0026lt;container\u0026gt; \u0026lt;snap\u0026gt; # restore container to snapshot lxc copy \u0026lt;container\u0026gt;/\u0026lt;snap\u0026gt; \u0026lt;new-container\u0026gt; # new container from snapshot lxc delete \u0026lt;container\u0026gt;/\u0026lt;snap\u0026gt; # delete the snapshot lxc info \u0026lt;container\u0026gt; # lists available snapshots, plus other info lxc move \u0026lt;container\u0026gt;/\u0026lt;snap\u0026gt; \u0026lt;container\u0026gt;/\u0026lt;new-snap\u0026gt; # rename snapshot Setting up the Container You can enter your container immediately with a root shell with lxc shell \u0026lt;container\u0026gt; and proceed with your regular setup, such as updating and installing packages, setting up new users, etc. To make this process more repeatedly, you can also just move a setup script from the host to the container first, and then execute that script inside the container. That way you can have a record of what you did when you first set up the container.\n# drop into root shell lxc shell \u0026lt;container\u0026gt; # execute arbitrary command in container lxc exec \u0026lt;container\u0026gt; -- \u0026lt;program\u0026gt; [\u0026lt;options\u0026gt;] # move a file from host to container lxc file push /host/file \u0026lt;container\u0026gt;/path/on/container # move a file from container to host lxc file pull \u0026lt;container\u0026gt;/path/to/file /path/on/host # edit a file inside container lxc file edit \u0026lt;container\u0026gt;/etc/passwd See my earlier blog post for a list of some CLI tools I like to install on new systems.\nNetworking Giving the Container Access to the Internet The first step in container networking is to make sure your container can access the network. This may require your firewall to let traffic through on the default bridge. On an ArchLinux host, use\nufw route allow in on lxdbr0 ufw allow in on lxdbr0 while on a Fedora host you might use\nfirewall-cmd --zone=trusted --change-interface=lxdbr0 --permanent Network Forwarding At this point your container is available from the host on wich the LXD service is running. But the whole point of the exercise is to make the container accessible from the lab members\u0026rsquo; various devices. I\u0026rsquo;ll present two options here for setting this up, depending on whether you need access from outside of your local network or not. Either way, make sure SSH is set up inside your container and you can SSH into the container from the host shell. Both methods rely on using the network forward feature built into LXD. See the documentation for details.\nFor network forwarding to work we need to know two things about our container: what device our container is using to connect to the internet; on a default setup, this will be lxdbr0 but check with lxc network list to be sure. The second item we need is the IP address of our container, which can be displayed with lxc list \u0026lt;container\u0026gt;.\nThe next item we need is an IP address to forward from. We can either get an IP address dedicated to the container, or hijack some ports from our host for re-routing.\nTo add a second IP to your existing network device, use the ip a command to find the device name (on your host) of the network device connected to your network. If you use wifi, this might be something like wlp4s0 or similar. Then pick an IP not otherwise assigned by the router and assign it to this device - in addition to the existing IP - using the following command:\nip -4 a add dev \u0026lt;device-name\u0026gt; \u0026lt;free-ip\u0026gt;/24 Note that this will only persist until the host reboots. You can then create a network forward on the container\u0026rsquo;s device (e.g., lxdbr0) with the newly assigned IP as the listening address. Using this command will let the container handle all incoming traffic to the new IP:\nlxc network forward create lxdbr0 \u0026lt;listening_address\u0026gt; You can then edit the target address with\nlxc network forward edit lxdbr0 \u0026lt;listening_address\u0026gt; and specify the container\u0026rsquo;s IP as the target_address.\nThe alternative method is to use the host\u0026rsquo;s IP as the listening address and then just forward particular ports to the container, e.g. port 22 for SSH or 590x for VNC servers. This way you skip creating the second IP above, and just start by creating and editing a network forward with the the host IP as listening address. The edit can then list the ports you want forwarded. Here\u0026rsquo;s an example of a valid file:\ndescription: Sample Forward config: {} ports: - description: ssh protocol: tcp listen_port: \u0026#34;10022\u0026#34; # any unused host port target_port: \u0026#34;22\u0026#34; target_address: \u0026lt;container-ip\u0026gt; - description: VNC servers protocol: tcp listen_port: 105901-105904 # any unused host port target_port: 5901-5904 target_address: \u0026lt;container-ip\u0026gt; listen_address: \u0026lt;host-ip\u0026gt; Aside from these forwards, you may consider setting up a postfix server and associated forward so you can use the mail command to programmatically send emails to users. One great use case for this is the sending of log files after completion of long running jobs. This keeps your users from needing to manually log in and check the status of their jobs. If you have used HPC services at your campus, you may have experienced the utility of this first hand.\nNetwork forwarding options are explained in more detail in the documentation, which also contains a link to a short YouTube video demonstring these commands in a shell session.\nGetting a GUI Running First, think about whether the tools you use require a GUI. A lot of research work can be done entirely within the command line or by using servers with particular software. So instead of installing a regular RStudio instance, you could install RStudio Server. Jupyter is already designed around the client/server model, as are RDBMS systems. If your team doesn\u0026rsquo;t feel comfortable with ViM and prefers VS Code, use the remote extension to use a VS Code server that can be opened up from your teams\u0026rsquo; local computers using SSH.\nIf you only need a GUI to use one GUI app at a time, say Mathematica/Matlab, then the simplest option will be to use X-forwarding via SSH. Make sure that X11Forwarding yes is set in your sshd_config file and restart the sshd service to turn it on. You\u0026rsquo;ll also need to install xorg-xauth on an ArchLinux container. From then on, connecting via SSH with the -X flag should work as desired.\nIf you need an entire desktop environment available, you can set up VNC or NoMachine the same way you would for a regular system. I have seen a lot of comments arguing for NoMachine being more performant, but the default TigerVNC on Arch/Fedora has worked sufficiently well for most of my needs.\n","permalink":"https://dmsenter89.github.io/post/23-01-virtual-lab/","summary":"\u003cp\u003eDealing with computer resources in a modern lab can be tricky. Even if all participating researchers have laptops, a central location for storage or to host licensed software is desirable. While a physical computer can be setup for such a use, that is not always the most desirable solution. We want multiple people to have concurrent access to our resources while providing safe, sandboxed environments. Sometimes lab members want/need root access to learn certain tasks, but we don\u0026rsquo;t want them to accidentally take down our carefully configured systems. This leads us to the idea of containerization which can provide various failsafes. In this post, we will be setting up the \u003ca href=\"https://wiki.archlinux.org/title/LXD\"\u003eLXD (ArchWiki)\u003c/a\u003e environment as a virtual lab computer. This solution gives an entire working system, including systemd, similar to but more lightweight than VirtualBox.\u003c/p\u003e","title":"Setting up a Virtual Lab Computer"},{"content":"Understanding whether a variable\u0026rsquo;s missingness from a dataset is related to the underlying value of the data is a key concept in the field of missing data analysis. We distinguish three broad categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In his book Statistical Rethinking, McElreath1 gives an amusing example to illustrate this concept: he considers variants of a dog eating homework and how the dog chooses - if at all - to eat the homework. The examples he give show substantial shifts in observed values, which make for a good illustration of the types of problems you might encounter. A lecture corresponding to the example from the book can be found on YouTube. In this post, I will first briefly review the different missing data mechanisms before implementing McElreath\u0026rsquo;s examples in SAS.\nOverview of Missing Data Mechanisms My presentation here follows van Buuren2. Let $Y$ be a $n \\times p$ matrix representing a sample of $p$ variables for $n$ units of the sample and $R$ be a corresponding $n \\times p$ indicator matrix, so that\n$$r_{i,j} = \\begin{cases} 1 \u0026amp; y_{i,j} \\text{ is observed} \\\\ 0 \u0026amp; y_{i,j} \\text{ not observed.}\\end{cases} $$\nWe denote the observed data by $Y_\\text{obs}$ and the missing data that $Y_\\text{miss}$ so that $Y=(Y_\\text{obs},Y_\\text{miss})$.\nWe distinguish three main categories for how the distribution of $R$ may depend on $Y$. This relationship is described as the missing data model. Let $\\psi$ contain the parameters of this model. The general expression of the missing data model is $\\mathrm{Pr}(R|Y_\\text{obs}, Y_\\text{miss}, \\psi)$, where $\\psi$ consists of the parameters of the missing data model.\nMissing Completely at Random (MCAR). This implies that the cause of the missing data is unrelated to the data itself. In this case,\n$$ \\mathrm{Pr}(R=0| Y_\\text{obs}, Y_\\text{miss}, \\psi) = \\mathrm{Pr}(R=0|\\psi).$$\nThis is the ideal case, but unfortunately rare in practice. Many researchers implicitly assume this when using methods such as list-wise deletion, otherwise known as complete case analysis, which can produce unbiased estimates of sample means if the data are MCAR, although the reported standard error will be too large.\nMissing at Random (MAR). Missingness is the same within groups defined by the observed data, so that\n$$ \\mathrm{Pr}(R=0| Y_\\text{obs}, Y_\\text{miss}, \\psi) = \\mathrm{Pr}(R=0|Y_\\text{obs},\\psi).$$\nThis is a often a more reasonable assumption in practice and the starting point for modern missing data methods.\nMissing not at Random (MNAR). If neither the MCAR or MAR assumptions hold, then we may find that missingness depends on the missing data itself, in which case there is no simplification and $$ \\mathrm{Pr}(R=0| Y_\\text{obs}, Y_\\text{miss}, \\psi) = \\mathrm{Pr}(R=0| Y_\\text{obs}, Y_\\text{miss}, \\psi).$$\nAs you can imagine, this is the most tricky case to deal with.\nDogs Eating Homework Consider dogs (D) eating students\u0026rsquo; homework. Each student\u0026rsquo;s homework score (H) is graded on a 10-point scale and each student\u0026rsquo;s score varies in proportion to how much they study (S). We assume the amount of time they study is normally distributed. A binomial is used to generate homework scores from the normed time spent studying. McElreath uses the following code to simulate the full data set:\nN \u0026lt;- 100 S \u0026lt;- rnorm( N ) H \u0026lt;- rbinom( N, size=10, inv_logit(S) ) where inv_logit(x) = exp(x)+(1+exp(x)), the definition used by the LOGISTIC function in SAS. With a data step, this can be represented in SAS as follows:\ndata full; DO i=1 to 100; S=RAND(\u0026#39;NORM\u0026#39;); H=RAND(\u0026#39;BINO\u0026#39;, LOGISTIC(S), 10); output; END; label S=\u0026#39;Amount of Studying\u0026#39; H=\u0026#39;Homework Score\u0026#39;; drop i; run; We can get closer in form to the R code by using PROC IML, but that\u0026rsquo;s a story for a different post.\nSay we are interested in estimating the relationship between $S$ and $H$. In our example, we assume that $H$ is not directly observable. Instead, $H^*$ is observed - a subset of the full data set $H$ with some homework values missing. We can now look at why some of those values are missing. Specifically, in McElreath\u0026rsquo;s example each student has a dog $D$ and sometimes the dog eats the homework. But here we can again ask, why is the dog eating the homework? McElreath uses directed acyclic graphs (DAGs) to represent different missing data models, reproduced below. As we will see, these are some intuitive examples for our three missing data mechanism categories.\nThe directed acyclic graphs corresponding to McElreath\u0026rsquo;s examples of missing data models. $S$ represents the amount of time spent studying, which in turn influences the homework score $H$, which is only partially observed (indicated by the circle). Alas, dogs $D$ eat some of the homework. The actually observed scores - those not eaten - are indicated by $H^*$. Adapted from figure 15.4 in Statistical Rehinking.\nMissing Completely At Random (MCAR) In the first example, the dogs eat homework completely at random. This is the most basic and benign case, and corresponds to DAG 1) in Figure 1. McElreath\u0026rsquo;s R code is given by\nD \u0026lt;- rbern( N ) Hm \u0026lt;- H # H*, but * is not a valid char for varnames in R Hm[D==1] \u0026lt;- NA We can implement this in SAS by using the RAND function with the Bernoulli argument in an if/else clause:\nif RAND(\u0026#39;BERN\u0026#39;, 0.5) then Hm = .; else Hm = H; This causes about half of our data to be hidden, but not in a biased way.\nMissing at Random (MAR) In the second example, we assume the amount of time a student spends studying decreases the amount of time they have to play with and exercise their dog. This, in turn, influences whether the homework gets eaten. Or, as McElreath puts it, the \u0026ldquo;dog eats conditional on the cause of homework.\u0026rdquo; In his particular example, the homework is eaten whenever a student spends more time studying than the average $S=0$. This corresponds to DAG 2) in Figure 1.\nD \u0026lt;- ifelse( S\u0026gt;0 , 1 , 0 ) Hm \u0026lt;- H Hm[D==1] \u0026lt;-NA In SAS:\nif S\u0026gt;0 then Hm = .; else Hm = H; Missing not at Random (MNAR) In this case, we have some correspondence between the missing variable\u0026rsquo;s value and whether or not it is missing from the data set. Here, the \u0026ldquo;dog eats conditional on the homework itself.\u0026rdquo; Suppose that dogs prefer to eat bad homework. In such a case, the value of $H$ is directly related to whether or not $H$ is observed in the particular unit or not. His example R code is as follows:\n# dogs prefer bad homework D \u0026lt;- ifelse( H\u0026lt;5 , 1 , 0 ) Hm \u0026lt;- H Hm[D==1] \u0026lt;- NA And in SAS:\nif H\u0026lt;5 then Hm = .; else Hm = H; The Full SAS Code We can now build a SAS data set that contains a full copy of the original data set, together with our various examples of missing data mechanisms. I have added a seed to the data step for reproducibility.\ndata full; CALL streaminit( 451 ); LABEL Type = \u0026#39;Missing Data Mechanism\u0026#39; S = \u0026#39;Amount of Studying\u0026#39; H = \u0026#39;Homework Score\u0026#39; Hm = \u0026#39;Observed Homework Score\u0026#39; ; DO i=1 to 100; TYPE = \u0026#39;FULL\u0026#39;; S = RAND(\u0026#39;NORM\u0026#39;); H = RAND(\u0026#39;BINO\u0026#39;, LOGISTIC(S), 10); Hm = H; output; /* Example 1) MCAR */ TYPE = \u0026#39;MCAR\u0026#39;; if RAND(\u0026#39;BERN\u0026#39;, 0.5) then Hm = .; else Hm = H; output; /* Example 2) MAR */ TYPE = \u0026#39;MAR\u0026#39;; if S\u0026gt;0 then Hm = .; else Hm = H; output; /* Example 3) MNAR */ TYPE = \u0026#39;MNAR\u0026#39;; if H\u0026lt;5 then Hm = .; else Hm = H; output; END; drop i; run; You may want to run a PROC SORT or PROC SQL afterwards to group the different categories together, as they will be alternating in this data set.\nResults Boxplot of our example data. Note that the MCAR data looks very similar to the original data set, unlike the MAR and MNAR versions.\nWe can see that MCAR leads to minimal bias in our example data, while both the MAR and MNAR variations lead to substantial differences in observed vs actual homework scores for our synthetic population. For a more subtle example, see section 2.2.4 in van Buuren,2 available online here.\nR. McElreath, Statistical Rethinking, 2nd ed, Chapman and Hall/CRC, Boca Raton, FL, 2020.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nS. van Buuren, Flexible Imputation of Missing Data, 2nd ed, Chapman and Hall/CRC, Boca Raton, FL, 2019.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/23-01-missing-data-mechanisms/","summary":"\u003cp\u003eUnderstanding whether a variable\u0026rsquo;s missingness from a dataset is related to the underlying value of the data is a key concept in the field of missing data analysis. We distinguish three broad categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In his book \u003cem\u003eStatistical Rethinking\u003c/em\u003e, McElreath\u003csup id=\"fnref:1\"\u003e\u003ca href=\"#fn:1\" class=\"footnote-ref\" role=\"doc-noteref\"\u003e1\u003c/a\u003e\u003c/sup\u003e gives an amusing example to illustrate this concept: he considers variants of a dog eating homework and how the dog chooses - if at all - to eat the homework. The examples he give show substantial shifts in observed values, which make for a good illustration of the types of problems you might encounter. A lecture corresponding to the  example from the book can be found on \u003ca href=\"https://www.youtube.com/watch?v=oMiSb8GKR0o\u0026amp;list=PLDcUM9US4XdMROZ57-OIRtIK0aOynbgZN\u0026amp;index=18\"\u003eYouTube\u003c/a\u003e. In this post, I will first briefly review the different missing data mechanisms before implementing McElreath\u0026rsquo;s examples in SAS.\u003c/p\u003e","title":"Missing Data Mechanisms"},{"content":"There are certain CLI tools that I find myself installing whenever I set up a new system. I\u0026rsquo;m not talking about the general system setup, like installing vim or Python, but some drop-in replacements for older Linux tools and some cli solutions that I use quite regularly. I thought I would collect them here for convenience. The headers are sorted alphabetically, except that \u0026ldquo;Other\u0026rdquo; is last because that seems most sensible.\n\u0026ldquo;Better\u0026rdquo; Drop-Ins Data and File Wrangling Development Resource Management Other \u0026ldquo;Better\u0026rdquo; Drop-Ins The following tools I use as \u0026ldquo;better\u0026rdquo; drop-ins for other commands. Instead of ls, I typically now use exa. Instead of grep, I tend to use ripgrep. Instead of find, I tend to use fd. For quick viewing of text-files with syntax highlighting, I like bat over cat. As a git/diff pager with syntax highlighting I use delta. All of these tools are written in Rust. The old df command can be improved with duf, a Go implementation with output that\u0026rsquo;s nicer to read (and alternatively, outputs as JSON).\nData and File Wrangling Calculator A great CLI calculator implemented in C++ is Qalculate!. It is mainly intended to be run with a Qt or GTK GUI, but does include a CLI version that can be invoked with qalc. A Rusty alternative is rink. A benefit of rink is that if you are trying to convert incompatible units, it will make a suggestion for a transformation that makes each side compatible, which can help with dimensional analysis. A good Rusty calculator app without units but wide support of operations, including functions, is kalker.\nCSV and JSON CSV files are ubiquitous, and being able to manipulate them and get an overview of what is contained without needing to actually load them in Excel/Python/SAS/etc is very useful. I used to really like csvkit for that. The main drawback here is speed for large CSV files, due to it being implemented in Python. A must faster program written in Rust is qsv, successor to BurntSushi\u0026rsquo;s xsv. It has more features than csvkit, is faster, and seems more flexible.\nAnother cool Python program for interacting with CSV files is visidata. It is a CSV viewer that doubles as a spreadsheet program, allows you to make plots and statistics and just do a ton of different things with your file in-memory.\nThe C program jq aims to be the sed of working with JSON files.\nFile Managers There are a lot of file managers to choose from these days. There are lots of popular options like mc, ranger, nnn, but I tend to keep falling back on vifm.\nDevelopment If you need to benchmark something, try Rusty old Hyperfine. There is an interesting make alternative called just, which looks promising but I haven\u0026rsquo;t played with it yet.\nResource Management Disk Space There are several excellent tools here. One that can be found in most repos I\u0026rsquo;ve encountered is Ncdu, a disk usage analyzer with an ncurses interface written in C. It is reasonably fast and let\u0026rsquo;s you interactively delete folders while you\u0026rsquo;re at it. A parallel implementation of the same idea but written in Go can be found with gdu. For non-interactive use, dust (du + rust) is available.\nSystem Status - General One of the first tools I usually install is htop. It is a widely available and fast process viewer written in C. You can use it to kill or renice a process interactively without needing to find its PID. An alternative to this is glances, \u0026ldquo;an eye on your system\u0026rdquo; written in Python. It has a lot more information, including disk usage, sensor temperatures, battery information (on laptops), etc. and can be extended with plugins. It can be used interactively on the CLI, but it also gives the option of running in client/server mode which is nifty.\nSimilar to glances, bottom is a Rust program giving general system information including plots, but it does not have quite the same range of information to it as glances does.\nSystem Status - Networking To see what is clogging up your internet pipes, try nethogs written in C++. A nice rusty alternative is bandwhich.\nOther Trying to figure out when it\u0026rsquo;s a good time to speak with a colleague in a different time zone? Install the Python package undertime.\nDon\u0026rsquo;t want to remeber different package manager\u0026rsquo;s syntax? Install pacapt and use ArchLinux\u0026rsquo; pacman syntax on your system instead.\n","permalink":"https://dmsenter89.github.io/post/22-12-some-cli-tools/","summary":"A few convenience CLI tools I find myself installing on new systems regularly.","title":"Some CLI Tools"},{"content":"Creating a minimum working example (MWE) is a relatively frequent task. It is no problem to share an MWE for a feature in SAS because a large number of example data sets are shipped and installed by default. But sometimes you need an MWE because you are having trouble accomplishing a particular task with particular input data. At that point, you will need to share the data or a subset thereof together with your code. In SAS Forums, the preferred way to do this is with a datastep using a datalines/cards statement. Writing these by hand can be tedious since the data source is not typically a datalines statement to begin with. I have previously seen a SAS macro that can be used to generate a datalines statement from a SAS data set, but can\u0026rsquo;t seem to locate it at the moment. The data source I personally encounter the most often in my work is either in CSV or Excel formats. Since the latter can easily be exported to CSV, I decided to write a program that generates a SAS data step given a CSV file.\nFor the implementation language I chose to use Go. I started learning about Go back in May when I implemented a simple CLI version of Wordle. Since then I have increasingly used Go to write various small tools at work. It has been a very enjoyable language to write in and distribution via GitHub is easy. If you have the Go toolchain installed, you can get the latest copy of csv2ds using\ngo install github.com/dmsenter89/csv2ds@latest The tool is very simple to use. Give it a CSV file or list of CSV files and it will generate a data step for each file using the CSV\u0026rsquo;s base name as the data set name. To ensure compatibility, variable names and the data set name are processed to be compatible with SAS\u0026rsquo; naming scheme. The tool will attempt to guess if a particular column is numeric or not. If a column is determined to not be numeric, the longest cell will be used to set that variable\u0026rsquo;s length via a length statement to prevent truncation.\nI often work with the csvkit suite of command-line tools. It\u0026rsquo;s a wonderful collection of Python programs that can import data into CSV, generate basic column statistics, and use grep and SQL to extract data from a CSV file, amongst other things. This collection is designed to allow you to pipe the output from one as input to the next. Consider this example. Csvcut is used to extract only certain columns from the file data.csv. Then csvgrep is used to subset to use only the data pertaining to one particular county. Then the data is sorted by the total_cost variable and displayed. I wanted my tool to be compatible with this suite, so if - is passed as the filename, csv2ds will read the contents of STDIN instead. Changing the above csvkit example by replacing csvlook with my tool will generate the corresponding SAS data set:\ncsvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvsort -c total_cost -r | csv2ds - At this point csv2ds is quite simple, but sufficient for my needs. Some minor intervention may be needed to make the data step template work for your data. Informats like DOLLAR are not recognized as numeric and minor edits would need to be made to the produced template.\nCheckout my new tool over on GitHub.\n","permalink":"https://dmsenter89.github.io/post/22-11-csv2ds/","summary":"CSV2DS is a new program I wrote in Go to help me create minimum working examples for SAS that can be shared as a single SAS script.","title":"CSV2DS"},{"content":"Windows Subsystem for Linux (WSL) is an important part of my daily work flow. Unfortunately, the main distro supplied by Windows is Ubuntu, which - for a variety of reasons - is not exactly my favorite distro. Luckily, WSL2 allows you to import an arbitrary Linux distro to use instead. I got the idea from an article (Dev.to) by Jonathan Bowman explaining how to get Fedora up and running in WSL2. This article summarizes the key points of Bowman\u0026rsquo;s post and includes information for my long time daily driver, Arch Linux.\nThe short of it is that you can import a root filesystem tarball into WSL2 from Windows terminal using the following command:\nwsl --import \u0026lt;distro-name\u0026gt; \u0026lt;distro-target-location\u0026gt; \u0026lt;path-to-tarball\u0026gt; Once imported, you can launch into your distro using wsl -d \u0026lt;distro-name\u0026gt;. The only question is how to get the root filesystem for this import step.\nThere are two options we can go with: using a pre-fabricated root filesystem (rootfs), or creating our own using Docker.\nUsing an Existing Root Filesystem Some distros publish these. For Arch Linux, you can find them on GitLab. Two main images are available: base and base-devel. The latter has the base-devel package group pre-installed.\nFor Fedora, you can head over to GitHub to get a copy of the rootfs. Note that for Fedora, the rootfs is merely part of the repo and not a separate release page. You\u0026rsquo;ll be able to pick your base version of Fedora by switching branches in the repository.\nThese rootfs images are usually compressed. Before you can use them with WSL2, the tarball needs to be extracted. The Arch Linux rootfs can be extracted with zstd and the Fedora rootfs can be extracted using 7z.\nMaking your Own Root Filesystem Docker allows you to export a container to a root filesystem tarball:\ndocker export -o \u0026lt;rootfs-name\u0026gt;.tar \u0026lt;container-or-image-name\u0026gt; The neat thing here is that you can use either an image or a container name.\nArch Linux images are available from DockerHub. Available tags include the above mentioned base and base-devel. Fedora is also available on DockerHub and its tags include version numbers (e.g., 37 or 36).\nAdditional Setup Once you have imported the distro you only have a barebones system available. Likely only the root user is available, which is not ideal. You\u0026rsquo;ll want to install the packages you want to use and set up your own user in addition to root. If you are building your own rootfs using Docker, you can build everything interactively in your container by running docker run -it \u0026lt;image-name\u0026gt;:\u0026lt;tag\u0026gt; to drop into a shell and do all your setup there. Alternatively, you can create a Dockerfile with the basic setup and build an image from that.\nArch Linux Pacman won\u0026rsquo;t work out-of-the-box because it doesn\u0026rsquo;t ship with keys. You\u0026rsquo;ll need to run pacman-keys --init first. Install your favorite software using pacman, e.g.\npacman -Syu exa htop vim User management and other common setup tasks are covered in the Arch Wiki\u0026rsquo;s General Recommendations. Key tasks include adding a new user:\nuseradd -m -G wheel $username Fedora Make sure to run dnf upgrade to get the latest version of your packages. You may need to install either the util-linux or util-linux-core packages in order to get the mount command working (used by WSL to mount the Windows filesystem). To be able to add a non-root user with a password you\u0026rsquo;ll need to make sure that passwd is installed.\nTo add a non-root user in Fedora, use\nuseradd -G wheel $username passwd $username # in interactive mode, you\u0026#39;ll type in your password here General Case In order to actually start the WSL instance as your non-root user, you\u0026rsquo;ll need to edit /etc/wsl.conf inside of your distro. If the user section doesn\u0026rsquo;t exist yet, you can just run\necho -e \u0026#34;\\n[user]\\ndefault = ${username}\\n\u0026#34; \u0026gt;\u0026gt; /etc/wsl.conf Those are the basics to get you up and running. Not everything will necessarily work smoothly out-of-the box as you may be missing some packages that you\u0026rsquo;re not aware of until you need them, but overall I\u0026rsquo;ve had a positive experience with this setup.\n","permalink":"https://dmsenter89.github.io/post/22-11-setup-an-arbitrary-wsl2-distro/","summary":"\u003cp\u003e\u003ca href=\"https://learn.microsoft.com/en-us/windows/wsl/\"\u003eWindows Subsystem for Linux\u003c/a\u003e (WSL) is an important part of my daily work flow. Unfortunately, the main distro supplied by Windows\nis Ubuntu, which - for a variety of reasons - is not exactly my favorite distro. Luckily, WSL2 allows you to import an arbitrary Linux distro\nto use instead. I got the idea from \u003ca href=\"https://dev.to/bowmanjd/install-fedora-on-windows-subsystem-for-linux-wsl-4b26\"\u003ean article (Dev.to)\u003c/a\u003e\nby Jonathan Bowman explaining how to get Fedora up and running in WSL2. This article summarizes the\nkey points of Bowman\u0026rsquo;s post and includes information for my long time daily driver, \u003ca href=\"https://archlinux.org/\"\u003eArch Linux\u003c/a\u003e.\u003c/p\u003e","title":"Setup an Arbitrary WSL2 Distro"},{"content":"One of the coolest packages for R is knitr. Essentially, it allows you to combine explanatory writing, such as a paper or blog post, directly with your analysis code in a Markdown document. When the target document is compiled (\u0026lsquo;knitted\u0026rsquo;), the R code in the document is run and the results inserted into the final document. The target document could be an HTML or a PDF file, for example. This is great for many reasons. You have a regular report you want to run, but the data updates? Just re-knit and your entire report is updated. No more separate running of the code followed by copying the results into whatever software you use to build the report itself. This makes it not just less cumbersome, but less error prone. It also improves reproducibility. Somebody wants to see your work, perhaps because they are unsure of your results or they want to extend your work? You can share the markdown file and the other party can see exactly what code was used to generate what part of your report or paper.\nWhile knitr is certainly not the first package that allows for this workflow, and also not the only one, I have found it to be the most consistent and easy to use.1 Luckily, knitr supports a variety of languages, including SAS. And you can even mix and match multiple languages in one document.2\nYou might think that this sounds similar to Jupyter notebooks. While that is true, and there is a Jupyter kernel for SAS as well, knitr has some advantages over Jupyter for report-generation. Without additional tools, you have the option to execute but not display the code that generates your results, making a cleaner report. You can also elect to only show part of the code, with manual setup code running behind the scenes without being printed to the report itself. Additionally, the entire document is executed linearly. That means that if you update a code chunk towards the beginning of your document, it affects the code chunks following it, while in Jupyter you easily get in the habit of executing the chunks independently which can lead to inconsistencies if you don\u0026rsquo;t pay attention to the cell numbers.3\nIn this post, I\u0026rsquo;ll demonstrate the basics of setting up a reproducible report using the SAS engine in knitr.\nInstall Perhaps the easiest way to get started for beginners is to use RStudio and Anaconda. With that you can create a sample R Markdown document (File -\u0026gt; New File -\u0026gt; R Markdown). Press the knit button. If any packages required by knitr are missing, RStudio will install them for you. This way you can be sure that all the R parts are set up correctly. Additionally, I recommend installing the SASmarkdown package with\n# from CRAN: install.packages(\u0026#34;SASmarkdown\u0026#34;) # from GitHub: devtools::install_github(\u0026#34;Hemken/SASmarkdown\u0026#34;) Once install is complete, load the package (library(SASmarkdown)) and check the output. If you see a message that SAS was found, you are good to go. If not, you will either need to add SAS to your PATH or simply provide the path to SAS as an option in your document (see below).\nA Basic Markdown File The important thing is to load the SASMarkdown package in your document. I recommend making a setup chunk at the very top of your document and setting include to FALSE. That way the setup chunk is executed, but not printed to your final document.\n```{r setup, include=FALSE} library(SASmarkdown) # if SAS is not in your path, define it manually: saspath \u0026lt;- \u0026#34;C:/Program Files/SASHome/SASFoundation/9.4/sas.exe\u0026#34; knitr::opts_chunk$set(engine=\u0026#34;sashtml\u0026#34;, engine.path=saspath) ``` With that, we\u0026rsquo;re ready to run a basic SAS chunk using just the SAS option. This produces the typewriter-style output that is familiar from Enterprise Guide for example:\n```{sas example1} proc print data=sashelp.class; run; ``` If we want to take advantage of the modern HTML output that is standard in SAS Studio, we use the sashtml engine instead:\n```{sashtml example2} /* if you want, you can set an ODS style for HTML output: */ ods html style=journal; proc print data=sashelp.class; run; ``` If you want graphical output, for example from SGPLOT, you\u0026rsquo;ll need to use the sashtml engine. To get the default blue look from SAS Studio, use the HTMLBLUE style:\n```{sashtml example3} ods html style=HTMLBLUE; proc sgplot data=sashelp.cars; scatter x=EngineSize y=MPG_CITY; run; ``` Some Additional Comments The first thing that is important to note is that each chunk is processed separately. That means each chunk should be written so as to be capable of being executed independent of the others. It is possible to get around this using the collectcode=TRUE chunk option. This chunk will then subsequently be executed prior to the code from a following chunk. So for example:\n```{sashtml save1, collectcode=TRUE} data sample; set sashelp.class; run; ``` And now use it again: ```{sashtml save2} proc means data=sample; run; ``` This is particularly useful for libnames and setting the preferred ODS style, so you don\u0026rsquo;t have to keep doing it again in each cell.\nThe other thing to note is that knitr for SAS works best with HTML output. It can use SAS styles and produce output looking like what you would expect running in SAS Studio. If you want PDF output, you can get nicer output using LaTeX Tagsets for ODS and the StatRep System.\nknitr itself was based on Sweave, but uses Markdown instead of LaTeX code. Other languages have similar packages, for example pweave for Python or Weave for Julia.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nThe chunks from different languages do not have access to each other\u0026rsquo;s data. To move data between the different engines, more setup work is needed.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nIf you code in Julia, there is an interesting new reactive notebook called Pluto that promises to always keep your cells in sync, while being geared towards a Jupyter-style workflow.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/22-11-sas-markdown-for-reproducibility/","summary":"\u003cp\u003eOne of the coolest packages for R is \u003ca href=\"https://yihui.org/knitr/\"\u003eknitr\u003c/a\u003e. Essentially, it allows you to combine explanatory writing, such as a paper or blog post, directly with your analysis code in a Markdown document. When the target document\nis compiled (\u0026lsquo;knitted\u0026rsquo;), the R code in the document is run and the results inserted into the final document. The target document could\nbe an HTML or a PDF file, for example. This is great for many reasons. You have a regular report you want to run, but the data updates?\nJust re-knit and your entire report is updated. No more separate running of the code followed by copying the results into whatever\nsoftware you use to build the report itself. This makes it not just less cumbersome, but less error prone. It also improves reproducibility.\nSomebody wants to see your work, perhaps because they are unsure of your results or they want to extend your work? You can share the\nmarkdown file and the other party can see exactly what code was used to generate what part of your report or paper.\u003c/p\u003e","title":"SAS Markdown for Reproducibility"},{"content":"In a first semester probability course, students encounter combinatorics and point estimates such as the mean and median of a data set. A common example is the low odds of winning the lottery. When discussing the topic of point estimates, students are exposed to the idea of a \u0026ldquo;fair bet\u0026rdquo; or \u0026ldquo;fair game\u0026rdquo; - one in which the expected value of the random variable associated with the game is equal to the cost of participation or zero, depending on if a fixed cost is included in the game or tracked separately. This year, the Mega Millions had a jackpot in excess of one billion dollars. This had me thinking - mathematically, this is likely a fair game. But I still would expect to loose out playing it. In this article, I want to explore this idea further using the Mega Millions lottery as a particular example.\nMega Millions is played by a choosing five numbers from 1 to 70 (the white balls) and one number from 1 to 25 (the golden \u0026ldquo;Mega Ball\u0026rdquo;). Five white balls (W) and one golden ball (G) are drawn without replacement twice per week. Prizes are earned by matching the drawn numbers. Payouts generally follow a fixed schedule for everything but the jackpot, at least outside of California where the payouts for all prizes are pari-mutual instead. Below is a table of all possible events as given on the Mega Millions website, sorted by increasing odds.\nEvent Variable Value Odds 5 W + G $x_1$ Jackpot 1/302,575,350 5 W $x_2$ $1,000,000 1/12,607,306 4 W + G $x_3$ $10,000 1/931,001 4 W $x_4$ $500 1/38,792 3 W + G $x_5$ $200 1/14,547 2 W + G $x_6$ $10 1/693 3 W $x_7$ $10 1/606 1 W + G $x_8$ $4 1/89 G $x_9$ $2 1/37 No Match $x_{10}$ $0 24/1 The Fair Bet Analysis A Fair Bet or Fair game is one in which the expected value of the random variable doesn\u0026rsquo;t favor either the player or the house. Given a cost of 2 USD per game, we can say that Mega Millions is fair when $E[X]=2$, or more specifically when\n$$E[X] = \\sum_{i=1}^{10} x_i P(X=x_i) = \\frac{n}{302,575,351} + \\sum_{i=2}^{10} x_i P(X=x_i) = 2$$\nI use Maxima to solve for the jackpot representing a fair game and to print a few representative values of the expected value for some jackpot options.\n/* Define Expectation as dependent on n */ E(n) := n/302575351 + 1000000/12607307 + 10000/931002 + 500/38793 + 200/14548 + 10/694 + 10/607 + 4/90 + 2/38; /* solve for fair game */ float(solve(E(n)=2,n)); /* give expected return for different jackpot values */ jackpots : [5e7, 1e8, 2.5e8, 5e8, 7.5e8, 1e9, 2e9]; for i in jackpots do printf(true, \u0026#34;E(~:D) = $~$ ~%\u0026#34;, i, E(i))$ From this we learn that to have a fair jackpot, we require $n = 531,123,698.80$. Even with a fair pet, the expected value is very modest. For example, a 2 billion USD jackpot has $E[X]~=6.85$ - less than 5 USD above the ticket price.\nHow long until we Profit? Most people don\u0026rsquo;t play the lottery to win small amounts like 5 USD. They want to become millionaires. Given that our expected values are so low, let\u0026rsquo;s take a look at how long it will take us to become rich if we take the lottery game route.\nThe Geometric Distribution and Our Lottery Let\u0026rsquo;s start by considering only the events that would result in a win of a million dollars or more. In other words, events $x_1$ and $x_2$. We have\n$$P(x_1 \\vee x_2) = \\frac{315,182,658}{3,814,660,340,689,757} \\approx 8.262\\times 10^{-8}. $$\nIf we are only interested in this outcome, we can treat our outcome as a Bernoulli variable with $p=P(x_1 \\vee x_2)$. Then the expected number of games we need to play to win a million dollars or more is distributed like a geometric with $E[G] = 1/p$. For our specific case:\n$$ E[G] = \\frac 1 p = \\frac{3,814,660,340,689,757}{315,182,658} \\approx 12,103,014.$$\nRecall that two games are played per week. Converting this expected number of games to years, it would take approximately $115,977$ years for us to win. Even if one drawing were held each day, we would expect to take more than $33,000$ years to win.\nSince the CDF of the geometric distribution is well defined, we can use it to estimate the number of games required for a certain likelihood of having a win of at least a million dollars. To have roughly 50% odds of winning, we need to play about $8,400,000$ games of Mega Millions. Note that in this case you would still likely be in the hole since the $1,000,000$ USD jackpot is nearly 24 times more likely than the main jackpot and each game costs 2 USD to play.\nSimulating a Lifetime of Playing At this point you might agree that the lottery is not a good get-rich-quick scheme. That alone doesn\u0026rsquo;t mean that you are all but guaranteed to loose money over a lifetime of playing. So let\u0026rsquo;s run some simulations and see what the distribution of our net worth is after taking everything into account. To make things as fair as possible, we will assume a constant jackpot of 750 million USD.\nLet\u0026rsquo;s say you spend 50 years playing the Mega Millions at 2 USD for one ticket at each of the two weekly drawings. That comes out to just about $2,609$ weeks or $5,218$ games for a total price of $10,436$ USD. I simulated $50,000$ individuals each playing $5,218$ games for a constant jackpot of $750,000,000$ USD - much higher than the typical jackpot and advantageous to the players. This will cost them each $10,436$ USD in ticket costs over the 50 years they play. Yet, despite the simulated lottery being rigged in the players\u0026rsquo; favor, 99% of my players win less than 600 USD total over this 50 year time period.\nStatistic Value ($) Mean -9,169.43 SD 20,985.05 Min -9,974.00 25% -9,912.00 50% -9,898.00 75% -9,890.00 99% -9,884.00 Max 1,990,204.00 From these results it is clear that for all but the luckiest few, even just saving the money under a mattress outperforms playing the lottery. You can explore a distribution plot of my simulation with Plotly here. Note that this page may take a moment to load due to the many data points. You will need to zoom in on the left-hand side to be able really make anything out.\nImplementation Note The number of simulations grows quickly given the $5,218$ games we are using. Doing $50,000$ simulations of that many games requires over 260 million random draws. Prototyping in Python often makes sense because of the many features available for analysis and plotting, but this seems like an example where a compiled language might outperform by a considerable amount. I decided to try this out (GitHub).\nAll of the programs were written with an emphasis on simplicity over performance so as to avoid biasing the results. Since the different individuals play their games independently, I wrote both a single-threaded C++ version as well as one utilizing OpenMP\u0026rsquo;s parallel for loop. As alternative compiled languages I added implementations in Go and Rust.\nFor scripting languages I included Python and Julia. In Julia the main loop can trivially be set to run concurrently by prepending Threads.@threads to the for loop, so inlcuded that as an option as well. This instructs the Julia interpreter to run this loop with the available threads. By default this is one, but can be set higher using an environment variable or by starting Julia with the -t flag and specifying the desired number of threads.\nI used hyperfine (GitHub) to benchmark the performance of my programs in WSL; see output below for details.\nCommand Mean [s] Min [s] Max [s] Relative C++ (Single Thread) 4.602 ± 0.087 4.514 4.828 1.88 ± 0.06 C++ (OpenMP) 2.653 ± 0.170 2.506 3.092 1.08 ± 0.07 Go 9.362 ± 0.058 9.258 9.417 3.83 ± 0.10 Julia 28.606 ± 0.372 28.198 29.520 11.69 ± 0.33 Julia (4 Threads) 19.016 ± 0.274 18.673 19.511 7.77 ± 0.23 Python 57.727 ± 0.530 59.783 ± 3.811 57.833 70.242 Rust 2.447 ± 0.062 2.391 2.579 1.00 I was surprised by Rust\u0026rsquo;s performance. I only looked up enough Rust to be able to implement this simple example, so I find it surprising that it can keep up with a multi-threaded C++ implementation.\nUpdates: This blogpost has been updated with new benchmark values. The original post did not include results in Rust.\n","permalink":"https://dmsenter89.github.io/post/22-09-lottery/","summary":"\u003cp\u003eIn a first semester probability course, students encounter combinatorics and point estimates such as the mean and median of a data set. A common example is the low odds of winning the lottery. When discussing the topic of point estimates, students are exposed to the idea of a \u0026ldquo;fair bet\u0026rdquo; or \u0026ldquo;fair game\u0026rdquo; - one in which the expected value of the random variable associated with the game is equal to the cost of participation or zero, depending on if a fixed cost is included in the game or tracked separately. This year, the Mega Millions had a jackpot in excess of one billion dollars. This had me thinking - mathematically, this is likely a fair game. But I still would expect to loose out playing it. In this article, I want to explore this idea further using the Mega Millions lottery as a particular example.\u003c/p\u003e","title":"Does it ever make sense to play the Lottery?"},{"content":"Most people can guess the current life expectancy for Americans at birth as being in the high 70s or around 80. In fact, given the current mortality table published by the Social Security Administration (SSA), males have a life expectancy of about 76 compared to a female life expectancy of about 81. Of course that is only an expected value. Guessing the distribution of a person\u0026rsquo;s life expectancy is somewhat more difficult. In this post, we\u0026rsquo;ll take a look at some simulated lives to get a feel for the distribution of life expectancy and its implications for retirement planning.\nLet\u0026rsquo;s begin by looking at our mortality table. The rows indicate an individual\u0026rsquo;s current age. For both a male and a female, three values are then given: the probability of death in a given year, the \u0026ldquo;Number of Lives\u0026rdquo;, and the life expectancy for this individual. The probability of death in a given year is somewhat self-explanatory. The \u0026ldquo;Number of Lives\u0026rdquo; variable starts with 100,000 individuals and gives the number of survivors at a given age. So for example, of the 100,000 males \u0026ldquo;born\u0026rdquo; at age 0, we expect 99,392 to be alive at age 1. The life expectancy is the expected number of years of life remaining for an individual. We can start by plotting this to get a feel for the data.\nSurvival curve for 100,000 males and females given the 2019 SSA mortality tables. The dashed line indicates the typical retirement age of 67.\nTo get a feeling for the distribution of age at death, I ran 10,000 simulations each for males and females starting at ages 0, 25, 40, 60, and 80. These ages were chosen to represent the full range of possibilities at birth, followed by early, mid- and late career individuals. Age 80 was included for comparison as an older retiree value. Since the probability of death by age 50 is so low, we expect very little difference for the first three ages, with differences becoming more pronounced as age progresses, but it is still useful to visualize.\nOverview of the distribution of age at death by sex for different ages at the beginning of the simulation. Outliers are are not represented. Note how the results for ages 0 to 40 are nearly indistinguishable.\nAs expected, we see relatively little variation between birth and age 40, with some recognizable changes beginning at age 60. Given that, I will visualize the distribution for an individual starting at age 40. A 40 year old is about 25-30 years away from retirement and has probably at least started thinking about saving and how much they\u0026rsquo;ll need to put away to last through retirement.\nDistribution of age at death for males and females given a starting age of 40. Half of the starting population is expected to make it to at least 81/85 (Male/Female), and a quarter will make it at least to 88/91.\nSo now that have seen the distribution, let\u0026rsquo;s consider how long we\u0026rsquo;ll live past the typical retirement age of 67. The table below lists the ages by sex for the top percentiles given a starting age of 40.\nTop Percentiles - Age at Death Males Females 5% 95 98 10% 93 95 20% 89 92 30% 86 90 40% 84 87 50% 81 85 We see that 40% of females and 30% of males are expected to live at least 20 years past retirement age. A little more than 5% of females will make it thirty years past retirement, but only 2.5% of males will. While only a small minority of retirees will need to fund their retirement for thirty or more years, it is not unreasonable to target retirement funds to last until we reach age 90.\nUnfortunately, a large share of Americans have insufficient 401k balances to cover their expected longevity (see here, here, and here for some estimates of savings by age group). Many are likely relying on social security benefits to cover some of the difference. This system may not last that long, or at least not with current benefit levels. Social Security outlays have exceeded allocated revenues since 2010 and are currently expected to continue to do so well into the 2090s (see Table A-1). Social security trust fund balances for old-age and survivor benefits are rapidly declining. Between 2020 and 2030, the CBO expects a drop of 80% in this fund. Curiously, over the same time period the trust fund for military personnel is expected to grow by more than 70%, while the fund for civilian government employees is expected to grow by more than 20% (see CBO report). As such, younger Americans not working for the government will need to consider how to fund a multi-decade retirement in the face of potentially large reductions in social security benefits.\n","permalink":"https://dmsenter89.github.io/post/22-09-life-expectancy/","summary":"A look at the distribution of age at death based on social security mortality tables to see how long we can expect to be in retirement for.","title":"Life Expectancy Data"},{"content":"This post is a follow-up to my post on how to load data from Zillow. Housing prices have soared through the COVID-19 pandemic, leading to a lot of discussion about housing affordability. The quickly growing home values coupled with the subsequent raising of interest rates on mortgages are seeing more and more people priced out of the ability to purchase a home. While rent prices have increased as well, they haven\u0026rsquo;t increased as sharply as home prices.\nIn this post, I will look at the Zillow data set and consider a popular question for millennials - is it better to buy or rent in the current market? For this analysis, I will use the zip code level data of the Zillow Home Value Index (ZHVI) and the Zillow Observed Rent Index (ZORI) for North Carolina metropolitan areas in the years 2019 through June 2022. To evaluate the monthly costs of owning a home, I will utilize the 30 year fixed rate mortgage average from the FRED database, a comprehensive database aggregating various economic time series maintained by the St. Louis Federal Reserve.\nDisclaimer: This post is not financial advice. We are only exploring some aggregate data sets. Past performance is not indicative of future performance.\nInitial Thoughts Crafting a Mortgage-to-Rent Index A Naive Implementation A Slightly Better Implementation Further Thoughts Risk and Other Costs Housing as an Investment Last Thoughts Initial Thoughts Both Zillow data sets are available with monthly data published at the zip code level. As such, they are equally well-spaced in time. One issue we notice at initial inspection is that the ZHVI observations are dated to the last day of every month, while the ZORI observations are dated to the first of the month. For merging and comparison, we will set the ZHVI to the first of the month.\nBoth data sets are available for download at the zip code level. The ZORI data set is much smaller, however. It covers 2,453 distinct zip codes compared to the ZHVI\u0026rsquo;s 27,366 distinct zip codes.\nAn initial graph of average ZHVI and ZORI for North Carolina shows the dramatic growth in home values and the degree to which rent is lagging behind. Both y-axes have been scaled proportionally.\nAverage Zillow Home Value Index (ZHVI) and Zillow Observed Rent Index (ZORI) values for North Carolina. The left and right axes have been scaled proportionally.\nThe FRED data set is published weekly on Thursdays and provides a seasonally unadjusted look at the average, actual mortgage rates that homebuyers have received throughout the US. So while this data set has the largest number of data points in time, it is the least granular on a geographic level. I don\u0026rsquo;t have detailed information about the geographic variability in mortgage rates and will ignore this for now. To allow merging with the Zillow data, the FRED data set will need to be averaged in some form. For this post, I will use simple averages by month. It is important to note that the FRED data is not in decimal notation. That means that 3.5% is written as 3.5 as opposed to 0.035.\nOne thing that is interesting about the FRED data is that it goes back as far as 1971. Looking at the overall historic values, we see that the past decade\u0026rsquo;s very low interest rates (less than 5%) are an anomaly.\nCrafting a Mortgage-to-Rent Index I will create two indices, one being a very naive implementation focusing just on the actual monthly payment on the mortgage loan and the monthly rent payments. This is followed by an improved index that takes into account additional monthly expenses. After both indices have been created, I will discuss some of their shortcomings.\nA Naive Implementation For monthly cost comparisons, we don\u0026rsquo;t actually care about the value of the home directly. What matters is the monthly mortgage cost. There is a lot of variability here and I will make the following assumptions:\nThe mortgage originates in the month given by Date at the rate given by the monthly average of the Mortgage30US time series. A 20% down payment was made, so that the mortgage amount is for 80% of the ZHVI. A 20% down payment is often recommended because it avoids the need for mortgage insurance, which would create an additional monthly expense. Closing costs are handled separately and not rolled into the mortgage. Monthly mortgage payments can be calculated with the PMT function in SAS. The general syntax is PMT( rate, number-of-periods, principal-amount [, future-amount] [, type]). The rate is the APR divided by the number of periods in a year, in our case 12. The number of periods is 12 payments over 30 years, i.e. 360. The principal amount is 80% of ZHVI, while the future amount is zero (default value). SAS allows us to pick of we want to use end of period (type=0) or beginning of period (type=1) payments. For the monthly amount the difference is small, so I will use the default end of period scheme. With that, our formula for the monthly payment is PMT(M30Rate/12, 360, 0.8*ZHVI).\nWe can now construct a simple unit-less ratio of monthly mortgage costs over monthly rent. If this index is greater than 1, it is cheaper to rent while if it is less than 1 it is cheaper to purchase a home. Below is a figure demonstrating the distribution of this index in North Carolina for 2019 through June 2022.\nMean index value for North Carolina. The shaded region represents the 25th through 75th percentile of index values. Note that the index is \u0026lt;1 until 2022, despite the sharp increase in ZHVI across the state seen above.\nThe index indicates that the situation was favorable to home buyers until 2022, when it flips to being generally favorable to renters. FRED data indicate that the average mortgage rate didn\u0026rsquo;t substantially begin growing until winter 2021. This lends credence to the view that historically low interest rates have helped buffer would-be homeowners from increases in home value.\nIndividual zip codes\u0026rsquo; index values behave similar in pattern to the the figure above, making it a good approximation to observed behavior.\nA Slightly Better Implementation The previous implementation underestimates the true monthly cost of home ownership. For one, homeowners are required to pay property taxes each year. While an individual doesn\u0026rsquo;t directly pay the local government on a monthly basis, funds for this purpose are typically collected in an escrow account so it is an ongoing cost. Property tax rates are set at the county and municipal levels and mapping them directly to the ZIP codes requires some work. The rates themselves are fixed fees for every $100 of assessed home value. For the purpose of taxation, the home value used is not the ZHVI, but the value assigned to the home by the county. In North Carolina these assessments must take place at least once every 8 years, but individual counties may choose to do more frequent assessments. As such, there can be a large discrepancy between the ZHVI and the assessed value. I have seen cases where the ZHVI is roughly double that of the county\u0026rsquo;s assessed value. Finding the average assessed value is easy for individual homes (Zillow displays it on its website), but finding an aggregate value that can be used for analysis is tricky. For ease of use I\u0026rsquo;ll assume that assessed value is 70% of ZHVI. I expect this measure to underestimate the assessed values of homes in 2019 but to be close to target in 2022.\nBased on the most recent effective tax rates published by the North Carolina Department of Revenue, the average tax rate is approximately 1.02%, with a minimum of 0.33% and a maximum of 1.7%.\nIn addition to taxes, escrow accounts will typically collect monthly insurance fees as well. The two main types of insurance included are homeowners insurance and mortgage insurance. Mortgage insurance is mandatory when purchasing a home with less than a 20% down payment and protects the lender from the borrower defaulting on the loan repayment. In the above example we assumed a 20% down payment, so mortgage insurance would be optional.\nThe cost of homeowners insurance varies wildly as can be seen in this article from NerdWallet. Their calculated average monthly cost for North Carolina is $142. Renters may be required to purchase renters insurance, but this is not a requirement for all properties and the cost is substantially lower than homeowners insurance. NerdWallet\u0026rsquo;s analysis gives a monthly average cost of $12 for North Carolina.\nMean adjusted index value for North Carolina. The shaded region represents the 25th through 75th percentile of index values.\nAs expected, the adjusted index shows the same overall behavior but is shifted up. This index gives a more complicated picture, in which the zip code starts mattering somewhat more in whether renting or buying is cheaper on a monthly basis.\nFurther Thoughts The above analysis was relatively simple, but a good excuse to take a look at the data made available by Zillow. This is not the end of the story, however. I have made several simplifications here, although on average I would argue they were in favor of home purchases. There additional matters to consider, which I will take up below. We also only discussed aggregate data. Particular housing may offer other benefits, especially if you have knowledge of some likely future events. For example, a large tech company planning to move to an area tends to increase both rent and housing prices, so purchasing now may be beneficial by locking in today\u0026rsquo;s prices. We also haven\u0026rsquo;t considered whether there are differences in housing stock and location that may outweigh purely financial considerations.\nRisk and Other Costs Another item to consider is the question of risk in either scenario. A homeowner incurs somewhat larger risks of on-going expenses that we have conveniently ignored. Roofs, HVAC systems, etc. need maintenance and eventually will need to be repaired. Neither last forever either, so they will each need to be replaced at least once, if not more often, during the 30 years the house is being paid down. A majority of new construction and a large number of existing homes are part of a Home Owners Association (HOA). Being part of an HOA not only increases monthly cost due to HOA fees, but also exposes a homeowner to the risk of special assessments. These are one-time financial obligations on homeowners to cover some expense for the HOA.\nWhile a renter may be immune to costs such as these, they are not immune to changing rent prices. By having a fixed-rate mortgage on the other hand, monthly housing expenses stay locked at current rates. Furthermore, while rent is not tax deductible, mortgage interest is. Under current guidelines, up to 750,000 USD in mortgage interest on your primary home can be deducted from your income tax. This is especially valuable during the first 15 years of the loan, when interest accounts for the majority of the mortgage payment.\nHousing as an Investment One issue that we have ignored so far is the idea of home purchases as investments. Home prices have historically increased. Each year lived and mortgage paid on a property increases your share in the home\u0026rsquo;s value, otherwise known as equity. Some would argue that even if the monthly cost of homeownership is in excess of the cost of renting an equivalent home, it\u0026rsquo;s worth it in the long run as an investment in your financial future. Proponents may point to homeownership being a key component in long term wealth accrual.\nAssessing the value of equity growth in wealth building needs to be viewed in opposition to investing the potential price differential. We are only talking about the difference in housing cost being utilized here, not in using additional funds to pay down a mortgage early. This has not been an effective strategy for nearly 30 years.1 The S\u0026amp;P 500\u0026rsquo;s average annual return during the period of 1975-2021 has been 10.2%. Home values on the other hand increased annually by an average of only 4.5% between 1975 and today.2 This would indicate investment in the S\u0026amp;P 500 would be expected to outperform real estate investment for the average person over the long run.\nThe variability of the two is rather different, however. For comparison, we can show a distribution of annual returns from the S\u0026amp;P 500 compared to FRED\u0026rsquo;s All-Transactions House Price Index.3\nHistograms showing the distribution of annual returns of the All-Transactions House Price Index for North Carolina and the S\u0026amp;P 500 for 1976-2021.\nWhile this shows that we can expect greater payoff from stock market investments compared to real estate, this is not helpful if we have to pay for housing and don\u0026rsquo;t have excess funds to throw at the stock market. If the actual monthly cost of homeownership and rent are about equal and max out our available funds, then purchasing a home builds at least some equity, no matter how small, compared to no investment being made. Only when rent is lower than mortgage costs and fees does it make sense to start comparing rates of return.\nWe should never underestimate the power of compounding, however. Consider the following example: you have the option of purchasing a 400,000 USD home with a 80,000 USD (20%) down payment at 5% interest as a 30 year fixed rate mortgage. Assume no other costs. In total, this housing purchase will cost you approximately 698,000 USD over the course of 30 years. Assuming 4.53% annual increase in housing value, this leaves you with about 1,553,000 USD in equity at the end. Given your total cost you have made a profit of approximately 855,000 USD. Now compare this to stock market returns, assuming the 10.22% annual growth we found above. Investing the down payment only, with no further payments made, would leave you with approximately 1,695,000 USD after 30 years. To match the amount of equity in our example home after 30 years would require only a 73,500 USD in initial investment. If we only wanted to match the profit of about 855,000 USD with a one-time investment, we would require an initial investment of approximately 42,500 USD.\nDespite these superior returns from investment, we still need to live somewhere. I have already mentioned that unlike a mortgage payment, rent is not fixed over the 30 year timespan. How much should I expect renting to cost me? Let\u0026rsquo;s assume we invest the entire down payment and rent a home that costs exactly the same as the mortgage would have cost. Assuming once annual increases in rent of 4.17%, we spend a total of nearly 1,190,000 USD on rent over the course of 30 years. Given these expenses, the profit from investing the 80,000 USD shrinks to a mere 500,000 USD.\nThis suggests a strategy of minimizing the down payment and investing the difference, if possible. One downside of this is that having a smaller down payment may lead to less favorable mortgage terms. In this second scenario, let\u0026rsquo;s assume that we make a 40,000 USD down payment on a 400,000 USD home with an increased, but still fixed, mortgage rate of 5.25% over 30 years coupled with a 40,000 USD initial investment in the S\u0026amp;P 500 at 10.22% increase annually with no additional investments made. Here we invest a total of approximately 766,000 USD into our home, reducing our profit on the home purchase to approximately 797,000 USD. But our initial 40,000 USD investment has accumulated to about 847,000 USD. This brings our total profit to approximately 1,644,000 USD in this scenario - nearly double the 855,000 USD profit in the 80,000 USD down payment scenario.\nLast Thoughts Overall, we see that the more precise we want to be, the more difficult it becomes to estimate an ideal strategy as there are a lot of moving pieces, not all of which can best be approximated by a geographic average value. Given the very different variability in housing value increase compared to stock market returns, it may make sense to generate a large number of random walks to get a better feel for the distribution of outcomes after 30 years. Finally, this entire post has assumed that we remain living in the same home for the entire 30 year life of the mortgage. Many will want to move at some point during their next thirty years, perhaps to make room for a larger family, to down-size, or to follow a job opportunity. This creates additional costs and considerations for deciding between renting and owning a home.\nAverage mortgage rates have been below average annual return of the S\u0026amp;P 500 since the 1990s.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nAnnual returns of S\u0026amp;P 500 were calculated from data provided by Investopedia. Annual returns of housing prices were calculated from the All-Transactions House Price Index for North Carolina (FRED time series NCSTHPI).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI have chosen this measure for historic house price increases as this index is available on a quarterly basis starting in 1975, unlike Zillow data which only goes back a little more than 20 years. Both US wide data (USSTHPI) and state specific data (e.g., for North Carolina: NCSTHPI) are available from FRED. I will use the North Carolina data set for my comparison. An alternative measure to consider is the median sales price of houses sold in the South census region, available on a quarterly basis starting in 1965 from FRED (MSPS).\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/post/22-08-should-you-buy-or-rent/","summary":"\u003cp\u003eThis post is a follow-up to my post on how to load \u003ca href=\"/post/22-08-zillow-data/\"\u003edata from Zillow\u003c/a\u003e. Housing prices have soared through the COVID-19 pandemic, leading to a lot of discussion about housing affordability. The quickly growing home values coupled with the subsequent raising of interest rates on mortgages are seeing more and more people priced out of  the ability to purchase a home. While rent prices have increased as well, they haven\u0026rsquo;t increased as sharply as home prices.\u003c/p\u003e","title":"Is it better to buy or rent housing?"},{"content":"A big difference between using compiled languages like C/C++ compared to scripting languages like Javascript or Python is that prior to execution, compiled languages require an explicit compilation step where the human readable code is translated to machine code for execution, the so-called \u0026ldquo;binary\u0026rdquo; of the code. Typically, a separate binary needs to be compiled for each target operating system and architecture. Compiling for your own machine is not a problem. The difficulty lies in creating binaries for machines that you don\u0026rsquo;t normally use, so you might not have an extra Mac lying around just to compile your program on for other Mac users. Compiling programs for an operating system or architecture other than the one you are working with is called cross-compiling. This would allow a Linux developer to create binaries for Windows and Mac computers, for example.\nFor most languages, this requires installing additional development tools and increases the complexity of the compilation workflow. I have found Go to be a pleasant exception to this, because cross-compilation is built into the standard Go tools. There is no need to learn any additional build-tools. All you need to learn about are some system variables that you need to set when compiling.\nGo build tools know which system you are building for by checking the GOOS and GOARCH environment variables. If they are unset, the tools fall back to GOHOSTOS and GOHOSTARCH. In other words, to change the target OS/architecture for your build, all you have to do is set the GOOS and GOARCH variables during the build. So say you want to build a simple program hello.go for a Windows computer with the same architecture as your development machine. All you have to do is write\nGOOS=windows go build hello.go instead of just go build hello.go and you\u0026rsquo;re good to go. This would produce a hello.exe binary you could copy to a Windows machine to run.\nTo check what combinations of GOOS and GOARCH are valid, run go tool dist list. To see which environment variables Go is currently seeing, run go env.\n","permalink":"https://dmsenter89.github.io/post/22-08-cross-compiling-with-go/","summary":"When using compiled languages like C/C++, Go, Rust, etc. a separate binary needs to be created for every OS and architecture the software is meant to run on. Cross-compilation can be daunting at first. In Go, it is built-in and straightforward to use.","title":"Cross Compiling With Go"},{"content":"Zillow is a well-known website widely used by those searching for a home or curious to find out the value of their current home. What you may not know is that Zillow has a dedicated research page. To make their website work optimally, they churn through tons of data on the American housing market. They share insights they gleaned via zillow.com/research. If you visit their research website you\u0026rsquo;ll notice they have a data page where you can download some really cool data sets for your own research. They even have an API with which you can load data directly, but you\u0026rsquo;ll have to register for access. In this post, we\u0026rsquo;ll look at how to load the CSV files that are available for direct download into SAS for analysis.\nThe CSV files can be downloaded here. In the example below, I\u0026rsquo;m working with the Zillow Home Value Index file for all homes, seasonally adjusted at the ZIP code level. Tha file is fairly large. It has data going from January 2000 through June 2022 in more than 27,000 rows of data and about 280 columns. Below is an image of the beginning of this file.\nWhen working with large CSV files, I find it useful to get a feel for it in the CLI with csvkit. This is especially important when importing with a SAS data step, because we need to know the number of columns and their order, amongst other things, for our code. To get an overview of the total number of columns and their contents, run\ncsvcut -n Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv The output is fairly long, so you may prefer piping to a pager. I don\u0026rsquo;t need all the different identifiers in the file, so I\u0026rsquo;m going to exclude those I won\u0026rsquo;t need and put them into a separate, smaller CSV.\n# ignore these four columns which I won\u0026#39;t need csvcut -C RegionID,SizeRank,RegionType,StateName Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv \u0026gt; Zip_zhvi_small.csv # alternatively, also cut down on date columns to only 2022 for debugging csvcut -C RegionID,SizeRank,RegionType,StateName,10-273 Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_month.csv \u0026gt; Zip_zhvi_small.csv You can also reduce the file size by using csvgrep to filter any of the columns. For example, if we only wanted the data for North Carolina we could run csvgrep -c State -m NC in the pipe.\nFor SAS, we need to know the maximum length of string columns so we can allocate the appropriate length to the corresponding SAS variables. This is easily done with the csvstat tool:\ncsvcut -c Metro,City,CountyName Zip_zhvi_small.csv | csvstat --len You can also specify the list of columns in csvstat directly, but in my experience that tends to be slower.\nAlright, now we have everything we need to start on our DATA step! We start with the attribute statement. One problem with importing this file is that everyhing is in wide format, with the dates used as headers. We will get around this shortly. I have seen people use transpose etc for similar problems online, but this is unnecessary if we feel comfortable with the DATA step. We\u0026rsquo;ll start by naming the identifying columns just as in the CSV file. For the date columns, we will use a numeric range prefixed by date (date1-date270). You can use csvcut to find the exact number of date columns you have. We will also allocate the same number of columns for the ZHVI values, so we\u0026rsquo;ll need to add a val1-val270. This and the date variable are temporary and will be dropped later, in favor of the Date and ZHVI variables.\nattrib ZIP informat=best12. format=z5. State informat=$2. City informat=$30. Metro informat=$42. CountyName informat=$29. date1-date270 informat=YYMMDD10. format=DATE9. val1-val270 informat=best16. Date format=Date9. ZHVI format=Dollar16. ; Now we will allocate an array to hold all of the date and ZHVI values during the processing of each row. Since the date column won\u0026rsquo;t change, we\u0026rsquo;ll tell SAS to retain its values.\nretain date1-date270; array d(270) date1-date270; array v(270) val1-val270; This is where the magic happens now. You may not know it, but you are not limited to a single INPUT statement in a DATA step. We use this and start by reading in only the first row. Because we use an OUTPUT statement later, this reading of row 1 will be processed, but not saved into the output data set.\nif _n_ = 1 then do; input ZIP $ State $ City $ Metro $ CountyName $ date1-date270; PUT _ALL_; /* if you want to see what that looks like */ end; With this if clause, the date1 through date270 variables will be populated, and because we used a retain statement earlier, these values remain available to us during the processing of every other row. You can probably guess where this is going now: we will process each row, and then OUTPUT one line per date which we have access to now thanks to our array and the retain statement.\ninput ZIP $ State $ City $ Metro $ CountyName $ val1-val270; do i=1 to 270; Date = d(i); /* look up date for column i */ ZHVI = v(i); /* use the corresponding i-th value for ZHVI */ OUTPUT; /* This output creates one line per date column */ end; At the end of your data step, don\u0026rsquo;t forget to\ndrop i date1-date270 val1-val270; so those variables don\u0026rsquo;t clutter your data set. And that\u0026rsquo;s it! You now have the data set loaded and available in SAS.\n","permalink":"https://dmsenter89.github.io/post/22-08-zillow-data/","summary":"\u003cp\u003eZillow is a well-known website widely used by those searching for a home or curious to find out\nthe value of their current home. What you may not know is that Zillow has a dedicated research page.\nTo make their website work optimally, they churn through tons of data on the American housing market.\nThey share insights they gleaned via \u003ca href=\"https://www.zillow.com/research/\"\u003ezillow.com/research\u003c/a\u003e. If you\nvisit their research  website you\u0026rsquo;ll notice they have a data page where you can download some really\ncool data sets for your own research. They even have an API with which you can load data directly, but\nyou\u0026rsquo;ll have to register for access. In this post, we\u0026rsquo;ll look at how to load the CSV files that are\navailable for direct download into SAS for analysis.\u003c/p\u003e","title":"Loading Zillow Housing Data in SAS"},{"content":"Background Sudden death accounts for 10% of deaths in the United States. Prior research has focused on sudden death in older victims, leaving much unknown about risk factors for younger, working age adults. Compared to older adults, younger adults may be more vulnerable to genetic factors. Understanding age related differences in sudden death risk factors may guide future prevention efforts, as the factors contributing to sudden death in younger patients may warrant different types of prevention than those affecting older adults.\nMethods From 2013-2015, out-of-hospital deaths among adults aged 18-64 in Wake County, NC were screened and adjudicated to identify 306 sudden death victims. A comparison group of 1,113 patients matched for age, gender, and residence was formed by randomly sampling individuals from the Carolina Data Warehouse. For sudden death victims and controls, the prevalence of sudden death risk factors, comprising comorbidities, mental illnesses, and family history variables, were assessed in three age groups (18-41, 42-54, and 55-64 years old). Hypothesis testing was conducted for each variable across all pairwise combinations of age groups using an unpaired, two-sample proportion Z-test with $p\u0026lt;0.05$ as statistically significant.\nResults None of the variables analyzed were more prevalent in younger sudden death victims compared to older victims. However, family history variables were largely missing (\u0026gt;80%) compared to comorbidities and mental illnesses (\u0026lt;10%), with younger adults being disproportionately affected.\nConclusions Underreporting of family history in medical records leaves younger adults as a poorly understood subset of sudden death victims who are currently unable to be appropriately screened for lifesaving preventive measures.\nPublic Health Implications Better family history documentation may identify younger adults at higher risk of sudden death, allowing more comprehensive population level screening and implementation of prevention measures appropriate for addressing genetic factors instead of the environmental factors more common in older adults.\n","permalink":"https://dmsenter89.github.io/publication/doshi-2022-family-history/","summary":"\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eSudden death accounts for 10% of deaths in the United States. Prior research has\nfocused on sudden death in older victims, leaving much unknown about risk factors for younger,\nworking age adults. Compared to older adults, younger adults may be more vulnerable to genetic factors.\nUnderstanding age related differences in sudden death risk factors may guide future prevention efforts,\nas the factors contributing to sudden death in younger patients may warrant different types of prevention\nthan those affecting older adults.\u003c/p\u003e","title":"Family History And Chronic Medical Conditions Associated With Sudden Death Among Working Age Adults"},{"content":"Recently I\u0026rsquo;ve needed to capture the entries of a datalines statement in SAS for editing. Generally, this is a straight forward problem if I only need to do it with one file or all of the files that I am using are formatted identically. But then I started thinking about the more general case. SAS doesn\u0026rsquo;t care about the case of my keywords, so I need a case insensitive match. I need to account for possible extra whitespace. So far so good. But what if I have two different keywords that can start my data section, and the end of the data section is indicated with different characters depending on the chosen keyword? Could I still use a single regular expression?\nSAS does in fact allow a number of different keywords to enter data in a data step. In my experience, the most common are the datalines and datalines4 statements. The main difference between them is how the end of the data is indicated. For datalines, a single semicolon is used, while datalines4 uses a sequence of four semicolons, thereby allowing the use of semicolons in the data itself. There are some aliases for these commands that can be used: cards/lines and cards4/lines4 with matching behavior. A simple data step with these statements could look like this:\ndata person; input name $ sex $ age; datalines; /* or `cards` or `lines` */ Alfred M 14 Alice F 13 ; data person4; input name $ sex $ age; datalines4; /* or `cards4` or `lines4` */ Alfred M 14 Alice F 13 ;;;; We could write two separate RegEx expressions, one for the datalines/cards/lines statement and a second one for the datalines4/cards4/lines4 statement. But, if the RegEx engine we are using allows conditionals, e.g. the Python RegEx engine, then we can write a single statement that can capture both types of statements. The basic format of the conditonal capture is (?(D)A|B), which can be read as \u0026ldquo;if capture group D is set, then match A, otherwise match B.\u0026rdquo; For more details, see here.\nUsing this technique, we can capture both types of statements in one go. The short form of the solution I found is this regular expression: r\u0026quot;(?:(?:(?:data)?lines)|cards)(4)?\\s*;(.*?)(?(1);{4}|;)\u0026quot; with two flags set: case insensitive and dot-all. If we utilize Python\u0026rsquo;s verbose flag, we can format this a bit nicer as well:\nre.compile( r\u0026#34;\u0026#34;\u0026#34;(?:(?: # mark groups as non-capture groups (?:data)? # maybe match `data`, but don\u0026#39;t capture lines) # matches `lines` |cards) # alternatively, matches `cards` (4)? # a `4` may be present \\s*; # there might be whitespace before the ; (.*?) # lazy-match data content (?(1) # check if capture group 1 is set, if so ;{4} # match `;;;;` |;) # otherwise, match a single ; \u0026#34;\u0026#34;\u0026#34;, flags=re.DOTALL | re.X | re.I) A great website to help you build up a regular expression is regex101.com. It allows you to copy a sample text and regular expression. It then explains your expression and lists the capture groups by number, which can be convenient. It also allows you to try out different RegEx engines. Try setting it to Python with the flags we mentioned, and see how it works!\n","permalink":"https://dmsenter89.github.io/post/22-05-conditional-regex-python/","summary":"When the endpoint of your match depends on an earlier term, try conditional regex matching in Python.","title":"Conditional RegEx Matching with Python"},{"content":"Lately I\u0026rsquo;ve been playing around with Go. I\u0026rsquo;ve read about Go for a few years and have been using some software written in Go (this website is built with Hugo), but never tried it before. So what better way to give Go a shake than to write some code. Since Wordle has been popular, I thought I\u0026rsquo;d write a very simple Wordle implementation in Go; you can check it out on GitHub. It\u0026rsquo;s been a good way for me to get familiar with some of the basisc of Go, such as variables and their types, functions, etc. So far I\u0026rsquo;ve been enjoying it.\nThe Go website has a very nicely written documentation and package page. The Go Playground let\u0026rsquo;s you test out Go in your browser without needing to install anything. I\u0026rsquo;ve also found Bodner\u0026rsquo;s \u0026ldquo;Learning Go\u0026rdquo; to be helpful.\nGo is a compiled language with a pretty picky compiler. It won\u0026rsquo;t let you compile code with unnecessary imports and variable declarations, which help keeps your code clean. Cross-compilation is built-in. While Go is not a common language in scientific computing, the gonum package has implemented a number of important functions and seems to be well developed. I look forward to learning more about Go in the future.\n","permalink":"https://dmsenter89.github.io/post/22-05-go-wordle/","summary":"Learning the basics of Go with a Wordle app.","title":"Wordle in Golang"},{"content":"I frequently find myself needing to concatenate data sets but also wanting to be able to distinguish which row came from which data set originally. Introductory SAS courses tend to teach the in keyword, for a workflow similar to this:\ndata Concat1; set data1(in = ds0) data2(in = ds1); if ds0 then source = \u0026#34;data1\u0026#34;; else if ds1 then source = \u0026#34;data2\u0026#34;; run; With more than two input data sets, this can get unwieldy and repetitive. In an old blog post on Rick Wicklin\u0026rsquo;s DO LOOP, a better method is introduced - the indsname option. Using this method, the above code looks much nicer:\ndata Concat2; set data1-data2 indsname = source; /* the INDSNAME= option is on the SET statement */ libref = scan(source,1,\u0026#39;.\u0026#39;); /* extract the libref */ dsname = scan(source,2,\u0026#39;.\u0026#39;); /* extract the data set name */ run; As long as your input data sets are reasonably named, you\u0026rsquo;ll now have access to all the information needed.\n","permalink":"https://dmsenter89.github.io/post/22-04-sas-indsname-option/","summary":"\u003cp\u003eI frequently find myself needing to concatenate data sets but also wanting to be able to distinguish\nwhich row came from which data set originally. Introductory SAS courses tend to teach the \u003ccode\u003ein\u003c/code\u003e keyword,\nfor a workflow similar to this:\u003c/p\u003e","title":"The INDSNAME Option in SAS"},{"content":"In a previous post, I have shown how to connect to the Census API and load data with Python. In this post, I will do the same using SAS instead. Before we get started, two important links from last time: a guide to the API can be found here and a list of the available data sets can be accessed here.\nPicking the Data For this post, I\u0026rsquo;ll use the same data as last time. There we used the 2018 American Community Survey 1-Year Detailed Table and asked for three variables - total population, household income, and median monthly cost for Alamance and Orange counties in North Carolina (FIPS codes 37001 and 37135). The variable names are not very intuitive, so I highly recommend starting your code with a comment section that includes a markdown-style table of the variables that you want to use. Here is an example table for our data:\nVariable Label B01003_001E Total Population B19001_001E Household Income (12 Month) B25105_001E Median Monthly Housing Cost Building the Query The next step is to build the query. Like last time, the API consists of a base URL that points us to the data set we are looking for, a list of the variables we want to request, and a description of the geography for which we want to request those variables. Just like last time, I\u0026rsquo;ll build the query using several macros for flexibility purposes. Note that since \u0026amp; has a special meaning in SAS, we need to use %str(\u0026amp;) when referring to it to avoid having the log clobbered with warnings about unresolved macros.\n%let baseurl=https://api.census.gov/data/2018/acs/acs1; %let varlist=NAME,B01003_001E,B19001_001E,B25105_001E; %let geolist=for=county:001,135%str(\u0026amp;)in=state:37; %let fullurl=\u0026amp;baseurl.?get=\u0026amp;varlist.%str(\u0026amp;)\u0026amp;geolist.; %put \u0026amp;=fullurl; Your log should now show the full query URL:\nFULLURL=https://api.census.gov/data/2018/acs/acs1?get=NAME,B01003_001E,B19001_001E,B25105_001E\u0026amp;for=county:001,135\u0026amp;in=state:37 Making the API Request The API call is achieved with a simple PROC HTTP call using a temporary file to hold the response from the server.\nfilename response temp; proc http url=\u0026#34;\u0026amp;fullurl.\u0026#34; method=\u0026#34;GET\u0026#34; out=response; run; Handling the JSON Response We read the JSON response by utilizing the LIBNAME JSON Engine in SAS.\nlibname manual JSON fileref=response; Now run proc datasets lib=manual; quit;. You\u0026rsquo;ll see two data sets that were created: ALLDATA which contains the whole JSON file\u0026rsquo;s contents in a single data set, and ROOT which is a data set of all the root-level data. The latter one is the one we want. Here\u0026rsquo;s what the first few observations in each look like:\nFirst few observations in the automatically created data sets.\nJust like with Python, all columns are treated as character variables at first. Because of the way the Census API is structured, the first row consists of headers, which SAS didn\u0026rsquo;t use. This is something we\u0026rsquo;ll need to fix. At this point we have two main routes we can use to fix these issues - we can manually create a new data set from ROOT with PROC SQL and address the issues in that way, or we can take advantage of SAS\u0026rsquo; JSON map feature to define how we want to load the JSON when the LIBNAME statement is executed. There are good use cases for each, so I will show both methods.\nCleaning up via PROC SQL Using PROC SQL, you can rename all the character variables you want to keep. To change from character to numeric, you\u0026rsquo;ll use the input function. You can then assign formats and labels as desired. To get rid of the first row, you can just add a conditional having ordinal_root ne 1 to avoid loading that line.\nproc sql; create table census as select element1 as Name, input(element2, best12.) as B01003_001E format=COMMA12. label=\u0026#39;Total Population\u0026#39;, input(element3, best12.) as B19001_001E format=DOLLAR12. label=\u0026#39;Household Income (12 Month)\u0026#39;, input(element4, best12.) as B25105_001E format=DOLLAR12. label=\u0026#39;Median Monthly Housing Cost\u0026#39;, element5 as state, element6 as county from manual.root having ordinal_root ne 1; quit; Result from the PROC SQL method.\nA benefit of this method is that as you fix the input table, you can already begin to work with it thanks to the calculated keyword in PROC SQL. Say we weren\u0026rsquo;t actually interested in housing cost and household income, but instead would like to know what percent of their annual income a household spends on housing in a given county. We could just add a new variable to our PROC SQL call and build our table like this:\nproc sql; create table census as select element1 as Name, input(element2, best12.) as B01003_001E format=COMMA12. label=\u0026#39;Total Population\u0026#39;, input(element3, best12.) as B19001_001E format=DOLLAR12. label=\u0026#39;Household Income (12 Month)\u0026#39;, input(element4, best12.) as B25105_001E format=DOLLAR12. label=\u0026#39;Median Monthly Housing Cost\u0026#39;, /* Now calculate what we want from the new columns: */ 12*(calculated B25105_001E)/calculated B19001_001E as HousingCostPCT format=PERCENT10.2, element5 as state, element6 as county from manual.root having ordinal_root ne 1; quit; Using a JSON MAP Alternatively, we could change the way SAS reads the JSON data by editing the JSON map it uses to decode the JSON file. The first step is to ask SAS to create a map for us to edit:\nfilename automap \u0026#34;sas.map\u0026#34;; libname autodata JSON fileref=response map=automap automap=create; The map will look something like this: Beginning of the automatically created JSON map.\nNote that this is also a JSON file which you can edit in a text editor. With this map, you can change the names of the data sets and variables, assign labels and formats, and also re-format incoming data. Variables and data sets you don\u0026rsquo;t want to read can simply be deleted from the map. Here\u0026rsquo;s the beginning of my edited file: Beginning of my edited JSON map.\nSince the first row of observations in the JSON are actually a header and non-numeric, I add ? prior to the specified informat. This prevents errors in the log and simply replaces non-matching variables with missing values. We can now reload the JSON using our custom map by dropping the automap=create option from the LIBNAME statement:\nlibname autodata JSON fileref=response map=automap; When I now print the resulting data set, the header row is still there, but replaced by missing values in numeric columns: The data set as a result of the edited JSON map.\nThis means we\u0026rsquo;ll need to additionally drop this row in a separate step using a delete statement either in a PROC SQL or DATA step.\nWhichever method you choose, you now can access data via an API call from SAS. Happy exploring!\n","permalink":"https://dmsenter89.github.io/post/22-04-census-api-with-sas/","summary":"A post showing how PROC HTTP and LIBNAME JSON can be used to directly work with the Census API from SAS.","title":"Working with the Census API Directly from SAS"},{"content":"Background In the United States, former incarceration is a risk factor for chronic conditions and sudden death (SD) due to poor healthcare continuity after release and lack of community support. During the COVID-19 pandemic, all-cause mortality increased, and preexisting risk factors and social limitations of having an incarceration history were exacerbated. We hypothesized that sudden deaths among the formerly incarcerated increased during the pandemic.\nMethods North Carolina death certificates from pre-COVID-19 (2014) and during the COVID-19 pandemic (2020) were screened for presumed SD. Individuals were excluded based on age ($\u0026lt;18$ or $\u0026gt;65$), violent or expected deaths, and deaths in hospitals or care facilities. Deaths were matched to the North Carolina Department of Public Safety Criminal Offender Database for a history of incarceration. ICD-10 codes for hypertension, diabetes, chronic respiratory disease, obesity, mental health, and substance abuse were extracted from the top four causes of death on the death certificates.\nResults We found no significant difference in the prevalence of former incarceration in SD victims from 2014 to 2020. In 2020, the odds of substance abuse among SD victims with a history of incarceration were significantly greater compared to those without a history of incarceration (OR (95CI): 2.29 (1.91-2.73)). The odds of substance abuse among the formerly incarcerated were greater in 2020 SD victims compared to 2014 SD victims (2.29 (1.76-2.99)).\nConclusion Sudden death among the formerly incarcerated did not significantly increase during the COVID-19 pandemic. Increased public health funding may have limited the expected effects of COVID-19 on rates of sudden death in formerly incarcerated individuals. However, the formerly incarcerated appear to be vulnerable to dying of sudden death associated with substance abuse in 2020. Improving statewide transition programs targeting substance abuse counseling during healthcare crises should improve health outcomes and reduce the rate of sudden death among the formerly incarcerated population.\n","permalink":"https://dmsenter89.github.io/publication/raghunathan-2022-incarceration/","summary":"\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eIn the United States, former incarceration is a risk factor for chronic conditions and sudden death (SD)\ndue to poor healthcare continuity after release and lack of community support. During the COVID-19 pandemic,\nall-cause mortality increased, and preexisting risk factors and social limitations of having an incarceration\nhistory were exacerbated. We hypothesized that sudden deaths among the formerly incarcerated increased during\nthe pandemic.\u003c/p\u003e","title":"Former Incarceration As A Risk Factor For COVID-19 Associated Sudden Death"},{"content":"Background Housing insecurity is a powerful social determinant of health that is associated with increased all-cause mortality. The health consequences of and contributors to housing insecurity are poorly studied, which makes preventative care elusive for this population. In order to address these issues, we assessed the prevalence of housing insecurity among sudden death victims and examined its relationship to sudden death, mental illness, and clinical comorbidities.\nMethods From 1 March 2013 to 28 February 2015, out-of-hospital deaths in Wake County, North Carolina, were screened and adjudicated to identify 399 sudden deaths among residents between the ages of 18 and 64. A control sample of 1,101 living patients were generated by randomly sampling for age, gender, and Wake County residence from the Carolina Data Warehouse for Health. Housing status was abstracted from clinical records from sudden death victims and controls.\nResults Housing insecurity was documented in 28 (7.1%) of victims and 47 (4.3%) of controls (OR(95CI): 1.71(1.05-3.2)). This difference remained significant after adjusting for hypertension, age, and diabetes. However, when additionally adjusting for depression, anxiety, alcohol abuse, substance abuse, schizophrenia, and bipolar disorder, the increased prevalence of housing insecurity in the sudden death group became insignificant.\nConclusions Housing insecure individuals experience a higher burden of sudden death than housing secure individuals. Mental illness appears to confound the relationship between housing insecurity and sudden death. Treating underlying mental illness may be a path towards specializing clinical care to optimize health outcomes for housing insecure individuals.\nClinical Implications Optimize health outcomes for housing insecure individuals by acknowledging that housing insecurity appears to be a risk factor for sudden death - a relationship complicated by mental illness. Patients with housing insecurity should be screened for mental illness, and treatment and referral to a mental illness specialist should be considered.\n","permalink":"https://dmsenter89.github.io/publication/vrooman-2022-housing-insecurity/","summary":"\u003ch2 id=\"background\"\u003eBackground\u003c/h2\u003e\n\u003cp\u003eHousing insecurity is a powerful social determinant of health that is associated with increased all-cause mortality.\nThe health consequences of and contributors to housing insecurity are poorly studied, which makes preventative care\nelusive for this population. In order to address these issues, we assessed the prevalence of housing insecurity\namong sudden death victims and examined its relationship to sudden death, mental illness, and clinical comorbidities.\u003c/p\u003e","title":"Housing Insecurity: Effects on Sudden Death and Interaction with Mental Illness"},{"content":"I\u0026rsquo;ve recently needed to append several lines of data to a SAS data step that I collected and built via a shell script. For search-and-replace in bash I typically use sed, but this time I ran into a problem - sed does not like multiline shell variables. Thanks to Stack, I found a way to accomplish this task using awk instead.\nSuppose you have a file called data.sas with the following contents:\ndata person; infile datalines delimiter=\u0026#39;,\u0026#39;; input name :$10. dept :$30.; datalines4; John,Sales Mary,Accounting Theresa,Management Stewart,HR ;;;; run; Note that I am using a datalines4 statement so that I get an easy to identify target for the substitution. I want to insert a multiline shell variable before the ;;;; to add my data to this data step. Say I have the following variable:\nNEWDATA=$(cat \u0026lt;\u0026lt;-END Will,Compliance Sidney,Management END ) If I try to use sed (sed \u0026quot;s/\\;\\{4\\}/$DATA\\n;;;;/\u0026quot; data.sas) I will get an error about an unterminated s command. Instead of sed, I can use awk with a variable to achieve the same goal:\nawk -v r=\u0026#34;$NEWDATA\\n;;;;\u0026#34; \u0026#39;{gsub(/;{4}/, r)}1\u0026#39; data.sas The one downside is that awk does not have an in-place option like sed, and if I try to redirect to the same file I\u0026rsquo;m reading from I get an empty file out. So you\u0026rsquo;ll have to rename the original file in your processing script to achieve a similar effect as with the inplace option in sed.\nFor additional approaches, see this StackOverflow Question.\n","permalink":"https://dmsenter89.github.io/post/22-03-multiline-replacement/","summary":"\u003cp\u003eI\u0026rsquo;ve recently needed to append several lines of data to a SAS data step that I collected and built\nvia a shell script. For search-and-replace in bash I typically use sed, but this time I ran into a problem -\nsed does not like multiline shell variables. Thanks to Stack, I found a way to accomplish this task using awk instead.\u003c/p\u003e","title":"Multline Bash Variable Replacement"},{"content":"I love using SASPy, but the setup can take a minute. I used to do the setup via the CLI until I started thinking I might be able to just do it straight from a Jupyter notebook. Having just a couple of cells in Jupyter notebook makes for easy copy-and-paste and reduces setup time. The code below has been tested on both Windows and Linux. As a bonus, this also works on Google Colab.\nYou can easily install packages via pip from Jupyter either by using a shell cell (!) or by using the pip magic command: %pip install saspy. Once done, copy and paste the following into a code cell and run to create the sascfg_personal.py file:\nimport saspy, platform from pathlib import Path # get path for configuration file cfgpath = saspy.__file__.replace(\u0026#39;__init__.py\u0026#39;,\u0026#39;sascfg_personal.py\u0026#39;) # To pick the path for Java, we need to know whether we\u0026#39;re on Windows or not if platform.system()==\u0026#39;Windows\u0026#39;: print(\u0026#34;Windows detected.\u0026#34;) javapath = !where java authfile = Path(Path.home(),\u0026#34;_authinfo\u0026#34;) else: javapath = !which java authfile = Path(Path.home(),\u0026#34;.authinfo\u0026#34;) # the `!` command returns a string list, we want only the string javapath = javapath[0] print(f\u0026#34;Java is present at {javapath}\u0026#34;) # US home Region configuration string set up via string-replacement. # For other server addresses, see https://support.sas.com/ondemand/saspy.html cfgtext = f\u0026#34;\u0026#34;\u0026#34;SAS_config_names=[\u0026#39;oda\u0026#39;] oda = {{\u0026#39;java\u0026#39; : \u0026#39;{repr(javapath).strip(\u0026#34;\u0026#39;\u0026#34;)}\u0026#39;, #US Home Region \u0026#39;iomhost\u0026#39; : [\u0026#39;odaws01-usw2.oda.sas.com\u0026#39;,\u0026#39;odaws02-usw2.oda.sas.com\u0026#39;,\u0026#39;odaws03-usw2.oda.sas.com\u0026#39;,\u0026#39;odaws04-usw2.oda.sas.com\u0026#39;], \u0026#39;iomport\u0026#39; : 8591, \u0026#39;authkey\u0026#39; : \u0026#39;oda\u0026#39;, \u0026#39;encoding\u0026#39; : \u0026#39;utf-8\u0026#39; }}\u0026#34;\u0026#34;\u0026#34; # write the configuration file with open(cfgpath, \u0026#39;w\u0026#39;) as file: file.write(cfgtext) print(f\u0026#34;Wrote configuration file to {cfgpath}\u0026#34;) print(f\u0026#34;Content of file: \\n```\\n{cfgtext}\\n```\u0026#34;) Optionally, you can set up an authentication file with your username and password. Without this file, you\u0026rsquo;ll be prompted for your username and password each time you log in.\n# change variables to match your username and password omr_user_id = r\u0026#34;max.mustermann@sample.com\u0026#34; omr_user_password = r\u0026#34;K5d7#QBPw\u0026#34; with open(authfile, \u0026#34;w\u0026#34;) as file: file.write(f\u0026#34;oda user {omr_user_id} password {omr_user_password}\u0026#34;) And that\u0026rsquo;s it! You\u0026rsquo;re now ready to connect to SASPy. In my experience you don\u0026rsquo;t even need to restart the kernel to begin work with SAS on ODA. You can try the following snippet in a new cell:\n# starts a new SAS session with the `oda` configuration we set up sas_session = saspy.SASsession(cfgname=\u0026#39;oda\u0026#39;) # load a SAS data set and make a scatter plot cars = sas_session.sasdata(\u0026#39;cars\u0026#39;, \u0026#39;sashelp\u0026#39;) cars.scatter(x=\u0026#39;msrp\u0026#39;, y=\u0026#39;horsepower\u0026#39;) # directly run SAS code to print a table sas_session.submitLST(\u0026#34;proc print data=sashelp.cars(obs=6); run;\u0026#34;) # quit SAS connection sas_session.endsas() ","permalink":"https://dmsenter89.github.io/post/22-03-saspy-setup/","summary":"\u003cp\u003eI love using SASPy, but the setup can take a minute. I used to do the setup via the CLI until I\nstarted thinking I might be able to just do it straight from a Jupyter notebook. Having just a\ncouple of cells in Jupyter notebook makes for easy copy-and-paste and reduces setup time. The code\nbelow has been tested on both Windows and Linux. As a bonus,\nthis also works on  \u003ca href=\"https://colab.research.google.com/\"\u003eGoogle Colab\u003c/a\u003e.\u003c/p\u003e","title":"Easy SASPy Setup from Jupyter"},{"content":"Sometimes we have to deal with manually entered data, which means there is a good chance that the data needs to be cleaned for consistency due to the inevitable errors that creep in when typing in data, not to speak of any inconsistencies between individuals entering data.\nIn my particular case, I was recently dealing with a data set that included manually calculated ages that had been entered as a complete string of the number of years, months, and days of an individual. Such a string is not particularly useful for analysis and I wanted to have the age as a numeric variable instead. Regular expressions can help out a lot in this type of situation. In this post, we will look at a few representative examples of the type of entries I\u0026rsquo;ve encountered and how to read them using RegEx in SAS.\nLet\u0026rsquo;s Look at the Data What we\u0026rsquo;re starting from.\nIf we look at our sample data, we notice a few things. The data is consistently ordered from largest to smallest, in the order of year, month, and day. For some lines, only the year variable is available. In all cases, the string starts with two digits.\nSeparation of the time units is inconsistent; occasionally they are separated by commas, sometimes by hyphens, and in some cases by spaces alone. The terms indicating the units are spelled and capitalized inconsistently as well. There are some abbreviations and occasionally the plural \u0026rsquo;s\u0026rsquo; in days is wrapped in parentheses.\nIf you want to follow along, you can create the sample data with the following code:\ndata raw; infile datalines delimiter = \u0026#39;,\u0026#39; MISSOVER DSD; attrib ID informat=best32. format=1. STR_AGE informat=$500. format=$500. label=\u0026#39;Age String\u0026#39; VAR1 informat=best32. format=1.; input ID STR_AGE $ VAR1; datalines; 1,\u0026#34;62 Years, 5 Months, 8 Days\u0026#34;,1 2,43 Yrs. -2 Months -4 Day(s), 2 3,33 years * months 24 days, 1 4,58,1 5,\u0026#34;47 Yrs. -11 Months -27 Day(s)\u0026#34;,2 ; run; The RegEx Patterns We will use a total of three regex patterns, one for each of the time units: year, month, day. SAS uses Pearl regex and the function prxparse to define the regex patterns that are supposed to be searched for.\nFor the year variable, we need to match the first two digits in our string. Therefore, the correct call is prxparse('/^(\\d{2}).*/'). Note that the ( and ) delimit the capture group.\nThe month and day regex patterns are very similar. For the months, we want to lazy-match the until we hit between one or two digits followed by an \u0026rsquo;m\u0026rsquo; and some number of other characters. We use the i flag since we cannot guarantee capitalization: prxparse('/.*?(\\d{1,2}).M.*/i'). The day pattern is nearly identical: prxparse('/.*?(\\d{1,2}).D\\D*$/i').\nWe can extract our matches using the prxposn function. We use the prxmatch function to check if we actually have a match:\n/* match into strings */ if prxmatch(year_rxid, STR_AGE) then year_dig_str = prxposn(year_rxid,1,STR_AGE); if prxmatch(month_rxid, STR_AGE) then month_dig_str = prxposn(month_rxid,1,STR_AGE); if prxmatch(day_rxid, STR_AGE) then day_dig_str = prxposn(day_rxid,1, STR_AGE); The extracted strings can then be converted to numeric variables using the input function.\nThe last step is the calculation of the age from the three components. Since not all three time units are specified for every row, we cannot use the standard arithmetic of years + months + days, because the missing values would propagate. We need to use the sum function instead.\nPutting it all together, we get the correct output:\nThe Result\nComplete Code data fixed; set raw; /* define the regex patterns */ year_rxid = prxparse(\u0026#39;/^(\\d{2}).*/\u0026#39;); month_rxid = prxparse(\u0026#39;/.*?(\\d{1,2}).M.*/i\u0026#39;); day_rxid = prxparse(\u0026#39;/.*?(\\d{1,2}).D\\D*$/i\u0026#39;); /* match 2 digits followed by D and non-digit chars */ /* make sure we have enough space to store the extraction */ length year_dig_str month_dig_str day_dig_str $4; /* match into strings */ /* match into strings */ if prxmatch(year_rxid, STR_AGE) then year_dig_str = prxposn(year_rxid,1,STR_AGE); if prxmatch(month_rxid, STR_AGE) then month_dig_str = prxposn(month_rxid,1,STR_AGE); if prxmatch(day_rxid, STR_AGE) then day_dig_str = prxposn(day_rxid,1, STR_AGE); /* use input to convert str -\u0026gt; numeric */ years = input(year_dig_str, ? 12.); months = input(month_dig_str, ? 12.); days = input(day_dig_str, ? 12.); /* Use SUM function when calculating age to avoid missing values propagating */ age = sum(years,months/12,days/365.25); /* get rid of temporary variables */ drop month_rxid month_dig_str year_rxid year_dig_str day_rxid day_dig_str; run; proc print data=fixed; run; ","permalink":"https://dmsenter89.github.io/post/21-09-sas-regex-date-cleanup/","summary":"\u003cp\u003eSometimes we have to deal with manually entered data, which means there is a good chance that the data needs to be cleaned for consistency due to the\ninevitable errors that creep in when typing in data, not to speak of any inconsistencies between individuals entering data.\u003c/p\u003e","title":"Cleaning up a Date String with RegEx in SAS"},{"content":"I find myself needing to import CSV files with a relatively large number of columns. In many cases, proc import works surprisingly well in giving me what I want. But sometimes, I need to do some work while reading in the file and it would be nice to just use a data step to do so, but I don\u0026rsquo;t want to type it in by hand. That\u0026rsquo;s when a combination of proc import and some regex substitution can come in handy.\nFor the first step, run a proc import, like this sample code that is provided by SAS Studio when you double click on a CSV file:\nFILENAME REFFILE \u0026#39;/path/to/file/data.csv\u0026#39;; PROC IMPORT DATAFILE=REFFILE DBMS=CSV OUT=WORK.IMPORT; GETNAMES=YES; RUN; If you run this code, you will see that SAS generates a complete data step for you. This is what the beginning of one looks like:\nSample log output.\nThere will be be two lines for each variable, one giving the informat and one giving the format that SAS decided on. This will be followed by an input statement. You can copy that from the log into a text editor such as VSCode, but unfortunately the line numbering of the LOG will carry over. One convenient way of fixing this is to use regex search-and-replace. Each line starts with a space followed by 1-3 digits, followed by a variable number of spaces until the next word. To capture this I use ^\\s\\d{1,3}\\s+ as my search term and replace with nothing. This will left align the whole data step, but this can be adjusted later.\nAt this point the data step can be saved as a SAS file or copied back over to the file you are working within SAS Studio, but I like to do one more adjustment. I really like using the attrib statement, see documentation, because it allows me to see the informat, format, and label of a variable all in one place. So I use regex to re-arrange my informat statement into the beginnings of an attribute statement. Use the search term informat\\s([^\\s]+)\\s([^\\s]+)\\s+; to capture each informat line and create two capture groups - the variable name as group 1 and the informat as group 2. If you use the replace code $1 informat=$2 format=$2, you will see the beginnings of an attribute statement. In this replacement scheme, each informat matches each format. This is fine for date and character variables, but you may want to adjust the display format for some of your numeric variables.\nTo clean this up, get rid of the format lines (you can search for ^format.+\\n and replace with an empty replace to delete them), add the attrib statement below the infile and make sure to end the block of attributes with a semicolon, and indent your code as desired.\nSample data step view.\nAnd there you have it! The beginning of a nicely formatted data step that you can start to work with.\n","permalink":"https://dmsenter89.github.io/post/21-07-proc-import-to-data-step-with-regex/","summary":"\u003cp\u003eI find myself needing to import CSV files with a relatively large number of columns. In many cases, \u003ccode\u003eproc import\u003c/code\u003e works surprisingly well in giving me what I want. But sometimes, I need to do some work while reading in the file and it would be nice to just use a data step to do so, but I don\u0026rsquo;t want to type it in by hand. That\u0026rsquo;s when a combination of \u003ccode\u003eproc import\u003c/code\u003e and some regex substitution can come in handy.\u003c/p\u003e","title":"From Proc Import to a Data Step with Regex"},{"content":"One of the editors I use regularly is VS Code. I work a lot with Python, but when installing Anaconda using default settings on a Windows machine already having VSC installed there\u0026rsquo;s a good chance you\u0026rsquo;ll run into an issue. When attempting to run Python code straight from VSC you may get an error. This should be fixed on some newer versions of Anaconda, but I\u0026rsquo;ve needed to do something about it often enough I feel it\u0026rsquo;s useful to save the solution janh posted on StackExchange.\nSpecifically, the issue can be fixed by manually changing VSC\u0026rsquo;s default shell from PowerShell to CMD. Just open the command palette (CTRL+SHIFT+P), search \u0026ldquo;Terminal: Select Default Profile\u0026rdquo; and switch to \u0026ldquo;Command Prompt\u0026rdquo;. Everything should work as expected from now on!\n","permalink":"https://dmsenter89.github.io/post/21-07-vsc-python-fix/","summary":"\u003cp\u003eOne of the editors I use regularly is VS Code. I work a lot with Python, but when installing Anaconda\nusing default settings on a Windows machine already having VSC installed there\u0026rsquo;s a good chance you\u0026rsquo;ll run into\nan issue. When attempting to run Python code straight from VSC you may get an error. This should be fixed\non some newer versions of Anaconda, but I\u0026rsquo;ve needed to do something about it often enough I feel it\u0026rsquo;s\nuseful to save the solution \u003ca href=\"https://stackoverflow.com/users/1072989/janh\"\u003ejanh\u003c/a\u003e posted on\n\u003ca href=\"https://stackoverflow.com/questions/54828713/working-with-anaconda-in-visual-studio-code\"\u003eStackExchange\u003c/a\u003e.\u003c/p\u003e","title":"Making VS Code and Python Play Nice on Windows"},{"content":"I am currently working with a database provided by the North Carolina Department of Public Safety that consists of several fixed-width files. Each of these has an associated codebook that gives the internal variable name, a label of the variable, its data type, as well as the start column and the length of the fields for each column. To import the data sets into SAS, I could copy and paste part of that data into my INPUT and LABEL statements, but that gets tedious pretty fast when dealing with dozens of lines. And since I have multiple data sets like that, I didn\u0026rsquo;t really want to do it that way. In this post I show how a simple command-line script can be written to deal with this problem.\nIntroducing AWK Here are the first few lines of one of these files:\nCMDORNUM OFFENDER NC DOC ID NUMBER CHAR 1 7 CMCLBRTH OFFENDER BIRTH DATE DATE 8 10 CMCLSEX OFFENDER GENDER CODE CHAR 18 30 CMCLRACE OFFENDER RACE CODE CHAR 48 30 CMCLHITE OFFENDER HEIGHT (IN INCHES) CHAR 78 2 CMWEIGHT OFFENDER WEIGHT (IN LBS) CHAR 80 3 We can see that the data is tabular and separated by multiple spaces. Linux programs often deal with column data and a tool is available for manipulating column-based data on the command-line: AWK, a program that can be used for complex text manipulation from the command-line. Some useful tutorials on AWK in general are available at grymoire.com and at tutorialspoint.\nFor our purposes, we want to know about the print and printf commands for AWK. To illustrate how this works, make a simple list of three lines with each term separated by a space:\ncat \u0026lt;\u0026lt; EOF \u0026gt; list.txt 1 one apple pie 2 two orange cake 3 three banana shake EOF To print the whole file, you\u0026rsquo;d use the print statement: awk '{print}' list.txt. But I could do that with cat, so what\u0026rsquo;s the point? Well, what if I only want one of the columns? By default, $n refers to the nth column in AWK. So to print only the fruits I could write awk '{print $3}' list.txt.\nMultiple columns can be printed by listing multiple columns separated by a comma: awk '{print $2,$3}' list.txt. Note that if you omit the comma the two columns get concatenated into a single column.\nIf additional formatting is required, we can use the printf command. So to create a hyphenated fruit and food-item column, we could use awk '{printf \u0026quot;%s-%s\\n\u0026quot;, $3, $4}' list.txt. Note that we have to indicate the end-of line or else everything will be printed into a single line of text.\nNow we almost have all of the skills to create the label and input statements in SAS! Let\u0026rsquo;s create a comma-delimited list for practice:\ncat \u0026lt;\u0026lt; EOF \u0026gt; list.txt 1,one,apple pie 2,two,orange cake 3,three,banana shake EOF The -F flag is used to tell AWK to use a different column separator. So to print the third column, we\u0026rsquo;d use awk -F ',' '{print $3}' list.txt.\nMaking the SAS statements Now we know everything we need to know about AWK to create code we want. First we note that our coding file uses multiple spaces as column separators as opposed to single spaces. If each item was a single word, this wouldn\u0026rsquo;t be a problem. Unfortunately, our second column reads \u0026ldquo;OFFENDER NC DOC ID NUMBER\u0026rdquo; which would be split into five columns by default. So we will need to use the column separator flag as -F '[[:space:]][[:space:]]+'.\nThe LABEL Statement A SAS label has the general form LABEL variable-1=label-1\u0026lt;...variable-n=label-n\u0026gt;;, so for example\nlabel score1=\u0026#34;Grade on April 1 Test\u0026#34; score2=\u0026#34;Grade on May 1 Test\u0026#34;; is a valid label statement. In our file the variable names are given in column 1 and the appropriate labels in column 2. So an AWK script to print the appropriate labels can be written like this:\nawk -F \u0026#39;[[:space:]][[:space:]]+\u0026#39; \u0026#39;{printf \u0026#34;\\t%s=\\\u0026#34;%s\\\u0026#34;\\n\u0026#34;, $1, $2}\u0026#39; FILE.DAT This is what everything looks like given our code:\nThe INPUT STATEMENT The INPUT statement can be made in a similar way, it just requires some minor tweaking as INPUT can be a bit more complex to handle a variety of data, see the documentation. In our case we are dealing with a fixed-width record. The fourth column gives the starting column of the data and the fifth gives us the width of that field. The third gives us the data type. The majority of ours are character, so it seems easiest to just have the AWK script print each line as though it were a character together with a SAS comment giving the name and \u0026ldquo;official\u0026rdquo; data type. Then the few lines that need adjustment can be manually adjusted. The corresponding code would look like this:\nawk -F \u0026#39;[[:space:]][[:space:]]+\u0026#39; \u0026#39;{printf \u0026#34;\\t@%s %s $%s. /*%s - %s*/\\n\u0026#34;,$4, $1, $5, $3, $2}\u0026#39; FILE.DAT This is what is returned by our code (highlighted part has been manually edited):\nI hope you all find this useful and that it will save you some typing!\n","permalink":"https://dmsenter89.github.io/post/21-07-awk-for-sas/","summary":"\u003cp\u003eI am currently working with a database provided by the North Carolina Department of Public Safety\nthat consists of several fixed-width files. Each of these has an associated codebook that gives the\ninternal variable name, a label of the variable, its data type, as well as the start column and\nthe length of the fields for each column. To import the data sets into SAS, I could copy and paste\npart of that data into my INPUT and LABEL statements, but that gets tedious pretty fast when dealing\nwith dozens of lines. And since I have multiple data sets like that, I didn\u0026rsquo;t really want to do it that way.\nIn this post I show how a simple command-line script can be written to deal with this problem.\u003c/p\u003e","title":"Making INPUT and LABEL Statements with AWK"},{"content":"I have been using both SAS and Python extensively for a while now. With each having great features, it was very useful to combine my skills in both languages by seamlessly moving between SAS and Python in a single notebook. In the video below, fellow SAS intern Ariel Chien and I show how easy it is to connect the SAS and Python kernels using the open-source SASPy package together with SAS OnDemand for Academics. I hope you will also find that this adds to your workflow!\nThe Jupyter notebook from the video can be viewed on GitHub. For installation instructions, check out the SASPy GitHub page. Configuration for SASPy to connect to ODA can be found at this support page. For more information on SAS OnDemand for Academics, click here.\n","permalink":"https://dmsenter89.github.io/post/21-06-youtube-tutorial/","summary":"\u003cp\u003eI have been using both SAS and Python extensively for a while now. With each having great features, it was very useful to combine my\nskills in both languages by seamlessly moving between SAS and Python in\na single notebook. In the video below, fellow SAS intern Ariel Chien and I show how easy it is to connect the SAS and Python kernels using the open-source SASPy package together with SAS OnDemand for Academics.\nI hope you will also find that this adds to your workflow!\u003c/p\u003e","title":"SASPy Video Tutorial"},{"content":"The Census Bureau has updated its population estimates for 2020 with county level data. This means any projects that have had to rely on the 2019 estimates can now switch to the 2020 estimates.\nThis is particularly useful for those of us who have been trying to track the development of COVID-19. The average incidence rates are typically rescaled to new cases per 100,000 people. Previous graphs and maps I have created used the 2019 estimates. I have now updated my code for mapping North Carolina developments to use the 2020 estimates.\nCounty level data for North Carolina using the NYT COVID data set. Date set to June 8, 2021.\nBelow this post is my code for loading the necessary data using SAS. Note that I\u0026rsquo;m using a macro called mystate that can be set to the statecode abbreviation of your choice. The conditional County ne 0 is in the code because the county level CSV includes both the county data as well as the totals for each state.\nfilename popdat url \u0026#39;https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv\u0026#39;; data censusdata; infile POPDAT delimiter=\u0026#39;,\u0026#39; MISSOVER DSD lrecl=32767 firstobs=2; informat SUMLEV REGION DIVISION State County best32. STNAME $20. CTYNAME $35. CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010-POPESTIMATE2020 best32.; format SUMLEV REGION DIVISION STATE best32. COUNTY 5. STNAME $20. CTYNAME $35. CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010-POPESTIMATE2020 COMMA12. StateCode $2.; input SUMLEV REGION DIVISION STATE COUNTY STNAME $ CTYNAME $ CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010-POPESTIMATE2020; if (State ne 0) and (State ne 72) then do; FIPS=put(State, Z2.); Statecode=fipstate(FIPS); if Statecode eq \u0026amp;mystate and County ne 0 then output; end; keep STNAME CTYNAME County FIPS Statecode Popestimate2020; run; The media release can be viewed here. The county-level data set can be downloaded at this page.\n","permalink":"https://dmsenter89.github.io/post/21-06-covid-county-incidence/","summary":"\u003cp\u003eThe Census Bureau has updated its population estimates for 2020 with county level data. This means any\nprojects that have had to rely on the 2019 estimates can now switch to the 2020 estimates.\u003c/p\u003e","title":"Census 2020 Population Estimates Updated"},{"content":"All organisms must deal with fluid transport and interaction, whether it be internal, such as lungs moving air for the extraction of oxygen, or external, such as the expansion and contraction of a jellyfish bell for locomotion. Most organisms are highly deformable and their elastic deformations can be used to move fluid, move through fluid, and resist fluid forces. A particularly effective numerical method for biological fluid-structure interaction simulations is the immersed boundary (IB) method. An important feature of this method is that the fluid is discretized separately from the boundary interface, meaning that the two meshes do not need to conform with each other. This thesis covers the development of a new software tool for the semi-automated creation of finite difference meshes of complex 2D geometries for use with immersed boundary solvers IB2d and IBAMR, alongside two examples of locomotion - the flight of tiny insects and the metachronal paddling of brine shrimp.\nAs mentioned, an advantage of the IB method is that complex geometries, e.g., internal or external morphology, can easily be handled without the need to generate matching grids for both the fluid and the structure. Consequently, the difficulty of modeling the structure lies often in discretizing the boundary of the complex geometry (morphology). Both commercial and open source mesh generators for finite element methods have long been established; however, the traditional immersed boundary method is based on a finite difference discretization of the structure. In chapter 2, I present a software library called MeshmerizeMe for obtaining finite difference discretizations of boundaries for direct use in the 2D immersed boundary method. This library provides tools for extracting such boundaries as discrete mesh points from digital images. Several examples of how the method can be applied are given to demonstrate the effectiveness of the software, including passing flow through the veins of insect wings, within lymphatic capillaries, and around starfish using open-source immersed boundary software.\nAs an example of insect flight, I present a 3D model of clap and fling. Of the smallest insects filmed in flight, most if not all clap their wings together at the end of the upstroke and fling them apart at the beginning of the downstroke. This motion increases the strength of the leading edge vortices generated during the downstroke and augments the lift. At the Reynolds numbers ($Re$) relevant to flight in these insects (roughly $4\u0026lt;Re\u0026lt;40$), the drag produced during the fling is substantial, although this can be reduced through the presence of wing bristles, chordwise wing flexibility, and more complex wingbeat kinematics. It is not clear how flexibility in the spanwise direction of the wings can alter the lift and drag generated. In chapter 3, a hybrid version of the immersed boundary method with finite elements is used to simulate a 3D idealized clap and fling motion across a range of wing flexibilities. I find that spanwise flexibility, in addition to three-dimensional spanwise flow, can reduce the drag forces produced during the fling while maintaining lift, especially at lower $Re$. While the drag required to fling 2D wings apart may be more than an order of magnitude higher than the force required to translate the wings, this effect is significantly reduced in 3D. Similar to previous studies, dimensionless drag increases dramatically for Re\u0026lt;20, and only moderate increases in lift are observed. Both lift and drag decrease with increasing wing flexibility, but below some threshold, lift decreases much faster. This study highlights the importance of flexibility in both the chordwise and spanwise directions for low Re insect flight. The results also suggest that there is a large aerodynamic cost if insect wings are too flexible.\nMy second application of locomotion pertains to a 2D model of swimming, specifically the method known as metachronal paddling. This method is used by a variety of organisms to propel themselves through a fluid. This mode of swimming is characterized by an array of appendages that beat out of phase, such as the swimmerets used by long-tailed crustaceans like crayfish and lobster. This form of locomotion is typically observed over a range of Reynolds numbers greater than 1 where the flow is dominated by inertia. The majority of experimental, modeling, and numerical work on metachronal paddling has been conducted on the higher Reynolds number regime (order 100). In this chapter, a simplified numerical model of one of the smaller metachronal swimmers, the brine shrimp, is constructed. Brine shrimp are particularly interesting since they swim at Reynolds numbers on the order of 10 and sprout additional paddling appendages as they grow. The immersed boundary method is used to numerically solve the fluid-structure interaction problem of multiple rigid paddles undergoing cycles of power and return strokes with a constant phase difference and spacing that are based on brine shrimp parameters. Using a phase difference of 8%, the volumetric flux and efficiency per paddle as a function of the Reynolds number and the spacing between legs is quantified. I find that the time to reach periodic steady state for adult brine shrimp is large (approx. 150 stroke cycles) and decreases with decreasing Reynolds number. Both efficiency and average flux increase with Reynolds number. In terms of leg spacing, the average flux decreases with increased spacing while the efficiency is maximized for intermediate leg spacing.\n","permalink":"https://dmsenter89.github.io/publication/phd-dissertation/","summary":"\u003cp\u003eAll organisms must deal with fluid transport and interaction, whether it be internal, such as lungs moving air for the extraction of oxygen, or external, such as the expansion and contraction of a jellyfish bell for locomotion. Most organisms are highly deformable and their elastic deformations can be used to move fluid, move through fluid, and resist fluid forces. A particularly effective numerical method for biological fluid-structure interaction simulations is the immersed boundary (IB) method. An important feature of this method is that the fluid is discretized separately from the boundary interface, meaning that the two meshes do not need to conform with each other. This thesis covers the development of a new software tool for the semi-automated creation of finite difference meshes of complex 2D geometries for use with immersed boundary solvers IB2d and IBAMR, alongside two examples of locomotion - the flight of tiny insects and the metachronal paddling of brine shrimp.\u003c/p\u003e","title":"Immersed Boundary Simulations and Tools for Studying Insect Flight and Other Applications"},{"content":"","permalink":"https://dmsenter89.github.io/talk/dissertation-defense/","summary":"","title":"Dissertation Defense"},{"content":"Metachronal paddling can be described as the sequential oscillation of appendages whereby adjacent paddles maintain a nearly constant phase difference. This mechanism is widely used in nature, both in locomotion such as swimming in crustaceans and in fluid transport such as the clearance of mucus in the mammalian lung. Aside from the wide range of applications, metachronal paddling can be observed across a wide range of Reynolds number regimes.\nI work on simulating the hydrodynamics of metachronal paddling in brine shrimp (Artemia). Brine shrimp are small aquatic crustaceans who lay dormat eggs and are widely used in aquaculture. Their thoracopods are spaced closely together and beat with a small phase difference. We are interested in the hydrodynamics and efficiency of this swimming pattern, which has not previously been rigorously explored.\n","permalink":"https://dmsenter89.github.io/project/metachronal-paddling/","summary":"Metachronal paddling is a widely used mechanism for fluid transport and locomotion. We study the hydrodynamics of metachronal paddling in brine shrimp (\u003cem\u003eArtemia\u003c/em\u003e)","title":"Metachronal Paddling"},{"content":"Git is a widely used version control system that allows users to track their software development in both public and private repositories. It is also increasingly used to store data in text formats, see for example the New York Times COVID-19 data set. This post will briefly demonstrate how to clone and pull updates from a GitHub repository using the git functions that are built into SAS Studio.\nGit functionality has been built into SAS Studio for a little while, so there are actually two slightly different iterations of the git functions. The examples in this post will use the versions compatible with SAS Studio 3.8, which is the current version available at SAS OnDemand for Academics. All git functions use the same prefix. In older versions such as SAS Studio 3.8 the prefix is gitfn_, which is followed by a git command such as \u0026ldquo;clone\u0026rdquo; or \u0026ldquo;pull\u0026rdquo;. In SAS Studio 5, the prefix has been simplified to just git_. Most git functions have the same name between the\ntwo versions, so that the only difference is the prefix. A complete table of the old and new versions of the git functions is available in the documentation.\nWe use the git functions by calling them in an otherwise empty DATA step. In other words, we use the format\ndata _null_; /* use your git functions here */ run; Cloning a Repo To clone a repo from github we use gitfn_clone. It takes two arguments - the URL of the repository of interest and the path to an empty folder. You can have SAS create the folder for you by using OPTIONS DLCREATEDIR. The basic syntax for the clone is as follows:\ndata _null_; rc = gitfn_clone ( \u0026#34;\u0026amp;repoURL.\u0026#34;, /* URL to repo */ \u0026#34;\u0026amp;targetDIR.\u0026#34;); /* folder to put repo in */ put rc=; /* equals 0 if successful */ run; It doesn\u0026rsquo;t matter if the URL you use ends in \u0026ldquo;.git\u0026rdquo; or not. In other words, the following two macros would both work the same:\n%LET repoURL=https://github.com/nytimes/covid-19-data; /* works the same as */ %LET repoURL=https://github.com/nytimes/covid-19-data.git; You can also use password based authentication to pull in private repositories:\ndata _null_; rc = gitfn_clone ( \u0026#34;\u0026amp;repoURL.\u0026#34;, \u0026#34;\u0026amp;targetDIR.\u0026#34;, \u0026#34;\u0026amp;githubUSER.\u0026#34;, /* your GitHub username */ \u0026#34;\u0026amp;githubPASSW.\u0026#34;); /* your GitHub password */ put rc=; /* equals 0 if successful */ run; NOTE: GitHub is deprecating password-based authentication; you will need to switch to OAuth authentication or SSH keys if you are not already using them. To access a repository using an SSH key, use the following:\ndata _null_; rc = gitfn_clone( \u0026#34;\u0026amp;repoURL.\u0026#34;, \u0026#34;\u0026amp;targetDIR.\u0026#34;, \u0026#34;\u0026amp;sshUSER.\u0026#34;, \u0026#34;\u0026amp;sshPASSW.\u0026#34;, \u0026#34;\u0026amp;sshPUBkey.\u0026#34;, \u0026#34;\u0026amp;sshPRIVkey.\u0026#34;); put rc=; run; Pull-ing in Updates It is just as easy to pull in updates to a local repository by using gitfn_pull(\u0026quot;\u0026amp;repoDIR.\u0026quot;). This also works with SSH keys for private repositories:\ndata _null_; rc = gitfn_pull( \u0026#34;\u0026amp;repoDIR.\u0026#34;, \u0026#34;\u0026amp;sshUSER.\u0026#34;, \u0026#34;\u0026amp;sshPASSW.\u0026#34;, \u0026#34;\u0026amp;sshPUBkey.\u0026#34;, \u0026#34;\u0026amp;sshPRIVkey.\u0026#34;); run; Other Functions SAS also offers other built-in functions, such as _diff, _status, _push, _commit, and others. For a complete list, see the SAS documentation here.\n","permalink":"https://dmsenter89.github.io/post/21-01-git-with-sas-studio/","summary":"\u003cp\u003eGit is a widely used version control system that allows users to track their software\ndevelopment in both public and private repositories. It is also increasingly used to store\ndata in text formats, see for example the \u003ca href=\"https://github.com/nytimes/covid-19-data\"\u003eNew York Times COVID-19 data set\u003c/a\u003e.\nThis post will briefly demonstrate how to clone and pull updates from a GitHub repository\nusing the git functions that are built into SAS Studio.\u003c/p\u003e","title":"Using Git with SAS Studio"},{"content":"A popular beginners machine learning problem is the prediction of housing prices. A frequently used data set for this purpose uses housing prices in California along some additional gathered through the 1990 Census. One such data set is available here at Kaggle. Unfortunately, that data set is rather old. And I live in North Carolina, not California! So I figured I might as well create a new housing data set, but this time with more up-to-date information and using North Carolina as the state to be analyzed. One thing that may be interesting about North Carlina as compared to California is the position of major populations centers. In California, major population centers are near the beach, while major population centers in North Carolina are in the interior of the state. Both large citites and proximity to the beach tend to correlate with higher housing prices. In California, unlike in North Carolina, both of these go together.\nThis post will describe the Kaggle data set with California housing prices and then walk you through how the relevant data can be acquired from the Census Bureau. I\u0026rsquo;ll also show how to clean the data. For those who just want to explore the complete data set, I have made it available for download here .\nThe Source Data Set Acquiring the Census Data Set Census Variables Geography Considerations Acquiring Location Data Data Merge with GEOID Matching Data Cleaning The Source Data Set The geographic unit of the Kaggle data set is the Census block group, which means we will have several thousand data points for our analysis. For a good big-picture overview of Census geography divisions, see this post from the University of Pittsburgh library. The data set\u0026rsquo;s ten columns contain geographic, housing, and Census information that can be broken down as follows:\ngeographic information longitude latitude ocean proximity housing information median age of homes median value of homes total number of rooms in area total number of bedrooms in the area Census information population number of households median income Most of these exist directly in the Census API data that we have covered previously. The ocean proximity variable is a categorical giving approximate distance from the beach. My data set will not include this last categorical variable.\nAcquiring the Census Data Set Census Variables The first, and most time consuming aspect, is to figure out where the data we want is located. We know that the US has a decennial census, so accurate information is available every ten years at every level of geography that the Census Bureau tracks. Since it is currently a census year 2020 and the newest information hasn\u0026rsquo;t been tabulated yet, that means the last census count that is available is from 2010. While this is 20 years more current than the California set from 1990, it still seems a bit outdated. Luckily, since the introduction of the American Community Survey (ACS) we have annually updated information available - but not for every level of geography. Only the 5-year ACS average gives us census block-level information for the whole state, making it comparable to the Kaggle data set. The most recent of these is the 2018.\nI start by creating a data dictionary from the groups and variables pages of the \u0026ldquo;American Community Survey: 1-Year Estimates: Detailed Tables 5-Year\u0026rdquo; data set. Note that median home age is not directly available. Instead, we will use the median year structures were built to calculate the median home age. Our data dictionary also does not include any data for the longitude and latitude of each row. We will get that data separately.\ndata_dictionary = { \u0026#39;B01001_001E\u0026#39; : \u0026#34;population\u0026#34;, \u0026#39;B11001_001E\u0026#39; : \u0026#34;households\u0026#34;, \u0026#39;B19013_001E\u0026#39; : \u0026#34;median_income\u0026#34;, \u0026#39;B25077_001E\u0026#39; : \u0026#34;median_house_value\u0026#34;, \u0026#39;B25035_001E\u0026#39; : \u0026#34;median_year_structure_built\u0026#34;, \u0026#39;B25041_001E\u0026#39; : \u0026#34;total_bedrooms\u0026#34;, \u0026#39;B25017_001E\u0026#39; : \u0026#34;total_rooms\u0026#34;, } Geography Considerations The next step is figuring out exactly what level of geography we want. Our data set goes down to the Census block level at its most granular. Unfortunately, the Census API won\u0026rsquo;t let us pull the data for all the Census blocks in a state at once. Census tracts on the other hand can be acquired in one go. If we were to shortcut and use only tract data, this would be a pretty quick API call build:\nprimary_geo = \u0026#34;tract:*\u0026#34; secondary_geo = \u0026#34;state:37\u0026#34; query = base_URL + \u0026#34;?get=\u0026#34; + \u0026#34;,\u0026#34;.join(data_dictionary.keys()) + f\u0026#34;\u0026amp;for={primary_geo}\u0026amp;in={secondary_geo}\u0026#34; But let\u0026rsquo;s try and do it for the Census blocks instead. This will require us to build a sequence of API calls that loops over a larger geographic area, say the different counties in the state, and pull in the respective census block data for that geographic unit. While the FIPS codes for the state counties are sorted alphabetically, they are not contiguous. A full listing of North Carolina county FIPS codes is availalbe from NCSU here. It appears to be that the county FIPS codes are three digits long, starting at 001 and go up to 199 in increments of 2, meaning only odd numbers are in the county set. So it looks like we will be using range(1,200,2) with zero-padding to create the list of county FIPS codes. So we could use a loop similar to this:\nvars_requested = \u0026#34;,\u0026#34;.join(data_dictionary.keys()) for i in range(1,200,2): geo_request = f\u0026#34;for=block%20group:*\u0026amp;in=state:37%20county:{i:03}\u0026#34; query = base_URL + f\u0026#34;?get={vars_requested}\u0026amp;{geo_request}\u0026#34; While practicing to write the appropriate API call, you may find it useful to give it frequent, quick tests using curl. If you are using Jupyter or IPython, you can use !curl \u0026quot;{query}\u0026quot; to test your API query. Don\u0026rsquo;t forget the quotation marks, since the ampersand has special meaning in the shell. It may be helpful to test the output of your call at the county or city level with that reported on the Census Quickfacts page, if your variable is listed there. This can help make sure you are pulling the data you actually want.\nNow that we have figured out the loop necessary for creation of the API calls, we can put everything together and create a list of Pandas DataFrames which we then concatenate to create our master list.\nimport pandas as pd import requests # create the base-URL host_name = \u0026#34;https://api.census.gov/data\u0026#34; year = \u0026#34;2018\u0026#34; dataset_name = \u0026#34;acs/acs5\u0026#34; base_URL = f\u0026#34;{host_name}/{year}/{dataset_name}\u0026#34; # build the api calls as a list query_vars = base_URL + \u0026#34;?get=\u0026#34; + \u0026#34;,\u0026#34;.join(list(data_dictionary.keys()) + [\u0026#34;NAME\u0026#34;,\u0026#34;GEO_ID\u0026#34;]) api_calls = [query_vars + f\u0026#34;\u0026amp;for=block%20group:*\u0026amp;in=state:37%20county:{i:03}\u0026#34; for i in range(1,200,2) ] # running the API calls will take a moment rjson_list = [requests.get(call).json() for call in api_calls] # create the data frame by concatenation df_list = [pd.DataFrame(data[1:], columns=data[0]) for data in rjson_list] df = pd.concat(df_list, ignore_index=True) # save the raw output to disk df.to_csv(\u0026#34;raw_census.csv\u0026#34;, index=False) And now we have the data set! We do still have to address the issue of our values all being imported as strings as mentioned in my Census API post.\nAcquiring Location Data As mentioned above, we are still missing information regarding the latitude and longitude of the different block groups. The Census Bureau makes a lot of geographically coded data available on its TIGERweb page. You can interact with it both using a REST API and its web-interface. A page with map information exists here.\nDealing with shapefiles and the TIGERweb API can get a little complicated. Luckily, I know someone with expertise in GIS and shapefiles so we will be using a CSV file of the geographic data we need courtesy of Summer Faircloth, a GIS intern at the North Carolina Department of Transportation. She downloaded the TIGER/Line Shapefiles for the 20189 ACS Block Groups and Census Tracts and joined the data sets in ArcMap, from where she exported our CSV file, which is now available here .\nWe don\u0026rsquo;t need all of the columns in the CSV file, so we will limit the import to the parts we need with the usecols keyword.\ndf = pd.read_csv(\u0026#34;raw_census.csv\u0026#34;, dtype={}) shapedata = pd.read_csv(\u0026#34;BlockGroup_Tract2018.csv\u0026#34;, dtype={\u0026#34;GEOID\u0026#34;: str}, usecols=[\u0026#39;GEOID\u0026#39;,\u0026#39;NAMELSAD\u0026#39;,\u0026#39;INTPTLAT\u0026#39;,\u0026#39;INTPTLON\u0026#39;,\u0026#39;NAMELSAD_1\u0026#39;] ) shapedata = shapedata.rename(columns={\u0026#39;INTPTLAT\u0026#39; : \u0026#39;latitude\u0026#39;, \u0026#39;INTPTLON\u0026#39; : \u0026#39;longitude\u0026#39; }) Data Merge with GEOID Matching At this stage we have two data frames - the first consists of all the Census information sans the geographic coordinates of the block groups, and a second data set containing the block groups\u0026rsquo; location. Both data sets contain a GEOID column that can be used for merging. The GEOID returned by the Census API includes additional information to the regular FIPS code based GEOID used in the TIGERweb system. For example, \u0026ldquo;1500000US370010204005\u0026rdquo; in the census data set is actually GEOID \u0026ldquo;370010204005\u0026rdquo; for purposes of the TIGERweb data set. We\u0026rsquo;ll use a string split to make our GEO_ID variable from the Census API compatible with the FIPS code based GEOID from the TIGERweb service.\ndf[\u0026#34;GEO_ID\u0026#34;] = df[\u0026#34;GEO_ID\u0026#34;].str.split(\u0026#39;US\u0026#39;).str[1] df = df.merge(shapedata, left_on=\u0026#39;GEO_ID\u0026#39;, right_on=\u0026#34;GEOID\u0026#34;) Data Cleaning Now that our data set has been assembled, we can work on cleaning up the merged data set. We have the following tasks left:\nconvert column data types to numeric drop unnecessary columns rename columns handle missing values calculate median age of homes for col in data_dictionary.keys(): if col not in [\u0026#34;NAME\u0026#34;, \u0026#34;GEO_ID\u0026#34;]: df[col] = pd.to_numeric(df[col]) To indicate missing values, the Census API returns a value of \u0026ldquo;-666666666\u0026rdquo; in numeric columns. As all of our variables - except for longitude - ought to be positive, we can use the mask function to convert all negative values to missing. We\u0026rsquo;ll start by filtering out the string columns that are no longer necessary.\n# filter down to our numerical columns keeps = list(data_dictionary.keys()) +[\u0026#34;latitude\u0026#34;, \u0026#34;longitude\u0026#34;] df = df.filter(items=keeps) # replace vals \u0026lt; 0 with missing k = df.loc[:, df.columns != \u0026#39;longitude\u0026#39;] k = k.mask(k \u0026lt; 0) df.loc[:, df.columns != \u0026#39;longitude\u0026#39;] = k Now that the missing values have been handled, we can go ahead and calculate our median home age.\ndf.rename(columns=data_dictionary, inplace=True) df[\u0026#34;housing_median_age\u0026#34;] = 2018 - df[\u0026#34;median_year_structure_built\u0026#34;] df.drop(columns=\u0026#34;median_year_structure_built\u0026#34;, inplace=True) And now we\u0026rsquo;re done! We will save our output data set to disk for future analysis in a different post.\ndf.to_csv(\u0026#34;NC_Housing_Prices_2018.csv\u0026#34;, index=False) ","permalink":"https://dmsenter89.github.io/post/20-11-north-carolina-housing/","summary":"\u003cp\u003eA popular beginners machine learning problem is the prediction of housing prices. A frequently used data set for this purpose uses housing prices in California along some additional  gathered through the 1990 Census. One such data set is available \u003ca href=\"https://www.kaggle.com/camnugent/california-housing-prices\"\u003ehere\u003c/a\u003e at Kaggle. Unfortunately, that data set is rather old. And I live in North Carolina, not California! So I figured I might as well create a new housing data set, but this time with more up-to-date information and using North Carolina as the state to be analyzed. One thing that may be interesting about North Carlina as compared to California is the position of major populations centers. In California, major population centers are near the beach, while major population centers in North Carolina are in the interior of the state. Both large citites and proximity to the beach tend to correlate with higher housing prices. In California, unlike in North Carolina, both of these go together.\u003c/p\u003e","title":"North Carolina Housing Data"},{"content":"After reading a news article about teacher pay in the US, I was curious and wanted to look into the source data myself. Unfortunately, the source that was mentioned was a publication by the National Education Association (NEA) which had the data as tables embedded inside a PDF report. As those who know me can attest, I don\u0026rsquo;t like hand-copying data. It is slow and error-prone. Instead, I decided to use the tabula package to extract the information from the PDFs directly into a Pandas dataframe. In this post, I will show you how to extract the data and how to clean it up for analysis.\nThe Data Source Loading the Data Cleaning the Data Numeric Conversion Table B-6 The Data Source Several years worth of data are available in PDF form on the NEA website. Reading through the technical notes, they highlight that they did not collect all of their own salary information. Some states\u0026rsquo; information is calculated from the American Community Survey (ACS) done by the Census Bureau - a great resource whose API I have covered in a different post. Each report includes accurate data for the previous school year, as well as estimates for the current school year. As of this post, the newest report is the 2020 report which includes data for the the 2018-2019 school year, as well as estimates of the 2019-2020 school year.\nThe 2020 report has the desired teacher salary information in two separate locations. One is in table B-6 on page 26 of the PDF, which shows a ranking of the different states\u0026rsquo; average salary in addition to the average salary:\nA second location is in table E-7 on page 46, which gives salary data for the completed school year as well as different states\u0026rsquo; estimates for the 2019-2020 school year:\nNote that table E-7 lacks the star-annotation marking NEA estimated values. This, and the lack of the ranking column, makes Table E-7 easier to parse. In the main example below, this will be the source of the five years of data. I will however also show how to parse table B-6 at the end of this post for completion.\nLoading the Data As of October 2020, the NEA site has five years worth of reports online. Unfortunately, these are not labeled consistently for all five years. Similarly the page numbers differ for each report. Prior to the 2018 report, inconsistent formats were used for the tables which require previous years to be parsed separately from the newer tables. For this reason, I\u0026rsquo;ll make a dictionary for the 2018-2020 reports only, which will simplify the example below.\nreport = { \u0026#39;2020\u0026#39; : { \u0026#39;url\u0026#39; : \u0026#34;https://www.nea.org/sites/default/files/2020-10/2020%20Rankings%20and%20Estimates%20Report.pdf\u0026#34;, \u0026#39;page\u0026#39; : 46, }, \u0026#39;2019\u0026#39; : { \u0026#39;url\u0026#39; : \u0026#34;https://www.nea.org/sites/default/files/2020-06/2019%20Rankings%20and%20Estimates%20Report.pdf\u0026#34;, \u0026#39;page\u0026#39; : 49, }, \u0026#39;2018\u0026#39; : { \u0026#39;url\u0026#39; : \u0026#34;https://www.nea.org/sites/default/files/2020-07/180413-Rankings_And_Estimates_Report_2018.pdf\u0026#34;, \u0026#39;page\u0026#39; : 51, }, } We can now use dictionary comprehension to fill in a dictionary with all the source tables of interest. We will be using the tabula package to extract data from the PDFs. If you don\u0026rsquo;t have it installed, you can use pip install tabula-py to get a copy. The method that reads in a PDF is aptly called read_pdf. Its first argument is a file path to the PDF. Since we want to use a URL, we will use the keyword argument stream=True and then name the specific page in each PDF that contains the information we are after. By default, read_pdf returns a list of dataframes, so we just save the first element from the list, which is the report we are interested in.\nNote: if you are using WSL, depending on your settings, you may get the error Exception in thread \u0026quot;main\u0026quot; java.awt.AWTError: Can't connect to X11 window server using 'XXX.XXX.XXX.XXX:0' as the value of the DISPLAY variable. error when running read_pdf. This is fixed by having an X11 server running.\nimport tabula import pandas as pd source_df = {year : tabula.read_pdf(report[year][\u0026#39;url\u0026#39;], stream=True, pages=report[year][\u0026#39;page\u0026#39;])[0] for year in report.keys()} And that\u0026rsquo;s it in principle. How cool is that! Of course, we still need to clean our data a little bit.\nCleaning the Data Let\u0026rsquo;s take a look at the first and last few entries of the 2020 report:\npd.concat([source_df[\u0026#39;2020\u0026#39;].head(), source_df[\u0026#39;2020\u0026#39;].tail()]) Unnamed: 0 2018-19 2019-20 From 2018-19 to 2019-20 From 2010-11 to 2019-20 (%) 0 State Salary($) Salary($) Change(%) Current Dollar Constant Dollar 1 Alabama 52,009 54,095 4.01 13.16 -2.58 2 Alaska 70,277 70,877 0.85 15.36 -0.69 3 Arizona 50,353 50,381 0.06 8.03 -7.00 4 Arkansas 49,438 49,822 0.78 8.31 -6.75 48 Washington 73,049 72,965 -0.11 37.86 18.69 49 West Virginia 47,681 50,238 5.36 13.51 -2.28 50 Wisconsin 58,277 59,176 1.54 9.17 -6.02 51 Wyoming 58,861 59,014 0.26 5.19 -9.44 52 United States 62,304 63,645 2.15 14.14 -1.73 We see that each column is treated as a string object (which you can confirm by running source_df['2020'].dtypes) and that the first row of data is actually at index 1 due to the fact that the PDF report used a two-row header. This means we can safely drop the first row of every dataframe. We can also drop the last row of every dataframe since that just contains summary data of the US as a whole, which we can easily regenerate as necessary. So row indices 0 and 52 can go for all of our data sets.\nfor df in source_df.values(): df.drop([0, 52], inplace=True) Next up I\u0026rsquo;d like to fix the column names. The fist column is clearly the name of the state (except in the case of Washington D.C.), while the next two columns give the years for which the salary information is given. Let\u0026rsquo;s rename the second and third columns according to the pattern Salary %YYYY-YY using Python\u0026rsquo;s f-string syntax.\nfor df in source_df.values(): df.rename(columns={ df.columns[0] : \u0026#34;State\u0026#34;, df.columns[1] : f\u0026#34;Salary {str(df.columns[1])}\u0026#34;, df.columns[2] : f\u0026#34;Salary {str(df.columns[2])}\u0026#34;, }, inplace=True) source_df[\u0026#34;2020\u0026#34;].head() # show the result of our edits so far State Salary 2018-19 Salary 2019-20 From 2018-19 to 2019-20 From 2010-11 to 2019-20 (%) 1 Alabama 52,009 54,095 4.01 13.16 -2.58 2 Alaska 70,277 70,877 0.85 15.36 -0.69 3 Arizona 50,353 50,381 0.06 8.03 -7.00 4 Arkansas 49,438 49,822 0.78 8.31 -6.75 5 California 83,059 84,659 1.93 24.74 7.39 Looks like we\u0026rsquo;re almost done! Let\u0026rsquo;s drop the unnecessary columns and check our remaining column names:\nfor year, df in source_df.items(): df.drop(df.columns[3:], axis=1, inplace=True) print(f\u0026#34;{year}:\\t{df.columns}\u0026#34;) 2020:\tIndex(['State', 'Salary 2018-19', 'Salary 2019-20'], dtype='object') 2019:\tIndex(['State', 'Salary 2017-18', 'Salary 2018-19'], dtype='object') 2018:\tIndex(['State', 'Salary 2017', 'Salary 2018'], dtype='object') We can see that the column naming scheme in 2018 was different than in the previous reports. To make them all compatible for our merge, we\u0026rsquo;re going to have to do some more editing. Based on the other reports, it appears as though the 2018 report used the calendar year of the end of the school year, while the others utilized a range. This can easily be solved using regex substitution. We\u0026rsquo;ll do that now.\nimport re for year, df in source_df.items(): if year != \u0026#34;2018\u0026#34;: df.rename(columns={ df.columns[1] : re.sub(r\u0026#34;\\d{2}-\u0026#34;, \u0026#39;\u0026#39;, df.columns[1]), df.columns[2] : re.sub(r\u0026#34;\\d{2}-\u0026#34;, \u0026#39;\u0026#39;, df.columns[2]), }, inplace=True) # print the output for verification print(f\u0026#34;{year}:\\t{df.columns}\u0026#34;) 2020:\tIndex(['State', 'Salary 2019', 'Salary 2020'], dtype='object') 2019:\tIndex(['State', 'Salary 2018', 'Salary 2019'], dtype='object') 2018:\tIndex(['State', 'Salary 2017', 'Salary 2018'], dtype='object') Now that everything works, we can do our merge to create a single dataframe with the information for all of the school years we have downloaded.\nmerge_df = source_df[\u0026#34;2018\u0026#34;].drop([\u0026#34;Salary 2018\u0026#34;], axis=1).merge( source_df[\u0026#34;2019\u0026#34;].drop([\u0026#34;Salary 2019\u0026#34;], axis=1)).merge( source_df[\u0026#34;2020\u0026#34;]) merge_df.head() State Salary 2017 Salary 2018 Salary 2019 Salary 2020 0 Alabama 50,391 50,568 52,009 54,095 1 Alaska 6 8,138 69,682 70,277 70,877 2 Arizona 4 7,403 48,723 50,353 50,381 3 Arkansas 4 8,304 50,544 49,438 49,822 4 California 7 9,128 80,680 83,059 84,659 Numeric Conversion We\u0026rsquo;re almost done! Notice that we still have not dealt with the fact that every column is still treated as a string. Before we can use the to_numeric function, we still need to take care of two issues:\nThe commas in the numbers. While they are nice for our human eyes, Pandas doesn\u0026rsquo;t like them. In the 2017 salary column, there appears to be extraneous white space after the first digit for some entries. Luckily, both of these problems can be remedied with a simple string replacement operation.\nmerge_df.iloc[:,1:] = merge_df.iloc[:,1:].replace(r\u0026#34;[,| ]\u0026#34;, \u0026#39;\u0026#39;, regex=True) for col in merge_df.columns[1:]: merge_df[col] = pd.to_numeric(merge_df[col]) Now we\u0026rsquo;re done! We have created an overview of annual teacher salaries from the 2016-17 school year until 2019-20 extracted from a series of PDFs published by the NEA. We have cleaned up the data and converted everything to numerical values. We can now get summary statistics and do any analysis of interest with this data.\nmerge_df.describe() # summary stats of our numeric columns Salary 2017 Salary 2018 Salary 2019 Salary 2020 count 51.000000 51.000000 51.000000 51.000000 mean 56536.196078 57313.039216 58983.254902 60170.647059 std 9569.444674 9795.914601 10286.843230 10410.259274 min 42925.000000 44926.000000 45105.000000 45192.000000 25% 49985.000000 50451.500000 51100.500000 52441.000000 50% 54308.000000 53815.000000 54935.000000 57091.000000 75% 61038.000000 61853.000000 64393.500000 66366.000000 max 81902.000000 84227.000000 85889.000000 87543.000000 Table B-6 As mentioned above, table B-6 in the 2020 Report presents slightly greater challenges. A lot of the cleaning is similar or identical, so I will not reproduce it in full. Instead, I have loaded a subsetted part of table B-6 and will show how this can be cleaned up as well. But first, let\u0026rsquo;s look at the first several entries:\nUnnamed: 0 2017-18 (Revised) 2018-19 0 State Salary($) Rank Salary($) 1 Alabama 50,568 36 52,009 2 Alaska 69,682 7 70,277 3 Arizona 48,315 45 50,353 4 Arkansas 49,096 44 49,438 5 California 80,680 2 83,059 * 6 Colorado 52,695 32 54,935 7 Connecticut 74,517 * 5 76,465 * 8 Delaware 62,422 13 63,662 We can see that there is an additional hurdle compared to the previous tables: the second column now contains data from two columns, both the Salary information as well as a ranking of the salary as it compares to the different states. For a few states, there is additionally a \u0026lsquo;*\u0026rsquo; to denote values that were estimated as opposed to received. We can again use a simple regex replace together with a capture group to parse out only those values that we are interested in, while dropping the extraneous information using the code below.\nb6.iloc[:,1:] = b6.iloc[:,1:].replace(r\u0026#34;([\\d,]+).*\u0026#34;, r\u0026#34;\\1\u0026#34;, regex=True) And now we\u0026rsquo;re back to where we were above before we did the string conversion. This is what it looks like after also dropping the first row and renaming the columns:\nState Salary 2018 Salary 2019 1 Alabama 50,568 52,009 2 Alaska 69,682 70,277 3 Arizona 48,315 50,353 4 Arkansas 49,096 49,438 5 California 80,680 83,059 6 Colorado 52,695 54,935 7 Connecticut 74,517 76,465 8 Delaware 62,422 63,662 From here on out, we can proceed as in the previous example.\n","permalink":"https://dmsenter89.github.io/post/20-10-tabula/","summary":"What do you do when your data table is in PDF format? Let\u0026rsquo;s use tabula-py to extract teacher salary information from PDFs directly into Pandas dataframes. We\u0026rsquo;ll also use some regex to clean up the results.","title":"Teacher Salaries"},{"content":"The Census Bureau makes an incredible amount of data available online. In this post, I will summarize how to get access to this data via Python by using the Census Bureau\u0026rsquo;s API. The Census Bureau makes a pretty useful guide available here - I recommend checking it out.\nAPI Basics Building an the Base URL Building the Query The \u0026lsquo;Get\u0026rsquo; Variables Location Variables The Complete Call Making the API Request Reading the JSON into Pandas API Basics We can think of an API query of consisting of two main parts: a base URL (also called a root URL) and a query string. These two strings are joined together with the query character \u0026ldquo;?\u0026rdquo; to create an API call. The resulting API call can in theory be copy-and-pasted into the URL bar of your browser, and I recommend this when first playing around with a new API. Seeing the raw text returned in the browser can help you understand the structure of what is being returned. In the case of the Census Bureau\u0026rsquo;s API, it returns a string that essentially looks like a list of lists from a Python perspective. This can easily be turned into a Pandas dataset. Be aware that all values are returned as strings. You\u0026rsquo;ll have to convert number columns to numeric by yourself.\nTo get an overview of all available data sets, you can go to the data page which contains a long list of data sets. This data page is incredibly useful because it gives access to all of the information needed to build a correct API call, including the base URLs of all data sets and the variables available in each.\nA snapshot of two datasets available as part of the 2018 American Community Survey (ACS). Building an the Base URL Let\u0026rsquo;s build a sample API call for the 2018 American Community Survey 1-Year Detailed Table. While we could just copy the base URL from the data page, I like to assemble mine manually from its component parts. This makes it easier to write a wrapper for the API calls if you plan on scraping the same data from multiple years.\nhost_name = \u0026#34;https://api.census.gov/data\u0026#34; year = \u0026#34;2018\u0026#34; dataset_name = \u0026#34;acs/acs1\u0026#34; base_URL = f\u0026#34;{host_name}/{year}/{dataset_name}\u0026#34; Building the Query Now that we have the base URL, we can work on building the query. For purposes of the Census Bureau, you will need two components: the variables of interest, which are listed after the get= keyword, and the geography for which you would like the data listed after the for= keyword. For certain subdivisions, like counties, you can specify two levels of geography by adding an in= keyword at at the end.\nThe \u0026lsquo;Get\u0026rsquo; Variables Since many of the data sets have a large amount of variables in them, it often makes sense to take a look at the \u0026ldquo;groups\u0026rdquo; page first. This page lists variables as groups, giving you a better overview of what data is available. This page is available at {base_URL}/groups.html. A complete list of all variables in the data set is available at {base_URL}/variables.html.\nLet\u0026rsquo;s find some variables. The most basic variable we\u0026rsquo;d expect to find here is total population. We can find this variable in group \u0026ldquo;B01003\u0026rdquo;. The total estimate is in sub-variable \u0026ldquo;001E\u0026rdquo;, meaning that the variable for total population is \u0026ldquo;B01003_001E\u0026rdquo;. Let\u0026rsquo;s also get household income (group \u0026ldquo;B19001\u0026rdquo;) not broken down by race: \u0026ldquo;B19001_001E\u0026rdquo;. There is also median monthly housing cost (group B25105) with variable \u0026ldquo;B25105_001E\u0026rdquo;. Since the variable names can be a little difficult to parse, I recommend making a data dictionary as you prepare the list of variables to fetch.\ndata_dictionary = { \u0026#34;B01003_001E\u0026#34; : \u0026#34;Total Population\u0026#34;, \u0026#34;B19001_001E\u0026#34; : \u0026#34;Household Income (12 Month)\u0026#34;, \u0026#34;B25105_001E\u0026#34; : \u0026#34;Median Monthly Housing Cost\u0026#34;, } This way, the list of variables can easily be created from the data dictionary:\nget_vars = \u0026#39;,\u0026#39;.join(data_dictionary.keys()) Location Variables Which geographic variables are available for a particular data set can be found {base_URL}/geography.html. The Census Bureau uses FIPS codes to reference the different geographies. To find the relevant codes, see here. Delaware for example has FIPS code 10 while North Carolina is 37. So to get information for these two states, we\u0026rsquo;d use for=state:10,37. You can also use \u0026lsquo;*\u0026rsquo; as a wildcard. So to get all the states\u0026rsquo; info you\u0026rsquo;d write for=state:*.\nSubdivisions for similarly. To get information for Orange County (FIPS 135) in North Carolina (FIPS 37), you could write for=county:135 with the keyword in=state:37. Let\u0026rsquo;s get the information for Orange and Alamance counties in North Carolina.\ncounty_dict = { \u0026#34;001\u0026#34; : \u0026#34;Alamance County\u0026#34;, \u0026#34;135\u0026#34; : \u0026#34;Orange County\u0026#34;, } county_fips = \u0026#39;,\u0026#39;.join(county_dict.keys()) state_dict = {\u0026#34;37\u0026#34; : \u0026#34;North Carolina\u0026#34;} state_fips = \u0026#39;,\u0026#39;.join(state_dict.keys()) query_str = f\u0026#34;get={get_vars}\u0026amp;for=county:{county_fips}\u0026amp;in=state:{state_fips}\u0026#34; The Complete Call The complete API call can now be easily assembled from the previous two pieces:\napi_call = base_URL + \u0026#34;?\u0026#34; + query_str If we copy-and-paste this output into our browser, we can see the result looks as follows:\nThe result of our sample API query. Making the API Request We can make the API request with Python\u0026rsquo;s requests package:\nimport requests r = requests.get(api_call) And that\u0026rsquo;s it! We now have the response we wanted. To interpret the response as JSON, we would call the json method of the response object: r.json(). The result can then be fed into Pandas to generate our data set.\nReading the JSON into Pandas We can use Pandas\u0026rsquo; DataFrame method directly on our data, making sure to specify that the first row consists of column headers.\nimport pandas as pd data = r.json() df = pd.DataFrame(data[1:], columns=data[0]) We can then do any renaming based on the dictionaries we have created previously.\ndf.rename(columns=data_dictionary, inplace=True) df[\u0026#39;county\u0026#39;] = df[\u0026#39;county\u0026#39;].replace(county_dict) df[\u0026#39;state\u0026#39;] = df[\u0026#39;state\u0026#39;].replace(state_dict) The last step is to make sure our numeric columns are interpreted as such. Since all of the requested variables are in fact numeric, we can use the dictionary of variables to convert what we need to numeric variables.\nfor col in data_dictionary.values(): df[col] = pd.to_numeric(df[col]) And that\u0026rsquo;s it! We\u0026rsquo;re now ready to work with our data.\n","permalink":"https://dmsenter89.github.io/post/20-08-census-api/","summary":"\u003cp\u003eThe Census Bureau makes an incredible amount of data available online. In this post, I will summarize how to get access to this data via Python by using the Census Bureau\u0026rsquo;s API. The Census Bureau makes a pretty useful guide available \u003ca href=\"https://www.census.gov/data/developers/guidance/api-user-guide.html\"\u003ehere\u003c/a\u003e - I recommend checking it out.\u003c/p\u003e","title":"Accessing Census Data via API"},{"content":"","permalink":"https://dmsenter89.github.io/publication/senter-2020-meshmerizeme/","summary":"","title":"A semi-automated finite difference mesh creation method for use with immersed boundary software IB2d and IBAMR"},{"content":"This workshop covers data acquisition and basic data preparation with a focus on using Python with Jupyter Notebooks. To avoid having to install Python locally during the workshop, we will be utilizing an Azure notebook project. The example files are located here.\nPlease note that the free Azure notebooks will only be available until early October. To continue using Python and Jupyter notebooks, you may want to consider using a local installation. For Windows and Mac users, I recommend using Anaconda. For continued cloud usage, you may consider Cocalc. Please note that you will need a subscription for your Cocalc notebooks to be able to download data from external sources.\nAdditional Links:\nEngauge Digitizer (software to extract data points from graphs). Markdown Cheatsheet. ","permalink":"https://dmsenter89.github.io/talk/webscraping-tutorial/","summary":"\u003cp\u003eThis workshop covers data acquisition and basic data preparation with a focus on using Python with Jupyter Notebooks. To avoid having to install Python locally during the workshop, we will be utilizing an \u003ca href=\"https://notebooks.azure.com/\"\u003eAzure notebook\u003c/a\u003e project. The example files are located \u003ca href=\"https://notebooks.azure.com/dmsenter/projects/datacollectiontutorial\"\u003ehere\u003c/a\u003e.\u003c/p\u003e","title":"Basics of Web Scraping with Python"},{"content":"Basics of Web Scraping with Python Michael Senter\nGoals for Today Understand what tools and methods are available.\nBe able to create a new project using Python and Jupyter.\nBe able to edit existing code snippets to gather data.\nPython easy to learn, reads like \u0026ldquo;pseudocode\u0026rdquo; widely used in a variety of fields many books, websites, etc. to help you learn print(\u0026#34;Hello, world!\u0026#34;) Data Sources CSV/Excel Downloads COVID Related Data Johns Hopkins Dashboard The Johns Hopkins data is published on GitHub and is updated regularly.\nUsing SAS filename outfile \u0026#34;~/import-data-nyt.sas\u0026#34;; /* download official SAS script to above filename */ proc http url=\u0026#34;https://raw.githubusercontent.com/sassoftware/covid-19-sas/master/Data/import-data-nyt.sas\u0026#34; method=\u0026#34;get\u0026#34; out=outfile; run; /* run the downloaded script */ %include \u0026#34;~/import-data-nyt.sas\u0026#34;; /* state and county level data are now in memory */ ","permalink":"https://dmsenter89.github.io/slides/webscraping-tutorial/","summary":"\u003ch1 id=\"basics-of-web-scraping-with-python\"\u003eBasics of Web Scraping with Python\u003c/h1\u003e\n\u003cp\u003eMichael Senter\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"goals-for-today\"\u003eGoals for Today\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003eUnderstand what tools and methods are available.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eBe able to create a new project using Python and Jupyter.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eBe able to edit existing code snippets to gather data.\u003c/p\u003e","title":"Webscraping Tutorial"},{"content":"My website is back up and running! Some incompatabilities between my old site and updates to both Hugo and the Academic Theme have led to some downtime on this page as I didn\u0026rsquo;t have time to look through how to rebuild my site without loosing previous content. I\u0026rsquo;m currently in the process of updating everything and will try to bring back some material as well. Stay tuned!\nThis page is currently using the Academic theme from Hugo. Docs and other templates are available at wowchemy.\n","permalink":"https://dmsenter89.github.io/post/porting-forward/","summary":"\u003cp\u003eMy website is back up and running! Some incompatabilities between my old site and updates to both Hugo and the Academic Theme have led to some downtime on this page as I didn\u0026rsquo;t have time to look through how to rebuild my site without loosing previous content. I\u0026rsquo;m currently in the process of updating everything and will try to bring back some material as well. Stay tuned!\u003c/p\u003e","title":"Porting Forward"},{"content":"Please join me as I present the work I have done so far in my graduate career and discuss avenues for future study.\n","permalink":"https://dmsenter89.github.io/talk/thesis-proposal/","summary":"\u003cp\u003ePlease join me as I present the work I have done so far in my graduate career and discuss\navenues for future study.\u003c/p\u003e","title":"Thesis Proposal"},{"content":"Insects are ubiquitious throughout the world. Most of us are familiar with winged insects such as butterflies and bees. Insect flight is an interesting topic from a biomechanics perspective. Unlike birds, most insects (with some eceptions, such as dragonflies and others) do not have flight muscles attached to their wings. Instead, their flight muscles oscillate their thorax, which in turn makes the wings move. Furthermore, they beat their wings at a very high speed. The aerodynamics of insect flight are also very interesting. Larger insects are able to fly by creating a leading edge vortex. This method does not work in the smallest insect fliers. Such insects include the thrips and chalcid wasps, some of which have wingspans as small as 1 mm. These insects have unusual wing structures, as can be seen in this image:\nThe solid part of the wing is rather small and narrow, with many large bristles projecting from the solid part of the wing. Insects such as thrips do not create a leading edge vortex; instead, they fly using the \u0026ldquo;clap and fling\u0026rdquo; method. This method is common amongst insects who fly in the intermediate Reynolds number regime, $1\\leq \\mathrm{Re} \\leq 100$.\n","permalink":"https://dmsenter89.github.io/project/clap-and-fling/","summary":"Simulating the aerodynamics of the smallest insect fliers.","title":"Clap and Fling"},{"content":"IB2d and IBAMR are two software packages implementing the immersed boundary method (see below). These packages model fluid-structure interaction problems based on user given parameters and geometry. The manual creation of the initial geometry mesh can be difficult and time consuming, especially for the complex shapes encountered in biological applications. Oftentimes we have images of the geometry we wish to explore. I am developing software to help automate the creation of such CFD meshes for 2D simulations with a file-format suitable for use with IB2d and IBAMR from images. An initial prototype version is available on Github. A paper exploring the use of MeshmerizeMe in conjuction with IB2d for simulations is in preparation.\nUsage MeshmerizeMe needs two input files per experimental geometry: an SVG image file with the geometry of interest and an input2d file with the experiment parameters. When selecting an SVG for use with MeshmerizeMe it will automatically look for the input2d file in the same folder. It will then parse the paths, transform them into the correct coordinate system and appropriately sample the paths based on the size of the Cartesian grid set in the input2d file. The geometry will be exported as a vertex file. This file is readable by both IB2d and IBAMR.\nSVGs were chosen as the image source as the are an open, text-based format making them very accesible to work with. They are standardized for web use and many tools exist for creating and manipulating SVG images. They can be created from source images such as photographs or scans by means of edge detection tools and by manually tracing the outline of a shape of interest Consider optimizing the SVG prior to processing to save time.\nAs the current version of MeshmerizeMe only handles a subset of SVG, tools that optimize the SVG files created by your editor are very useful. Examples of such software include SVGO, which also offers a webapp called SVGOMG. Another software is svgcleaner.\nIBM Background One aspect of computational fluid dynamics is the investigation of fluid-structure interactions. One method developed for the study of such interactions is the immersed boundary method (IBM) developed by Peskin1. It is well known that fluids can be studied from both a Eulerian and a Lagrangian view. The IBM combines these - the domain of the problem is resolved as a Cartesian grid on which Eulerian equations are solved for fluid velocity and pressure. In the case of Newtonian fluids the incompressible Navier-Stokes equations comprising of\n$$ \\rho \\left( \\frac{\\partial \\mathbf{u}}{\\partial t} + \\mathbf{u} \\cdot \\nabla \\mathbf{u} \\right) = - \\nabla \\mathbf{p} + \\mu \\nabla^2 \\mathbf{u} + \\mathbf{f}$$\nand\n$$\\nabla \\cdot \\mathbf{u} = 0$$\nneed to be solved.\nThe immersed structures are modeled as fibers in the form of parametric curves $X(s,t)$, where $s$ is a parameter and $t$ is time. The fiber experiences force distributions $F(s,t)$, and we can derive the force the fiber exerts on the fluid from the momentum equation. For the fibers we then solve\n$$\\mathbf{f} = \\int_\\Gamma \\mathbf{F}(s,t),\\delta\\left(\\mathbf{x}-\\mathbf{X}(s,t)\\right),ds$$\nand\n$$\\frac{\\partial \\mathbf{X}}{\\partial t} = \\int_\\Omega \\mathbf{u}(\\mathbf{x},t), \\delta \\left( \\mathbf{x}-\\mathbf{X}(s,t)\\right),d\\mathbf{x}.$$\nHere, $\\Gamma$ is the immersed structure and $\\Omega$ is the fluid domain.\nThe immersed structures are discretized not on a Cartesian grid but on a separate Lagrangian grid on the fiber itself. Of import to CFD software users is that the initial discretization of the immersed structure has to be supplied by the user. While this is not too difficult for simple geometries, the often complex structures encountered in mathematical biology can present a significant time investment. This is the part where MeshmerizeMe comes in handy.\nCharles S Peskin. 2002. \u0026ldquo;The immersed boundary method.\u0026rdquo; Acta numerica 11:479-517.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","permalink":"https://dmsenter89.github.io/project/meshmerizeme/","summary":"Automatic mesh generation from 2D images for use with immersed boundary solvers.","title":"MeshmerizeMe"},{"content":"","permalink":"https://dmsenter89.github.io/publication/hohenegger-2017-mean/","summary":"","title":"Mean first passage time in a thermally fluctuating viscoelastic fluid"},{"content":" I\u0026rsquo;m D. Michael Senter, a Research Statistician Developer at SAS Institute. Mathematical modeling, data analysis, and machine learning have been the major themes of my work, and the through line of my career has been using code to solve computationally complex problems.\nI enjoy mentoring and teaching. My teaching philosophy is based on my own experience — that a passion for math and computer science can be cultivated through active learning and emphasizing small victories.\nI\u0026rsquo;m originally from Nuremberg, Germany, where I attended the Neues Gymnasium Nürnberg, a school following the humanist tradition of education. I now live in Lexington with my family, where I enjoy travelling, collecting books, and reading nerdy webcomics in my spare time.\nInterests Data Cleaning and Preparation Missing Data Methods Big Data and Data Analytics Causal Inference Education PhD in Applied Mathematics, University of North Carolina at Chapel Hill (2021) Graduate Certificate in Bioinformatics and Computational Biology, UNC Chapel Hill (2021) Graduate Certificate in BD2K Data Science, UNC Chapel Hill (2021) BSc in Mathematics, University of Utah (2015) Elsewhere GitHub LinkedIn ORCID Google Scholar Curriculum Vitae ","permalink":"https://dmsenter89.github.io/about/","summary":"About D. Michael Senter","title":"About"}]