Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load data #2

Open
wants to merge 2 commits into
base: load-data
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 87 additions & 1 deletion src/book/chapter_1_examples.clj
Original file line number Diff line number Diff line change
Expand Up @@ -233,5 +233,91 @@
;;Special values encoded as numbers are *dangerous* because if they are not handled properly, they can generate bogus results, like a 99-pound baby.
;;Here we replacing these values with `nil` to skip them in the further calculations.
;;
;;The last of `map-columns` creates a new column `totalwgt_lb` that combines pounds and ounces into a single quantity, in pounds.
;;The last of `map-columns` creates a new column `totalwgt-lb` that combines pounds and ounces into a single quantity, in pounds.
;;
;;## 1.7 Validation
;;
;;When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data
;;incorrectly or introduce other misunderstandings. If you take time to validate the data, you can save time later and avoid errors.
;;
;;One way to validate data is to compute basic statistics and compare them with published results. For example, the NSFG codebook includes tables that summarize each variable. Here is the
;;table for `outcome`, which encodes the outcome of each pregnancy:
;;
;;```
;; value label Total
;; 1 LIVE BIRTH 9148
;; 2 INDUCED ABORTION 1862
;; 3 STILLBIRTH 120
;; 4 MISCARRIAGE 1921
;; 5 ECTOPIC PREGNANCY 190
;; 6 CURRENT PREGNANCY 352
;;```
;;
;;We can use `frequencies` function to count the number of times each value appears. If we select the `outcome` column from the DataFrame, we can use `frequencies` to compare with the
;;published data:
;;
(->> (:outcome (nsfg/read-fem-preg-dataset))
(frequencies)
(sort-by first))
;;
;;The result of `frequencies` is a map where keys are values and we can sort by it using `(sort-by first)`, so the values appear in order.
;;
;;Comparing the results with the published table, it looks like the values in `outcome` are correct. Similarly, here is the published table for `birthwgt_lb`
;;
;;```
;; value label Total
;; . INAPPLICABLE 4449
;; 0-5 UNDER 6 POUNDS 1125
;; 6 6 POUNDS 2223
;; 7 7 POUNDS 3049
;; 8 8 POUNDS 1889
;; 9-95 9 POUNDS OR MORE 799
;;```
;;
;;And here are the frequencies:
;;
(->> (:birthwgt-lb (nsfg/read-fem-preg-dataset))
(frequencies)
(sort-by first))
;;
;;The counts for 6, 7, and 8 pounds check out, and if you add up the counts for 0-5 and 9-95, they check out, too. But if you look more closely, you will notice one value that has to be
;;an error, a 51 pound baby!
;;
;;To deal with this error, I added the following logic to `clean-fem-preg`:
;;```
;; (tc/map-columns :birthwgt-lb
;; [:birthwgt-lb]
;; (fn [v]
;; (let [na-vals [51 97 98 99]] ; here if the value is 51 or 97, 98, 99 then it is replaced by nil
;; (if (in? na-vals v) nil v))))
;;```
;;This statement replaces invalid values with `nil`.
;;
;;## 1.8 Interpretation
;;
;;To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.
;;
;;As an example, let’s look at the sequence of outcomes for a few respondents. Because of the way the data files are organized, we have to do some processing to collect the pregnancy data
;;for each respondent. Here’s a function that does that:
;;
;;```
;; (defn make-preg-map [caseid]
;; (-> (read-fem-preg-dataset) ; load dataset
;; (tc/select-columns [:caseid :outcome]) ; select only caseid and outcome columns
;; (tc/select-rows (comp #(= caseid %) :caseid)) ; select only rows with needed caseid
;; :outcome))
;;```
;;
;;This example looks up one respondent and prints a list of outcomes for her pregnancies:
(nsfg/make-preg-map 10229)
;;
;;The outcome code `1` indicates a live birth. Code `4` indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.
;;
;;Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more.
;;
;;But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth.
;;If we consider this data with empathy, it is natural to be moved by the story it tells.
;;
;;Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. We can use this data to answer statistical questions about family
;;life, reproduction, and health. At the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude.
;;
7 changes: 7 additions & 0 deletions src/data/nsfg.clj
Original file line number Diff line number Diff line change
Expand Up @@ -169,3 +169,10 @@

(catch AssertionError e "Assert data failed" (.getMessage e)))
(prn "All tests passed.")))

(defn make-preg-map [caseid]
(let [caseid (str caseid)]
(-> (read-fem-preg-dataset)
(tc/select-columns [:caseid :outcome])
(tc/select-rows (comp #(= caseid %) :caseid))
:outcome)))