Data Exploration for Paragonimiasis

Raw data is stored in CLEAN-ActiveSurvey.xlsx.

File: DataExploration.ipynb

  • Primary goal is to familiarize myself with the Python package pandas.
  • Data set contains 9,355 rows and 42 columns
  • 9 of the columns are demographic data (age, address, sex, etc.)
  • The average age per village in the data set appears bimodal. Around half the villages have a mean age of around 13, whereas the other half have a mean age around 50.
  • There are 10 symptoms, each symptom takes two columns (has symptom and duration of symptom)
  • If the patient answered no to a symptom, the corresponding duration is set to NaN.
  • Duration of fever is missing for 94% of patients with a fever.
  • Breakdown of symptoms and NaN entries:
SymptomPresence %Missing Duration %
Cough5.954%0%
Expectoration2.2127%1.4493%
Chest Pain2.6296%0%
Fever99.4976%94.0052%
Loss of Appetite1.5714%4.0816%
Blood in Sputum0.1924%0%
Night Sweats2.0844%3.0769%
Weight Loss0.6734%42.8571%
Shortness of Breath2.79%1.494%
Tiredness5.7082%1.6854%
  • Column Duration_of_the_symptom is NaN for all but a handful of entries.
  • After filling duration NaNs when no presence of symptom to zero and removing Duration_of_the_symptom, I dropped the remaining rows with an NaN entry. This resulted in a much smaller data set size of 550.
  • The presence of symptoms on the smaller data set is quite different:
SymptomPresence %
Cough85.2727%
Expectoration30.5455%
Chest Pain38.1818%
Fever93.8182%
Loss of Appetite20.7273%
Blood in Sputum1.81812%
Night Sweats31.6364%
Weight Loss6.5455%
Shortness of Breath44.1818%
Tiredness76.1818%

File: TossFever.ipynb

  • Since the vast majority of NaN entries are for fever duration and 99% of patients had a fever, it may make sense to drop the fever columns. We also drop Duration_of_the_symptom since that is 99% NaN. This results in 9,209 remaining patients.
  • Durations of symptoms are stored as strings typically of the form ”? days” or ”? weeks” (a few entries are lacking spaces). With some simple regex matching we can convert these strings to an integer representing the number of days.
  • Relation to presence and duration of symptom are as follows:
SymptomPresence %CountMeanStdMinMax
Cough5.5598%5129.44.161824221
Expectoration1.8243%16811.3035714.30952214
Chest Pain2.2804%2109.1428574.828996214
Loss of Appetite1.2379%11412.7894746.323819430
Blood in Sputum0.1303%127.254.266679414
Night Sweats1.8895%1745.8275865.988828230
Weight Loss0.3909%3622.2522.598831190
Shortness of Breath2.6387%2436.4814814.465046214
Tiredness4.8431%4468.0269064.321797230
  • Only three patients have been treated for TB
  • The most common methods of fresh water crab consumption is cooked and in soups:
Consumption MethodCountPercent
Raw00%
Roasted20.0217%
Smoked20.0217%
Soup915499.4028%
Pickled00%
Cooked920899.9891%
  • The most common frequency of consumption is once a week:
FrequencyCountPercent
Once in a week899597.6762%
Occasionally1781.9329%
Once in a month230.2498%
More than once in a week130.1412%
  • The most common duration of consumption is more than 5 years:
DurationCountPercent
>5 years918199.6959%
1-3 years10.0109%
3-11 months10.0109%
<3 months260.2823%