Primary goal is to familiarize myself with the Python package pandas.
Data set contains 9,355 rows and 42 columns
9 of the columns are demographic data (age, address, sex, etc.)
The average age per village in the data set appears bimodal. Around half the villages have a mean age of around 13, whereas the other half have a mean age around 50.
There are 10 symptoms, each symptom takes two columns (has symptom and duration of symptom)
If the patient answered no to a symptom, the corresponding duration is set to NaN.
Duration of fever is missing for 94% of patients with a fever.
Breakdown of symptoms and NaN entries:
Symptom
Presence %
Missing Duration %
Cough
5.954%
0%
Expectoration
2.2127%
1.4493%
Chest Pain
2.6296%
0%
Fever
99.4976%
94.0052%
Loss of Appetite
1.5714%
4.0816%
Blood in Sputum
0.1924%
0%
Night Sweats
2.0844%
3.0769%
Weight Loss
0.6734%
42.8571%
Shortness of Breath
2.79%
1.494%
Tiredness
5.7082%
1.6854%
Column Duration_of_the_symptom is NaN for all but a handful of entries.
After filling duration NaNs when no presence of symptom to zero and removing Duration_of_the_symptom, I dropped the remaining rows with an NaN entry. This resulted in a much smaller data set size of 550.
The presence of symptoms on the smaller data set is quite different:
Symptom
Presence %
Cough
85.2727%
Expectoration
30.5455%
Chest Pain
38.1818%
Fever
93.8182%
Loss of Appetite
20.7273%
Blood in Sputum
1.81812%
Night Sweats
31.6364%
Weight Loss
6.5455%
Shortness of Breath
44.1818%
Tiredness
76.1818%
File:TossFever.ipynb
Since the vast majority of NaN entries are for fever duration and 99% of patients had a fever, it may make sense to drop the fever columns. We also drop Duration_of_the_symptom since that is 99% NaN. This results in 9,209 remaining patients.
Durations of symptoms are stored as strings typically of the form ”? days” or ”? weeks” (a few entries are lacking spaces). With some simple regex matching we can convert these strings to an integer representing the number of days.
Relation to presence and duration of symptom are as follows:
Symptom
Presence %
Count
Mean
Std
Min
Max
Cough
5.5598%
512
9.4
4.161824
2
21
Expectoration
1.8243%
168
11.303571
4.30952
2
14
Chest Pain
2.2804%
210
9.142857
4.828996
2
14
Loss of Appetite
1.2379%
114
12.789474
6.323819
4
30
Blood in Sputum
0.1303%
12
7.25
4.266679
4
14
Night Sweats
1.8895%
174
5.827586
5.988828
2
30
Weight Loss
0.3909%
36
22.25
22.598831
1
90
Shortness of Breath
2.6387%
243
6.481481
4.465046
2
14
Tiredness
4.8431%
446
8.026906
4.321797
2
30
Only three patients have been treated for TB
The most common methods of fresh water crab consumption is cooked and in soups:
Consumption Method
Count
Percent
Raw
0
0%
Roasted
2
0.0217%
Smoked
2
0.0217%
Soup
9154
99.4028%
Pickled
0
0%
Cooked
9208
99.9891%
The most common frequency of consumption is once a week:
Frequency
Count
Percent
Once in a week
8995
97.6762%
Occasionally
178
1.9329%
Once in a month
23
0.2498%
More than once in a week
13
0.1412%
The most common duration of consumption is more than 5 years: