Data Sets
The full data set was created by combining two different data sets: actively collected (9355 individuals) and passively collected (3046 individuals). Columns that only appeared in one data set were added to the other data set then filled with blank entries.
There was an overlap in the values of Serial_No between the two data sets. To remain the values being unique, we add 1,000,000 to each serial in the passive data set. This also allows us to differentiate which data set the individual came from.
The following columns were removed:
- Name_of_the_study_participant
- Date_of_the_interview
- Tube_No_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Spot_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Overnight_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Overnight_Fresh_sputum_sample_without_fixative
- Tube_No_Spot_Fresh_sputum_sample_with_preservative
- Date_Collection_Spot_Fresh_sputum_sample_with_preservative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_with_preser
- Tube_No_Spot_sputum_sample_with_Ethanol_Sol
- Date_Collection_Spot_sputum_sample_with_Ethanol_Soln
- Date_sent_to_Lab_Spot_sputum_sample_with_Ethanol_Soln
- Tube_No_Overnight_Stool_sample
- Date_Collection_of_Overnight_stool_sample
- Date_Sent_to_lab_of_Overnight_stool_sample
- Tube_No_Spot_Whole_Blood_sample
- Date_Collection_Spot_whole_blood_sample
- Scrutinize_Filled_all_questions
- Scrutinize_Sputum_collected
- Scrutinize_Stool_collected
- Scrutinize_Blood_collected
- Field_Investigator_name
- Date_checked_Field_investigator
- Checked_by_name
- Checked_by
- Designation
Statistical Analysis: Continuous Variables
We start by investigating for significant differences in the age, weight and height of individuals who tested ELISA positive/negative. A common approach is to use a two-sample t-test, which determines if there is a statistical difference between the means of two samples.
Running the t-test on these columns results in the following table:
| Age | Height (cm) | Weight (kg) | |
|---|---|---|---|
| Positive Mean | 32.0184 | 149.043 | 44.2147 |
| Negative Mean | 35.9002 | 149.262 | 46.8544 |
| t-stat | -3.13453 | -0.220966 | -2.74948 |
| p-value | 0.00172535 | 0.825122 | 0.00597772 |
The -value represents the probability of receiving the two samples under the assumption that they are independent. Using the golden standard of <0.05, we see that the average age and average weight of ELISA positive individuals is significantly lower than ELISA negative individuals.
Statistical Analysis: Categorical Variables
The majority of the data set is categorical, e.g. sex, marital status, religion, etc. For categorical data, we cannot use a t-test. Instead, we use another common test, the chi-squared two sample test. The chi-squared test provides a -value measuring the significance in the differences between the proportions of responses between two groups.
For example, consider the responses in regards to marital status:
| Positive | Negative | |
|---|---|---|
| Married | 305 | 11574 |
| Unmarried | 21 | 430 |
Running a chi-squared test on this table results in a -value of 0.0103. The -value being less than 0.05 suggests that there is significant differences in the proportions of responses between ELISA positive and ELISA negative individuals. We can see this visually by creating bar-charts:

The red bars represent the percentage of ELISA positive individuals who gave a particular response. Likewise, the blue bars are the percentage of ELISA negative individuals who gave a particular response.
We get a similar result in terms of expectoration () and chest pain (:
| Positive | Negative | |
|---|---|---|
| Had Expectoration | 53 | 2540 |
| No Expectoration | 273 | 9461 |
| Positive | Negative | |
|---|---|---|
| Had Chest Pain | 51 | 2610 |
| No Chest Pain | 275 | 9391 |
Or in bar chart form:


In terms of pathogen factors, we also find a strong significant difference in the consumption of cray fish ().
| Positive | Negative | |
|---|---|---|
| Consumes cray fish | 4 | 15 |
| Does not consume cray fish | 322 | 11989 |

In summary, we found statistically significant differences between ELISA positive and ELISA negative in the following variables:
- Marital status
- Expectoration
- Chest pain
- Consumption of cray fish
The remaining columns had no statistical significant differences. The breakdown of the remaining columns are included at the end of this document.
Future Work
There is a large difference in the sample sizes, 326 ELISA positive and 12075 ELISA negative. The chi-squared test is very sensitive to sample size which may result in false identification of statistically significant differences. More work needs to be done to take into account the large disparity between sample sizes.
Not all columns have been tested. For example, it may be beneficial to test the durations of symptoms with a t-test. Before doing so, some cleanup of the data is necessary as it is incomplete.
Full Breakdown
Host Factors:
Sex_of_the_study_participant p-value: 0.6600516932794341
| Positive | Negative | |
|---|---|---|
| Female | 180 | 6345 |
| Male | 146 | 5654 |
| Transgender | 0 | 5 |
Marital_status_of_the_participant p-value: 0.010338783754939688
| Positive | Negative | |
|---|---|---|
| Married | 305 | 11574 |
| Unmarried | 21 | 430 |
Religion_of_the_participant p-value: 0.8420839699727036
| Positive | Negative | |
|---|---|---|
| Chistianity | 0 | 5 |
| Hindu | 325 | 11941 |
| Islam | 1 | 58 |
Belongs_to_tribal_community p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12004 |
Educational_qualification_of_the_study_participant p-value: 0.1896305167122306
| Positive | Negative | |
|---|---|---|
| Graduate/Degree/Diploma | 74 | 2558 |
| Illiterate | 90 | 3004 |
| Post-graduate | 72 | 2439 |
| Secondary | 90 | 4003 |
Occupation_of_the_study_participant p-value: 0.07118126521529632
| Positive | Negative | |
|---|---|---|
| Agri. Laborer | 38 | 1222 |
| Construction laborer | 26 | 1050 |
| Forest products | 12 | 483 |
| Housewife | 1 | 42 |
| Livestock | 0 | 14 |
| MNREGA work only | 18 | 455 |
| Others | 1 | 13 |
| Own cultivation | 1 | 129 |
| Own cultivation and laborer | 70 | 1882 |
| Private | 2 | 63 |
| Small business/Petti/Tea shop | 30 | 986 |
| Student | 127 | 5665 |
Which_health_facility_do_you_access_when_you_are_sick p-value: 0.5154638395142579
| Positive | Negative | |
|---|---|---|
| Government | 300 | 10992 |
| Health centers run by NGO | 0 | 44 |
| Private | 25 | 841 |
| Self-treatment | 0 | 39 |
| Traditional healer | 1 | 88 |
Have_you_had_Cough p-value: 0.37789035913987745
| Positive | Negative | |
|---|---|---|
| No | 242 | 8556 |
| Yes | 84 | 3352 |
Have_you_had_Expectoration p-value: 0.037877357849456596
| Positive | Negative | |
|---|---|---|
| No | 273 | 9461 |
| Yes | 53 | 2540 |
Have_you_had_Chest_pain p-value: 0.010026969615090838
| Positive | Negative | |
|---|---|---|
| No | 275 | 9391 |
| Yes | 51 | 2610 |
Have_you_had_fever p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 3 | 105 |
| Yes | 323 | 11896 |
Have_you_lost_appetite p-value: 0.15609852522410084
| Positive | Negative | |
|---|---|---|
| No | 300 | 10732 |
| Yes | 26 | 1269 |
Have_you_had_blood_in_sputum p-value: 0.430210821210259
| Positive | Negative | |
|---|---|---|
| No | 326 | 11947 |
| Yes | 0 | 54 |
Have_you_had_night_sweats p-value: 0.6305612868045478
| Positive | Negative | |
|---|---|---|
| No | 282 | 10248 |
| Yes | 44 | 1753 |
Have_you_lost_weight p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 315 | 11595 |
| Yes | 10 | 386 |
Have_you_had_shortness_of_breath p-value: 0.22037504598491625
| Positive | Negative | |
|---|---|---|
| No | 267 | 9474 |
| Yes | 59 | 2527 |
Have_you_had_tiredness p-value: 0.8687320660144218
| Positive | Negative | |
|---|---|---|
| No | 252 | 9342 |
| Yes | 74 | 2659 |
Other_symptoms p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 325 | 11951 |
| Yes | 1 | 50 |
Have_you_taken_any_treatment_or_actions p-value: 0.6159132422548381
| Positive | Negative | |
|---|---|---|
| No | 312 | 11597 |
| Yes | 0 | 38 |
Have_you_ever_been_treated_for_TB p-value: 0.6160488670397868
| Positive | Negative | |
|---|---|---|
| No | 312 | 11600 |
| Yes Currently on ATT | 0 | 38 |
Pathogen Factors:
Consumption_of_fresh_water_crabs p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 0 | 5 |
| Yes | 326 | 11999 |
Duration_of_consumption_of_fresh_water_crabs p-value: 0.6758771337543388
| Positive | Negative | |
|---|---|---|
| 1-3 years | 0 | 3 |
| 3-11 months | 0 | 2 |
| <3 months | 0 | 51 |
| >5 years | 326 | 11948 |
Frequency_of_consumption_of_fresh_water_crabs p-value: 0.22952043999118324
| Positive | Negative | |
|---|---|---|
| More than once in a week | 18 | 511 |
| Occasionally | 37 | 1135 |
| Once in a month | 19 | 961 |
| Once in a week | 252 | 9397 |
Raw_fresh_water_crabs p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 315 | 11585 |
| Yes | 11 | 419 |
Roasted_fresh_water_crabs p-value: 0.32006920638339187
| Positive | Negative | |
|---|---|---|
| No | 324 | 11980 |
| Yes | 2 | 24 |
Smoked_fresh_water_crabs p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12002 |
| Yes | 0 | 2 |
Soup_fresh_water_crabs p-value: 0.3020510534208001
| Positive | Negative | |
|---|---|---|
| No | 5 | 101 |
| Yes | 321 | 11903 |
Pickled_fresh_water_crabs p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12004 |
Cooked_fresh_water_crabs p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 0 | 5 |
| Yes | 260 | 9231 |
Consumption_of_cray_fishes p-value: 1.78890200369268e-05
| Positive | Negative | |
|---|---|---|
| No | 322 | 11989 |
| Yes | 4 | 15 |
Duration_of_consumption_of_cray_fishes p-value: 1.0
| Positive | Negative | |
|---|---|---|
| 1-3 years | 3 | 15 |
Frequency_of_consumption_of_cray_fishes p-value: 1.0
| Positive | Negative | |
|---|---|---|
| Occasionally | 3 | 15 |
Raw_cray_fish p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12004 |
Roasted_cray_fish p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12004 |
Smoked_cray_fish p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12004 |
Soup_cray_fish p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 325 | 11985 |
| Yes | 1 | 19 |
Pickled_cray_fish p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12004 |
Cooked_cray_fish p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 326 | 12004 |
Consumption_of_wlid_boar_meat p-value: 0.06804177212239208
| Positive | Negative | |
|---|---|---|
| No | 323 | 11975 |
| Yes | 3 | 29 |
Duration_of_consumption_of_wild_boar_meat p-value: 0.9585895848817743
| Positive | Negative | |
|---|---|---|
| 1-3 years | 0 | 1 |
| 3-11 months | 0 | 1 |
| <3 months | 0 | 1 |
| >5 years | 3 | 29 |
Frequency_of_consumption_of_wild_boar_meat p-value: 1.0
| Positive | Negative | |
|---|---|---|
| Occasionally | 3 | 31 |
| Once in a month | 0 | 1 |
Raw_wild_boar_meat p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Roasted_wild_boar_meat p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Smoked_wild_boar_meat p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Soup_wild_boar_meat p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Pickled_wild_boar_meat p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Cooked_wild_boar_meat p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9233 |
| Yes | 0 | 3 |
Consumption_of_Rodents_Rats_etc p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Duration_of_consumption_of_rodents p-value: error
| Positive | Negative |
|---|
Frequency_of_consumption_of_rodents p-value: error
| Positive | Negative |
|---|
Raw_rodents p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Roasted_rodents p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Smoked_rodents p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Soup_rodents p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Pickled_rodents p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |
Cooked_rodents p-value: 1.0
| Positive | Negative | |
|---|---|---|
| No | 260 | 9236 |