Data Sets

The full data set was created by combining two different data sets: actively collected (9355 individuals) and passively collected (3046 individuals). Columns that only appeared in one data set were added to the other data set then filled with blank entries.

There was an overlap in the values of Serial_No between the two data sets. To remain the values being unique, we add 1,000,000 to each serial in the passive data set. This also allows us to differentiate which data set the individual came from.

The following columns were removed:

  • Name_of_the_study_participant
  • Date_of_the_interview
  • Tube_No_Spot_Fresh_sputum_sample_without_fixative
  • Date_Collection_Spot_Fresh_sputum_sample_without_fixative
  • Date_sent_to_Lab_Spot_Fresh_sputum_sample_without_fixative
  • Date_Collection_Overnight_Fresh_sputum_sample_without_fixative
  • Date_sent_to_Lab_Overnight_Fresh_sputum_sample_without_fixative
  • Tube_No_Spot_Fresh_sputum_sample_with_preservative
  • Date_Collection_Spot_Fresh_sputum_sample_with_preservative
  • Date_sent_to_Lab_Spot_Fresh_sputum_sample_with_preser
  • Tube_No_Spot_sputum_sample_with_Ethanol_Sol
  • Date_Collection_Spot_sputum_sample_with_Ethanol_Soln
  • Date_sent_to_Lab_Spot_sputum_sample_with_Ethanol_Soln
  • Tube_No_Overnight_Stool_sample
  • Date_Collection_of_Overnight_stool_sample
  • Date_Sent_to_lab_of_Overnight_stool_sample
  • Tube_No_Spot_Whole_Blood_sample
  • Date_Collection_Spot_whole_blood_sample
  • Scrutinize_Filled_all_questions
  • Scrutinize_Sputum_collected
  • Scrutinize_Stool_collected
  • Scrutinize_Blood_collected
  • Field_Investigator_name
  • Date_checked_Field_investigator
  • Checked_by_name
  • Checked_by
  • Designation

Statistical Analysis: Continuous Variables

We start by investigating for significant differences in the age, weight and height of individuals who tested ELISA positive/negative. A common approach is to use a two-sample t-test, which determines if there is a statistical difference between the means of two samples.

Running the t-test on these columns results in the following table:

AgeHeight (cm)Weight (kg)
Positive Mean32.0184149.04344.2147
Negative Mean35.9002149.26246.8544
t-stat-3.13453-0.220966-2.74948
p-value0.001725350.8251220.00597772

The -value represents the probability of receiving the two samples under the assumption that they are independent. Using the golden standard of <0.05, we see that the average age and average weight of ELISA positive individuals is significantly lower than ELISA negative individuals.

Statistical Analysis: Categorical Variables

The majority of the data set is categorical, e.g. sex, marital status, religion, etc. For categorical data, we cannot use a t-test. Instead, we use another common test, the chi-squared two sample test. The chi-squared test provides a -value measuring the significance in the differences between the proportions of responses between two groups.

For example, consider the responses in regards to marital status:

PositiveNegative
Married30511574
Unmarried21430

Running a chi-squared test on this table results in a -value of 0.0103. The -value being less than 0.05 suggests that there is significant differences in the proportions of responses between ELISA positive and ELISA negative individuals. We can see this visually by creating bar-charts:

The red bars represent the percentage of ELISA positive individuals who gave a particular response. Likewise, the blue bars are the percentage of ELISA negative individuals who gave a particular response.

We get a similar result in terms of expectoration () and chest pain (:

PositiveNegative
Had Expectoration532540
No Expectoration2739461
PositiveNegative
Had Chest Pain512610
No Chest Pain2759391

Or in bar chart form:

In terms of pathogen factors, we also find a strong significant difference in the consumption of cray fish ().

PositiveNegative
Consumes cray fish415
Does not consume cray fish32211989

In summary, we found statistically significant differences between ELISA positive and ELISA negative in the following variables:

  • Marital status
  • Expectoration
  • Chest pain
  • Consumption of cray fish

The remaining columns had no statistical significant differences. The breakdown of the remaining columns are included at the end of this document.

Future Work

There is a large difference in the sample sizes, 326 ELISA positive and 12075 ELISA negative. The chi-squared test is very sensitive to sample size which may result in false identification of statistically significant differences. More work needs to be done to take into account the large disparity between sample sizes.

Not all columns have been tested. For example, it may be beneficial to test the durations of symptoms with a t-test. Before doing so, some cleanup of the data is necessary as it is incomplete.

Full Breakdown

Host Factors:

Sex_of_the_study_participant p-value: 0.6600516932794341

PositiveNegative
Female1806345
Male1465654
Transgender05

Marital_status_of_the_participant p-value: 0.010338783754939688

PositiveNegative
Married30511574
Unmarried21430

Religion_of_the_participant p-value: 0.8420839699727036

PositiveNegative
Chistianity05
Hindu32511941
Islam158

Belongs_to_tribal_community p-value: 1.0

PositiveNegative
No32612004

Educational_qualification_of_the_study_participant p-value: 0.1896305167122306

PositiveNegative
Graduate/Degree/Diploma742558
Illiterate903004
Post-graduate722439
Secondary904003

Occupation_of_the_study_participant p-value: 0.07118126521529632

PositiveNegative
Agri. Laborer381222
Construction laborer261050
Forest products12483
Housewife142
Livestock014
MNREGA work only18455
Others113
Own cultivation1129
Own cultivation and laborer701882
Private263
Small business/Petti/Tea shop30986
Student1275665

Which_health_facility_do_you_access_when_you_are_sick p-value: 0.5154638395142579

PositiveNegative
Government30010992
Health centers run by NGO044
Private25841
Self-treatment039
Traditional healer188

Have_you_had_Cough p-value: 0.37789035913987745

PositiveNegative
No2428556
Yes843352

Have_you_had_Expectoration p-value: 0.037877357849456596

PositiveNegative
No2739461
Yes532540

Have_you_had_Chest_pain p-value: 0.010026969615090838

PositiveNegative
No2759391
Yes512610

Have_you_had_fever p-value: 1.0

PositiveNegative
No3105
Yes32311896

Have_you_lost_appetite p-value: 0.15609852522410084

PositiveNegative
No30010732
Yes261269

Have_you_had_blood_in_sputum p-value: 0.430210821210259

PositiveNegative
No32611947
Yes054

Have_you_had_night_sweats p-value: 0.6305612868045478

PositiveNegative
No28210248
Yes441753

Have_you_lost_weight p-value: 1.0

PositiveNegative
No31511595
Yes10386

Have_you_had_shortness_of_breath p-value: 0.22037504598491625

PositiveNegative
No2679474
Yes592527

Have_you_had_tiredness p-value: 0.8687320660144218

PositiveNegative
No2529342
Yes742659

Other_symptoms p-value: 1.0

PositiveNegative
No32511951
Yes150

Have_you_taken_any_treatment_or_actions p-value: 0.6159132422548381

PositiveNegative
No31211597
Yes038

Have_you_ever_been_treated_for_TB p-value: 0.6160488670397868

PositiveNegative
No31211600
Yes Currently on ATT038
Pathogen Factors:

Consumption_of_fresh_water_crabs p-value: 1.0

PositiveNegative
No05
Yes32611999

Duration_of_consumption_of_fresh_water_crabs p-value: 0.6758771337543388

PositiveNegative
1-3 years03
3-11 months02
<3 months051
>5 years32611948

Frequency_of_consumption_of_fresh_water_crabs p-value: 0.22952043999118324

PositiveNegative
More than once in a week18511
Occasionally371135
Once in a month19961
Once in a week2529397

Raw_fresh_water_crabs p-value: 1.0

PositiveNegative
No31511585
Yes11419

Roasted_fresh_water_crabs p-value: 0.32006920638339187

PositiveNegative
No32411980
Yes224

Smoked_fresh_water_crabs p-value: 1.0

PositiveNegative
No32612002
Yes02

Soup_fresh_water_crabs p-value: 0.3020510534208001

PositiveNegative
No5101
Yes32111903

Pickled_fresh_water_crabs p-value: 1.0

PositiveNegative
No32612004

Cooked_fresh_water_crabs p-value: 1.0

PositiveNegative
No05
Yes2609231

Consumption_of_cray_fishes p-value: 1.78890200369268e-05

PositiveNegative
No32211989
Yes415

Duration_of_consumption_of_cray_fishes p-value: 1.0

PositiveNegative
1-3 years315

Frequency_of_consumption_of_cray_fishes p-value: 1.0

PositiveNegative
Occasionally315

Raw_cray_fish p-value: 1.0

PositiveNegative
No32612004

Roasted_cray_fish p-value: 1.0

PositiveNegative
No32612004

Smoked_cray_fish p-value: 1.0

PositiveNegative
No32612004

Soup_cray_fish p-value: 1.0

PositiveNegative
No32511985
Yes119

Pickled_cray_fish p-value: 1.0

PositiveNegative
No32612004

Cooked_cray_fish p-value: 1.0

PositiveNegative
No32612004

Consumption_of_wlid_boar_meat p-value: 0.06804177212239208

PositiveNegative
No32311975
Yes329

Duration_of_consumption_of_wild_boar_meat p-value: 0.9585895848817743

PositiveNegative
1-3 years01
3-11 months01
<3 months01
>5 years329

Frequency_of_consumption_of_wild_boar_meat p-value: 1.0

PositiveNegative
Occasionally331
Once in a month01

Raw_wild_boar_meat p-value: 1.0

PositiveNegative
No2609236

Roasted_wild_boar_meat p-value: 1.0

PositiveNegative
No2609236

Smoked_wild_boar_meat p-value: 1.0

PositiveNegative
No2609236

Soup_wild_boar_meat p-value: 1.0

PositiveNegative
No2609236

Pickled_wild_boar_meat p-value: 1.0

PositiveNegative
No2609236

Cooked_wild_boar_meat p-value: 1.0

PositiveNegative
No2609233
Yes03

Consumption_of_Rodents_Rats_etc p-value: 1.0

PositiveNegative
No2609236

Duration_of_consumption_of_rodents p-value: error

PositiveNegative

Frequency_of_consumption_of_rodents p-value: error

PositiveNegative

Raw_rodents p-value: 1.0

PositiveNegative
No2609236

Roasted_rodents p-value: 1.0

PositiveNegative
No2609236

Smoked_rodents p-value: 1.0

PositiveNegative
No2609236

Soup_rodents p-value: 1.0

PositiveNegative
No2609236

Pickled_rodents p-value: 1.0

PositiveNegative
No2609236

Cooked_rodents p-value: 1.0

PositiveNegative
No2609236