Paragonimiasis Data Merging

We want to merge the active and passive data sets into one large data set. This is fairly straightforward in pandas with pd.concat. However, some cleanup is necessary first:

  • The column names in the Excel files are not all on row 1 making it harder to import into pandas. Some quick manual editing fixes this issue.
  • The data sets have mostly the same columns, but there are a few columns unique to only one data set:
    • Address_of_the_healthcare_facility
    • Date_Checked
    • Date_Collection_Overnight_Fresh_sputum_sample_with_preservative
    • Date_Collection_Overnight_sputum_sample_with_Ethanol_Soln
    • Date_Collection_Spot_Serum_sample
    • Date_sent_to_Lab_Overnight_Fresh_sputum_sample_with_preservative
    • Date_sent_to_Lab_Overnight_sputum_sample_with_Ethanol_Soln
    • Date_sent_to_lab_Spot_serum_sample
    • Date_sent_to_lab_Spot_whole_blood_sample
    • Healthcare_facility_code
    • Name_of_the_healthcare_facility
    • Result_of_sputum_microscopy
    • Result_of_sputum_smear_microscopy
    • Tube_No_Overnight_Fresh_sputum_sample_with_preservative
    • Tube_No_Overnight_Fresh_sputum_sample_without_fixative
    • Tube_No_Overnight_sputum_sample_with_Ethanol_Soln
    • Tube_No_Spot_Serum_sample
    • Village_code
  • When merging, these columns are set to NaN for the rows that don’t have the column
  • There is an overlap in Serial_No values. To correct this (and help differentiate the collection methodology) we add 1,000,000 to each serial for the passive data set.
  • At this point we can finally merge the data sets with pd.concat.
  • There are 26 columns marked for removal:
    • Date_of_the_interview
    • Tube_No_Spot_Fresh_sputum_sample_without_fixative
    • Date_Collection_Spot_Fresh_sputum_sample_without_fixative
    • Date_sent_to_Lab_Spot_Fresh_sputum_sample_without_fixative
    • Date_Collection_Overnight_Fresh_sputum_sample_without_fixative
    • Date_sent_to_Lab_Overnight_Fresh_sputum_sample_without_fixative
    • Tube_No_Spot_Fresh_sputum_sample_with_preservative
    • Date_Collection_Spot_Fresh_sputum_sample_with_preservative
    • Date_sent_to_Lab_Spot_Fresh_sputum_sample_with_preser
    • Tube_No_Spot_sputum_sample_with_Ethanol_Sol
    • Date_Collection_Spot_sputum_sample_with_Ethanol_Soln
    • Date_sent_to_Lab_Spot_sputum_sample_with_Ethanol_Soln
    • Tube_No_Overnight_Stool_sample
    • Date_Collection_of_Overnight_stool_sample
    • Date_Sent_to_lab_of_Overnight_stool_sample
    • Tube_No_Spot_Whole_Blood_sample
    • Date_Collection_Spot_whole_blood_sample
    • Scrutinize_Filled_all_questions
    • Scrutinize_Sputum_collected
    • Scrutinize_Stool_collected
    • Scrutinize_Blood_collected
    • Field_Investigator_name
    • Date_checked_Field_investigator
    • Checked_by_name
    • Checked_by
    • Designation
  • Some columns instead of being blank have a string such as ‘Nil’. I’ve located three unique (upto letter casing) strings: ‘nil’, ‘nill’, and ‘nan’. These have been replaced with np.nan which results in a blank entry when saving.
  • The final result has been saved as Merged Data.csv.