Paragonimiasis Data Merging
We want to merge the active and passive data sets into one large data set. This is fairly straightforward in pandas with pd.concat. However, some cleanup is necessary first:
- The column names in the Excel files are not all on row 1 making it harder to import into
pandas. Some quick manual editing fixes this issue. - The data sets have mostly the same columns, but there are a few columns unique to only one data set:
- Address_of_the_healthcare_facility
- Date_Checked
- Date_Collection_Overnight_Fresh_sputum_sample_with_preservative
- Date_Collection_Overnight_sputum_sample_with_Ethanol_Soln
- Date_Collection_Spot_Serum_sample
- Date_sent_to_Lab_Overnight_Fresh_sputum_sample_with_preservative
- Date_sent_to_Lab_Overnight_sputum_sample_with_Ethanol_Soln
- Date_sent_to_lab_Spot_serum_sample
- Date_sent_to_lab_Spot_whole_blood_sample
- Healthcare_facility_code
- Name_of_the_healthcare_facility
- Result_of_sputum_microscopy
- Result_of_sputum_smear_microscopy
- Tube_No_Overnight_Fresh_sputum_sample_with_preservative
- Tube_No_Overnight_Fresh_sputum_sample_without_fixative
- Tube_No_Overnight_sputum_sample_with_Ethanol_Soln
- Tube_No_Spot_Serum_sample
- Village_code
- When merging, these columns are set to NaN for the rows that don’t have the column
- There is an overlap in
Serial_Novalues. To correct this (and help differentiate the collection methodology) we add 1,000,000 to each serial for the passive data set. - At this point we can finally merge the data sets with
pd.concat. - There are 26 columns marked for removal:
- Date_of_the_interview
- Tube_No_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Spot_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Overnight_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Overnight_Fresh_sputum_sample_without_fixative
- Tube_No_Spot_Fresh_sputum_sample_with_preservative
- Date_Collection_Spot_Fresh_sputum_sample_with_preservative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_with_preser
- Tube_No_Spot_sputum_sample_with_Ethanol_Sol
- Date_Collection_Spot_sputum_sample_with_Ethanol_Soln
- Date_sent_to_Lab_Spot_sputum_sample_with_Ethanol_Soln
- Tube_No_Overnight_Stool_sample
- Date_Collection_of_Overnight_stool_sample
- Date_Sent_to_lab_of_Overnight_stool_sample
- Tube_No_Spot_Whole_Blood_sample
- Date_Collection_Spot_whole_blood_sample
- Scrutinize_Filled_all_questions
- Scrutinize_Sputum_collected
- Scrutinize_Stool_collected
- Scrutinize_Blood_collected
- Field_Investigator_name
- Date_checked_Field_investigator
- Checked_by_name
- Checked_by
- Designation
- Some columns instead of being blank have a string such as ‘Nil’. I’ve located three unique (upto letter casing) strings: ‘nil’, ‘nill’, and ‘nan’. These have been replaced with
np.nanwhich results in a blank entry when saving. - The final result has been saved as
Merged Data.csv.