Paragonimiasis Data Merging
We want to merge the active and passive data sets into one large data set. This is fairly straightforward in pandas
with pd.concat
. However, some cleanup is necessary first:
- The column names in the Excel files are not all on row 1 making it harder to import into
pandas
. Some quick manual editing fixes this issue. - The data sets have mostly the same columns, but there are a few columns unique to only one data set:
- Address_of_the_healthcare_facility
- Date_Checked
- Date_Collection_Overnight_Fresh_sputum_sample_with_preservative
- Date_Collection_Overnight_sputum_sample_with_Ethanol_Soln
- Date_Collection_Spot_Serum_sample
- Date_sent_to_Lab_Overnight_Fresh_sputum_sample_with_preservative
- Date_sent_to_Lab_Overnight_sputum_sample_with_Ethanol_Soln
- Date_sent_to_lab_Spot_serum_sample
- Date_sent_to_lab_Spot_whole_blood_sample
- Healthcare_facility_code
- Name_of_the_healthcare_facility
- Result_of_sputum_microscopy
- Result_of_sputum_smear_microscopy
- Tube_No_Overnight_Fresh_sputum_sample_with_preservative
- Tube_No_Overnight_Fresh_sputum_sample_without_fixative
- Tube_No_Overnight_sputum_sample_with_Ethanol_Soln
- Tube_No_Spot_Serum_sample
- Village_code
- When merging, these columns are set to NaN for the rows that don’t have the column
- There is an overlap in
Serial_No
values. To correct this (and help differentiate the collection methodology) we add 1,000,000 to each serial for the passive data set. - At this point we can finally merge the data sets with
pd.concat
. - There are 26 columns marked for removal:
- Date_of_the_interview
- Tube_No_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Spot_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Overnight_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Overnight_Fresh_sputum_sample_without_fixative
- Tube_No_Spot_Fresh_sputum_sample_with_preservative
- Date_Collection_Spot_Fresh_sputum_sample_with_preservative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_with_preser
- Tube_No_Spot_sputum_sample_with_Ethanol_Sol
- Date_Collection_Spot_sputum_sample_with_Ethanol_Soln
- Date_sent_to_Lab_Spot_sputum_sample_with_Ethanol_Soln
- Tube_No_Overnight_Stool_sample
- Date_Collection_of_Overnight_stool_sample
- Date_Sent_to_lab_of_Overnight_stool_sample
- Tube_No_Spot_Whole_Blood_sample
- Date_Collection_Spot_whole_blood_sample
- Scrutinize_Filled_all_questions
- Scrutinize_Sputum_collected
- Scrutinize_Stool_collected
- Scrutinize_Blood_collected
- Field_Investigator_name
- Date_checked_Field_investigator
- Checked_by_name
- Checked_by
- Designation
- Some columns instead of being blank have a string such as ‘Nil’. I’ve located three unique (upto letter casing) strings: ‘nil’, ‘nill’, and ‘nan’. These have been replaced with
np.nan
which results in a blank entry when saving. - The final result has been saved as
Merged Data.csv
.