Paragonimiasis Data Merging

We want to merge the active and passive data sets into one large data set. This is fairly straightforward in pandas with pd.concat. However, some cleanup is necessary first:

The column names in the Excel files are not all on row 1 making it harder to import into pandas. Some quick manual editing fixes this issue.
The data sets have mostly the same columns, but there are a few columns unique to only one data set:
- Address_of_the_healthcare_facility
- Date_Checked
- Date_Collection_Overnight_Fresh_sputum_sample_with_preservative
- Date_Collection_Overnight_sputum_sample_with_Ethanol_Soln
- Date_Collection_Spot_Serum_sample
- Date_sent_to_Lab_Overnight_Fresh_sputum_sample_with_preservative
- Date_sent_to_Lab_Overnight_sputum_sample_with_Ethanol_Soln
- Date_sent_to_lab_Spot_serum_sample
- Date_sent_to_lab_Spot_whole_blood_sample
- Healthcare_facility_code
- Name_of_the_healthcare_facility
- Result_of_sputum_microscopy
- Result_of_sputum_smear_microscopy
- Tube_No_Overnight_Fresh_sputum_sample_with_preservative
- Tube_No_Overnight_Fresh_sputum_sample_without_fixative
- Tube_No_Overnight_sputum_sample_with_Ethanol_Soln
- Tube_No_Spot_Serum_sample
- Village_code
When merging, these columns are set to NaN for the rows that don’t have the column
There is an overlap in Serial_No values. To correct this (and help differentiate the collection methodology) we add 1,000,000 to each serial for the passive data set.
At this point we can finally merge the data sets with pd.concat.
There are 26 columns marked for removal:
- Date_of_the_interview
- Tube_No_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Spot_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_without_fixative
- Date_Collection_Overnight_Fresh_sputum_sample_without_fixative
- Date_sent_to_Lab_Overnight_Fresh_sputum_sample_without_fixative
- Tube_No_Spot_Fresh_sputum_sample_with_preservative
- Date_Collection_Spot_Fresh_sputum_sample_with_preservative
- Date_sent_to_Lab_Spot_Fresh_sputum_sample_with_preser
- Tube_No_Spot_sputum_sample_with_Ethanol_Sol
- Date_Collection_Spot_sputum_sample_with_Ethanol_Soln
- Date_sent_to_Lab_Spot_sputum_sample_with_Ethanol_Soln
- Tube_No_Overnight_Stool_sample
- Date_Collection_of_Overnight_stool_sample
- Date_Sent_to_lab_of_Overnight_stool_sample
- Tube_No_Spot_Whole_Blood_sample
- Date_Collection_Spot_whole_blood_sample
- Scrutinize_Filled_all_questions
- Scrutinize_Sputum_collected
- Scrutinize_Stool_collected
- Scrutinize_Blood_collected
- Field_Investigator_name
- Date_checked_Field_investigator
- Checked_by_name
- Checked_by
- Designation
Some columns instead of being blank have a string such as ‘Nil’. I’ve located three unique (upto letter casing) strings: ‘nil’, ‘nill’, and ‘nan’. These have been replaced with np.nan which results in a blank entry when saving.
The final result has been saved as Merged Data.csv.

Notes

Explorer

2024-06-03

Paragonimiasis Data Merging

Graph View

Backlinks

Source code