Changes Between Data Sets

Mahantesh sent us a modified copy of the paragonimiasis data sets. I loaded each merged data set (combination of passive and active studies) into pandas and spent some time comparing the differences between the data sets.

The following columns had their names corrected:

  • Taenia_solium †
  • Sputum_test
  • Pinworms †
  • Ascaris_Lumbricoides †
  • Hymenolepis †
  • Strongyloides_larvae †
  • Other_tests
  • Hookworms †
  • Taenia_saginata †
  • Mantoux_test
  • X_Ray_test
  • Trichuris_trichiura †
  • Intestinal_trematodes †

All of the columns marked with † have the sole value of “No” for all patients. The remaining have at most 37 filled entries and 12,363 missing entries.

A single patient (ID of P2019) was removed in the new data set.

I then went through each column and counted the number of changed entries between the two data sets. It appears the column Serial_No has changed which I had been using an index. So instead we make comparison based on the unique patient identifier given by Individual_Particular_ID and make the assumption that these identifiers have not changed.

The code for detecting changes for a specific column is given below:

def detect_changes(col):
    count = 0
    for patient in shared_patients:
        try:
            new_value = df_new.loc[df_new['Individual_Particular_ID'] == patient, col].to_list()[0]
            old_value = df_old.loc[df_old['Individual_Particular_ID'] == patient, col].to_list()[0]
            if new_value != old_value and not np.isnan(new_value):
                count += 1
        except:
            pass
 
    return count

In summary, for each column that the two data sets share, we walk through each patient that are contained in both data sets. Then for each patient we check if their value for the current column has changed. We keep track of the number of changes per column. The resulting counts are given below:

Column Name# Changes
Age_of_the_study_participant1
Weight_of_the_study_participant_in_Kgs1
Height_of_the_study_participant_in_Cms6

Training Results

Below is training results of the original data set:

center

Below is the training results of the new data set:

center

The black bar represents the mean MCC value across the eleven trials.