The duplicate checking process occurs in several stages. First, the master data file keys are loaded. Next, the keys of the file to be added are loaded. The priority of the cast is a function of its source and is stored in the Surface Codes group of the Meds-ascii format. It determines which member of a duplicate pair is kept in the master dataset.
These keys are then sorted by latiitude. This simplifies and speeds checking for duplication - casts will only be considered if they are within 0.05 degree latitude and 0.1 degree longitude of the new cast.
There are three ways to determine whether or not a cast is a duplicate of a cast already in the database.
Casts which "fail" these tests are considered unique and added to
either the "add" file or the new version of the master data file, depending
on which version of the program is running..
If the "new" cast is considered an exact match, it is not added to the master data file unless it is of higher priority. If the cast already in the master data file is to be rejected, it gets an additional history record with the flag "DU" and all the quality flags for the parameter (usually temperature) are set to '3' (reject). This ensures that the cast will not be used in any further analyses.
Whenever it seems useful, the program rewriteDA2DAnodupes can be run which creates a new data file without the casts having hte DU flag. In any case, these casts are not used in any of the further processing.
These routines were initially tested by adding the WOCE datasets to the Mership datasets and minimal duplicates remained after testing ( 58 casts or 0.04% of the 14523 casts in the final dataset). 49.2% of the casts in the WOCE datasets were eliminated as true duplicates of profiles already in the master dataset. 4% of the drops added were flagged as near duplicates and the majority of these (90.1%) were true repeat casts. With only 0.04% of the data base true duplicates, it will probably not be necessary to manually remove them but each database will be checked as they go in to ensure that a higher proportion of duplicates don't enter the final master file.
We have also discovered that a significant percentage of the data in the WOA datasets are true duplicate casts but have been attributed to different years or months. Because the valid time of these profiles cannot be determined, both must be eliminated from the database. A proportion also have different times or dates within the same year/month but these errors are not considered as serious as an error in year.
Only the program duplicateselfDA currently checks for identical casts which are nominally years or months apart which is a good reason to run it after the master data file has been assembled. This was tested using a combined WOA/Mership dataset. All casts which were identified as exact duplicates but with different years or months were hand checked. Of the 155 casts so identified, 72 were ultimately re-added to the master data file. Most of these were selected as the result of one very low resolution cast happening to have the same value as a nearby, higher resolution cast. Out of a total of 16000 casts, hand checking 155 for duplication is reasonable.
The file of casts identified as exact duplicates with identical date/time/position/data_types
was also examined. In this case, one of the pair (the higher resolution
version, if any) had been retained in the database. Of the 3945 casts
in this file, 1000 were checked manually to see if the casts were true
duplicates. Of the 1000 checked, 5 were casts which should have been
retained in the master dataset. Given that one of the matching pair
WAS retained, this error rate is considered insignificant and we will accept
that 1/2 % of the casts identified as exact duplicates are not really duplicates;
this loss is unavoidable without expensive manual checking and so will
be ignored.