Results
We compare the performance of the four methods on manually annotated ground truth data, then apply the best-performing method to a large corpus of Web datasets in order to understand the prevalence of different provenance relationships between those datasets.
We generated a corpus of dataset metadata by crawling the Web to find pages with schema.org metadata indicating that the page contains a dataset. We then limited the corpus to datasets that have persistent de-referencible identifiers (i.e., a unique code that permanently identifies a digital object, allowing access to it even if the original location or website changes). This corpus includes 2.7 million dataset-metadata entries.
To generate ground truth for training and evaluation, we manually labeled 2,178 dataset pairs. The labelers had access to all metadata fields for these datasets, such as name, description, provider, temporal and spatial coverage, and so on.
We compared the performance of the four different methods — schema.org, heuristics-based, gradient boosted decision trees (GBDT), and T5 — across various dataset relationship categories (detailed breakdown in the paper). The ML methods (GBDT and T5) outperform the heuristics-based approach in identifying dataset relationships. GBDT consistently achieves the highest F1 scores across various categories, with T5 performing similarly well.