Determining the Similarity of Research Data by Using an Interoperable Metadata Extraction Method




Research Data Management, Metadata, Linked Open Data, Data Similarity, Metadata Similarity, Linked Data


Determining the similarity of research data is not a simple task, as the formats can differ widely depending on the domain. Especially, since many formats are represented as binary files, the raw comparison of these will not yield good results. This makes it hard to accurately tell how similar certain research work is by comparing the data. With the emergence of extracted interoperable metadata, a form to describe data has been provided which is independent of the data format. Therefore, this work tries to use this extracted interoperable metadata and create a method to determine the similarity of research data based on their metadata. The produced method utilizes domain knowledge about the extracted metadata and the way they are formulated. A baseline is created, and further methods are created to compare to. The results show that our method outperforms all other methods, especially the ones which are focused on comparing the research data itself, not the metadata. Since the results are promising, we propose further investigations against other datasets and possible use cases.


Download data is not yet available.


M. D. Wilkinson, M. Dumontier, I. J. J. Aalbersberg, et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Scientific data, vol. 3, p. 160 018, 2016. DOI: 10.1038/sdata.2016.18.

D. Chandrasekaran and V. Mago, “Evolution of semantic similarity—a survey,” ACM Comput. Surv., vol. 54, no. 2, Feb. 2021, ISSN: 0360-0300. DOI: 10.1145/3440755. [Online]. Available:

S. Kim, Y. J. Yoo, J. So, J. G. Lee, J. Kim, and Y. W. Ko, “Design and implementation of binary file similarity evaluation system,” International Journal of Multimedia and Ubiquitous Engineering, vol. 9, no. 1, pp. 1–10, 2014. DOI: 10.14257/ijmue.2014.9.1.01.

B. Heinrichs, N. Preuß, M. Politze, M. S. M ̈uller, and P. F. Pelz, “Automatic General Metadata Extraction and Mapping in an HDF5 Use-case,” in Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR,, INSTICC, SciTePress, 2021, pp. 172–179, ISBN: 978-989-758-533-3. DOI: 10.5220/0010654100003064.

B. Heinrichs and M. Politze, “Moving Towards a General Metadata Extraction Solution for Research Data with State-of-the-Art Methods,” 12th International Conference on Knowledge Discovery and Information Retrieval, Nov. 2, 2020. DOI: 10 . 18154 / RWTH - 2020 - 12385. [Online]. Available:

C. Mattmann and J. Zitting, Tika in action, 2011.

D. Wood, M. Lanthaler, and R. Cyganiak, “RDF 1.1 Concepts and Abstract Syntax,” W3C, W3C Recommendation, Feb. 2014,

A. Perego, A. G. Beltran, R. Albertoni, S. Cox, D. Browning, and P. Winstanley, “Data Catalog Vocabulary (DCAT) - Version 2,” W3C, W3C Recommendation, Feb. 2020,

J. Carroll, “Matching rdf graphs,” May 2002, pp. 5–15, ISBN: 978-3-540-43760-4. DOI: 10.1007/3-540-48005-6_3.

P. Maillot and C. Bobed, “Measuring structural similarity between rdf graphs,” in Proceedings of the 33rd Annual ACM Symposium on Applied Computing, ser. SAC’18, Pau, France: Association for Computing Machinery, 2018, pp. 1960–1967, ISBN: 9781450351911. DOI: 10. 1145/ 3167132. 3167342. [Online]. Available:

A. Petrova, E. Sherkhonov, B. Cuenca Grau, and I. Horrocks, “Entity comparison in rdf graphs,” in The Semantic Web – ISWC 2017, C. d’Amato, M. Fernandez, V. Tamma, et al., Eds., Cham: Springer International Publishing, 2017, pp. 526–541, ISBN: 978-3-319-68288-4. DOI: 10.1007/978-3-319-68288-4_31.

M. Eid, M. Gollwitzer, and M. Schmitt, Statistik und Forschungsmethoden, Lehrbuch (Grundlagen Psychologie), ger, 3., korrigierte Auflage, Online-Ausgabe. Weinheim ; Basel: Beltz, 2013, 1 Online–Ressource (XXXII, 1024 Seiten), ISBN: 978-3-621-27524-8. [Online]. Available:

J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability,” Educational and Psychological Measurement, vol. 33, pp. 613–619, 1973. DOI: 10.1177/00131644730330030.




How to Cite

Heinrichs, B., & Yazdi, M. A. (2023). Determining the Similarity of Research Data by Using an Interoperable Metadata Extraction Method. Proceedings of the Conference on Research Data Infrastructure , 1.
Received 2023-04-25
Accepted 2023-06-29
Published 2023-09-07