The Aruna Object Storage
A Distributed Multi Cloud Object Storage System for Scientific Data Management
Keywords:data management, storage, cloud-native, multi-cloud, FAIR, data mesh
The exponential growth of scientific data has led to an increasing demand for effective data management and storage solutions. Academic computing infrastructures are often fragmented, which can make it challenging for researchers to leverage cloud-native principles and modern data analysis tools. To address this challenge, a new distributed storage platform called Aruna Object Storage (AOS) was developed. AOS is a cloud-native, scalable, and domain-agnostic object storage system that provides an S3-compatible interface for a variety of data analysis tools like Apache Spark, TensorFlow, and Pandas. The system uses an underlying distributed NewSQL database to manage detailed information about its resources and can be deployed across multiple data centers for geo-redundancy. AOS is designed to support modern DataOps practices, including the adoption of FAIR principles. Resources in AOS are organized into Objects, Datasets, Collections and Projects, which represent relations of data objects. Additionally, these can be further annotated with key-value pairs called Labels and Hooks to provide additional information about the data. The system's event-driven architecture makes it easy to automate actions and enforce data validation checks, significantly improving accessibility and reproducibility of scientific results. AOS is open source and freely available via https://aruna-storage.org.
- C. L. Borgman, "The conundrum of sharing research data" Journal of the American Society for Information Science and Technology, vol. 63, no. 6, pp. 1059–1078, 2012. DOI: https://doi.org/10.1002/asi.22634
- K. De Smedt, D. Koureas, and P. Wittenburg, "Fair digital objects for science: From data pieces to actionable knowledge units", Publications, vol. 8, no. 2, p. 21, 2020. DOI: https://doi.org/10.3390/publications8020021
- I. A. Machado, C. Costa, and M. Y. Santos, "Data mesh: Concepts and principles of a paradigm shift in data architectures", Procedia Computer Science, vol. 196, pp. 263–271, 2022. DOI: https://doi.org/10.1016/j.procs.2021.12.013
- S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, "Big data analytics on apache spark", International Journal of Data Science and Analytics, vol. 1, pp. 145–164, 2016. DOI: https://doi.org/10.1007/s41060-016-0027-9
- P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, "Nextflow enables reproducible computational workflows", Nature biotechnology, vol. 35, no. 4, pp. 316–319, 2017. DOI: https://doi.org/10.1038/nbt.3820
- W. McKinney et al., "Pandas: A foundational python library for data analysis and statistics", Python for high performance and scientific computing, vol. 14, no. 9, pp. 1–9, 2011
- M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, et al., "The fair guiding principles for scientific data management and stewardship", Scientific data, vol. 3, no. 1, pp. 1–9, 2016. DOI: https://doi.org/10.1038/sdata.2016.18
- E. Yuan and J. Tong, "Attributed based access control (abac) for web services", in IEEE International Conference on Web Services (ICWS’05), IEEE, 2005. DOI: https://doi.org/10.1109/ICWS.2005.25
- K. Indrasiri and D. Kuruppu, "gRPC: up and running: building cloud native applications with Go and Java for Docker and Kubernetes", O’Reilly Media, 2020, ISBN: 9781492058335
Conference Proceedings Volume
Copyright (c) 2023 Marius Dieckmann, Sebastian Beyvers, Jannis Hochmuth, Anna Rehm, Frank Förster, Alexander Goesmann
This work is licensed under a Creative Commons Attribution 4.0 International License.