The Aruna Object Storage

A Distributed Multi Cloud Object Storage System for Scientific Data Management

Authors

DOI:

https://doi.org/10.52825/cordi.v1i.404

Keywords:

data management, storage, cloud-native, multi-cloud, FAIR, data mesh

Abstract

The exponential growth of scientific data has led to an increasing demand for effective data management and storage solutions. Academic computing infrastructures are often fragmented, which can make it challenging for researchers to leverage cloud-native principles and modern data analysis tools. To address this challenge, a new distributed storage platform called Aruna Object Storage (AOS) was developed. AOS is a cloud-native, scalable, and domain-agnostic object storage system that provides an S3-compatible interface for a variety of data analysis tools like Apache Spark, TensorFlow, and Pandas. The system uses an underlying distributed NewSQL database to manage detailed information about its resources and can be deployed across multiple data centers for geo-redundancy. AOS is designed to support modern DataOps practices, including the adoption of FAIR principles. Resources in AOS are organized into Objects, Datasets, Collections and Projects, which represent relations of data objects. Additionally, these can be further annotated with key-value pairs called Labels and Hooks to provide additional information about the data. The system's event-driven architecture makes it easy to automate actions and enforce data validation checks, significantly improving accessibility and reproducibility of scientific results. AOS is open source and freely available via https://aruna-storage.org.

Downloads

Download data is not yet available.

References

- C. L. Borgman, "The conundrum of sharing research data" Journal of the American Society for Information Science and Technology, vol. 63, no. 6, pp. 1059–1078, 2012. DOI: https://doi.org/10.1002/asi.22634

- K. De Smedt, D. Koureas, and P. Wittenburg, "Fair digital objects for science: From data pieces to actionable knowledge units", Publications, vol. 8, no. 2, p. 21, 2020. DOI: https://doi.org/10.3390/publications8020021

- I. A. Machado, C. Costa, and M. Y. Santos, "Data mesh: Concepts and principles of a paradigm shift in data architectures", Procedia Computer Science, vol. 196, pp. 263–271, 2022. DOI: https://doi.org/10.1016/j.procs.2021.12.013

- S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, "Big data analytics on apache spark", International Journal of Data Science and Analytics, vol. 1, pp. 145–164, 2016. DOI: https://doi.org/10.1007/s41060-016-0027-9

- P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, "Nextflow enables reproducible computational workflows", Nature biotechnology, vol. 35, no. 4, pp. 316–319, 2017. DOI: https://doi.org/10.1038/nbt.3820

- W. McKinney et al., "Pandas: A foundational python library for data analysis and statistics", Python for high performance and scientific computing, vol. 14, no. 9, pp. 1–9, 2011

- M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, et al., "The fair guiding principles for scientific data management and stewardship", Scientific data, vol. 3, no. 1, pp. 1–9, 2016. DOI: https://doi.org/10.1038/sdata.2016.18

- E. Yuan and J. Tong, "Attributed based access control (abac) for web services", in IEEE International Conference on Web Services (ICWS’05), IEEE, 2005. DOI: https://doi.org/10.1109/ICWS.2005.25

- K. Indrasiri and D. Kuruppu, "gRPC: up and running: building cloud native applications with Go and Java for Docker and Kubernetes", O’Reilly Media, 2020, ISBN: 9781492058335

Downloads

Published

2023-09-07
Received 2023-04-26
Accepted 2023-06-29
Published 2023-09-07

Funding data