Predict COVID-19 Spreading With C-SMOTE




SML, Evolving Data Stream, Concept Drift, Balancing, COVID-19


Data continuously gathered monitoring the spreading of the COVID-19 pandemic form an unbounded flow of data. Accurately forecasting if the infections will increase or decrease has a high impact, but it is challenging because the pandemic spreads and contracts periodically. Technically, the flow of data is said to be imbalanced and subject to concept drifts because signs of decrements are the minority class during the spreading periods, while they become the majority class in the contraction periods and the other way round. In this paper, we propose a case study applying the Continuous Synthetic Minority Oversampling Technique (C-SMOTE), a novel meta-strategy to pipeline with Streaming Machine Learning (SML) classification algorithms, to forecast the COVID-19 pandemic trend. Benchmarking SML pipelines
that use C-SMOTE against state-of-the-art methods on a COVID-19 dataset, we bring statistical evidence that models learned using C-SMOTE are better.


Download data is not yet available.


A. Tsymbal, “The problem of concept drift: Definitions and related work,” Computer Science Department, Trinity College Dublin, vol. 106, no. 2, p. 58, 2004.

H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009.

A. Bifet and R. Gavald ` a, “Learning from time-changing data with adaptive windowing,” in SDM, SIAM, 2007, pp. 443–448.

A. Bernardo, H. M. Gomes, J. Montiel, B. Pfahringer, A. Bifet, and E. Della Valle, “Csmote: Continuous synthetic minority oversampling for evolving data streams,” in BigData, In press, IEEE, 2020.

N. V. Chawla, K.W. Bowyer, L. O. Hall, andW. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002.

A. Fern´andez, S. Garc´ıa, F. Herrera, and N. V. Chawla, “SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary,” J. Artif. Intell. Res., vol. 61, pp. 863–905, 2018.

H. Han, W. Wang, and B. Mao, “Borderline-smote: A new over-sampling method in imbalanced data sets learning,” in ICIC (1), ser. LNCS, vol. 3644, Springer, 2005, pp. 878– 887.

H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: adaptive synthetic sampling approach for imbalanced learning,” in IJCNN, IEEE, 2008, pp. 1322–1328.

C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “DBSMOTE: density-based synthetic minority over-sampling technique,” Appl. Intell., vol. 36, no. 3, pp. 664–684, 2012.

L. Abdi and S. Hashemi, “To combat multi-class imbalanced problems by means of oversampling and boosting techniques,” Soft Comput., vol. 19, no. 12, pp. 3369–3385, 2015.

C. Bellinger, S. Sharma, N. Japkowicz, and O. R. Za¨ıane, “Framework for extreme imbalance classification: SWIM - sampling with the majority class,” Knowl. Inf. Syst., vol. 62, no. 3, pp. 841–866, 2020.

G. Douzas and F. Bac¸ ˜ao, “Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE,” Inf. Sci., vol. 501, pp. 118–135, 2019.

A. Ghazikhani, R. Monsefi, and H. S. Yazdi, “Recursive least square perceptron model for non-stationary and imbalanced data stream classification,” Evol. Syst., vol. 4, no. 2, pp. 119–131, 2013.

——, “Online neural network model for non-stationary and imbalanced data stream classification,” Int. J. Machine Learning & Cybernetics, vol. 5, no. 1, pp. 51–62, 2014.

B. Mirza, Z. Lin, and N. Liu, “Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift,” Neurocomputing, vol. 149, pp. 316–329, 2015.

A. Ghazikhani, R. Monsefi, and H. S. Yazdi, “Ensemble of online neural networks for non-stationary and imbalanced data streams,” Neurocomputing, vol. 122, pp. 535–544, 2013.

B.Wang and J. Pineau, “Online bagging and boosting for imbalanced data streams,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 12, pp. 3353–3366, 2016.

L. E. B. Ferreira, H. M. Gomes, A. Bifet, and L. S. Oliveira, “Adaptive random forests with resampling for imbalanced data streams,” in IJCNN, IEEE, 2019, pp. 1–6.

A. Bernardo, E. Della Valle, and A. Bifet, “Incremental rebalancing learning on evolving data streams,” in ICDM (Workshops), IEEE, 2020, pp. 844–850.

S. Wang, L. L. Minku, and X. Yao, “A learning framework for online class imbalance learning,” in CIEL, IEEE, 2013, pp. 36–45.

——, “Resampling-based ensemble methods for online class imbalance learning,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 5, pp. 1356–1368, 2015.

I. E. Agbehadji, B. O. Awuzie, A. B. Ngowi, and R. C. Millham, “Review of big data analytics, artificial intelligence and nature-inspired computing models towards accurate detection of covid-19 pandemic cases and contact tracing,” International journal of environmental research and public health, vol. 17, no. 15, p. 5330, 2020.

I. Arpaci, S. Alshehabi, M. Al-Emran, M. Khasawneh, I. Mahariq, T. Abdeljawad, and A. E. Hassanien, “Analysis of twitter data using evolutionary clustering during the covid-19 pandemic,” Computers, Materials & Continua, vol. 65, no. 1, pp. 193–204, 2020.

J. Farooq and M. A. Bazaz, “A novel adaptive deep learning model of covid-19 with focus on mortality reduction strategies,” Chaos, Solitons & Fractals, vol. 138, p. 110 148, 2020.

J. Hasell, E. Mathieu, D. Beltekian, B. Macdonald, C. Giattino, E. Ortiz-Ospina, M. Roser, and H. Ritchie, “A cross-country database of covid-19 testing,” Scientific data, vol. 7, no. 1, pp. 1–7, 2020.

H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes, and T. Abdessalem, “Adaptive random forests for evolving data stream classification,” Mach. Learn., vol. 106, no. 9-10, pp. 1469–1495, 2017.

A. Bifet and R. Gavald ` a, “Adaptive learning from evolving data streams,” in IDA, ser. Lecture Notes in Computer Science, vol. 5772, Springer, 2009, pp. 249–260.

A. Bifet, J. Read, I. Zliobaite, B. Pfahringer, and G. Holmes, “Pitfalls in benchmarking data stream classification and how to avoid them,” in ECML/PKDD (1), ser. Lecture Notes in Computer Science, vol. 8188, Springer, 2013, pp. 465–479.

J. Gama, R. Sebasti˜ao, and P. P. Rodrigues, “Issues in evaluation of stream learning algorithms,” in KDD, ACM, 2009, pp. 329–338.

J. Akosa, “Predictive accuracy: A misleading performance measure for highly imbalanced data,” in Proceedings of the SAS Global Forum, vol. 12, 2017.

P. M. Domingos and G. Hulten, “Mining high-speed data streams,” in KDD, ACM, 2000, pp. 71–80.



How to Cite

Bernardo, A., & Della Valle, E. . (2021). Predict COVID-19 Spreading With C-SMOTE. Business Information Systems, 1, 27–38.

Conference Proceedings Volume


Big Data