Text-Aware Predictive Monitoring of Business Processes

The real-time prediction of business processes using historical event data is an important capability of modern business process monitoring systems. Existing process prediction methods are able to also exploit the data perspective of recorded events, in addition to the control-flow perspective. However, while well-structured numerical or categorical attributes are considered in many prediction techniques, almost no technique is able to utilize text documents written in natural language, which can hold information critical to the prediction task. In this paper, we illustrate the design, implementation, and evaluation of a novel text-aware process prediction model based on Long Short-Term Memory (LSTM) neural networks and natural language models. The proposed model can take categorical, numerical and textual attributes in event data into account to predict the activity and timestamp of the next event, the outcome, and the cycle time of a running process instance. Experiments show that the text-aware model is able to outperform state-of-the-art process prediction methods on simulated and real-world event logs containing textual data.


Introduction
In recent years, a progressive and rapid tendency to digital transformation has become apparent in most aspects of industrial production, provision of services, science, education, and leisure. This has, in turn, caused the widespread adoption of new technologies to support human activities. A significant number of these technologies specialize in the management of enterprise business processes.
The need of analysis and compliance in business processes, united to a larger and larger availability of historical event data have stimulated the birth and growth of the scientific discipline of process mining. Process mining enables the We thank the Alexander von Humboldt (AvH) Stiftung for supporting our research interactions. Please do not print this document unless strictly necessary. discovery of process models from historical execution data, the measurement of compliance between data and a process model, and the enhancement of process models with additional information extracted from complete process cases.
Advancements in process mining and other branches of data science have also enabled the possibility of adopting prediction techniques, algorithms that train a mathematical model from known data instances and are able to perform accurate estimates of various features of future instances. In the specific context of process mining, predictive monitoring is the task of predicting features of partial process instances, i.e., cases of the process still in execution, on the basis of recorded information regarding complete process instances. Examples of valuable information on partial process instances are the next activity in the process to be executed for the case, the time until the next activity, the completion time of the entire process instance, and the last activity in the case (outcome). If accurately estimated, these case features can guide process owners in making vital decisions, and improve operations within the organization that hosts the process; as a result, accurate predictive monitoring techniques are widely desirable and a precious asset for companies and organizations.
Existing predictive monitoring techniques typically operate at the merging point between process mining and machine learning, and are able to consider not only the control-flow perspective of event data (i.e., the activity, the case identifier, and the timestamp), but also additional data associated with them. However, few prediction techniques are able to exploit attributes in the form of text associated with events and cases. These textual attributes can hold crucial information regarding a case and its status within the workflow of a process. A general framework describing the problem is shown in Figure 1. The aim of this paper is to assess the extent to which textual information can influence predictive monitoring. To this end, we present a novel predictive monitoring approach able to exploit numerical, categorical, and textual attributes associated with events, as well as control-flow information. Our prediction model estimates features of cases in execution by combining a set of techniques for sequential and textual data encoding with predictions from an LSTM neural network, a machine learning technique particularly effective on sequential data such as process traces. Validation through experiments on real-life event logs shows that our approach is effective in extracting additional information from textual data, and outperforms state-of-the-art approaches for predictive monitoring.
The remainder of the paper is structured as follows. Section 2 discusses some recent work related to predictive monitoring. Section 3 presents some preliminary definitions. Section 4 illustrates the details and architecture of our text-aware predictive monitoring technique. Section 5 presents the evaluation of the predictor and the results of the experiments. Section 6 concludes the paper.

Related Work
The intersection of process mining and machine learning is a rich and influential field of research. Among the numerous applications of machine learning in process mining, feature prediction on partial process traces based on historical complete traces (i.e., predictive monitoring) is particularly prominent.
Earlier techniques for prediction in process mining focused on white-box and human-interpretable models, largely drawn from statistics. Many proposals have been put forward to compute an estimate of the cycle time of a process instance, including decision trees [6] and simulation through stochastic Petri nets [13]. Additionally, Teinemaa et al. [16] proposed a process outcome prediction method based on random forests and logistic regression. Van der Aalst et al.
[1] exploit process discovery as a step of the prediction process, obtaining estimations through replay on an annotated transition system; this technique is then extended by Polato et al. [12] by annotating a discovered transition system with an ensemble of naïve Bayes and support vector regressors, allowing for the dataaware prediction of cycle time and next activity.
The second half of the 2010s saw a sharp turn from ensemble learning to single prediction models, and from white-box to black-box models -specifically, recurrent neural networks. This is due to the fact that recurrent neural networks have been shown to be very accurate in learning from sequential data. However, they are not interpretable, and the training efficiency is often lower.
This family of prediction methods employs LSTM neural networks to estimate process instance features. Evermann et al. [7] proposed the use of LSTMs for next activity prediction; Tax et al. [15] trained LSTMs to predict cycle time of process instances. Navarin et al. [9] extended this approach by feeding additional attributes in the LSTM, attaining data-aware prediction. More recently, Park and Song [10] merged system-level information from a process model with a compact trace representation based on deep neural networks to attain performance prediction.
No existing predictive monitoring technique, to the best of our knowledge, incorporates information from free text, recorded as event or trace attribute, with the control-flow perspective of the process into a state-of-the-art LSTM neural network model for predictive monitoring: this motivates the approach we present in this paper.

Preliminaries
Let us first introduce some preliminary definitions and notations.
Definition 1 (Sequence). A sequence of length n ∈ N 0 over a set X is an ordered collection of elements defined by a function σ : {1, . . . , n} → X, which assigns each index an element of X. A sequence of length n is represented explicitly as σ = x 1 , x 2 , . . . , x n with x i ∈ X for 1 ≤ i ≤ n. In addition, is the empty sequence of length 0. Over the sequence σ we define |σ| = n, σ(i) = x i , and x ∈ σ ⇔ ∃ 1≤i≤n : x = x i . X * denotes the set of all sequences over X.

Definition 2 (Event, Trace, Event Log, Prefix Log). Let
A be the universe of activity labels. Let T be the closed under subtraction and totally or- Over an event e we define the projection functions π A (e) = a, π T (e) = t, Additional attributes d i ∈ D i may be in the form of text, i.e., its domain is the set of sequences D i = Σ * from a fixed and known alphabet Σ.
Next, let us define the target functions for our predictions:

Definition 3 (Target Functions).
Let σ ∈ E * be a non-empty trace, and let 1 ≤ k ≤ |σ|. The next activity function f a : E * ×N → A∪{ } returns the activity of the next event, or an artificial activity if the given trace is complete: The next timestamp function f t : E * ×N → T returns the time difference between the next event and last event in the prefix: The case outcome function f o : E * → A returns the last activity of the trace: The cycle time function f c : E * → T returns the total duration of the case, i.e., the time difference between the first and the last event of the trace: f c (σ) = π T (σ(|σ|)) − π T (σ(1)).
The prediction techniques we show include the information contained in textual attributes of events. In order to be readable by a prediction model, the text needs to be processed by a text model. Text models rely on a text corpus, a collection of text fragments called documents. Before computing the text model, the documents in the corpus are preprocessed with a number of normalization steps: conversion to lowercase, tokenization (separation in distinct terms), lemmatization (mapping words with similar meaning, such as "diagnose" and "diagnosed" into a single lemma), and stop word removal (deletion of uninformative parts of speech, such as articles and adverbs). These transformation steps are shown in Table 1. Step Transformation Example Document 0 Original "The patient has been diagnosed with high blood pressure." 1 Lowercase "the patient has been diagnosed with high blood pressure." 2 Tokenization "the", "patient", "has", "been", "diagnosed", "with", "high", "blood", "pressure", "." 3 Lemmatization "the", "patient", "have", "be", "diagnose", "with", "high", "blood", "pressure", "." 4 Stop word filtering "patient", "diagnose", "high", "blood", "pressure" In order to represent text in a structured way, we consider four different text models: Bag of Words (BoW) [5]: a model where, given a vocabulary V , we encode a document with a vector of length |V | where the i-th component is the term frequency (tf), the number of occurrences of the i-th term in the vocabulary, normalized with its inverse document frequency (idf), the inverse of the number of documents that contain the term. This tf-idf score accounts for term specificity and rare terms in the corpus. This model disregards the order between words.
Bag of N-Grams (BoNG) [5]: this model is a generalization of the BoW model. Instead of one term, the vocabulary consists of n-tuples of consecutive terms in the corpus. The unigram model (n = 1) is equivalent to the BoW model. For the bigram model (n = 2), the vocabulary consists of pairs of words that appear next to each other in the documents. The documents are encoded with the td-idf scores of their n-grams. This model is able to account for word order.
Paragraph Vector (Doc2Vec) [8]: in this model, a feedforward neural network is trained to predict one-hot encodings of words from their context, i.e., words that appear before or after the target word in the training documents. An additional vector, of a chosen size and unique for each document, is trained together with the word vectors. When the network converges, the additional vector carries information regarding the words in the corresponding document and their relationship, and is thus a fixed-length representation of the document.
Latent Dirichlet Allocation (LDA) [4]: a generative statistical text model, representing documents as a set of topics, which size is fixed and specified a priori. Topics are multinomial (i.e., categorical) probability distributions over all words in the vocabulary and are learned by the model in an unsupervised manner. The underlying assumption of the LDA model is that the text documents were created by a statistical process that first samples topic from a multinomial distribution associated with a document, then sample words from the sampled topics. Using the LDA model, a document is encoded as a vector by its topic distribution: each component indicates the probability that the corresponding topic was chosen to sample a word in the document. LDA does not account for word order.
In the next section, we will describe the use of text models in an architecture allowing to process a log to obtain a data-and text-aware prediction model.

Prediction Model Architecture
The goal of predictive monitoring is to estimate a target feature of a running process instance based on historical execution data. In order to do so, predictive monitoring algorithms examine partial traces, which are the events related to a process case at a certain point throughout its execution. Obtaining partial traces for an event log is equivalent to computing the set of all prefixes for the traces in the log. Prefix logs will be the basis for training our predictive model.
In this paper, we specifically address the challenge of managing additional attributes that are textual in nature. In order to account for textual information, we need to define a construction method for fixed-length vectors that encode activity labels, timestamps, and numerical, categorical, and textual attributes.
Given an event e = (a, t, d 1 , . . . , d m ), its activity label a is represented by a vector a using one-hot encoding. Given the set of possible activity labels A, an arbitrary but fixed ordering over A is introduced with a bijective index function index A : A → {1, . . . , |A|}. Using this function, the activity is encoded as a vector of size |A|, where the component index A (π A (e)) has value 1 and all the other components have value 0. The function 1 A : A → {0, 1} A is used to describe the realization of such one-hot encoding a = 1 A (π A (e)) for the activity label of the event e.
In order to capture time-related correlations, a set of time-based features is utilized to encode the timestamp t of the event. We compute a time vector t = (t 1 ,t 2 ,t 3 ,t 4 ,t 5 ,t 6 ) of min-max normalized time features, where t 1 is the time since the previous event, t 2 is the time since the first event of the case, t 3 is the time since the first event of the log, t 4 is the time since midnight, t 5 is the time since previous Monday, and t 6 is the time since the first of January. The min-max normalization is obtained through the formulâ where min(x) is the lowest and max(x) is the highest value for the attribute x.
Every additional attribute d i of e is encoded in a vector d i as follows: The encoding technique depends on the type of the attribute. Categorical attributes are one-hot encoded similarly to the activity label. Numerical attributes are min-max normalized: if the minimum and maximum are not bounded conceptually, the lowest or highest value of the attribute in the historical event log is used for scaling. Finally, if D i is a textual model, it is encoded in a fixed-length vector with one of the four text models presented in Section 3; the documents in the text corpus for the text model consist of all instances of the textual attribute D i contained in the historical log. This technique allows to build a complete fixed-length encoding for the event e = (a, t, d 1 , . . . , d m ), which we indicate with the tuple of vectors enc(e) = ( a, t, d 1 , . . . , d m ).
This encoding procedure allows us to build a training set for the prediction of the target functions presented in Section 3 utilizing an LSTM neural network. Figure 2 illustrates the entire encoding architecture, and the fit/predict pipeline for our final LSTM model. The schematic distinguishes between the offline (fitting) phase, where we train the LSTM with encoded historical event data, and the online (real-time prediction) phase, where we utilize the trained model to estimate the four target features on running process instances. Given an event log L, the structure of the training set is based on the partial traces in its prefix log L = {hd k (σ) | σ ∈ L∧1 ≤ k ≤ |σ|}. For each σ = e 1 , e 2 , . . . e n ∈ L and 1 ≤ k ≤ n, we build an instance of the LSTM training set. The network input vecx 1 , x 2 , . . . , x k is given by the event encodings vecx 1 = enc(e 1 ), vecx 2 = enc(e 2 ), through vecx k = enc(e k ). The targets ( y a , y t , y o , y c ) are given by y a = f a (σ, k), y t = f t (σ, k), y o = f o (σ), and y c = f c (σ, k). Figure 3 shows the topology of the network. The training utilizes gradient descent and backpropagation through time (BPTT). The loss for numerical prediction valuesŷ and the true value y is the absolute error AE(ŷ, y) = |ŷ − y|, while the loss for categorical prediction values is computed using the categorical cross-entropy error CE( ŷ, y) = − k i=1 y i · logŷ i .

Evaluation
The predictive monitoring approach presented in this paper has been implemented for validation, utilizing a Python-based, fully open-source technological stack. PM4Py [3] is a process mining Python tool developed by Fraunhofer FIT. It is used for event log parsing and its internal event log representation. The neural network framework Tensorflow [2], originally developed by Google, and its API Keras 1 were utilized to implement the final LSTM model. Furthermore,  guage processing capabilities required to preprocess and normalize text, as well as build and train the text models.
The text-aware model is compared to two other process prediction methods. First, the pure LSTM approach based on the ideas of Navarin et al. [9] is considered, which only uses the activity, timestamp, and additional non-textual attributes of each event. This approach can be considered the state of the art in predictive monitoring with respect to prediction accuracy. The second baseline is the process model-based prediction method originally presented by van der Aalst et al. [1]. This approach builds an annotated transition system for a log using a sequence, bag, or set abstraction. Each state of the transition system is annotated with measurements of historical traces that can be used to predict target values for unseen traces. During the prediction phase, running traces are mapped to the corresponding state of the transition system, and the measurements of the state are used to compute a prediction. We adopt the improvement of this method described in [14] to apply it to classification tasks and obtain the next activity and outcome predictions. The first 8 events of a trace are considered for the construction of the state space. Experiments with different horizon lengths (1, 2, 4, 16) mostly led to inferior results, and are thus not reported.
We evaluate the two baseline methods against our approach considering all four text models presented here, with a varying dimension of vector size (50, 100 and 500 for BoW and BoNG, 10, 20 and 100 for PV and LDA). The BoNG model is built with bigrams (n = 2). Of the four target functions presented in Section 3, classification tasks (next activity and outcome) are evaluated with a weightedaverage class-wise F 1 score; regression tasks (next timestamp and cycle time) are evaluated on Mean Absolute Error (MAE). The first 2/3 of the chronologically ordered traces is used to fit the prediction model to the historical event data. The remaining 1/3 of traces are used to measure the prediction performance. The process prediction models are evaluated on two real-world event logs, of which the general characteristics are given in Table 2. Additionally, snippets of the datasets are shown in Tables 3 and 4. The first describes the customer journeys of the Employee Insurance Agency commissioned by the Dutch Ministry of Social Affairs and Employment. The log is aggregated from two anonymized data sets provided in the BPI Challenge 2016, containing click data of customers logged in the official website werk.nl and phone call data from their call center.
The second log is generated from the MIMIC-III (Medical Information Mart for Intensive Care) database and contains hospital admission and discharge events of patients in the Beth Israel Deaconess Medical Center between 2001 and 2012.
The results of the experiments are shown in Table 5. The next activity prediction shows an improvement of 2.83% and 4.09% on the two logs, respectively, showing that text can carry information on the next task in the process. While the impact of our method on next timestamp prediction is negligible in the customer journey log, it lowers the absolute error by approximately 11 hours in the hospital admission log. The improvement shown in the outcome prediction is small but present: 1.52% in the customer journey log and 2.11% in the hospital admission log. Finally, the improvement in cycle time prediction is particularly notable in the hospital admission log, where the error decreases by 27.63 hours. In general, compared to the baseline approaches, the text-aware model can improve the predictions on both event logs with at least one parametrization.
In addition, the prediction performance is evaluated per prefix length for each event log. Figure 4 shows the F 1 score and next timestamp MAE for every      prefix trace of length 1 ≤ k ≤ 8 on a selection of prediction tasks. Note that the results on shorter traces are supported by a much larger set of traces due to prefix generation. For text-aware models, only the best encoding size is shown.
On the customer journey log, the performance of all models correlates positively with the available prefix length of the trace. All text-aware prediction models surpass the baseline approaches on very short prefix traces of length 3 or shorter, for next activity and outcome prediction: we hypothesize that the cause for this is a combination of higher availability of textual attributes in earlier events in the traces, and the high number of training samples of short lengths, which allow text models to generalize. The next timestamp and cycle time predictions show no difference between text-aware models and the LSTM baseline, although they systematically outperform transition system-based methods.
The hospital admission log is characterized by the alternation of admission and discharge events. Therefore, the prediction accuracy varies between odd and even prefix lengths. The text-aware prediction models generate slightly better predictions on admission events since only these contain the diagnosis as text attribute. Regarding the next timestamp prediction, higher errors after discharge events and low errors after admission events are observed. This can be explained by the short hospital stays compared to longer time between two hospitalizations.

Conclusion
The prediction of the future course of business processes is a major challenge in business process mining and process monitoring. When textual artifacts in a natural language like emails or documents hold critical information, purely control-flow-oriented approaches are limited in delivering accurate predictions.
To overcome these limitations, we propose a text-aware process predictive monitoring approach. Our model encodes process traces of historical process executions to sequences of meaningful event vectors using the control flow, timestamp, textual, and non-textual data attributes of the events. Given an encoded prefix log of historical process executions, an LSTM neural network is trained to predict the activity and timestamp of the next event, and the outcome and cycle time of a running process instance. The proposed concept of text-aware predictive monitoring has been implemented and evaluated on real-world event data. We show that our approach is able to outperform state-of-the-art methods using insights from textual data.
The intersection between the fields of natural language processing and process mining is a promising avenue of research. Besides validating our approach on more datasets, future research also includes the design of a model able to learn text-aware trace and event embeddings, and the adoption of privacy-preserving analysis techniques able to avoid the disclosure of sensitive information contained in textual attributes.