Towards a Guideline Affording Overarching Knowledge Building in Data Analysis Projects

Tight and competitive market situations pose a serious challenge to enterprises in the manufacturing industry domain. Competing in the use of data analytics to enhance products and processes requires additional resources to deal with the complexity. On the contrary, the possibilities afforded by digitization and data analysisbased approaches make for a valuable asset. In this paper we suggest a guideline to a systematic course of action for the data-based creation of holistic insight. Building an overlaying corpus of knowledge accelerates the learning curve within specific projects as well as across projects by exceeding the project-specific view towards an integrated approach.


Introduction
Demand and supply for insights derived from all kinds of accessible data sources in enterprises are higher than ever before as the pressure to keep up with global competitors meets the ever-growing possibilities of data acquisition and exploitation. A plethora of methods and tools is available to deal with and make use of these resources: from sensors to algorithms, from Industrial Internet of Things (IIoT) solutions to programming libraries and software. [1] While all business sectors face this situation equally and therefore must deal with similar challenges, the complexity of the task is particularly high in the manufacturing industry domain. [2] [3] This holds true especially for tasks within data-driven enhancement projects (EP) in the manufacturing industry domain which require a high level of innovation and are conducted in a project-based manner like one-of-a-kind production, research and development (R&D), customer-specific machinery and plant engineering or the design of cyber-physical production systems. [4] First and foremost, conducting successful data analysis projects does not only include the activities directly associated with analyzing data but involves the execution of several elaborate steps as well as strategic measures. To systematically align all relevant aspects affecting the analysis outcome in a wider sense will result in distinct quality improvement. [3] In our research we aim at providing the means to support achieving strategic goals by conducting data analysis projects which systematically connect relevant information fragments on all levels of aggregation from all relevant sources. Therefore, our research is driven by the following research question (RQ):

RQ:
How can a reference model be provided for complex tasks in the industrial domain which provides methodological support for the data-driven construction and utilization of an overlaying corpus of knowledge?
To answer this question, we developed an artifact in the form of a reference model to equip the user with a wide range of methodological support for conducting informed data analyses. The goal of the suggested framework is to not only derive insight about the examined topic of an active data mining project but to preserve and build on the findings exceeding project boundaries. The reference model aims to inspire rigorous and holistic investigation, to provide the means for communication, project management and documentation and to build the foundation for future software applications to support this holistic project-exceeding data mining approach thus also paving the way for an analysis and optimization of the activities undertaken within data mining projects themselves.
Following this approach this paper is structured as follows: In Section 2, we describe our motivation, we then sum up foundations and basic concepts in Section 3. Derived from the key activities of the sensemaking approach as described by [5] and more specifically by [6] a set of design principles is suggested, as will be described in Section 4. In fulfillment of the defined design principles a framework is presented in Section 5 to structure necessary methodological measures and to allocate useful activities within five layers of information aggregation. By presenting the reference model we advocate for a systematic course of action aiming at the creation of holistic insight. Finally, we draw a conclusion and give an outlook for further research in Section 6.

Motivation
The major purpose of the presented long-term design science research project is to elaborate methodological support for data-driven knowledge extraction projects in the manufacturing industry domain. Therefore, our main objective is to help artifact users gain a sophisticated understanding of the principles by which to conduct data-driven knowledge extraction projects, to reduce the associated hurdles for manufacturing companies and to create a basis to address and solve them in the future in a repeatable manner. The application of the presented reference model enables domain experts to derive cumulative knowledge, rather than re-inventing technical concepts and methodological procedures under new labels in every new project setting. [7] Specialists dealing with data analysis projects in the industrial domain face the necessity to cover the methodological skillset required in data science as well as a deep understanding of the domain fundamentals to consider relevant causalities and interactions and to purposefully derive and interpret results according to their context. Hence throughout all industrial sectors on the one hand domain experts successfully gain and apply data analytics knowledge while on the other hand data analysts engage in various domain contexts and oftentimes both have to team up with each other and with additional professionals like computer scientists and mathematicians to derive the desired outcome. While tremendous progress is underway in the domain-specific training of and proficient cooperation with data scientists and in the successful realization of data analytics projects the potential for even better outcome is huge. [8] [9] The main hurdles are the intricate communication between domain experts and data scientists, the scarcity of human resources for data analytics projects and the lack of domain-specific standardized procedures which lead to a singular quality of the execution and the use of results of data-driven analyses. These shortfalls especially hold true where a limited number of experts must realize data analytics projects next to rivaling work tasks as is the case in small and medium sized companies (SME), startups and R&D or planning departments. [3] A pre-study in the form of an exploratory study with six qualitative expert interviews aimed to identify the challenges that occur while setting up a data-driven knowledge extraction project confirmed these hurdles. The interviews were designed as partially standardized interviews using open to semi-open questions as initial starting points for the conversation and took between 70 and 180 minutes. The complete listing of the formulated questions and results will be provided by the authors upon request. The answers showed that practitioners tend to rely on traditional procedures and experience-based knowledge.
Their understanding of Data Mining (DM) mainly focused on the core analysis activities like the application of algorithms and often underestimated the effort and importance of peripheric aspects like the determination of target-aimed questions, data preparation to produce structured evaluable data sets, conclusive feature engineering and context-sensitive model building. The interviewees expressed their wish for more structure and guidance in data analytics projects while they found existing standard processes too generic to apply for their domain as well as not sufficiently considering real-life problems like data acquisition, data quality and operational data processing.

Foundation
Pursuing a long-term research project in the field of information systems (IS) aiming at the design of an artifact in the form of a reference model we comply with the design science paradigm stated by [10]. We furthermore adopt the three-cycle view of design science research (DSR) presented in [11] to address the relevance, design and rigor of the developed artifact. Additionally we rely on the steps for DSR research recommended by [12] to apply the paradigm to our research as follows: The problem identification and motivation for our research is constituted by the experience from numerous research projects and a pre-study in the form of expert interviews as described in Section 2. We then derived theory-based research goals and objectives by the definition of design principles as described in Section 4 followed by the design and development of the artifact, the outcome of which is presented in Section 5. While applying the findings in practice the derivation of a context-specific model should then be demonstrated and evaluated within future research. In an iterative manner the insights from an initial implementation within an example scenario should be used to further enhance the artifact and undergo subsequent evaluation phases to then be transferred to the community.
When attempting to represent and reduce reality to fulfill a subjective purpose like the understandable formulation of complex facts [13] for a class of similar problems a reference model is provided by introducing a model which is of recommendatory and universal character and allows for the derivation of application-specific models. [14] Consequently reference models are a generic type of model representing the essence of a commonpractice or best-practice view on a class of similar problems intended for re-use and acting as a blueprint for the derivation of specific models. [15] The addressed application field of the presented reference model comprises tasks in the industrial domain which require a high level of innovation and are conducted in a projectbased manner. When attempting to support such tasks there are various user roles and artifacts to take account of, notwithstanding that more than one user role can be fulfilled by one individual. These roles and artifacts are depicted in figure 1.

Figure 1. Addressed users and artifacts
As drawing conclusions by the statistical or algorithms-based study of large amounts of data today is widely established throughout all disciplines, numerous attempts have been made to standardize the data mining process especially in the field of computer science and economic analyses. Such procedure models generally consist of generic steps to structure and guide the planning and execution of DM projects. [20] Prominent standard operating models are subsequently named. Knowledge discovery in databases (KDD) is a description of the central building blocks of the overall multi-step procedure for complex real-world analysis tasks aiming at the discovery of knowledge in large amounts of data. [17] [18] Subsequent approaches like SEMMA and CRISP-DM emerged from the basic concept of KDD. The cross-industry standard process for DM (CRISP-DM) comprises the steps business understanding, data understanding, data preparation, modeling, evaluation and deployment, thus adding a more strategic perspective to the KDD core concept [19] [20]. The sample, explore, modify, model, and assess (SEMMA) methodology was developed by the SAS Institute to methodically organize the functions of its statistical and business intelligence software SAS Enterprise Miner, its constituent phases naming the concept in the form of an acronym. The analytics solutions unified method (ASUM) draws on a combination of agile and traditional implementation principles to achieve set solution goals and therefore complements the defined analysis phases by an additional project management stream to support the organizational realization. [21]

Design Principles
The concept of sensemaking originated in social psychology and was set in an organizational context by [5]. The approach describes how human beings in a social setting derive understanding of their surroundings by combining various information, creating connections and finally adding their own reasoning to it. The concept is described extensively in [22]. [6] sums up relevant literature and derives five key activities found in previous work as listed in table 1 which constitute the making of sense and thereby act as design goals for the developed reference model.
As the developed framework is supposed to not only support the understanding of facts and the creation of insight but also its utilization for the in-project and project-exceeding enhancement of the target-system, one more key activity is needed to complement the sensemaking key activities. By including the creation and utilization of a knowledge base we want to create a linkage to the field of knowledge management and thereby create the concept of knowledge making. By coining the term, we want to emphasize a creative, intuitive and iterative character of the approach, orienting on human behavior and the cognitive and social processes it originates in.
In DSR the concept of design principles (DP) provides the means to specify prescriptive design knowledge in a way that allows for a precise formulation to describe how the mechanisms of a technology or approach help to achieve particular aims. [23] According to [24] design principles should describe which actions are made possible through the use of an artifact and explain the material properties which make that action possible while naming the boundary conditions under which this description holds true. More precisely [24] suggests the formulation of a DP in the following form: "Provide the system with [material property-in terms of form and function] in order for users to [activity of user/group of users-in terms of action], given that [boundary conditions-user group's characteristics or implementation settings]." Following this suggestion, we formulated design principles for the presented reference model based on the derived knowledge making key activities as shown in table 1.

Reference Model
We want to motivate a highly strategic and integrated practice in data-driven enhancement projects [EP] in the manufacturing industry domain and to support this mindset by suggesting a framework to guide the efforts. The development of this reference model is driven by the needs identified in industrial practice and numerous research projects and realized by employing well-researched approaches grounded in established theory. We set up a grid-like structure to assign relevant methodologies to the respective analysis project phases and thereby fulfill the design principles formulated in Section 4. We based our approach on three widely established concepts: standard procedure models, the concept of data aggregation and the field of knowledge management. We attempt to provide the means for the effective combination and domain-specific adaption of these concepts while additionally overcoming their shortcomings as described in section 1 and further elaborated in [25] and [3].
We especially want to emphasize the importance of considering the various aggregation levels as described in table 2 in which information fragments can occur in, calling attention in particular to the intense interaction of all five levels of aggregation implying the necessity to expand awareness to each of them and their interrelations within each step of action. More specifically speaking an integrated consideration and operationalization is needed throughout all project phases as the strong focus on DM core analysis activities was one of the main hurdles found in the pre-study described in Section 2. The reference model supports practitioners in the inclusion of all aspects, from aggregation level 1, being the least connected state of raw data and the physical system realization and data acquisition up to level 5, comprising the overarching management of highly connected complex information constructs. Data aggregation is often depicted in a form similar to the traditional knowledge pyramid, although revised and refined approaches can be found superseding this strictly hierarchical view. [26] Within the scope of our research we adopt the view that information fragments can exist in various states of aggregation, starting from incrementally small pieces of data like a single binary number, but also forming states of light aggregation as in protocols or logfiles or of higher aggregation like in the form of data sets, tables, charts or reports, where data is set into context and provides declarations exceeding its alpha-numerical value. We therefore deem it valid to speak of information when referring to aggregated data. Data aggregation states then stretch to strongly aggregated forms of where aggregated chunks of information further connect to complex constructs representing relations comprising formal logic thus resembling the processing of insight and thought in the human mind. We therefore argue that the term information is suitable to describe aggregated forms of data and highly aggregated information equals knowledge in the daily use of language. In table 2 we convey this understanding to the manufacturing industry domain introducing an additional level of analogous real-life objects which the relevant data relates to and originates in.
Relevant objects within AL 1 can be controllers, motors, GPS trackers and sensors or transport systems, accompanied by the respective digital counterparts in AL 2 like output data of controllers, performance data of motors, GPS data and other sensor data. Furthermore AL 2 addresses additional descriptions of the target-system as e.g. conceptual models. Within AL 3 a suitable concept must be chosen to gather, process and contain any relevant information fragments to transfer them to higher levels of aggregation and derive and utilize insight. A suitable concept can be an enterprise-specific analysis framework, an individual adoption of the DM standard processes described in section 3 or domain-specific adoptions like the "DMME: Data mining methodology for engineering applications" as presented in [3]. Within AL 3 and the central analysis project phase of the chosen concept resides the core activity constituting the success of the EP: Proceeding in an intensely iterative character and closely observing the relation to any other grid point highly contextsensitive feature engineering is made possible. Within AL 4 the found facts and interrelations are implemented by integrating the derived insight within physical instantiations, instantiations of digital shadows or digital twins, simulation models or visualizations.
The knowledge base constituting AL 5 can take many forms, from the incorporation by an individual, classical SQL databases or ontologies to intelligent agents. Lastly the successful utilization of the concept will depend on what the respective knowledge base affords. Despite AL 5 constituting the bottleneck of the implementation, the more suitable its chosen way of instantiation is for the occasion the more intense the usage in practice will be. Highly formalized approaches and machine-readable implementations allow for complex and potent operations but require high effort to set up and maintain. Depending on the application situation the manageable effort of a lightweight solution can advance implementation success. We suggest orienting on existing solutions like for example extensively elaborated for the application of ontologies in the manufacturing domain in [27].
Two more aspects are vital to exploit the full potential of data analytics in the industrial domain: to take into account the dimorphic system character of the target system consisting of analogous and digital components and to focus on context-sensitive engineering of conclusive features as this step constitutes the heart of the project and is complemented by the choice and application of fitting tools and methods, only rendered possible by the utilization of aforementioned concepts providing the necessary context. [29] As pointed out by [29] and further elaborated by [30], the concepts described in Section 3 share the common essence of a stepwise description of the data mining project phases along with similar core principles of the activities performed during the respective steps. Attempting to capture the essence of the various data mining procedure models we derived a generalized version of data mining project phases as can be seen in figure 2. Based on the specification of the analysis project goal in phase 1 (P1) a conceptualization phase follows in phase 2 (P2). The data analysis core activities are performed in phase 3 (P3) and 4 (P4). First data is collected by setting up the necessary physical infrastructure and accumulating all accessible and presumably relevant information fragments, growing and extending the data pool. Then feature engineering, model building and extraction of relations follow, reducing the data build-up to a set of connected information which can then be deployed. Phase 5 (P5) draws on the preceding phases and can and should be conducted in parallel from the start as it preserves and makes available the methodological and meta-information of the data analysis project as well as comprises the supervision of its execution during and after the project. The phases described above provide the reference model with a basic sequence of actions to perform in a data analysis project and can be replaced by any adequate alternative during instantiation, e.g. a standard process or an enterprise specific procedure. Concurrently the necessity to consider various aggregation levels of available and derived information fragments pertains for all project steps. The aggregation level view in combination with the project phases forms a grid as presented in figure 3 to address the methodological repertory of each combination of layer and phase allowing for the mapping of relevant methods accompanied by respective meta-information. At each grid point a template is to be provided to document used methods and their domainspecific application as well as to give an initial information impulse comprising a narrow set of well-established methods along with a continuable list of methods and sufficient search terms. If available, sub-methodologies and detailed sub-selection options are included by grouping them in a hierarchical manner beneath the respective method, providing a template for each hierarchical dimension. The basic or initial selection can be realized by pre-defining a default method for each methodological category as well as by giving a minimum viable implementation strategy.
Within the iterative solution process a token of current knowledge cycles the defined project phases undergoing permanent revision and thus updating the knowledge base. The active token resembles an assumption about the current state of the targeted artifact, permanently considering the dimorphous character of the target-system. It is the state of the art for nearly any real-life system to be accompanied by a digital counterpart. From our point of view these two sides of reality, the analogous components and digital descriptions and traces mirroring them, form the targeted system and have to be considered continuously to investigate, analyze and enhance this system. For further details on the real-life system we suggest [30] and [31] on the concepts of digital shadow and digital twin. To realize the iterative procedure based on an assumption token it is advisable to orient on existing approaches like the "Conceptual Model of the Learning-Oriented Knowledge Management System" given in [32].
When applying the reference model, a specific model is derived tailored to support the targeted EP. Project phases, components included within the aggregation levels and respective methodological suggestions populating the reference model grid are adapted to their relevance within the given context. To ensure intuitive applicability for practitioners, the reference model and templates should be provided in form of visual content accompanied by textual explanations, preferably by the means of a software application.

Discussion and Outlook
In the presented paper we gave an outline towards a framework supporting the systematic data-based creation of insight. The suggested reference model aims at providing the means to accelerate the learning curve within an active data analysis project as well as to build and utilize an overlaying corpus of knowledge exceeding project boundaries. This aim can be addressed by orienting on the sensemaking approach as described by [6] to derive knowledgemaking key activities. To afford the realization of these activities design principles were formulated. Following these principles, we set up a grid-like structure to assign relevant methodologies to the respective analysis project phases while considering the possible aggregation levels information fragments can occur in. The presented reference model offers a guideline for communication, handling and documentation of technological and methodological information thus providing the means for the construction and utilization of an overarching knowledge base.
First application experience in the support of research projects showed the value of the reference model to promote a more integrated method of operation, but also made obvious how providing the means for intuitive applicability is crucial for the successful implementation of the approach. [4] [36] Future work will be devoted to the demonstration, evaluation and revision of the concept in practice. Additionally, a thorough analysis of existing and common methodological elements will be conducted by the analysis of research publications within leading journals and by assessment of accessible information on their application in practice to develop an appropriate classification and identify any additional elements that should be included. Moreover, having provided the means to document the usage of methods and their specification as well as having examined their classification allows for the construction of a formalized body of knowledge addressing the creation of knowledge itself. Future work will comprise the development of a taxonomy of methodological principles at hand to then be conveyed to an ontology defining logical relations, rules and principles allowing for decision support by typecasting similar EPs and deriving suitable solution approaches. While this paper focused on the motivation and the theoretical grounding of the concept, some consideration should also be given to its compliance with existing standards and tools to accelerate interoperability. The integration with standardized approaches like the Reference Architecture Model Industrie 4.0 (RAMI4.0) or with data management aspects like the data lifecycle approach can create synergies and add a helpful dimension to support the organizational implementation of the suggested method within enterprises. [37]