A Scoping Review of the Digital Transformation Literature using Scientometric Analysis

Digital transformation is the rapidly expanding research field dealing with the increased impact of digital technologies on both business and society. Due to the large number of papers and the semantic ambiguity surrounding the terminology, covering such a broad topic is difficult. To help researchers gain a better understanding of the knowledge structure of the research field, we conduct a scoping review using scientometrics. We searched for publications dealing with digital transformation on both Scopus and Web of Science. We downloaded their bibliometric data and thoroughly merged and cleaned it using lemmatization and stemmatization. This dataset was analyzed using VOSviewer to create co-author networks and co-word occurrence graphs of the titles, abstracts, and keywords. We also visualized the growth of the research field and retrieved the top conferences and journals based on the number of papers and the number of citations. K-means clustering was performed on the abstracts and keywords to find similar research focuses. These findings highlight the broad scope of the research field, the ambiguity of the terminology, the lack of collaboration, and the absence of research into the impact of digital transformation on society. Moving forward, more research needs to be done to establish the boundaries of digital transformation and to investigate the importance of society in this phenomenon.


Introduction
The world is going through a rapid digital evolution. The increased impact digital technologies have on both business and society has been frequently referred to as digital transformation (DT) in both information systems research and the professional world. DT is mostly defined in a business scope, such as 'the changes in ways of working, roles, and business offering caused by the adoption of digital technologies in an organization or the operating environment of the organization' [1]. In a broader sense, DT can be understood as the changes in all aspects of human life due to digital technology [2], or as 'the continuously increasing interaction between digital technologies, business, and society' [3, p.11]. The term dates back from the year 2000 [4] but it is only since 2015 the term truly gained traction.
The DT research is quickly gaining in popularity over the past few years. Due to the broad impact DT has on all aspects of society and industry, there is a large number of research topics related to DT. In addition, numerous researchers are linking their work with DT even though the connection is not always indistinct. Several authors have pointed out that it is not clear what is included in DT and what not [5]- [7]. Furthermore, there is a lack of consistent theoretical frameworks that can reconcile the literature [8]. This creates a situation in which it is hard to keep track of the research and its boundaries.
Given this outlook, an important scientific activity is to look back and analyze what has been researched so far. A series of literature reviews have already been conducted [6], [9]- [12]. Another method to examine the literature is scientometric analysis [13] that deals with analyzing the bibliometric meta-data surrounding scientific publications, e.g. the keywords, publication year, funding, and authors. Emphasis is placed on investigating the advances and structure of the research field by using data science and visualization techniques. Scientometrics is considered a complement to traditional literature reviews [13] and has several advantages. The results can be considered more objective because they are based on the analysis of bibliometric data and therefore not on the qualitative interpretation of the content of the papers [13]. This method also scales with a large number of papers without slowing down the process. Finally, the visualizations of the literature are easy to understand and can give new insights that are hard to grasp from literature reviews.
Scientometrics has been used in similar information systems topics such as industry 4.0 [14], digital innovation [15], and digital business models [16]. In DT, which can be considered as an overarching concept, only a handful more specific studies have been done. Reis et al. [17] performed a keyword analysis and some quantitative analysis, Schneider and Kokshagina [18] structured the literature based on technology and their impact, and Hausberg et al. investigated the co-citation graphs and research streams [19].
The DT research field is especially interesting to do a scientometric study due to its rapid expansion and size, its broad scope and impact, and the semantic ambiguity surrounding the terminology. Network graphs that can display the entire research field at glance can help to understand the extent, range, and nature of the phenomenon. For these reasons, we conduct a scoping review using scientometrics [13], [20]. Scoping reviews are 'concerned with contextualizing knowledge in terms of identifying the current state of understanding' [30, p.10].
 We contribute to the scientometric research by describing a detailed methodology with particular detail to data merging, cleaning, and manipulation using state-of-theart natural language processing (NLP) algorithms such as lemmatization and stemmatization. This methodology can guide future studies on scientometrics.  We contribute to the DT research by highlighting the breadth of the DT literature. We described the research growth, the most influential outlets, the research hubs, and the structure of the research field. The latter was done using co-occurrence graphs of the titles, abstracts, and keywords. These graphs can aid researchers to gain a better understanding of the research field without requiring much effort or expertise.  We analyze and debate how these results add to the discussion of DT and provide several discussion points and a research agenda. The discussion points can help the scientific community to move forward in the DT research field.
The paper proceeds as follows: the next section discusses the applied methodology in detail. In chapter 3, we present our scientometric results followed by the discussion and the research agenda in section 4. Then, the limitations of this study are discussed in section 5. We end the study with a conclusion in section 6 and a link to access the figures used in this paper online in higher quality in section 7.

Methodology
The research aim of this paper is to conduct a scoping review using scientometrics of the DT literature. To do so, we based our methodology on the recommended workflow for mapping research using bibliometric tools proposed by Zupic and Čater [13] while also being guided by the methodology on how to conduct a scoping review of Arksey and O'Malley [20].
We searched both Scopus and Web of Science (WoS), due to their wide and multidisciplinary coverage, for English conference and journal papers published between the years 2000 and 2020. Several queries were evaluated to find a query that retrieved as many relevant articles as possible without including irrelevant articles. We found that searching for papers with DT as keywords is a good strategy. This way, the obtained dataset has an extremely low false-positive rate (i.e., an article that is not related to DT). However, there are several issues with this search strategy. First, not all journals include keywords. Secondly, there is a higher chance of data quality errors in the keywords indexed by the databases compared to the titles. Thirdly, some papers use different synonyms as keywords to describe DT. Including these synonyms in the search query quickly results in massive datasets. Hence, we decided to compromise by also including papers with DT in the title. This results in a slightly higher false-positive rate, requiring more manual checking, but resolves some of the issues with only searching for keywords while keeping the dataset relatively precise and manageable. The full queries are listed below:  Scopus: AUTHKEY ("digital transformation") OR TITLE ("digital transformation") AND PUBYEAR > 1999 AND (LIMIT-TO (DOCTYPE , "cp") OR LIMIT-TO (DOCTYPE , "ar")) AND (LIMIT-TO (LANGUAGE , "English" ))  WoS: AK="digital transformation" OR TI="digital transformation" AND LANGUAGES: (English) Refined by: DOCUMENT TYPES: ( ARTICLE OR PROCEEDINGS PAPER ) AND TIMESPAN=2000-2020 The query was executed in December 2020 and retrieved 1985 articles in Scopus (1113 conference papers, 872 journal papers) and 1158 articles in WoS (527 conference papers, 638 journal papers). We downloaded all bibliometric data of these articles in CSV files. These include the authors, document type, citation count, references (only for Scopus), the abstract, title, keywords, and the source title. We then removed all duplicates by scanning for the same titles. In total, 716 duplicates were removed. This means that the overlap of WoS on Scopus is about 58%, or in other words, the Scopus dataset got extended by 22%. In total, this gives us a dataset of 2427 papers with 10 features or variables.
Next, we performed data preprocessing in five steps. In the first step, we merged the two datasets from Scopus and WoS using a custom-build python script. While this step is essential to get wide coverage of the literature, it is often overlooked in scientometric studies, e.g. [16], [17], [22]. Merging these datasets is not particularly easy because the formatting in each one is different. Hence, the script standardized all formatting differences such as fixing the punctuation marks between authors' names and initials and merging the column names.
Secondly, the papers were inspected by two authors to identify irrelevant papers or data errors. The inclusion criteria to assess relevance were based on whether or not the abstract is coherent with changes in one or more aspects of a business, society, or industry due to digital technologies. For example, several papers discussed computer algorithms to transform datatypes. Also, several papers were removed that were wrongly classified as conference or journal papers. In total, we removed 32 irrelevant articles and 28 data errors such as wrongly imported papers or non-English papers bringing the total to 2367 papers.
In the third step, we performed several manual data cleaning manipulations for the title, abstract, and keywords variables. Missing values and spelling errors that occurred in several papers such as wrongly exported characters were fixed. We merged synonyms, such as the fourth industrial revolution and industry 4.0. Additionally, several words were transformed into their acronyms. For example, 'chief digital officer(s)' was changed into 'CDO'. Other maintained acronyms include SMEs (small and medium enterprises), ICT (information and communication technology), ML (machine learning), and (I)IoT ((Industrial)Internet of Things). Finally, acronym variants were merged, such as SMAC-IT and SMACIT.
In the fourth step, we continued cleaning the words in the title, abstract, and keywords by building a Python-based text processor that utilizes the natural language toolkit package. The processor scans the entire text corpus and creates a dictionary made of all acronyms such as IT, IS, and AI. Next, all words except for the acronyms are changed to lower case. Then, all English stop words and non-alphabetical words that do not occur in the constructed dictionary are removed. We then singularized all words and changed British English into American English using the US2GB dictionary. To do so, a dictionary that includes 1,730 British and American words was used. We extended the dictionary with several DT specifics terminologies that were not included yet including digitalization, digitizing, and servitization.
In the fifth step, the Python processor changed each word into the most popular lemma of its stem. A lemma is a canonical form of the processed word. For example, 'transforms', 'transforming', 'transformed' are all forms of the lemma 'transform'. To do this, the script runs through the corpus several times. In the first iteration, a dictionary was created of all the words their lemmas and their total occurrence count. For the lemmas, we used WordNet Lemmatizer. Next, we generated a second dictionary with the stem of each and their lemma, using the Porterstemmer package. The stem is the root of the word, e.g. 'connect' is the stem of 'connects, 'connecting', 'connected' but also of 'connection'. The lemma in this dictionary is chosen from the first dictionary's most popular lemma of that stem and will represent the entire stem group. For example, if the first dictionary has two lemmas of the stem 'strateg', namely 'strategy', and 'strategic', the lemma with the highest occurrence will be chosen as the lemma of the stem 'strateg' for the entire corpus. In the final iteration, the script uses the created dictionary of stems and their lemmas to substitute the words with their respective lemma based on their stem value. Several exceptions were made to prevent cases where the lemma would reduce the meaning of the original word. For example, digitalization was added as an exception so that it cannot be changed into its lemma digital. Transforming the corpus into lemmas based on stem value makes sure that every word is an actual word (not always true for stems). Moreover, this method fixes different spelling styles for the same word. Lastly, this technique reduces the number of lemmas in the text while staying close to the original text, e.g. the number of unique words in the title got reduced from 2545 to 2293.
For the visualization of the data, we used Python for data exploration, cleaning, visualization, and clustering. VOSviewer [23] was used to create co-author and co-text network graphs. The clustering was done in three steps. First, we vectorized the data using tf-idf. We then applied principal component analysis to reduce the feature size to the top features to filter out noise. Lastly, we clustered the papers based on abstract and keywords using k-means. For the description of the topics, latent dirichlet allocation was used. The visualization was done with the Python package Bokeh. The research field is quickly expanding. In the past five years, the number of publications with DT as a keyword or in the title doubled annually, as shown in Figure 1. Two major factors could explain this growth. First, there is an increased interest in DT both by business and academics. In business, the importance of digitally transforming is more crucial than ever. Since the year 2000, digital is 'the main reason just over half of the companies on the Fortune 500 have disappeared' [24]. This translates itself into more research and interest from the academic world. Secondly, similar terminologies used to describe this phenomenon in the past, such as IT-enabled transformation, digitization, digitalization, or business transformation, have started to consolidate into DT. Therefore, it is likely that this evolution will continue in the coming years. While the growth seems to be slowing down in 2020, one must consider the publication lag and the covid-19 pandemic as potential factors.

Results
When looking at the publication details, we see that there are many outlets for research in DT. We give an overview of the most influential journals and conferences in the literature by the number of publications, citations, and average citations in Table 1. There are several points worth noticing. First, conferences seem highly important both by the number of papers and citations. They have a higher number of papers but a lower average citation count than journals. Second, the most influential conferences are focused on IS. Research in IS deals with the impact of IT in use by individuals and organizations [25], which fits closely with DT. These conferences usually have a track dedicated to digital transformation and business models (cf. ECIS 2020). When looking at the journals, the publication count is generally lower compared to conferences and compared to other fields. The journals themselves have a wide scope; there is no specific focus on DT itself. This can explain why there are many outlets with a small number of published DT papers. On the other side, journals are likely to suffer from publication lag, i.e. the time between submitting and publishing [26]. The average citation count is generally high due to several well-received papers in the journals. As the research field matures, we expect the journal papers to rise further in importance and new journals that focus on DT to emerge. In Figure 2, the evolution of the keywords over time is displayed to see research shifts and focusses. The years 2000 to 2015 are omitted due to the low number of papers. The keyword DT is not included in this graph because it is more or less shown in Figure 1. Emerging keywords were colorized for visual clarity. Several things are worth mentioning: the most popular keyword is digitalization. The second most common keyword is industry 4.0. Albeit its meaning is grounded in manufacturing, industry 4.0 can be considered as the DT of the manufacturing industry [14]. Upcoming keywords highlight the interest and role of big data and artificial intelligence (AI). Another frequent keyword used in combination with DT is the digital economy, which is the level of development of a social production system when digital technologies are implemented systematically [27]. In total, the ten most popular  In the co-word analysis, words are represented as nodes that are linked when they cooccurred together in one paper's title, keywords, or abstract. The number of times that happens in one paper is inconsequential because binary counting is used. The size of the nodes and links corresponds to the number of times they appeared in different papers. Clusters, determined by relatedness, are tinted in the same color. We performed a co-word analysis on the title, the abstract, and the keywords using VOSviewer. For each variable, the threshold of the number of occurrences needed for a term to be included was adapted so that each figure contained as much information as possible without overloading it. In the title and keywords, fewer frequent terms were found than in the abstract. Hence, a minimum of 10 occurrences per term was chosen for the title and keywords and 60 for the abstract. This means that all terms that appear x or more times are included in the graph and the bigger nodes appearing more times than x. The title co-word analysis is shown in Figure 3. Analyzing the titles can be useful to detect the typical paper focuses or research areas. The network clearly shows DT research is mostly focused on organizations or industries. Particularly, how organizations use and manage technology, innovate, and how industries are transformed. Several typical title structures can be identified in clockwise order: organization and technology research, supply chain research, literature reviews, health care research, public services, management and strategy, education, enterprise case studies, industry case studies, production and manufacturing research, and finally impact on the economy or sector level.
Similarly, several clusters can be identified in the abstract co-word analysis as shown in Figure 4. Irrelevant words like 'study' or 'paper' were excluded due to their irrelevance. The abstracts tell us more about the content of the papers than the title. Due to the higher word count in the abstract, and thus word overlap, the research hubs are less distinct as shown by the large clusters. If we zoom in on the clusters, we can identify several research areas. In the green cluster, we identify research about manufacturing, production, and supply chains;  Lastly, the keyword co-occurrence network is shown in Figure 5. The keywords have the benefit of summarizing the entire paper in several words. This graph can also be used to understand domain closeness and relatedness. From the figure, it can be seen that DT has a broad scope with many distinct aspects. The graph shows that digitalization and industry 4.0 are often used in combination with DT, highlighting the overlapping terminologies. Other highly connected keywords include the digital economy, innovation, business model, and technology. Several digital technologies are frequently mentioned such as big data, AI, cloud computing, blockchain, and (I)IoT. Once again, several research hubs can be identified. In clockwise order, starting from the top, the following hubs can be identified: leadership and CIOs; business model innovation (such as in transport and the sharing economy); egovernment or digital government and public services; digital ecosystems and platforms; strategy and change management; organizational culture; business agility; cyber-physical systems; transformation in education, production, manufacturing, and SMEs; healthcare; elearning and higher education; IoT and AI; enterprise architecture (EA), enterprise models, and requirements engineering. Using VOSviewer, a co-authorship network graph was constructed. This visualization displays the collaborations in the field by showing researchers as nodes and their collaborations (co-authored papers) as links between the nodes. The largest network of coauthors is shown in Figure 6. The different clusters correspond to several universities. The yellow, green, and brown clusters contain researchers from the Danube University Krems (Austria), the purple and cyan clusters contain researchers from the RWTH Aachen University (The Netherlands), the orange cluster contains the university national du sud (Argentina), and the blue cluster contains researchers from the Warwick business school (United Kingdom). These graphs show the research collaboration at glance.
Surprisingly, the largest co-authorship network is limited to 92 researchers. Furthermore, when we specify a minimum of two papers for an author to be included in the analysis, the network shrinks further to 32 authors. This is a rather unexpected outcome, as collaborations are generally higher in similar fields. This means that most DT research is located in smaller internal research groups. These research groups are plenty. Our analysis identified more than 100 research teams, of which many important ones can be accredited. To give a few, the research group with the most publications consists of A. Zimmermann, M. Möhring, D. Jugel, and R. Schmidt of Reutlingen University. Another important group with a high number of citations is from the Ludwig-Maximilians University of Munich, with the researchers T. Hess, A. Horlacher, and S. Chanias. Figure 6. The largest co-authorship network.
One possible question to ask is whether collaboration is low due to different research areas. While the co-word networks above indicate that this is not the case due to the several research hubs in which researchers can collaborate, we further investigate this by applying a k-means clustering algorithm to the papers based on the abstract and keywords. This method groups similar papers together based on their semantic similarity of the keywords and abstracts. The clustering outcome indicates that the papers retrieved in this study have several common themes in which authors can collaborate. There are 10 clusters recognized by the algorithm. We used topic modeling to find the most appropriate keywords for each cluster. In short, the clusters are about: c1: organization, digital, people, technology, research, human, resource, new, online, communication, analysis, process; c2: development, digital, technology, russian, social, consumer, energy, common, regional, information, requirement; c3: technology, industry, company, model, energy, production, manufacture, digitalization, service, development; c4: public, service, digital, technology, egovernment, change; c5: architecture, digital, service, framework, business, design, sustainable, compute, technology, institution, achieve; c6: digital, construction, management, process, implementation, analysis, study, paper, pilot, infrastructure, organization; c7: business, digital, model, customer, organization, service, management, bank, process; c8: learn, digital, university, technology, high, student, engineer, future, virtual, derive, image, elearning, traditional; c9: smart, technology, digital, service, energy, urban, development, management, build, economy, strategy, digitalization, regional; c10: innovation, firm, SMEs, strategy, digital, capability, research, dynamic, transformation. Together, these findings suggest that collaboration in the DT research field can be increased.

Discussion and research agenda
The results of this paper further advance our understanding of DT and provide more insight into the current research. We provide several discussion points, based on further reading and the results, for IS researchers who need to play an active role in the research field moving forward. First, multiple authors mention that DT not only deals with changes in the business but also with changes in society [1], [28], [29], people [2], [17], and societal values [8]. However, these aspects are not visible in our results. Society was barely mentioned in the titles in Figure 3, and apart from culture, customer, and collaboration, the abstracts in Figure 4 contain limited links to societal changes. Looking at the keywords in Figure 5, we find collaboration, customer experience, and adoption but no other keywords related to society. If it is true that DT impacts society, people, and values, and we believe it does, then more research needs to be conducted on the societal aspect of DT.
Second, the combination of findings and readings highlights the potential issue of illdefined terminology. In figure 5, it is shown that many papers also use digitalization, digitization, or industry 4.0 in combination with DT. In detail, our dataset reveals that 26% of papers include at least one of these 'substitutes'. This is peculiar because their meaning is considerably different. Digitization is about changing analog information into digital data or information [9] whereas digitalization is about the increased use of digital technologies but in essence still doing the same things [1], [30]. Industry 4.0 is a term to talk about the fourth industrial revolution in manufacturing including smart factories, IoT, robotics, and predictive maintenance [14]. The meaning of DT is still disputed, although it can be agreed upon that DT sketches the bigger picture in which businesses and people are completely changing their ways of working by smartly deploying digital technologies [3].
In addition, this ambiguity is problematic because it creates a situation where the overuse and misuse of DT have weakened its potency [31]. While it is true that DT is a broad concept that contains the changes that technology brings forward in business and society, it does not necessarily follow that DT should be used for all changes that can be classified under this definition. This is a prominent issue for future research. The breadth of the research field shown in Figures 4 and 5 could be too large because of the wrongful inclusion of the term DT. On the other hand, it seems like many definitions of DT are too narrow compared to our results. Given this debate and our first discussion point, we believe that DT is in a unique position and has the academic attention to bring together research in business, technology, and society to study the impact of the increased use of digital technologies. Moving forward, we suggest that researchers and practitioners need continued efforts to keep the DT literature relevant and framed correctly.
Third, the co-authorship network in Figure 6 showed a lack of collaboration in the DT research. This is a rather unexpected outcome given the size of the research field and the number of authors. In comparison with the distinct clusters found in Figure 7 and compared to other fields, the level of collaboration is low. For example, when we perform the same query on Scopus but with business process management notation (BPMN) as a keyword instead of DT, we obtain 1,278 papers from which a co-authorship network of 188 authors can be created that clearly shows distinct research hubs. The challenge is now to promote DT research collaboration. This is important for sharing and merging specialized knowledge and expertise which is the engine behind scientific progression.
Future work could investigate the contribution of the different disciplines and the used methodologies. In addition, the boundaries of DT research with other research fields require more investigations. Acquiring a deeper understanding of what knowledge is missing from this broad research field is another fruitful area for future studies. Furthermore, more research is needed to reconcile the various aspects of DT into a coherent theoretical frame and promoting this construct for framing future DT research. A commonly accepted framework for DT can help to enhance collaboration between researchers by connecting similar research through well-agreed upon terms. The precise aspects of DT, their scope, and their meaning need to be investigated and demarcated more clearly to serve as a connecting means for both practitioners and researchers. Doing this can be beneficial for increasing the specialization of outlets and researchers, which would be a welcome addition.

Limitations
The scientometric analysis performed in this paper is based on the acquired dataset from Scopus and WoS. Several biases exist related to the data extraction such as the included outlets of these databases, the data quality, and the fact that different queries could result in different findings. The insights from this paper were mainly based on the proxy that the title, abstract, and keywords provide a correct impression of the paper. Investigating the corpus of each paper would provide additional insights.

Conclusion
This research aimed to provide a scoping review of the DT literature using scientometric analysis. We described a detailed methodology on how to thoroughly prepare a dataset for scientometric studies. The results identified the general overview of the research field, including the evolution of papers being published each year, the most influential outlets, and the evolution of keywords. Additionally, we created co-word occurrence graphs of the titles, abstracts, and keywords. The data suggest that DT is not well defined and that there exist many different research hubs. It also suggests that the DT research field will continue to expand which makes fundamental, theoretical work increasingly important. Further work needs to be done on reconciling the literature and providing strict terminology.