Developing a Legal Form Classification and Extraction Approach for Company Entity Matching

Benchmark of Rule-Based and Machine Learning Approaches




Record Linkage, Company Entity Matching, Data Integration, Data Quality, Data Preparation


This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.


Download data is not yet available.



A. Abbasi, S. Sarker, and R. Chiang, “Big Data Research in Information Systems: Toward an Inclusive Research Agenda,” JAIS, vol. 17, no. 2, pp. I–XXXII, 2016, doi: 10.17705/1jais.00423.

C. Heinrich and G. Stühler, “Die Digitale Wertschöpfungskette: Künstliche Intelligenz im Einkauf und Supply Chain Management,” in Fallstudien zur Digitalen Transformation : Case Studies für die Lehre und praktische Anwendung, Wiesbaden, Germany: Springer Gabler, 2018, pp. 77–88.

M. Stonebraker and I. Ilyas, “Data Integration: The Current Status and the Way Forward,” IEEE Data Eng. Bull., vol. 41, no. 2, 3-9, 2018.

P. Christen, “Data Linkage: The Big Picture,” Harvard Data Science Review, 2019, doi: 10.1162/99608f92.84deb5c4.

F. Kruse, C. Schröer, and J. Marx Gómez, “Data Source Selection Support in the Big Data Integration Process - Towards a Taxonomy,” in Internationale Tagung Wirtschaftsinformatik (WI), Universität Duisburg-Essen, 2021.

X. L. Dong and D. Srivastava, “Big Data Integration,” Synthesis Lectures on Data Management, vol. 7, no. 1, pp. 1–198, 2015, doi: 10.2200/S00578ED1V01Y201404DTM040.

P. Christen, Data Matching. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, 10.1007/978-3-642-31164-2

Y. Govind et al., “Entity Matching Meets Data Science: A Progress Report from the Magellan Project,” 2019,

N. Barlaug and J. Atle Gulla, “Neural Networks for Entity Matching: A Survey,” 2020, arXiv:2010.11075

Y. Govind et al., “Cloudmatcher: a hands-off cloud/crowd service for entity matching,” Proc. VLDB Endow., vol. 11, no. 12, pp. 2042–2045, 2018, doi: 10.14778/3229863.3236255.

P. Christen and W. E. Winkler, “Record Linkage,” in Encyclopedia of Machine Learning and Data Mining, C. Sammut and G. I. Webb, Eds., Boston, MA: Springer US, 2016, pp. 1–10.

H. Köpcke, A. Thor, S. Thomas, and E. Rahm, “Tailoring entity resolution for matching product offers,” in Proceedings of the 15th International Conference on Extending Database Technology - EDBT '12, Berlin, Germany, 2012, p. 545.

P. Behnen, F. Kruse, and J. Marx Gómez, “Enhancement of Record Linkage by Using Attributes containing Natural Language Text,” in AAAI-MAKE 2021 Combining Machine Learning and Knowledge Engineering, Stanford University, Palo Alto, California, USA, 2021, pp. 1–14.

V. Grover and K. Lyytinen, “New State of Play in Information Systems Research: The Push to the Edges,” MISQ, vol. 39, no. 2, pp. 271–296, 2015, doi: 10.25300/MISQ/2015/39.2.01.

C.-J. Schild and S. Schultz, “Linking Deutsche Bundesbank Company Data using Machine-Learning-Based Classification,” 2017, doi: 10.1145/2951894.2951896.

T. Gschwind, C. Miksovic, J. Minder, K. Mirylenka, and P. Scotton, “Fast Record Linkage for Company Entities,” in 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 623–630., 10.1109/BigData47090.2019.9006095

R. Y. Wang and D. M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management Information Systems, vol. 12, no. 4, pp. 5–33, 1996, doi: 10.1080/07421222.1996.11518099.

N. Gali, R. Mariescu-Istodor, D. Hostettler, and P. Fränti, “Framework for syntactic string similarity measures,” Expert Systems with Applications, vol. 129, pp. 169–185, 2019, doi: 10.1016/j.eswa.2019.03.048.

W. Maass, J. Parsons, S. Purao, V. C. Storey, and C. Woo, “Data-Driven Meets Theory-Driven Research in the Era of Big Data: Opportunities and Challenges for Information Systems Research,” JAIS, pp. 1253–1273, 2018, doi: 10.17705/1jais.00526.

F. Kruse, A. P. Hassan, J.-P. Awick, and J. Marx Gómez, “A Qualitative Literature Review on Linkage Techniques for Data Integration,” in 53nd Hawaii International Conference on System Sciences, HICSS 2020, Grand Wailea, Maui, Hawaii, USA, January 7-10, 2020, 2020, pp. 1063–1073. [Online]. 10.24251/HICSS.2020.132

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan, “Deep Entity Matching with Pre-Trained Language Models,” arXiv:2004.00584, 2020.

S. Mudgal et al., “Deep Learning for Entity Matching,” in Proceedings of the 2018 International Conference on Management of Data - SIGMOD '18, Houston, TX, USA, 2018, pp. 19–34.

M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang, “Distributed representations of tuples for entity resolution,” Proc. VLDB Endow., vol. 11, no. 11, pp. 1454–1467, 2018, doi: 10.14778/3236187.3236198.

A. Doan et al., “Magellan: Toward Building Ecosystems of Entity Matching Solutions,” Commun. ACM, vol. 63, no. 8, pp. 83–91, 2020, doi: 10.1145/3405476.

J. Cuffe and N. Goldschlag, “Squeezing More Out of Your Data: Business Record Linkage with Python,” in 2018.

S. M. Randall, A. M. Ferrante, J. H. Boyd, and J. B. Semmens, “The effect of data cleaning on record linkage quality,” BMC medical informatics and decision making, vol. 13, pp. 1–10, 2013, doi: 10.1186/1472-6947-13-64.

I. Koumarelas, L. Jiang, and F. Naumann, “Data Preparation for Duplicate Detection,” Journal of Data and Information Quality (JDIQ), vol. 1, no. 1, pp. 1–24, 2020,

A. Akhundov, D. Trautmann, and G. Groh, “Sequence Labeling: A Practical Approach,” CoRR, arXiv:1808.03926, 2018.

S. Liu, B. Tang, Q. Chen, and X. Wang, “Drug Name Recognition: Approaches and Resources,” Information, vol. 6, no. 4, pp. 790–810, 2015, doi: 10.3390/info6040790.

X. Zhong, E. Cambria, and A. Hussain, “Extracting Time Expressions and Named Entities with Constituent-Based Tagging Schemes,” Cogn Comput, vol. 12, no. 4, pp. 844–862, 2020, doi: 10.1007/s12559-020-09714-8.

H. Nakayama, T. Kubo, J. Kamura, Y. Taniguchi, and X. Liang, doccano: Text Annotation Tool for Human. [Online]. Available:

Julio Villena Roman, Sonia Collada-Perez, Sara Lana-Serrano, and Jose C. González-Cristobal, “Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization,” 2011.

Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization,” ACM computing surveys (CSUR), 2002. [Online], 1