Mapping of ImageNet and Wikidata for Knowledge Graphs Enabled Computer Vision

Knowledge graphs are used as a source of prior knowledge in numerous computer vision tasks. However, such an approach requires to have a mapping between ground truth data labels and the target knowledge graph. We linked the ILSVRC 2012 dataset (often simply referred to as ImageNet) labels to Wikidata entities. This enables using rich knowledge graph structure and contextual information for several computer vision tasks, traditionally benchmarked with ImageNet and its variations. For instance, in few-shot learning classification scenarios with neural networks, this mapping can be leveraged for weight initialisation, which can improve the final p erformance m etrics v alue. W e m apped a ll 1 000 I mageNet labels – 461 were already directly linked with the exact match property (P2888), 467 have exact match candidates, and 72 cannot be matched directly. For these 72 labels, we discuss different problem categories stemming from the inability of finding an exact m atch. Semantically close non-exact match candidates are presented as well. The mapping is publicly available at https://github.com/DominikFilipiak/imagenet-to-wikidata-mapping.


Introduction
Thanks to deep learning and convolutional neural networks, the field of computer vision experienced rapid development in recent years. ImageNet (ILSVRC 2012) is one of the most popular datasets used for training and benchmarking models in the classification task for computer vision. Nowadays, an intense effort can be observed in the domain of few- [17] or zero-shot learning [16] which copes with various machine learning tasks, for which training data is very scarce or even non-available. More formally, N-way K-shot learning considers a setting, in which there are N categories with K samples to learn from (typically K ≤ 20 in few-shot learning). This is substantially harder from standard settings, as deep learning models usually rely on a large number of samples provided. One of the approaches to few-shot learning considers relying on some prior knowledge, such as the class label. This can be leveraged to improve the performance of the task. For instance, Chen et al. [4] presented Knowledge Graph Transfer Network, which uses the adjacency matrix built from knowledge graph correlations in order to create class prototypes in a few-shot learning classification. M ore g enerally, knowledgeembedded machine learning systems can use knowledge graphs as a source of information for improving performance metrics for a given task. One of these knowledge graphs is Wikidata [15], a popular collaborative knowledge graph.
Our main research goal concentrates on facilitating general-purpose knowledge graphs enabled computer vision methods, such as the aforementioned knowledge graph transfer network. In this paper, we provide a mapping between ImageNet classes and Wikidata entities, as this is the first step to achieve this goal. Our paper is inspired by and built on top of the work of Nielsen [12] -he first explored the possibility of linking ImageNet WordNet synsets with Wikidata. We also aim at providing detailed explanations for our choices and compare the results with these provided by Nielsen. Our publicly available mapping links WordNet synset used as ImageNet labels with Wikidata entities. It will be useful for the aforementioned computer vision tasks. Practical usage scenarios consider situations in which labelling data is a costly process and the considered classes can be linked to a given graph (that is, for few-or zero-shot learning tasks). However, simpler tasks, such as classification, can also use context knowledge stemming from rich knowledge graph structure (in prototype learning [18], for instance).
The remainder of this paper is structured as follows. In the next section, we briefly discuss related work. Then, in the third section, we provide detailed explanations about the mapping process, which is focused on the entities which do not have a perfect match candidate. The next section provides some analytics describing the mapping, as well as a comparison with automated matching using a NERD tool -entity-fishing [7]. The paper is concluded with a summary. Most importantly, the mapping is publicly available 1 .

Background and Related Work
To provide a mapping between ILSVRC 2012 and Wikidata, it is necessary to define some related terms first. This requires introducing a few additional topics, such as WordNet, since some concepts (such as structure) in ILSVRC 2012 are based on the former. This section provides a comprehensive overview of these concepts. We also enlist the existing literature on the same problem of finding this specific mapping. To the best of our knowledge, there were only two attempts to achieve this goal -both are described below.
WordNet is a large lexical database of numerous (primarily) English words [11]. Nouns and verbs have a hierarchical structure and they are grouped altogether as synsets (sets of synonyms) in WordNet. Historically, this database paid a significant role in various pre-deep learning era artificial intelligence applications (it is still used nowadays, though). ImageNet [5] is a large image database, which inherited its hierarchical structure from WordNet. It contains 14,197,122 images and 21841 WordNet-based synsets at the time of writing, which makes it an important source of ground-truth data for computer vision. ImageNet Large Scale Visual Recognition Challenge (abbreviated as ILSVRC) [13] was an annual competition for computer vision researchers. The datasets released each year (subsets of original ImageNet) form a popular benchmark for various tasks to this day. The one released at ILSVRC 2012 is particularly popular and commonly called ImageNet 2 up to this date. It gained scientific attention due to the winning architecture AlexNet [9], which greatly helped to popularise deep learning. ImageNet is an extremely versatile dataset -architectures coping with it usually have been successful with different datasets as well [2]. Models trained on ImageNet are widely used for transfer learning purposes [8].
Launched in 2012, Wikidata [15] is a collaborative knowledge graph hosted by Wikimedia Foundation. It provides a convenient SPARQL endpoint. To this date, it is an active project and it is an important source of information for e.g. Wikipedia articles. Due to its popularity, size, and ubiquity, Wikidata can be considered as one of the most popular and successful knowledge graph instances along with DBpedia [1] and Freebase-powered [2] Google Knowledge graph. Given the recent interest in the ability to leverage external knowledge in computer vision tasks [4], it would be therefore beneficial to map ImageNet classes to the correspond-ing Wikidata entities. The idea itself is not new, though the full mapping was not available to this date. To the best of our knowledge, Nielsen [12] was the first to tackle this problem. He summarised the encountered problems during preparing the mapping and classified them into few categories. These categories include missing synsets on Wikidata, matching with a disambiguation page, discrepancies between ImageNet and WordNet, differences between the WordNet and the Wikidata concepts with similar names, and multiple semantically similar items in WordNet and Wikidata. Nielsen described his effort in detail, though the full mapping was not published. Independently, Edwards [6] tried to map DBpedia and Wikidata to ImageNet (in a larger sense, not ILSVRC 2012) using various pre-existing mappings and knowledge graph embeddings methods, such as TransE [3], though the results of such mapping have not been published as well. Contrary to these papers, we publish our mapping.

Mapping
This section is devoted to the mapping between the ImageNet dataset and Wikidata. First, we explain our approach in order to provide such mapping. Then, we identify and group key issues, which occurred in the mapping process. We also provide more detailed explanations for the most problematic entities.
To prepare the mapping, we follow the approach and convention presented by Nielsen. Namely, we use synset names from WordNet 3.0 (as opposed to, for example, WordNet 3.1). That is, we first check the skos:exactMatch (P2888) property in terms of an existing mapping between Wikidata entities and WordNet synsets. This has to be done per every ImageNet class. For example, for the ImageNet synset n02480855 we search for P2888 equal to http: //wordnet-rdf.princeton.edu/wn30/02480855-n using Wikidata SPARQL endpoint. Listing 1 provides a SPARQL query for this example. As of November 2020, there are 461 already linked synsets out of 1000 in ImageNet using wdt:P2888 property. For the rest, the mapping has to be provided. Unlike Edwards [6], we do not rely on automated methods, since the remaining 539 entities can be checked by hand (although we test one of them in the next section). Using manual search on Google Search, we found good skos:exactMatch candidates for the next 467 ImageNet classes. These matches can be considered to be added directly to Wikidata, as they directly reflect the same concept. For the vast majority of the cases, a simple heuristics was enough -one has to type the synset name in the search engine, check the first result on Wikipedia, evaluate its fitness and then use its Wikidata item link. Using this method, one can link 928 classes in total (with 467 entities matched by hand).
Sometimes, two similar concepts were yielded. Such synsets were a subject of qualitative analysis, which aimed at providing the best match. Similarly, sometimes there is no good match at all. At this stage, 72 out of 1000 classes remain unmatched. Here, we enlist our propositions for them. We encountered problems similar to Nielsen [12], though we propose a different summary of common issues. We categorised these to the following category problems: hyponymy, animals and their size, age, and sex, ambiguous synsets, and non-exact match. Each of these represents a different form of a trade-off made in order to provide the full mapping. This is not a classification in a strict sense, as some of the cases could be placed in several of the aforementioned groups.
Hyponymy. This is the situation in which the level of granularity of WordNet synsets did not match the one from Wikidata. As a consequence, some terms did not have a dedicated entity. Therefore, we performed semantic inclusion, in which we searched for a more general "parent" entity, which contained this specific case. Examples include magnetic compass (extended to compass), mitten (glove), or whisky jug (jug). The cases falling to this category are presented in Table 1.
Animals and their size, age, and sex. This set of patterns is actually a subcategory of the hyponymy, but these animal-related nouns provided several problems worth distinguishing. The first one considers a situation in which a WordNet synset describes the particular sex of a given animal. This information is often missing on Wikidata, which means that the broader semantic meaning has to be used. For example, drake was mapped to duck, whereas ram, tup to sheep. However, while hen was linked to chicken, for cock, rooster (n01514668) there exist an exact match (Q2216236). Another pattern considers distinguishing animals of different age and size. For instance, lorikeet in WordNet is defined as "any of various small lories". As this definition is a bit imprecise, we decided to use loriini. In another example eft (juvenile newt) was linked to newt. Similarly, there is eastern and western green mamba, but WordNet defines it as "the green phase of the black mamba". The breed of poodle has three varieties (toy, miniature, and standard poodle), but Wikidata does not distinguish the difference between them -all were therefore linked to poodle (Q38904). These mappings are summarised in Table 2.
Ambiguous synsets. This is a situation in which a set of synonyms does not necessarily consist of synonyms (at least in terms of Wikidata entities). That is, for a synset containing at least two synonyms, there is at least one possible Wikidata entity. At the same time, the broader term for a semantic extension does not necessarily exist, since these concepts can be mutually exclusive. For instance, for the synset African chameleon, Chamaeleo chamaeleon there exist two candidates on Wikidata, Chamaeleo chamaeleon and Chamaelo africanus. We choose the first one due to the WordNet definition -"a chameleon found in Africa". Another synset, academic gown, academic robe, judge's robe contains at least two quite different notionswe have chosen academic dress, as this meaning seems to be dominant in the ImageNet.
Harvester, reaper is an imprecise category in ImageNet since it offers a variety of agricultural tools, not only these suggested by the synset name. Bonnet, poke bonnet has a match at Wikidata's bonnet, though it is worth noticing that ImageNet is focused on babies wearing this specific headwear. The mapping of this category can be found in Table 3.
Non-exact match. Sometimes, however, there is no good exact match for a given synset among Wikidata entities. At the same time, the broader term might be too broad. This leads to unavoidable inaccuracies. For example, for nipple we have chosen its meronym, baby bottle. Interestingly, nipple exists in Polish Wikidata, though it does not have any properties, which makes it useless in further applications. Other examples involve somewhat similar meaning -tile roof was mapped to roof tile, or steel arch bridge to through arch bridge. Plate rack was linked to dish drying cabinet, though it is not entirely accurate, as the ImageNet contains pictures of things not designated to drying, but sometimes for dish representation. In other example, we map baseball player to Category:baseball players. ImageNet contains photos of different kinds of stemware, not only goblet. Cassette was linked to a more fine-grained synset (audio cassette) as the images present audio cassettes in different settings. Table 4 summarises the mappings falling into this category.
ImageNet itself is not free from errors, since it is biased towards certain skin colour, gender, or age. This is a great concern for ethical artificial intelligence scientists since models trained on ImageNet are ubiquitous. There are some ongoing efforts to fix it with a more balanced set of images, though [19]. Beyer et al. [2] enlisted numerous problems with ImageNet, such as single pair per image, restrictive annotation process, or practically duplicate classes. They proposed a set of new, more realistic labels (ReaL) and argued that models trained in such a setting achieve better performance. Even given these drawbacks, ImageNet is still ubiquitous. Naturally, the presented mapping inherits problems presented in ImageNet, such as these in which images roughly do not present what the synset name suggests. This problem was previously reported by Nielsen [12] -he described it as a discrepancy between ImageNet and WordNet. As for some examples, this might include radiator, which in ImageNet represents home radiator, whereas the definition on Wikidata for the same name describes a bit more broad notion (for instance, it also includes car radiators). Monitor is a similar example since it might be any display device, though in ImageNet it is connected mostly to a computer display. Sunscreen, sunblock, sun blocker represent different photos of products and their appliance on the human body, which look completely different and might be split into two distinct classes.  Q2163958 (common water snake)

Analytics
We also check to what extent the process can be automated, as it might be useful for larger subsets of ImageNet (in a broad sense). In this section, we present the results of such an investigation. We also provide a concise analysis of the number of direct properties, which is a crucial feature in spite of the future usage of the mapping in various computer vision settings.
Foppiano and Romary developed entity-fishing [7], a tool for named entity recognition and disambiguation (abbreviated as NERD). This tool can be employed in order to provide an automatic mapping between ImageNet and Wikidata. We used indexed data built from the Wikidata and Wikipedia dumps from 20.05.2020. For this experiment, each synset is split on commas. For example, a synset barn spider, Araneus cavaticus (n01773549) is split into two synset elements: barn spider and Araneus cavaticus. For each of these elements, the term lookup service from entity-fishing is called, which searches the knowledge base for given terms in order to provide match candidates. Since this service provides a list of entities ranked by its conditional probability, we choose the one with the highest value.
We start with the 461 already linked instances, which can be perceived as ground-truth data for this experiment. Among them, for 387 (84%) synset elements there was at least one correct suggestion (for example, for a previously mentioned synset barn spider and Araneus cavaticus at least one was matched to Q1306991). In particular, 286 (62%) synsets were correctly matched for all its elements (for example, for a previously mentioned synset barn spider and Araneus cavaticus were both matched to Q1306991). While these results show that NERD tools can speed up the process of linking by narrowing down the number of entities to be searched for in some cases, it does not replace manual mapping completely -especially in more complicated and ambiguous cases, which were mentioned in the previous section. Nevertheless, for the remaining 539 synsets which were manually linked, an identical NERD experiment has been performed, which resulted in similar figures. For 448 (83%) synsets, entity-fishing provided the same match for at least one synset element. Similarly, for 342 synsets (63%) the tool yielded the same match for all elements. Albeit these figures can be considered as relatively not low, they prove that the mapping obtained in such a way might consider some discrepancies and justify the process presented in the previous section. Similarly to Nielsen, we also count the number of the direct properties available in Wikidata. This is a crucial feature since it enables to leverage knowledge graph structure. Listing 2 shows the query used for obtaining the number of properties for Q29022. The query was repeated for each mapped entity. Figure 1 depicts a histogram of direct properties for the 1000 mapped classes. This histogram presents the right-skewed distribution (normal-like after taking the natural logarithm) with the mean at 28.28 (σ = 22.77). Only one entity has zero properties (wall clock).
In total, there are 992 Wikidata entities used in the mapping, as some of them were used several times, like the mentioned poodle. These entities displayed 626 unique properties in total. The most popular ones are listed in Table 5. In the optics of computer vision, an important subset of these categories includes P373 (Commons category ) P910 (topic's main category ), and P279 (subclass of), as they directly reflect hyponymy and hypernymy with the knowledge graph. Such information can be later leveraged in the process of detecting (dis-)similar nodes in a direct way. For example, using e.g. graph path distance in SPARQL for entities sharing common ancestors considering a given property. However, SPARQL does not allow to count the number of arbitrary properties between two given entities. Using graph embedding is a potential workaround for this issue. For example, one can calculate the distances from 200-dimensional pre-trained embeddings provided by the PyTorch-BigGraph library [10]. Another possible direction considers leveraging other linked knowledge graphs, such as Freebase (P646), which is linked to the majority of considered instances. classes, we found candidates, which match their corresponding synset. Since 72 classes do not have a direct match, we proposed a detailed justification of our choices. We also compared our mapping with the one obtained from an automated process. To the best of our knowledge, we are the first to publish the mapping ImageNet and Wikidata. The mapping is publicly available for use and validation in various computer vision scenarios.
Future work should focus on empirically testing the mapping. Our results are intended to be beneficial for general-purpose computer vision research since the graphs can be leveraged as a source of contextual information for various tasks, as our analysis showed that the vast majority of the linked entities have a certain number of direct properties. This fact can be utilised according to the given computer vision task. For example, it may be used to generate low-level entity (label) embeddings and calculate distances between them in order to create a correlation matrix used in Knowledge Graph Transfer Network [4] in the task of few-shot image classification. This architecture leverages prior knowledge regarding the semantic similarity of considered labels (called correlation matrix in the paper), which are used for creating class prototypes. These prototypes are used to help the classifier learn novel categories with only few samples available. Correlations might be calculated using simple graph path distance, as well as using more sophisticated low-dimensional knowledge graph embeddings and some distance metrics between each instance. In this case, this will result in a 1000×1000 matrix, as there are 1000 labels in ImageNet. Embeddings from pre-trained models might be used for this task (such as the aforementioned PyTorch-BigGraph embeddings).
Future work might also consider extending the mapping in a way that allows considering larger subsets of ImageNet (in a broad sense), such as ImageNet-6K [4], the dataset, which consists of 6000 ImageNet categories. Preparation of such a large mapping might require a more systematic and collaboratively-oriented approach, which can help to create, verify and reuse the results [20]. The presented approach can also be used for providing mappings with other knowledge graphs and ImageNet. Another possible application might consider further mapping to the actions, which might be particularly interesting for applications in robotics, where robots would be deciding which actions to take based on such mappings [14].