Dataset
A controlled vocabulary for research and innovation in the field of Cultural Heritage & Heritage Sciences
- Title
- A controlled vocabulary for research and innovation in the field of Cultural Heritage & Heritage Sciences
- Creator(s)
- Duran-Silva, Nicolau
- Parra-Rojas, César
- Rondelli, Bernardo
- Giocoli, Luca
- Plazas, Adrià
- Grimau, Berta
- Rights
- Creative Commons Attribution Share Alike 4.0 International, Open Access
- Publisher
- Zenodo
- Language
- eng
- Date
- 2021
- Abstract
- This controlled vocabulary of keywords related to the field of Cultural Heritage and Heritage Sciences was built by SIRIS Academic in collaboration with IRPET (the Regional Institute for Economic Planning of Tuscany) and the ISPC (Institute of Heritage Science of CNR), in order to identify Cultural related research, development, and innovation activities. The work was carried out by consulting domain experts' advice, and it was ultimately applied to inform regional strategies on Cultural Heritage and research and innovation policy. The aim of this vocabulary is to enable one to retrieve texts (e.g. R&D projects and scientific publications) featuring the concepts included in the present vocabulary in their titles and abstracts, assuming that these records have a certain contribution of applications, techniques and issues, in the domain of Cultural Heritage and Heritage Sciences. The aim of this classification is to identify research products in the domain of Cultural Heritage, ranging from documents in some of its “traditional” disciplines, but also from documents emerging from interdisciplinary projects that apply novel areas and technologies in the domain of Cultural Heritage. The identification of texts in the domain of Cultural Heritage requires a task of text classification. Developing a method that could be applied to decide if a text can be relevant or have some relation to the domain of Cultural Heritage is a challenging task. The definition of what Cultural Heritage is and what it includes is a complex activity, even for domain experts. This is in particular because Cultural Heritage is quite a broad field of knowledge, and there is no full agreement on where the borders of the domain are. To define the scope of the perimeter, in this project, many of the available definitions were taken into account. Because of the high number of resources available in the domain, among thesauruses and taxonomies, the construction of a weakly-supervised controlled vocabulary was considered as the best way of retrieving documents in the domain. Since there is no annotated corpus/dataset of research texts in the domain capable of generalising the diversity of publications that can be related to the cultural domain, but stemming from different disciplines, we have opted for a text classification technique based on rules – specifically, a weakly-supervised controlled vocabulary. As defined by the Getty Institute, a controlled vocabulary is an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching. It typically includes preferred and variant terms and has a defined scope or describes a specific domain. The purpose of controlled vocabularies is to organize information and to provide terminology to catalogue and retrieve information. While capturing the richness of variant terms, controlled vocabularies also promote consistency in preferred terms and the assignment of the same terms to similar content (Harping, 2010). In short, Cultural Heritage is a rather abstractly-defined field, and Heritage Science is a particularly “fuzzy” field within Cultural Heritage. One of the main limitations of the approach we used is that the controlled vocabularies never capture all the lexical and linguistic variants of a term, and we may miss relevant texts if we cannot find the correct pattern to match during the search. But on the other hand, the controlled vocabulary is built from available vocabularies and thesauruses in the domain of Cultural Heritage, which are large resources. All the concepts in these resources are not included directly in the controlled vocabularies, because they would add noise to the classification. Therefore, the automatic weak supervision and a human curation of the final controlled vocabulary is fundamental for achieving correct results. The controlled vocabulary is built taking advantage of these four resources: The <strong>Art and Architecture Thesaurus (AAT)</strong>: this is a structured vocabulary with approximately 34,000 concepts, including 131,000 words, descriptions and other information related to art, architecture, decorative arts, archival material and material culture, commonly used for cataloguing and for information retrieval. Some cultural heritage categories in <strong>Wikipedia </strong>and <strong>DBpedia</strong>: these categories have been used to collect all related articles and subcategories, in order to obtain relevant, similar and specific instances of concepts linked to the domain. The <strong>RICHES Taxonomy</strong>: this taxonomy is a theoretical framework of related terms and their definitions, referring to the new concepts in the digital era, with the aim of defining the scope of some digital technologies applied to cultural heritage. <strong>Heritage Data - Linked Data Vocabularies for Cultural Heritage</strong>: a dataset which includes several cultural heritage thesauruses and vocabularies and is recognised as a reference point in the United Kingdom in the domain of cultural heritage. The collection of concepts extracted from these four resources was composed of more than 60,000 terms, which have been refined as described in the next section. <strong>## Automatic validation of the controlled vocabulary</strong> In order to refine the collection of concepts to have a final set of relevant concepts and terms in the domain of Cultural Heritage, a semi-automatic validation has been applied to remove the irrelevant, too general, and ambiguous terms. To keep the relevant ones, the specialization index (SI) metric has been calculated for each of the keywords in the collection. In this case, the SI can be obtained measuring the fraction of publications with a keyword in a set of publications in the domain of Cultural Heritage and normalizing over the fraction of publications in the open domain with that keyword. After the calculation of the SI, all the keywords below a certain threshold are removed, and a manual supervision step is applied in order to remove non-pertinent keywords. An example of this automatic validation can be observed in the next table: <strong>Keyword</strong> <strong>Specialization Index</strong> <strong>Automatic threshold</strong> <strong>Manual supervision</strong> male 0.27 Removed Accepted 3-d laser scanning 0.7 Removed Accepted 78 rpm records 20.7 Accepted Removed vienna 3.48 Accepted Removed radiocarbon dating 13.6 Accepted Accepted graffiti 25 Accepted Accepted bark painting 20.7 Accepted Accepted pompeii 16.23 Accepted Accepted The SI of the final keywords can be used as a probabilistic metric for each keyword. The final list of keywords was manually curated by domain experts. <strong>## Evaluation of the controlled vocabulary</strong> The final controlled vocabulary was evaluated with an external dataset with the aim of calculating its degree of precision. The evaluation dataset was composed of a collection of articles in 4 journals unequivocally considered to fall within the domain of Cultural Heritage. These four journals were: <em>(1) Journal Of Cultural Heritage, (2) Journal On Computing And Cultural Heritage, (3) Journal Of Cultural Heritage Management And Sustainable Development and (4) Digital Applications In Archaeology And Cultural Heritage.</em> This collection was composed of 5,000 articles, considered as the positive set, and another collection of randomly selected 5,000 articles outside of the Cultural Heritage domain, considered as the false set. The Cultural Heritage vocabulary was applied to the evaluation data set, obtaining a 95% of precision. After a set of improvements on the vocabulary, based on the exploration of publications not identified in the first test and the false positive results, we obtained a 98% of precision. The application of the vocabulary taking advantage of the probability of each keyword as its weight of being in the domain did not improve the results, and for this reason the probabilistic approach was discarded. ## <strong>Using the vocabulary to classify publications concerning Cultural Heritage</strong> The definition of the vocabulary does not, per se, allow to identify research contributions in Cultural Heritage: this is performed by actually matching the terms in the controlled vocabulary to the content of the gathered research textual records. To successfully carry out this task, a series of pattern matching rules must be defined to capture possible variants of the same concept, such as permutations of words within the concept and/or the presence of null words to be skipped. For this reason, we have carefully crafted matching rules that take into account permutations of words and that allow words within concept to be within a certain distance. In the following table we present some examples of the tagging process on some abstracts: <strong>Publication title</strong> <strong>Publication abstract</strong> Egocentric visitor localization and artwork detection in cultural sites using synthetic data Computer vision and machine learning can be used in <strong>cultural heritage to augment the experience of visitors during the exploration of the cultural site</strong>, as well as to assist its management. To achieve such goals, two fundamental tasks should be addressed, i.e., localizing <strong>visitors and recognizing the observed artworks</strong>. Wearable cameras offer a convenient setting to address both tasks through the analysis of images acquired from the visitors’ points of view. However, the engineering of approaches to address such tasks generally requires large amounts of labeled data. We propose a tool which can be used to collect and automatically label synthetic visual data suitable to study image-based localization and artwork detection. The tool simulates a virtual agent navigating the <strong>3D model of a real cultural site</strong> and automatically captures video frames along with the related ground truth camera poses and semantic masks indicating the position of artworks. We generate a dataset of synthetic images starting from the 3D model of a <strong>museum located in Siracusa</strong>, Italy. The experiments suggest that the proposed tool allows to drastically reduce the effort needed to collect and label data, providing a means to generate large-scale datasets suitable to study localization and <strong>artwork detection in cultural sites</strong>. Discovering Leonardo with artificial intelligence and holograms: A user study Cutting-edge visualization and interaction technologies are increasingly used in<strong> museum exhibitions</strong>, providing novel ways to engage visitors and enhance their <strong>cultural experience</strong>. Existing applications are commonly built upon a single technology, focusing on visualization, motion or verbal interaction (e.g., high-resolution projections, gesture interfaces, chatbots). This aspect limits their potential, since museums are highly heterogeneous in terms of visitors profiles and interests, requiring multi-channel, customizable interaction modalities. To this aim, this work describes and evaluates an artificial intelligence powered, interactive holographic stand aimed at describing <strong>Leonardo Da Vinci's art</strong>. This system provides the users with accurate<strong> 3D representations of Leonardo's machines</strong>, which can be interactively manipulated through a touchless user interface. It is also able to dialog with the users in natural language about Leonardo's art, while keeping the context of conversation and interactions. Furthermore, the results of a large user study, carried out during art and tech exhibitions, are presented and discussed. The goal was to assess how users of different ages and interests perceive, understand and explore <strong>cultural objects </strong>when holograms and artificial intelligence are used as instruments of knowledge and analysis. Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering Question Answering (QA) systems based on Information Retrieval return precise answers to natural language questions, extracting relevant sentences from document collections. However, questions and sentences cannot be aligned terminologically, generating errors in the sentence retrieval. In order to augment the effectiveness in retrieving relevant sentences from documents, this paper proposes a hybrid Query Expansion (QE) approach, based on lexical resources and word embeddings, for QA systems. In detail, synonyms and hypernyms of relevant terms occurring in the question are first extracted from MultiWordNet and, then, contextualized to the document collection used in the QA system. Finally, the resulting set is ranked and filtered on the basis of wording and sense of the question, by employing a semantic similarity metric built on the top of a Word2Vec model. This latter is locally trained on an extended corpus pertaining the same topic of the documents used in the QA system. This QE approach is implemented into an existing QA system and experimentally evaluated, with respect to different possible configurations and selected baselines, for the <strong>Italian language and in the Cultural Heritage domain</strong>, assessing its effectiveness in retrieving sentences containing proper answers to questions belonging to four different categories. "3D reconstruction and validation of historical background for immersive VR applications and games: The case study of the Forum of Augustus in Rome" "In the last decades, thanks to the success of the video games industry, the sector of technologies applied to cultural heritage has begun to envisage, in this domain, new possibilities for the <strong>dissemination of heritage and the study of the past </strong>through edutainment models. More recently, experimentation in the field of<strong> virtual archaeology </strong>has led to the development of virtual museums and interactive applications. Among these, the “serious game” segment – the<strong> application of interactive technologies to the cultural heritage domain</strong> – is rapidly growing, also including immersive VR technologies. Applied VR games and applications are characterized by a thorough <strong>historical background and a validated 3D reconstruction</strong>. Indeed, producing such products requires a tailored workflow and large effort in terms of time and professionals involved to guarantee such faithfulness. Drawing on our previous work in the<strong> field of virtual archaeology</strong> and referring to recent experiences related to the deployment of applied VR games on PlayStation VR, we describe and assess a workflow for the production of <strong>historically accurate 3D assets</strong>, targeting interactive, immersive VR products. The workflow is supported by the case study of the <strong>Forum of Augustus </strong>and different output applications, highlighting peculiarities and issues emerging from a multi and interdisciplinary approach. Through this classification process, we identified projects and publications related to heritage, with different levels of relationship and relevance, but mostly relevant to understanding the research competencies in the domain. The resulting research records were reviewed by experts in the domain, given the occurrence of some false positives. Among the main strengths of this step, it’s worth mentioning the fact that the vocabulary is broad and not restricted to the field of Heritage Science (that is, to STEM applications in Cultural Heritage), as it takes advantage of a variety of available resources. Moreover, by looking directly at the textual data, instead of using the assigned bibliometric areas, we can better capture interdisciplinary research. The limitations of this approach were presented at the beginning of this document: for example, relevant texts could be missed if the correct pattern to match during the search is not found. <strong>## Vocabulary of concepts related to Key Enabling Technologies in the domain of Cultural Heritage and Culture</strong> For the development of this vocabulary, the definition of key enabling technologies in the domain of Cultural Heritage and Culture, was based on reference of the report 'Technologies, Cultural Heritage and Culture' published on March 2019 by IRPET. A vocabulary for each Key Enabling Technology (hereafter, KET) was prepared by extracting the relevant concepts, words, technologies and examples from the Platform Report document 'Technologies, Cultural Heritage and Culture, within APPENDIX A. DESCRIPTION OF MAIN TECHNOLOGIES FOR ROADMAP (p. 45-61). Each vocabulary contains a set of terms divided into subdomains. The KETs have been divided into the following six groups: ICT PHOTONICS, MICRO- AND NANO-ELECTRONICS PLATFORMS NANO AND BIOTECHNOLOGY, ADVANCED MATERIALS PARTICLE ANALYTICAL SYSTEMS The initial keywords extracted from the document were enriched following the approach based on semantic keyword enrichment based on combination of concurrent keywords and word embeddings (Duran-Silva et al., 2019; Duran-Silva et al., 2021). This second vocabulary has to be used in combination with the Cultural Heritage vocabulary to capture KETs within the domain of cultural
Linked resources
Export
Position: 2747 (6 views)