Data | Écoles françaises à l'étranger

The main goal is to propose a strategy for the management of research data shared by the French schools abroad (EFE). This strategy is part of the open science movement and aims to strengthen the visibility of the schools' actors and the data they produce on a national and European scale.

The idea is not to build a single system at the network level but rather to articulate the different solutions deployed (or to be developed) in the schools that allow for the management of data throughout its life cycle in order to structure it in accordance with FAIR principles.

This network-wide strategy is based :

on the adoption and enrichment of shared and standardised repositories (actors, places, chronology, subjects) as well as the provision of permanent identifiers (DOI) at the level of the EFEs;
On a clear service grid with the coordinated adoption of tools, services, practices and procedures (storage / backup, authentication, metadata entry, etc.). The elements of this grid are described in detail in the form of best practice guides and technical sheets on the network's website;
On the interoperability of the various platforms involved in the life cycle of data within the Schools (planning, archiving, dissemination / exhibition)
On a formation plan adapted to the needs of the researchers in accordance with the service grids made available by the Schools or the national infrastructures.

The visibility of all the actors and resources of the Schools will be ensured by: ref

Systematic identification of actors (IdRef, IdHal, Orcid), awareness of the need to deposit publications in and systematic attribution of "EFE" DOIs to public resources;
The articulation between publications and research data (Bulletin archéologique des EFE, DataJournal / DataPapers, etc.)
A wide range of metadata for exposure to national or European infrastructures
- EAD: for finding aids (Calame, France-Archives)
- For data sets
  - Dublin-Core: Isidore, ...
  - DataCite: OpenAire, Ariadne-Plus
  - Cidoc-CRm : OpenArchéo
  - RIC-O
Implementing a set of exchange protocols or APIs (OAI-PMH, Sparql, Rest) allowing access to metadata/data
Supporting researchers in the exhibition / editorialization of their data
- Catalogues
- Digital libraries
- WebSIG

Common repositories

Repositories and persistent identifiers make it possible to designate and find an object, a document, a person, a place or an organisation in a univocal and persistent manner. They constitute the backbone of data management with a view to interoperability and the reuse of data.
Among the most commonly used identifiers in the research world are DOIs for digital resources, idRef, Orcid, IdHal for people, Viaf, Isni, ROR for organisations, Getty, Pactols, Inspire for subjects, PeriodO for chronologies, Geonames, Pleiades for places. This multiplication of "references" can be a problem but there are (most often) alignments between these identifiers.

Identifying the actors, resources (publications, data), and organisations involved in the scientific production of schools and assigning DOIs to all public resources is a prerequisite for the "FAIRisation" of EFE data.

Actors repository

The attribution (recovery) of a IdRef identifier (idHal, Orcid) for all the actors of the EFEs will make it possible to ensure the link between data producers, archives, publications, institutions and scientific operations

This implies, on the one hand, important retrospective work to be carried out at the level of the network or of each school in collaboration with ABES and, on the other hand, the creation of new entries in the ABES reference system. This operation must be carried out in consultation with the school libraries in order to set up procedures and possibly develop tools to simplify (automate) the work.

Institutional repository

The identification of all the institutions involved in scientific operations will make it possible to report on the network of actors/institutions/scientific operations in relation to archives and publications. This identification will be based on the IdRef/Viaf identifiers on the one hand and ROR (Research Organization Registry) on the other.

Locations repository

The identification of the locations of scientific operations will be based on reference systems widely used by the scientific community:

Geonames (collaborative geographic database)
Pleiades (historical geographical information on the ancient world)
IdRef (geographical information managed by the Sudoc, the Collex ArchéoAl project aims to enrich and complete the IdRef records, the geographical names concerning the sites of archaeological sites and the elements of the five EFE sites)
Pactols (for publications under OpenEditon, which benefits the ArchéoAl project)

An interface within the "scientific operations" platform will make it possible to simplify the selection of the location of a mission by proposing an interface for the selection of an area (or several). It will ensure automatic or manual alignment from the selected areas to the Geonames, Pleiades and ArchéoRef reference systems.

Référentiel « chronologie »

L’identification des chronologies se fera à l’aide du référentiel PeriodO. Si aucune chronologie ne décrit parfaitement le périmètre de l’École, il est possible de définir ses propres références (cf EFA : http://n2t.net/ark:/99152/p0gtzvd). L’alignement sur ce référentiel permettra entre autre l’interopérabilité avec l’infrastructure AriadnePlus.
Parallèlement, un travail d’alignement avec le référentiel Pactols devra être prévu pour permettre l’interopérabilité avec les plateformes OpenEdition ou OpenArcheo du consortium MASA.

Chronology repository

The subject repositories are based on the Pactols repository. Work has been carried out by the archives and publications departments to obtain a selection of terms relevant to indexing archive data or describing a publication. This selection represents 1,500 terms, a narrower selection will allow a scientific operation to be described.

Convergent service grids and shared procedures

Management of "research" storage spaces

Storage is a central element of the data life cycle. An offer adapted to the needs of researchers and scientific operations must be proposed in order to secure data during production (synchronisation to the school's server, provision of backup hard disks before uploading to the school's server). Solutions are already operational in the five schools with heterogeneous perimeters and uses.

The aim is to standardise the solutions (Nextcloud) and to harmonise the practices of the offer. This involves :
An upgrade of the infrastructure (volume)
Communication on the availability of this service and an incentive for researchers and project leaders (in particular on the "scientific operations" platform and the technical data sheets) to use them
A definition of the framework for use, migration procedures and the responsibility of each actor (researcher, IT department, institution)
Implementation of a solution for migrating data to third-party solutions (HN-Box, ShareDoc, host university's warehouse, etc.) when members leave the institution (depending on the institution's policy)
Integration of an archive tree accessible to archive services in order to facilitate the deposit to the latter at the end of the mission.

Good practice guides and data sheets

In order to present the grid of services dedicated to researchers and to assist them in the management of their data, the network portal hosts a set of best practice guides and technical sheets. These sheets list, by type of project, the norms and standards to be respected, the tools available internally or at the level of national or European infrastructures

They are presented in the form of a summary sheet and a detailed sheet associated with a bibliography, sitography, recommended tools, etc. They also refer to the procedures and specifications sheets local to the schools (storage, legal framework concerning the production and reuse of data, etc.).

Tools to support data collection, production and processing

The diversity of methods for collecting and producing data in the field (Filemaker, Excel, SysLat, etc.) means that it is not possible in the short term to propose a tool (or a set of tools) to facilitate this part of the researcher's work.
The idea is to focus for the moment on securing data, good practices and tools to facilitate the management of associated metadata.

Data security: a dedicated infrastructure

The security of the data produced is a central element of the data management process, and an offer adapted to the needs of researchers and scientific operations must be proposed. The idea is to favour the adoption of storage made available by the School (NexCloud), even if it means providing 4G solutions for isolated sites. For sites that cannot be covered by 4G, a backup solution based on hard disks / Nas must be made available with a clearly documented protocol.

Data and metadata: good practice guides and management tools.

Experiences are very varied and for the time being there is little or no structuring tool to assist researchers in entering metadata, apart from the provision of Excel spreadsheets by data type and renaming procedures (carried out by the archive service) using generic tools.

Prospective work must be carried out on the development of tools to facilitate/automate data entry in accordance with the metadata and reference systems recommended at the level of the EFEs and to avoid

Multiple entries;
The multiplication of data entry errors;
Numerous import/export operations;
Loss of data during transfers;
Lack of data traceability;
The absence of associated metadata;

A model (used as a basis for discussions on the desired uses and functionalities) has been developed to facilitate the entry of metadata and the renaming of researchers' files in the trees defined by the archive services. It creates a link between the information in the "scientific operations" platform, the repositories, the researchers' data and the archive platform.
It can be enriched with new functionalities according to the feedback and needs expressed: format conversion, exif/iptc import/export, management of 3D objects, GIS, EAD export, warehouses (dataverse, Nakala, etc.), Doi, etc.

Researchers are also supported at this stage by a set of good practice guides and technical sheets linked to the schools' service grid, available on the mission platform and the ResEFE portal.

Interoperable platforms

The notion of platforms described below does not correspond to a single solution but to devices offering common functionalities to the needs of the Schools and interoperable thanks to protocols and procedures for data exchange that respond to shared schemes at the network level. These platforms cover the key stages of the data life cycle: planning, archiving and dissemination.

Scientific Operations Platform

mission This platform comes into play at the planning stage of any scientific operation, and allows the context of a scientific operation and all the information relating to the data produced and/or used to be described.
It has a transversal role and serves as a reference base throughout the data life cycle. Its objective is to collect all the information concerning a scientific operation and to centralise it in order to feed the other stages of the data management process (collection/production, storage, archiving, dissemination, publication, reporting) and thus avoid possible duplication.

This information could for example :

To anticipate the need for dedicated storage space, archiving
To be used by the service of publications to fill automatically the metadata for OpenEdition (Authors, Subjects, Chronology)
To allow to automate the report of the missions and the data sets on HAL
...

It will make it possible, prior to the mission, to provide the operation manager with all the contractual and legal documents and information (good practice guide, technical sheets) necessary for the proper management of research data in relation to the school's service grid.

A data schema and an exchange protocol make it possible to standardise this information at the EFE level. Thus, the "missions" platform of the EfA and the "scientific operations" platform of the IFAO will be interoperable with the "Resefe" platform without impacting their local specificities.

The information collected is likely to be enriched as the project progresses and the platform ensures the history of operations.

The platform will allow to keep / deliver :

A Data Management Plan template (Docx, ODT, PDF) in accordance with the one proposed by MASA
Mission information in XML format according to a published schema (see https://schema.data.gouv.fr/)
Scoping letter
Commitment letter
Excavation authorisations
...

At the level of the EFE network, this platform ensures the centralisation and homogenisation of information concerning scientific operations.

Persistent identifiers platform

doi The attribution of a DOI to a resource is now possible at the level of the EFE network, via a subscription to Datacite (via INIST). The doi.resefe.fr platform allows all the actors of the Schools to create and manage the perennial identifiers for their digital resources in a completely autonomous way. A library (PHP) has been developed and allows other platforms to interface with this solution.

Archive platform

The objective of this platform is to collect, manage, perpetuate and disseminate (after archives agreement) all the data produced at the end of a scientific operation. The data are heterogeneous (photographs, plans, GIS, photogrammetries, databases) and sometimes require metadata that may be specific.
It therefore proposes a metadata base common to all types of documents based on ISAD(G), the international standard for archival description, and metadata adapted to the resource.
An "EFE" DOI is automatically assigned to deposited data as soon as it is published by its producer. They can then be distributed via a public interface, exportable (EAD, Dublin-Core, DataCite, Cidoc-Crm, IIIF, etc.), harvestable (OAI-PHM protocol) and therefore interoperable and reusable according to the associated licence.

This platform is the pivot in the data life cycle. First of all, it guarantees the conservation of the data produced by the Schools and develops the research tools associated with these archives.

In a second phase, it will allow :

To facilitate the articulation between research data and publications:
Archaeological bulletin of the EFE
Publication in the other publications of the schools (support on the Onix flows)
Articles deposited in HAL
To build a reflection on the Data-Papers around the mission data in connection with OpenEdition (Equipex 2020 project)
To develop the interaction between GIS and data
To build catalogues or collections of data
Build data editorial portals (Omeka-S, Drupal,...)
Develop digital libraries.

Finally, by providing a wide range of metadata, it ensures the exposure of data for national or European infrastructures (EAD: Calame, France-Archives) for finding aids, (Dublin-Core: Isidore, DataCite: OpenAire, Ariadne-Plus, Cidoc-CRm: OpenArchéo) for data sets.

This platform is based on a development carried out by the French School of Athens (https://archimage.efa.gr). The project is largely completed and can serve as a basis for other schools. However, adaptations are to be expected for each of them, in particular on the question of articulation with third-party warehouses (Nakala, Dataverse, Zenodo, ...)

Dissemination platforms

Dissemination platforms (catalogues, data editorial portals, webSIG digital libraries) will be developed in support of the archive platform (in order to separate data and presentation). The objective is to progressively build a set of reusable solutions (or bricks) on the scale of the network (accessible on an EFE github) and identified competences in order to support researchers in the exposure / editorialization of their data.