The 5th CNRS Open Science Day showcases free software and text mining

CNRS Research

The 5th edition of the CNRS Open Science Day held today is focusing on two specific subjects - the morning is dedicated to the role of open source software with the afternoon given over to the mining and analysis of textual data in the context of open science.

The 5th CNRS Open Science Day (French link) is taking place on Wednesday November 22nd. The event is dedicated to advances in open science and focuses on the role of free software and text or data mining methods. This 3rd pillar of the CNRS Roadmap for Open Science is a core issue for responses to the current challenges of disseminating, protecting and reusing research results given the advances made with artificial intelligence learning methods.

Amphithéâtre avec de nombreux spectateurs. Sur la scène, Antoine Petit au micro à droite devant un écran présentant le titre de la journée : "Science ouverte : logiciels libres et fouille de textes"
Antoine Petit, the CNRS Chairman and CEO, opens the 2023 Open Science Day. ©CNRS

This annual event is an opportunity to take stock of a specific open science theme in the presence of the management teams and scientific councils of the CNRS's ten Institutes and also the organisation's main Scientific Board. The day is organised by the CNRS Open Research Data Department (DDOR) and is an integral part of the CNRS's overall strategy: "Along with the Roadmap for Open Science and the 'Research Data' plan, these Open Science Days are a core element that defines our political action each year", explains Sylvie Rousset, the director of the DDOR.

It also represents the ideal opportunity to discuss the institution's pro-open science position and coordinate priority actions. Alain Schuhl, the CNRS's Deputy CEO for Science, reiterates that it is essential "to involve the Institutes in coming developments and provide them with the right materials and tools to build their disciplinary strategy".

Taking annual stock of the situation to plan tangible action for the future

Each annual event focuses on one of the main objectives set out in the CNRS Roadmap for Open Science – scientific publications in 2020, research assessment (French link) in 2021 and research data (French link) in 2022 – and has produced tangible results (see box text). This year, the subject is text and data mining and analysis. "Every year, CNRS researchers work hard on opening up scientific publications and data. Software dedicated to textual data mining is based on this openness, enable all the potential of open science to be fully exploited", points out Alain Schuhl.

Productive meetings

These working days provide input for the CNRS's political action. Several initiatives have been implemented as a result. For example, the 2020 day on publications drove support for the HAL open archive and the diamond publication model while also increasing awareness of the importance of pre-publications. The research assessment event in 2021 helped position the CNRS as an active member of the CoARA1 international coalition since its creation in 2022 with Sylvie Rousset elected to the CoARA board. In fact the CNRS had already signed the DORA2 declaration in July 2018 to rethink and remodel assessment criteria to make them more qualitative and integrate better recognition of the diversity of researchers' professions. The organisation also took part in developing the European agreement to reform research assessment.

Finally, several new actions were implemented this year following on from the 2022 Open Science Day. The CNRS has set up its own CNRS Research Data warehouse within the Recherche Data Gouv ecosystem led by the French Ministry of Higher Education and Research (MESR); a 'CNRS Research Data' (French link) catalogue has been created up to identify thematic trusted3 repositories for archiving, sharing and re-using data. Finally a CNRS data management plan template has been made available to help scientists plan their research projects from start to finish.

  • 1The Coalition on Advancing Research Assessment works on reworking enhanced research assessment on the international scale.
  • 2The San Francisco Declaration on the evolution of research assessment.
  • 3The 'CNRS Research Data' catalogue is a directory of data repositories and services which the CNRS manages or allocates resources to: https://cat.opidor.fr/index.php/CNRS_Donn%C3%A9es_de_la_Recherche_:_Catalogue_des_entrep%C3%B4ts_et_des_services

The place of open source software and dissemination models

The morning of the event will be devoted more broadly to open source software. These transdisciplinary collaborative resources are accessible to all and help forge links between research teams and society at large. Source codes can be inspected and modified which means such software can be adapted to specific requirements. The choice of a sharing software forge is a sovereignty criterion that guarantees all users, laboratories or institutions control over the data produced, their conservation and the rules for their re-use. All these points will be discussed at the event.

The 'Open Source Software Valorisation' session involves an overview of best practices in licensing, data protection and software promotion and dissemination. The session's aims are to provide scientists with the right resources to share their software and encourage those wishing to find the right methods for developing such software. Alain Schuhl reiterates the fact that "since 2021, the CNRS has provided €100K per year to support the Software Heritage initiative which is an international open archive for software source codes".

In 2023, CNRS Innovation, the CNRS's technology transfer structure, launched the OPEN programme (French link) to support research teams in promoting open source software developed in the framework of research projects. Representatives of CNRS Innovation will talk about the various technology transfer models supported by the organisation. Sylvie Rousset thinks "this discussion day will provide an opportunity to meet technology transfer stakeholders, raise awareness of our actions and build bridges with CNRS Innovation. This beginning initiative involving the DGDI, CNRS Innovation and the DDOR will be substantial and important".

The CNRS also invited a representative from the National Research Institute for Agriculture, Food and Environment to mark the occasion. Every year, the two organisations jointly run a national training event on document mining and information extraction for scientific communities in French higher education and research. "Some participants come back every year to update their knowledge and find out about new software at our practical workshops. It's always a good sign when participants want to come back year after year to reinforce their skills", says Sylvie Rousset.

Cinq personnes sur scène dont une debout au micro à droite
The day includes a round table on the promotion of free software. © CNRS

The CNRS is also working to develop a national strategy with its partners and the MESR. In this way, the Open Science Day dovetails with the Ministry's free software conference on November 29th organised to present the current state of production and development of research software and award the Open Science prizes for free software.

The Software Heritage initiative

Software Heritage (French link) collects, preserves and shares all publicly available software in source code form. Its aim is construct a common, shared infrastructure to benefit industry, research, culture and society as a whole.

Text data mining and large language models (LLM)

The afternoon is devoted to text and data mining. Scientific production accelerates annually and scientists are increasingly favouring digital solutions for browsing scientific literature, extracting information and producing new knowledge. Alain Schuhl considers that text mining will facilitate innovation to explore, share and reuse research results. "The CNRS is ready to support text mining tools and software and democratise their use in all disciplines", he says.

However conversational robots like ChatGPT (French link) have now been made available to the general public, raising questions about the relevance of the results these tools can provide. These AI assistants have potential for the analysis of large data volumes and for extracting information via a user interface that looks like a search engine. They also propose queries in the form of phrases, or 'prompts'. The latest robots are the fruit of years of text data mining work on developing large language models (LLMs) that are sufficiently indexed to train artificial intelligence. In this field, the CNRS is well placed with the Jean Zay supercomputer (see boxed text).

However, proprietary softwares are currently like closed 'black boxes' that do not display the source codes so users can understand how the algorithm works or the nature of the data its results are based on. This runs contrary to open science requirements that data and software should be accessible for reasons of integrity, replicability and transparency. Several presentations giving an overview of text data mining will be followed by a round-table focusing on these very issues.

The Jean Zay supercomputer and the 'Bloom' large language model

This is the leading national supercomputer for the artificial intelligence community and is used to solve the most complex scientific problems in climate research, astrophysics and biology. Jean Zay is located on the Saclay plateau, developed by HPE (French link) and operated by the CNRS's Institute for Development and Resources in Intensive Scientific Computing. The large language model 'Bloom' (French link) was trained on the Jean Zay supercomputer. 'Bloom' is an artificial intelligence capable of understanding any text in 46 languages and reproducing the main information.