Julien Breton Ph.D. thesis defense: “Extraction and formalization of regulatory industrial maintenance knowledge from semi-structured corpus data”

thesis julien breton

Friday, 27th June at 1.30 pm Paris time, Julien Breton, Ph.D. Candidate has defended his thesis named “Extraction and formalization of regulatory industrial maintenance knowledge from semi-structured corpus data”. His thesis defense took place at the IRIT research laboratory, Toulouse, France. Take a look at the summary below.

Keywords: Legal compliance,Industrial maintenance, Norm extraction,LLM (Large Language Models), Legal formalization, Semantic models

Summary

In various sectors, particularly in industrial maintenance, compliance with legal documents is imperative. Companies have a legal obligation to monitor, maintain, and repair their equipment in accordance with the current regulations. However, these regulations are constantly increasing and evolving. This difficulty in keeping up with legal developments encourages companies to adopt automation methods to analyze documents and ensure compliance. This automation primarily relies on the norms extraction present in textual documents, a task complicated by the multiplicity and density of legal documents.

Norm extraction methods are mainly based on syntactic rule-based systems; more broadly, legal information extraction relies on models such as Bi-LSTM and BERT. However, these methods all encounter the same limitation. They require considerable amounts of annotated data, a laborious and time-consuming process, involving costly experts. This limitation becomes a problem when annotated datasets are limited, as in legal industrial maintenance.

To address these challenges, our study proposes the use of large language models (LLM) as a solution to overcome data constraints. We have thus developed a hybrid system that combines the generation of synthetic data via LLM, which are then refined through syntactic filtering, to finally be distilled into a BERT model. This contribution is supported by the introduction of SEMLEG, a semantic model allowing the formalization of legal provisions dedicated to industrial maintenance. The SEMLEG model notably models the legal evolution and lifecycle of provisions, an element that is absent from the scientific literature.

Our experiments reveal that the production of synthetic data via LLM significantly reduces the dependence on annotated data while achieving promising performance. Our hybrid approach, guided by the SEMLEG model, approaches the performance of state-of-the-art models such as BERT while reducing the involvement of legal experts.

The jury was composed of:

Examiners

  • Abrouk Lylia – Rapporteur, University of Bourgogne Europe
  • Mulhem Philippe – Rapporteur, CNRS Alpes
  • Calabretto Sylvie – Examinator, INSA Lyon
  • Sallaberry Christian – Examinator, University of Pau et Pays de l’Adour

Supervisors :

  • Chevalier Max – Director, University of Toulouse
  • Trojahn Dos Santos Cassia – Co-Director, University of Grenoble Alpes
  • Billami Mokhtar Boumedyen – Scientific project manager, Berger-Levrault
validation thesis julien breton
Julien Breton receives doctorate degree

Academic perspectives

Short-term academic prospects:

  1. Reliability and improvement of the semantic model: it would be interesting to present it to a larger number of experts and get their feedback.
  2. Extend SEMLEG (the semantic model) to other domains: we focused our study on CMMS, but kept a “high-level” perspective to enable domains such as HR, health, etc., to formalize their legal obligations
  3. Improve the LLM brick in our hybrid architecture: mainly work aimed at improving the

Prompt Long-term academic prospects:

  1. Enable full interpretation of legal provisions: there are a number of cases in which legal provisions are deliberately vague. With the rise of knowledge compression in large models (LLM), it would be possible to take these cases into account, while also taking ethical and societal constraints into account.
  2. Create a CMMS dataset in order to evaluate SEMLEG and our hybrid system in this field
  3. Improve BERT architecture with the aim of improving representation/understanding over long distances

Industrial perspectives

  1. Complete the process of connecting legal and software with the following three steps

Scientific Publications:

(2024) Empowering CamemBERT Legal Entity Extraction With LLM Boostrapping: https://link.springer.com/chapter/10.1007/978-3-031-77792-9_6
(2024) Leveraging Semantic Model and LLM for Bootstrapping a Legal Entity Extraction: An Industrial Use Case: https://ebooks.iospress.nl/doi/10.3233/SSW240004

More ...

Scroll to Top