Julien Breton Ph.D. thesis defense: “Extraction and formalization of regulatory industrial maintenance knowledge from semi-structured corpus data”

3 minutes read

Friday, 27th June at 1.30 pm Paris time, Julien Breton, Ph.D. Candidate has defended his thesis named “Extraction and formalization of regulatory industrial maintenance knowledge from semi-structured corpus data”. His thesis defense took place at the IRIT research laboratory, Toulouse, France. Take a look at the summary below.

Keywords: Legal compliance,Industrial maintenance, Norm extraction,LLM (Large Language Models), Legal formalization, Semantic models

Summary

In various sectors, particularly in industrial maintenance, compliance with legal documents is imperative. Companies have a legal obligation to monitor, maintain, and repair their equipment in accordance with the current regulations. However, these regulations are constantly increasing and evolving. This difficulty in keeping up with legal developments encourages companies to adopt automation methods to analyze documents and ensure compliance. This automation primarily relies on the norms extraction present in textual documents, a task complicated by the multiplicity and density of legal documents.

Norm extraction methods are mainly based on syntactic rule-based systems; more broadly, legal information extraction relies on models such as Bi-LSTM and BERT. However, these methods all encounter the same limitation. They require considerable amounts of annotated data, a laborious and time-consuming process, involving costly experts. This limitation becomes a problem when annotated datasets are limited, as in legal industrial maintenance.

To address these challenges, our study proposes the use of large language models (LLM) as a solution to overcome data constraints. We have thus developed a hybrid system that combines the generation of synthetic data via LLM, which are then refined through syntactic filtering, to finally be distilled into a BERT model. This contribution is supported by the introduction of SEMLEG, a semantic model allowing the formalization of legal provisions dedicated to industrial maintenance. The SEMLEG model notably models the legal evolution and lifecycle of provisions, an element that is absent from the scientific literature.

Our experiments reveal that the production of synthetic data via LLM significantly reduces the dependence on annotated data while achieving promising performance. Our hybrid approach, guided by the SEMLEG model, approaches the performance of state-of-the-art models such as BERT while reducing the involvement of legal experts.

The jury was composed of:

Examiners

Abrouk Lylia – Rapporteur, University of Bourgogne Europe
Mulhem Philippe – Rapporteur, CNRS Alpes
Calabretto Sylvie – Examinator, INSA Lyon
Sallaberry Christian – Examinator, University of Pau et Pays de l’Adour

Supervisors :

Chevalier Max – Director, University of Toulouse
Trojahn Dos Santos Cassia – Co-Director, University of Grenoble Alpes
Billami Mokhtar Boumedyen – Scientific project manager, Berger-Levrault

validation thesis julien breton — *Julien Breton receives doctorate degree*

Academic perspectives

Short-term academic prospects:

Reliability and improvement of the semantic model: it would be interesting to present it to a larger number of experts and get their feedback.
Extend SEMLEG (the semantic model) to other domains: we focused our study on CMMS, but kept a “high-level” perspective to enable domains such as HR, health, etc., to formalize their legal obligations
Improve the LLM brick in our hybrid architecture: mainly work aimed at improving the

Prompt Long-term academic prospects:

Enable full interpretation of legal provisions: there are a number of cases in which legal provisions are deliberately vague. With the rise of knowledge compression in large models (LLM), it would be possible to take these cases into account, while also taking ethical and societal constraints into account.
Create a CMMS dataset in order to evaluate SEMLEG and our hybrid system in this field
Improve BERT architecture with the aim of improving representation/understanding over long distances

Industrial perspectives

Complete the process of connecting legal and software with the following three steps

Scientific Publications:

(2024) Empowering CamemBERT Legal Entity Extraction With LLM Boostrapping: https://link.springer.com/chapter/10.1007/978-3-031-77792-9_6
(2024) Leveraging Semantic Model and LLM for Bootstrapping a Legal Entity Extraction: An Industrial Use Case: https://ebooks.iospress.nl/doi/10.3233/SSW240004

More ...

Innovation

Yearbook Research & Innovation 2025: Governed, Responsible, and reality-based Research!

February 25, 2026

News

Nihed Bendahman Ph.D. thesis defense: “Evaluation and mitigation of hallucinations in automatic summarization in the specific context of legal documents”

December 17, 2025

News

Gabriel Darbord Ph.D. thesis defense: “Automatic test generation to help modernize our applications”

December 8, 2025

News

Berger-Levrault strengthens its ties with AI startups!

September 29, 2025

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Julien Breton Ph.D. thesis defense: “Extraction and formalization of regulatory industrial maintenance knowledge from semi-structured corpus data”

Summary

Academic perspectives

Short-term academic prospects:

Prompt Long-term academic prospects:

Industrial perspectives

Scientific Publications:

More ...

Yearbook Research & Innovation 2025: Governed, Responsible, and reality-based Research!

Nihed Bendahman Ph.D. thesis defense: “Evaluation and mitigation of hallucinations in automatic summarization in the specific context of legal documents”

Gabriel Darbord Ph.D. thesis defense: “Automatic test generation to help modernize our applications”

Berger-Levrault strengthens its ties with AI startups!

Start typing and press enter to search