Thursday 26th of September at 10h a.m. Paris time, Camille Gosset, Ph.D. Candidate has defended her thesis named “Methods and Models for the Automated Construction of Knowledge Graphs in the Legal Domain: Application to the Legal and Legal-Practical Resources of Local and Regional Authorities”. Her thesis defense took place at the LIRM (Laboratory of Computer Science, Robotics and Microelectronics) in Montpellier, France. Take a look at the summary below :
Summary
This thesis examines the construction of knowledge graphs from texts, focusing primarily on information extraction from unstructured texts. The objective of this work is to explore various aspects of information extraction from specialized corpora in the legal domain. To this end, we divide this study into two subtasks: terminology extraction and relation extraction.
Terminology extraction aims to automatically identify relevant terms in a given corpus of texts. Subsequently, from these terms, we extract the relations that link them. For relation extraction, two approaches are conceivable : either by determining in advance the types of relations that structure the terms, or by using the context, notably the verbs or other actions that link these terms (in the field of OpenIE). Thus, we address three main sub-problems. We introduce the terminology extraction system InfoGlean KeyTerms, composed of three modules : one for named entity recognition (NER), one for the extraction of relevant terms/text segments (KPE), and a final one for the extraction of legal entities. Expert annotations were provided in addition to this system to constitute a terminological base. After building this terminological base, we implemented two relation extraction systems: Relational Embeddings Model (REM) and GPT Open Relation EXtraction (GOREX). REM identifies typed relations between the extracted terms using the lexical network rezo-JDM. REM represents the relation pairs using a Word2Vec model, then classifies the types of relations. GOREX, on the other hand, exploits the principle of OpenIE by focusing on the verbs or action terms in the local context of the terms. GOREX uses LLMs to perform this task.
The analysis of the results revealed promising research avenues to be explored in future work for all systems. More specifically, the implementation of a hybrid relation extraction system could be an interesting path to explore.
The jury was composed of:
Examiners
- Sandra BRINGAY, University Professor, Université Paul Valéry, LIRMM
- Marianne HUCHARD, University Professor, University of Montpellier, LIRMM
- Didier SCHWAB, University Professor, LIG, UMR 5217
- Cassia TROJAHN, Senior Lecturer (HDR), UT2, IRIT – UMR 5505
Supervisors :
- Mokhtar Boumedyen BILLAMI, Berger-Levrault, co-supervisor
- Mathieu LAFOURCADE, Senior Lecturer (HDR), University of Montpellier, Thesis Director
And after?
Exploiting the knowledge present in raw text is difficult. This is why
knowledge needs to be structured for rapid access.
For Berger-Levrault, publisher of business software for local authorities and industry alike, structuring knowledge has become a major challenge.
Berger-Levrault develops software solutions specializing in Automatic Natural Language Processing (ANLP), such as search engines, response engines, automated legal intelligence and many others. These software solutions therefore need to exploit Berger-Levrault’s knowledge of textual data. Until now, however, they have been limited to the full extent of the raw knowledge. It is therefore necessary to structure this knowledge so that it can be used to enrich applications and thus improve them. This thesis focuses on two aspects in particular: (1) The representation of all knowledge through a “knowledge structure”; and (2) the use of this knowledge to improve the performance of several of its products.