News
- 04/01/2026 “What Matters to an LLM? Behavioral and Computational Evidences from Summarization” has been accepted to the Findings of EACL 2026.
- 01/12/2025 Check our new preprint "TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness" - arXiv link
- 02/11/2025 Our paper received the Best Evaluation Award at INLG 2025!
|
02/11/2025 Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals
@INLG 2025, Hanoi, Vietnam
|
23/05/2024 PSentScore: Evaluating Sentiment Polarity in Dialogue Summarization
@LREC-COLING 2024, Torino, Italia
|
21/07/2020 XAI pour l'évaluation de l'IA. (Séminaire stage Craft AI)
@Craft AI, Paris, France
|
05/2020 - 10/2020 Research Intern in the Department of Artificial Intelligence Evaluation
@Laboratoire national de métrologie et d’essais (LNE), Trappes, France
|
10/2021 - 12/2021
Advanced models of machine learning (exercise classes, M2)
(and 09/2022 - 12/2022)
@ Master INDUSTRIES DE LA LANGUE - Université Grenoble Alpes | GitHub
10/2021 - 12/2021
Automatic text generation (exercise classes, M2)
(and 09/2022 - 12/2022)
@ Master INDUSTRIES DE LA LANGUE - Université Grenoble Alpes | GitHub
|
|
Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals
Yongxin Zhou, Fabien Ringeval, François Portet
INLG, 2025
🏆
Best Evaluation Award
🎤
Oral Presentation
pdf |
abstract |
bibtex
This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models’ ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics.
@inproceedings{zhou-etal-2025-gpt,
title = "Can {GPT} models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals",
author = "Zhou, Yongxin and
Ringeval, Fabien and
Portet, Fran{\c{c}}ois",
editor = "Flek, Lucie and
Narayan, Shashi and
Phương, L{\^e} Hồng and
Pei, Jiahuan",
booktitle = "Proceedings of the 18th International Natural Language Generation Conference",
month = oct,
year = "2025",
address = "Hanoi, Vietnam",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.inlg-main.17/",
pages = "249--273",
abstract = "This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o) to generate dialogue summaries that adhere to human guidelines. Our evaluation involved experimenting with various prompts to guide the models in complying with guidelines on two datasets: DialogSum (English social conversations) and DECODA (French call center interactions). Human evaluation, based on summarization guidelines, served as the primary assessment method, complemented by extensive quantitative and qualitative analyses. Our findings reveal a preference for GPT-generated summaries over those from task-specific pre-trained models and reference summaries, highlighting GPT models' ability to follow human guidelines despite occasionally producing longer outputs and exhibiting divergent lexical and structural alignment with references. The discrepancy between ROUGE, BERTScore, and human evaluation underscores the need for more reliable automatic evaluation metrics."
}
|
|
PSentScore: Evaluating Sentiment Polarity in Dialogue Summarization
Yongxin Zhou, Fabien Ringeval, François Portet
LREC-COLING, 2024
🎤
Oral Presentation
pdf |
abstract |
bibtex
Automatic dialogue summarization is a well-established task with the goal of distilling the most crucial information from human conversations into concise textual summaries. However, most existing research has predominantly focused on summarizing factual information, neglecting the affective content, which can hold valuable insights for analyzing, monitoring, or facilitating human interactions. In this paper, we introduce and assess a set of measures PSentScore, aimed at quantifying the preservation of affective content in dialogue summaries. Our findings indicate that state-of-the-art summarization models do not preserve well the affective content within their summaries. Moreover, we demonstrate that a careful selection of the training set for dialogue samples can lead to improved preservation of affective content in the generated summaries, albeit with a minor reduction in content-related metrics.
@inproceedings{zhou-etal-2024-psentscore-evaluating,
title = "{PS}ent{S}core: Evaluating Sentiment Polarity in Dialogue Summarization",
author = "Zhou, Yongxin and
Ringeval, Fabien and
Portet, Fran{\c{c}}ois",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1163",
pages = "13290--13302",
abstract = "",
}
|
|
Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains
Vincent Segonne, Aidan Mannion, Laura Cristina Alonzo Canul, Alexandre Daniel Audibert, Xingyu Liu, Cécile Macaire, Adrien Pupier, Yongxin Zhou, Mathilde Aguiar, Felix E. Herron, Magali Norré, Massih R Amini, Pierrette Bouillon, Iris Eshkol-Taravella, Emmanuelle Esperança-Rodier, Thomas François, Lorraine Goeuriot, Jérôme Goulian, Mathieu Lafourcade, Benjamin Lecouteux, François Portet, Fabien Ringeval, Vincent Vandeghinste, Maximin Coavoux, Marco Dinarelli, Didier Schwab
LREC-COLING, 2024
pdf |
abstract |
bibtex
Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.
@inproceedings{segonne-etal-2024-jargon,
title = "Jargon: A Suite of Language Models and Evaluation Tasks for {F}rench Specialized Domains",
author = "Segonne, Vincent and
Mannion, Aidan and
Alonzo Canul, Laura Cristina and
Audibert, Alexandre Daniel and
Liu, Xingyu and
Macaire, C{\'e}cile and
Pupier, Adrien and
Zhou, Yongxin and
Aguiar, Mathilde and
Herron, Felix E. and
Norr{\'e}, Magali and
Amini, Massih R and
Bouillon, Pierrette and
Eshkol-Taravella, Iris and
Esperan{\c{c}}a-Rodier, Emmanuelle and
Fran{\c{c}}ois, Thomas and
Goeuriot, Lorraine and
Goulian, J{\'e}r{\^o}me and
Lafourcade, Mathieu and
Lecouteux, Benjamin and
Portet, Fran{\c{c}}ois and
Ringeval, Fabien and
Vandeghinste, Vincent and
Coavoux, Maximin and
Dinarelli, Marco and
Schwab, Didier",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.827/",
pages = "9463--9476",
abstract = ""
}
|
|
A Survey of Evaluation Methods of Generated Medical Textual Reports
Yongxin Zhou, Fabien Ringeval, François Portet
ACL - ClinicalNLP, 2023
pdf |
abstract |
bibtex
Medical Report Generation (MRG) is a sub-task of Natural Language Generation (NLG) and aims to present information from various sources in textual form and synthesize salient information, with the goal of reducing the time spent by domain experts in writing medical reports and providing support information for decision-making. Given the specificity of the medical domain, the evaluation of automatically generated medical reports is of paramount importance to the validity of these systems. Therefore, in this paper, we focus on the evaluation of automatically generated medical reports from the perspective of automatic and human evaluation. We present evaluation methods for general NLG evaluation and how they have been applied to domain-specific medical tasks. The study shows that MRG evaluation methods are very diverse, and that further work is needed to build shared evaluation methods. The state of the art also emphasizes that such an evaluation must be task specific and include human assessments, requesting the participation of experts in the field.
@inproceedings{zhou-etal-2023-survey,
title = "A Survey of Evaluation Methods of Generated Medical Textual Reports",
author = "Zhou, Yongxin and
Ringeval, Fabien and
Portet, Fran{\c{c}}ois",
booktitle = "Proceedings of the 5th Clinical Natural Language Processing Workshop",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.clinicalnlp-1.48",
doi = "10.18653/v1/2023.clinicalnlp-1.48",
pages = "447--459",
abstract = "",
}
|
|
Effectiveness of French Language Models on Abstractive Dialogue Summarization Task
Yongxin Zhou, François Portet, Fabien Ringeval
LREC, 2022
pdf |
abstract |
bibtex
Pre-trained language models have established the state-of-the-art on various natural language processing tasks, including dialogue summarization, which allows the reader to quickly access key information from long conversations in meetings, interviews or phone calls. However, such dialogues are still difficult to handle with current models because the spontaneity of the language involves expressions that are rarely present in the corpora used for pre-training the language models. Moreover, the vast majority of the work accomplished in this field has been focused on English. In this work, we present a study on the summarization of spontaneous oral dialogues in French using several language specific pre-trained models: BARThez, and BelGPT-2, as well as multilingual pre-trained models: mBART, mBARThez, and mT5. Experiments were performed on the DECODA (Call Center) dialogue corpus whose task is to generate abstractive synopses from call center conversations between a caller and one or several agents depending on the situation. Results show that the BARThez models offer the best performance far above the previous state-of-the-art on DECODA. We further discuss the limits of such pre-trained models and the challenges that must be addressed for summarizing spontaneous dialogues.
@InProceedings{zhou-portet-ringeval:2022:LREC,
author = {Zhou, Yongxin and Portet, François and Ringeval, Fabien},
title = {Effectiveness of French Language Models on Abstractive Dialogue Summarization Task},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {3571--3581},
abstract = {},
url = {https://aclanthology.org/2022.lrec-1.382}
}
|
|
THERADIA: Digital Therapies Augmented by Artificial Intelligence
Franck Tarpin-Bernard, Joan Fruitet, Jean-Philippe Vigne, Patrick Constant, Hanna Chainay, Olivier Koenig, Fabien Ringeval, Béatrice Bouchot, Gérard Bailly, François Portet, Sina Alisamir, Yongxin Zhou, Jean Serre, Vincent Delerue, Hippolyte Fournier, Kévin Berenger, Isabella Zsoldos, Olivier Perrotin, Frédéric Elisei, Martin Lenglet, Charles Puaux, Léo Pacheco, Mélodie Fouillen, Didier Ghenassia
AHFE, 2021
pdf |
abstract |
bibtex
Digital plays a key role in the transformation of medicine. Beyond the simple computerisation of healthcare systems, many non-drug treatments are now possible thanks to digital technology. Thus, interactive stimulation exercises can be offered to people suffering from cognitive disorders, such as developmental disorders, neurodegenerative diseases, stroke or traumas. The efficiency of these new treatments, which are still primarily offered face-to-face by therapists, can be greatly improved if patients can pursue them at home. However, patients are left to their own devices which can be problematic. We introduce THERADIA, a 5-year project that aims to develop an empathic virtual agent that accompanies patients while receiving digital therapies at home, and that provides feedback to therapists and caregivers. We detail the architecture of our agent as well as the framework of our Wizard-of-Oz protocol, designed to collect a large corpus of interactions between people and our virtual assistant in order to train our models and improve our dialogues.
@inproceedings{tarpin2021theradia,
title={THERADIA: Digital Therapies Augmented by Artificial Intelligence},
author={Tarpin-Bernard, Franck and Fruitet, Joan and Vigne, Jean-Philippe and Constant, Patrick and Chainay, Hanna and Koenig, Olivier and Ringeval, Fabien and Bouchot, B{\'e}atrice and Bailly, G{\'e}rard and Portet, Fran{\c{c}}ois and others},
booktitle={International Conference on Applied Human Factors and Ergonomics},
pages={478--485},
year={2021},
organization={Springer}
}
|
|
Yongxin Zhou, Matthieu Boussard, Agnes Delaborde
AAMAS - EXTRAAMAS, 2021
|
Explicabilité par Perturbations pour les Systèmes RAG
Yongxin Zhou, Philippe Mulhem, Didier Schwab
DIAG-LLM@CORIA-TALN, 2025
|
Exploration de caractéristiques linguistiques et acoustiques pour la génération automatique de rapports de séances de remédiation cognitive avec un assistant virtuel
Yongxin Zhou, Fabien Ringeval, François Portet
JPC, 2023 - 9èmes Journées de Phonétique Clinique
pdf |
poster |
bibtex
@article{zhouexploration,
title={Exploration de caract{\'e}ristiques linguistiques et acoustiques pour la g{\'e}n{\'e}ration automatique de rapports de s{\'e}ances de rem{\'e}diation cognitive avec un assistant virtuel},
author={ZHOU, Yongxin and RINGEVAL, Fabien and PORTET, Fran{\c{c}}ois},
journal={9{\`e}me Journ{\'e}e de Phon{\'e}tique Clinique},
pages={117}
}
|
- Conference Reviewer
- ACL Rolling Review, COLING (2025), LREC-Coling (2024), ACII (2023), GEM (2022, 2023, 2025), LREC (2022, 2026)
- Conference Area Chair
- Volunteer
- Phd volunteer in the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD 2022
- Virtual Volunteer in the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
- Volunteer in the CLEF 2024 Conference and Labs of the Evaluation Forum, CLEF 2024
- Mentorship
- 02/2021 - 07/2021, Supervision of Master Internship on Natural Language Grounding through Dense Video Captioning, Multi3Generation
- Teacher Assistant at The second Advanced Language Processing School, ALPS 2022
- Organizer
- Organisation of social activities at The first Advanced Language Processing School, ALPS 2021
- DIAG-LLM workshop for CORIA-TALN 2025
|
|