Structured Machine Learning Group -- Home

A Causal Framework for Evaluating Deferring Systems Filippo Palomba, Andrea Pugnana, Jose Manuel Alvarez, and Salvatore Ruggieri. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (AISTATS '25), 2025.

[paper] [bibtex] [abstract]

Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems, which allows to identify the causal impact of the deferring strategy on predictive accuracy. We distinguish two scenarios. In the first one, we have access to both the human and ML model predictions for the deferred instances. Here, we can identify the individual causal effects for deferred instances and the aggregates of them. In the second one, only human predictions are available for the deferred instances. Here, we can resort to regression discontinuity design to estimate a local causal effect. We evaluate our approach on synthetic and real datasets for seven deferring systems from the literature.

@inproceedings {pmlr-v258-palomba25a,
    author = { Palomba, Filippo and Pugnana, Andrea and Alvarez, Jose Manuel and Ruggieri, Salvatore },
    title = "A Causal Framework for Evaluating Deferring Systems",
    booktitle = "Proceedings of The 28th International Conference on Artificial Intelligence and Statistics",
    pages = "2143--2151",
    year = "2025",
    volume = "258",
    series = "AISTATS '25",
    month = "03--05 May",
    publisher = "PMLR",
    pdf = "https://raw.githubusercontent.com/mlresearch/v258/main/assets/palomba25a/palomba25a.pdf",
    url = "https://proceedings.mlr.press/v258/palomba25a.html",
}

Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain Burcu Sayin, Pasquale Minervini, Jacopo Staiano, and Andrea Passerini. In Proceedings of the 6th Clinical Natural Language Processing Workshop, 2024.

[paper] [bibtex] [abstract]

We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38{\%} of the time, Mistral can produce the correct answer, improving accuracy up to 74{\%} depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.

@inproceedings {sayin-etal-2024-llms,
    author = { Sayin, Burcu and Minervini, Pasquale and Staiano, Jacopo and Passerini, Andrea },
    title = "Can {LLM}s Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain",
    booktitle = "Proceedings of the 6th Clinical Natural Language Processing Workshop",
    month = "June",
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.clinicalnlp-1.19",
    doi = "10.18653/v1/2024.clinicalnlp-1.19",
    pages = "218--237",
}

Machine learning for microbiologists Francesco Asnicar, A. Morgan Thomas, Andrea Passerini, Levi Waldron, and Nicola Segata. In Nature Reviews Microbiology 22(4), 2024.

[paper] [bibtex]

@article {asnicar2024machine,
    author = { Asnicar, Francesco and Thomas, A. Morgan and Passerini, Andrea and Waldron, Levi and Segata, Nicola },
    title = "Machine learning for microbiologists",
    journal = "Nature Reviews Microbiology",
    volume = "22",
    number = "4",
    pages = "191--205",
    year = "2024",
    doi = "10.1038/s41579-023-00984-1",
    url = "https://doi.org/10.1038/s41579-023-00984-1",
}

Preference Elicitation in Interactive and User-centered Algorithmic Recourse: an Initial Exploration Seyedehdelaram Esfahani, Giovanni De Toni, Bruno Lepri, Andrea Passerini, Katya Tentori, and Massimo Zancanaro. In Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization (UMAP '24), 2024.

[paper] [bibtex] [abstract]

Algorithmic Recourse aims to provide actionable explanations, or recourse plans, to overturn potentially unfavourable decisions taken by automated machine learning models. In this paper, we propose an interaction paradigm based on a guided interaction pattern aimed at both eliciting the users’ preferences and heading them toward effective recourse interventions. In a fictional task of money lending, we compare this approach with an exploratory interaction pattern based on a combination of alternative plans and the possibility of freely changing the configurations by the users themselves. Our results suggest that users may recognize that the guided interaction paradigm improves efficiency. However, they also feel less freedom to experiment with “what-if” scenarios. Nevertheless, the time spent on the purely exploratory interface tends to be perceived as a lack of efficiency, which reduces attractiveness, perspicuity, and dependability. Conversely, for the guided interface, more time on the interface seems to increase its attractiveness, perspicuity, and dependability while not impacting the perceived efficiency. That might suggest that this type of interfaces should combine these two approaches by trying to support exploratory behavior while gently pushing toward a guided effective solution.

@inproceedings {umap2024,
    author = { Esfahani, Seyedehdelaram and De Toni, Giovanni and Lepri, Bruno and Passerini, Andrea and Tentori, Katya and Zancanaro, Massimo },
    title = "Preference Elicitation in Interactive and User-centered Algorithmic Recourse: an Initial Exploration",
    year = "2024",
    isbn = "9798400704338",
    publisher = "Association for Computing Machinery",
    address = "New York, NY, USA",
    url = "https://doi.org/10.1145/3627043.3659556",
    doi = "10.1145/3627043.3659556",
    booktitle = "Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization",
    pages = "249–254",
    numpages = "6",
    keywords = "Algorithmic Recourse, Counterfactual Examples, Human-centred AI",
    location = "Cagliari, Italy",
    series = "UMAP '24",
}

A Neuro-Symbolic Benchmark Suite for Concept Quality and Reasoning Shortcuts Samuele Bortolotti, Emanuele Marconato, Tommaso Carraro, Paolo Morettin, Emile Krieken, Antonio Vergari, Stefano Teso, and Andrea Passerini. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.

[paper] [bibtex] [abstract] [code]

The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available at: https://unitn-sml.github.io/rsbench.

@inproceedings {bortolotti2024benchmark,
    author = { Bortolotti, Samuele and Marconato, Emanuele and Carraro, Tommaso and Morettin, Paolo and van Krieken, Emile and Vergari, Antonio and Teso, Stefano and Passerini, Andrea },
    title = "A Neuro-Symbolic Benchmark Suite for Concept Quality and Reasoning Shortcuts",
    booktitle = "The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track",
    year = "2024",
    url = "https://openreview.net/forum?id=5VtI484yVy",
    code = "https://github.com/unitn-sml/rsbench-code",
}

BEARS Make Neuro-Symbolic Models Aware of their Reasoning Shortcuts Emanuele Marconato, Samuele Bortolotti, Emile Krieken, Antonio Vergari, Andrea Passerini, and Stefano Teso. In The 40th Conference on Uncertainty in Artificial Intelligence, 2024.

[paper] [bibtex] [abstract] [code]

Neuro-Symbolic (NeSy) predictors that conform to symbolic knowledge – encoding, e.g., safety constraints – can be affected by Reasoning Shortcuts (RSs): They learn concepts consistent with the symbolic knowledge by exploiting unintended semantics. RSs compromise reliability and generalization and, as we show in this paper, they are linked to NeSy models being overconfident about the predicted concepts. Unfortunately, the only trustworthy mitigation strategy requires collecting costly dense supervision over the concepts. Rather than attempting to avoid RSs altogether, we propose to ensure NeSy models are aware of the semantic ambiguity of the concepts they learn, thus enabling their users to identify and distrust low-quality concepts. Starting from three simple desiderata, we derive bears (BE Aware of Reasoning Shortcuts), an ensembling technique that calibrates the model’s concept-level confidence without compromising prediction accuracy, thus encouraging NeSy architectures to be uncertain about concepts affected by RSs. We show empirically that bears improves RS-awareness of several state-of-the-art NeSy models, and also facilitates acquiring informative dense annotations for mitigation purposes.

@inproceedings {marconato2024bears,
    author = { Marconato, Emanuele and Bortolotti, Samuele and van Krieken, Emile and Vergari, Antonio and Passerini, Andrea and Teso, Stefano },
    title = "{BEARS} Make Neuro-Symbolic Models Aware of their Reasoning Shortcuts",
    booktitle = "The 40th Conference on Uncertainty in Artificial Intelligence",
    year = "2024",
    code = "https://github.com/samuelebortolotti/bears",
    url = "https://openreview.net/forum?id=pDcM1k7mgZ",
}

Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, and Andrea Passerini. In First Conference on Language Modeling, 2024.

[paper] [bibtex] [abstract] [code]

Large Language Models (LLMs) demonstrate an impressive capacity to recall a vast range of factual knowledge. However, understanding their underlying reasoning and internal mechanisms in exploiting this knowledge remains a key research area. This work unveils the factual information an LLM represents internally for sentence-level claim verification. We propose an end-to-end framework to decode factual knowledge embedded in token representations from a vector space to a set of ground predicates, showing its layer-wise evolution using a dynamic knowledge graph. Our framework employs activation patching, a vector-level technique that alters a token representation during inference, to extract encoded knowledge. Accordingly, we neither rely on training nor external models. Using factual and common-sense claims from two claim verification datasets, we showcase interpretability analyses at local and global levels. The local analysis highlights entity centrality in LLM reasoning, from claim-related information and multi-hop reasoning to representation errors causing erroneous evaluation. On the other hand, the global reveals trends in the underlying evolution, such as word-based knowledge evolving into claim-related facts. By interpreting semantics from LLM latent representations and enabling graph-related analyses, this work enhances the understanding of the factual knowledge resolution process.

@inproceedings {bronziniunveiling,
    author = { Bronzini, Marco and Nicolini, Carlo and Lepri, Bruno and Staiano, Jacopo and Passerini, Andrea },
    title = "Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph",
    booktitle = "First Conference on Language Modeling",
    year = "2024",
    url = "https://openreview.net/forum?id=dWYRjT501w",
    code = "https://github.com/Ipazia-AI/latent-explorer",
}

Glitter or gold? Deriving structured insights from sustainability reports via large language models Marco Bronzini, Carlo Nicolini, Bruno Lepri, Andrea Passerini, and Jacopo Staiano. In EPJ Data Science 13(1), 2024.

[paper] [bibtex] [abstract] [code]

Over the last decade, several regulatory bodies have started requiring the disclosure of non-financial information from publicly listed companies, in light of the investors' increasing attention to Environmental, Social, and Governance (ESG) issues. Publicly released information on sustainability practices is often disclosed in diverse, unstructured, and multi-modal documentation. This poses a challenge in efficiently gathering and aligning the data into a unified framework to derive insights related to Corporate Social Responsibility (CSR). Thus, using Information Extraction (IE) methods becomes an intuitive choice for delivering insightful and actionable data to stakeholders. In this study, we employ Large Language Models (LLMs), In-Context Learning, and the Retrieval-Augmented Generation (RAG) paradigm to extract structured insights related to ESG aspects from companies' sustainability reports. We then leverage graph-based representations to conduct statistical analyses concerning the extracted insights. These analyses revealed that ESG criteria cover a wide range of topics, exceeding 500, often beyond those considered in existing categorizations, and are addressed by companies through a variety of initiatives. Moreover, disclosure similarities emerged among companies from the same region or sector, validating ongoing hypotheses in the ESG literature. Lastly, by incorporating additional company attributes into our analyses, we investigated which factors impact the most on companies' ESG ratings, showing that ESG disclosure affects the obtained ratings more than other financial or company data.

@article {bronzini2024glitter,
    author = { Bronzini, Marco and Nicolini, Carlo and Lepri, Bruno and Passerini, Andrea and Staiano, Jacopo },
    title = "Glitter or gold? Deriving structured insights from sustainability reports via large language models",
    journal = "EPJ Data Science",
    volume = "13",
    number = "1",
    pages = "41",
    year = "2024",
    publisher = "Springer Berlin Heidelberg",
    url = "https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-024-00481-2",
    code = "https://github.com/saturnMars/derivingStructuredInsightsFromSustainabilityReportsViaLargeLanguageModels",
}

Structured Machine Learning Group

Recent Publications

Latest News