Research data extraction with large language models

by Kai Armstrong

15:30 (40 min) in USB 2.022

The dissemination of methods and data from published research has been a long-time challenge, compounded by the diversity of nomenclature across disciplines. The extraction and consolidation of data is still arduous, and often requires a significant manual effort. A streamlined, automated process to extract and store research information in a central repository is critical for advancing research efficiency and accessibility.

In this talk I will discuss the transformative potential of large language models in data extraction. I will show how these models can be applied to improve the understanding of research papers and the data card extraction. I will explore advanced prompt engineering strategies, such as chain-of-thought prompting, discuss the use AI agent teams working synergistically towards a common goal, and examine multimodal approaches (e.g. GPT-4V, GPT-4o) that integrate visual and textual information. Finally, I will discuss the development of a centralised data storage.