Collection of Sci-LLMs 

Introduction

We curate extensive resources encompassing models, datasets and evaluations for Scientific Large Language Models (Sci-LLMs) , with a specific focus on the domains of biology and chemistry. We primarily emphasizes scientific languages, including textual, molecular, protein, and genomic languages, as well as their multimodal combinations. See  Scientific Large Language Models: A Survey on Biological & Chemical Domains for more details! We enumerate available resources for Sci-LLMs, open-source and maintain the related materials at GitHub, thereby facilitating accessibility for newcomers to the field.

Fig. 1: Scopes of Sci-LLMs in this survey.

We provide a systematic review of existing Sci-LLMs, including

  • Textual Scientific Large Language Models (Text-Sci-LLMs). Text-Sci-LLMs are a category of LLMs that are specifically trained on vast amounts of textual scientific data. These models excel in understanding, generating, and interacting with human language in its written form.
  • Molecular Large Language Models (Mol-LLMs). Mol-LLMs are specialized LLMs trained on molecular data, allowing them to understand and predict chemical properties and behaviors of molecules. This distinctive expertise renders them invaluable tools in drug discovery, materials science, and understanding complex chemical interactions.
  • Protein Large Language Models (Prot-LLMs). Prot-LLMs are trained specifically on protein-related data, including amino acid sequences, protein folding patterns, and other biological data relevant to proteins. Consequently, they possess the capability to accurately predict protein structures, functions, and interactions.
  • Genomic Large Language Models (Gene-LLMs). Gene-LLMs, with a primary focus on genomic data, are trained to understand and predict aspects of genetics and genomics. They can be used to analyze DNA sequences, understand genetic variations, and assist in genetic research endeavors such as identifying genetic markers for diseases or understanding evolutionary biology.
  • Multimodal Scientific Large Language Models (MM-Sci-LLMs). MM-Sci-LLMs are the most advanced and versatile models, capable of processing and integrating multiple types of scientific data. They can handle text, molecules, proteins, and more, making them suitable for complex scientific research that spans different data types, making them invaluable in interdisciplinary research areas.

Citation
If you find these resources useful, please cite our paper:

@misc{zhang2024scientific,
title={Scientific Large Language Models: A Survey on Biological & Chemical Domains},
author={Qiang Zhang and Keyan Ding and Tianwen Lyv and Xinda Wang and Qingyu Yin and Yiwen Zhang and Jing Yu and Yuhao Wang and Xiaotong Li and Zhuoyi Xiang and Kehua Feng and Xiang Zhuang and Zeyuan Wang and Ming Qin and Mengyao Zhang and Jinlu Zhang and Jiyu Cui and Tao Huang and Pengju Yan and Renjun Xu and Hongyang Chen and Xiaolin Li and Xiaohui Fan and Huabin Xing and Huajun Chen},
year={2024},
eprint={2401.14656},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

Collection of Sci-LLMs

Last updated: Jun 30, 2024

Collection of Sci-Datasets

Last updated: Jun 30, 2024

wdt_ID wdt_created_by wdt_created_at wdt_last_edited_by wdt_last_edited_at Date Datasets Paper Journal Category Topic
314 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2016.05 The MIMIC dataset
mimic-code
Data Descriptor: MIMIC-III, a freely accessible critical care database Scientific Data Text-Sci-LLMs
315 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2019.04 eICU-CRD The eICU Collaborative Research Database, a freely available multi-center database for critical care research Scientific Data Text-Sci-LLMs
316 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2018.11 cMedQA2 Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection IEEE Access Text-Sci-LLMs
317 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM MedDialog-Chinese MedDialog: Large-scale Medical Dialogue Datasets EMNLP 2020 Text-Sci-LLMs
318 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2023.10 ChiMed Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model arXiv Text-Sci-LLMs
319 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2023.03 HealthCareMagic-100k ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge arXiv Text-Sci-LLMs
320 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2019.01 MedQuAD A Question-Entailment Approach to Question Answering arXiv Text-Sci-LLMs
321 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2023.07 MultiMedQA Large language models encode clinical knowledge Nature Text-Sci-LLMs
322 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2015.07 Open-I Preparing a collection of radiology examinations for distribution and retrieval JAMIA Text-Sci-LLMs
323 user 24/07/2024 07:47 AM user 24/07/2024 07:47 AM 2024.03 Psych8k ChatCounselor: A Large Language Models for Mental Health Support arXiv Text-Sci-LLMs
Date Datasets Paper Journal Category Topic