Collection of Sci-LLMs
Introduction
We curate extensive resources encompassing models, datasets and evaluations for Scientific Large Language Models (Sci-LLMs) , with a specific focus on the domains of biology and chemistry. We primarily emphasizes scientific languages, including textual, molecular, protein, and genomic languages, as well as their multimodal combinations. See Scientific Large Language Models: A Survey on Biological & Chemical Domains for more details! We enumerate available resources for Sci-LLMs, open-source and maintain the related materials at GitHub, thereby facilitating accessibility for newcomers to the field.
We provide a systematic review of existing Sci-LLMs, including
- Textual Scientific Large Language Models (Text-Sci-LLMs). Text-Sci-LLMs are a category of LLMs that are specifically trained on vast amounts of textual scientific data. These models excel in understanding, generating, and interacting with human language in its written form.
- Molecular Large Language Models (Mol-LLMs). Mol-LLMs are specialized LLMs trained on molecular data, allowing them to understand and predict chemical properties and behaviors of molecules. This distinctive expertise renders them invaluable tools in drug discovery, materials science, and understanding complex chemical interactions.
- Protein Large Language Models (Prot-LLMs). Prot-LLMs are trained specifically on protein-related data, including amino acid sequences, protein folding patterns, and other biological data relevant to proteins. Consequently, they possess the capability to accurately predict protein structures, functions, and interactions.
- Genomic Large Language Models (Gene-LLMs). Gene-LLMs, with a primary focus on genomic data, are trained to understand and predict aspects of genetics and genomics. They can be used to analyze DNA sequences, understand genetic variations, and assist in genetic research endeavors such as identifying genetic markers for diseases or understanding evolutionary biology.
- Multimodal Scientific Large Language Models (MM-Sci-LLMs). MM-Sci-LLMs are the most advanced and versatile models, capable of processing and integrating multiple types of scientific data. They can handle text, molecules, proteins, and more, making them suitable for complex scientific research that spans different data types, making them invaluable in interdisciplinary research areas.
Citation
If you find these resources useful, please cite our paper:
@misc{zhang2024scientific,
title={Scientific Large Language Models: A Survey on Biological & Chemical Domains},
author={Qiang Zhang and Keyan Ding and Tianwen Lyv and Xinda Wang and Qingyu Yin and Yiwen Zhang and Jing Yu and Yuhao Wang and Xiaotong Li and Zhuoyi Xiang and Kehua Feng and Xiang Zhuang and Zeyuan Wang and Ming Qin and Mengyao Zhang and Jinlu Zhang and Jiyu Cui and Tao Huang and Pengju Yan and Renjun Xu and Hongyang Chen and Xiaolin Li and Xiaohui Fan and Huabin Xing and Huajun Chen},
year={2024},
eprint={2401.14656},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Collection of Sci-LLMs
Last updated: Jun 30, 2024
Collection of Sci-Datasets
Last updated: Jun 30, 2024
wdt_ID | wdt_created_by | wdt_created_at | wdt_last_edited_by | wdt_last_edited_at | Date | Datasets | Paper | Journal | Category | Topic |
---|---|---|---|---|---|---|---|---|---|---|
314 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2016.05 | The MIMIC dataset mimic-code |
Data Descriptor: MIMIC-III, a freely accessible critical care database | Scientific Data | Text-Sci-LLMs | |
315 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2019.04 | eICU-CRD | The eICU Collaborative Research Database, a freely available multi-center database for critical care research | Scientific Data | Text-Sci-LLMs | |
316 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2018.11 | cMedQA2 | Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection | IEEE Access | Text-Sci-LLMs | |
317 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | MedDialog-Chinese | MedDialog: Large-scale Medical Dialogue Datasets | EMNLP 2020 | Text-Sci-LLMs | ||
318 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2023.10 | ChiMed | Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model | arXiv | Text-Sci-LLMs | |
319 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2023.03 | HealthCareMagic-100k | ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge | arXiv | Text-Sci-LLMs | |
320 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2019.01 | MedQuAD | A Question-Entailment Approach to Question Answering | arXiv | Text-Sci-LLMs | |
321 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2023.07 | MultiMedQA | Large language models encode clinical knowledge | Nature | Text-Sci-LLMs | |
322 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2015.07 | Open-I | Preparing a collection of radiology examinations for distribution and retrieval | JAMIA | Text-Sci-LLMs | |
323 | user | 24/07/2024 07:47 AM | user | 24/07/2024 07:47 AM | 2024.03 | Psych8k | ChatCounselor: A Large Language Models for Mental Health Support | arXiv | Text-Sci-LLMs | |
Date | Datasets | Paper | Journal | Category | Topic |