ProteinGym

bioinformatics biology deep learning life sciences machine learning protein

Description

ProteinGym is a benchmark suite for assessing the performance of protein fitness prediction and design models. It comprises a large curated collection of 200+ high-throughput experimental assays (~3M mutated sequences), as well as clinical annotations from experts about the pathogenicity of mutants in over 3k human genes.

Update Frequency

Quarterly

License

MIT License

Documentation

https://github.com/OATML-Markslab/ProteinGym/blob/main/README.md

Managed By

See all datasets managed by Harvard Medical School, University of Oxford.

Contact

pascal_notin@hms.harvard.edu

How to Cite

ProteinGym was accessed on DATE from https://registry.opendata.aws/proteingym.

Usage Examples

Tutorials
Tools & Applications
Publications

Resources on AWS

  • Description
    ProteinGym dataset including all substitution/indel mutations from Deep Mutational Scanning (DMS) experiments (DMS_substitutions.parquet / DMS_indels.parquet), and all substitution/indel mutations from clinical variant databases (clinical_substitutions.parquet / clinical_indels.parquet).
    Resource type
    S3 Bucket
    Amazon Resource Name (ARN)
    arn:aws:s3:::proteingym
    AWS Region
    us-east-2
    AWS CLI Access (No AWS account required)
    aws s3 ls --no-sign-request s3://proteingym/

Edit this dataset entry on GitHub

Tell us about your project

Home