digital preservation free software open source software source code
Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. Author and committer information is anonymized.
Data is updated yearly
The term "Software Heritage Graph Dataset" designates the internal structure of the Software Heritage archive, and explicitly excludes the file contents. The "Software Heritage Graph Dataset" is distributed under the Creative Commons Attribution 4.0 International license. For terms of use of all other contents found in the S3 buckets, contact datasets@softwareheritage.orgBy accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data, the terms of use for bulk access, and the Software Heritage principles for large language models.
https://docs.softwareheritage.org/devel/swh-dataset/graph/athena.html
Software Heritage
See all datasets managed by Software Heritage.
Software Heritage Graph Dataset was accessed on DATE
from https://registry.opendata.aws/software-heritage.
arn:aws:s3:::softwareheritage
us-east-1
aws s3 ls --no-sign-request s3://softwareheritage/
arn:aws:s3:::softwareheritage-inventory
us-east-1
aws s3 ls --no-sign-request s3://softwareheritage-inventory/