Pseudogenomes for 1135 Arabidopsis thaliana strains were generated by combining reference and variant calls including deletions, with uncalled sites represented as Ns.
The data powering SNPst☆r is published as an Annotated Research Context (ARC) — Arabidopsis thaliana GWAS & Proteotype ARC (export 2026-06-05, 38.2 GB total).
Or download individual collections directly. All collections are gzipped JSON Lines (one document per line) except the SNP catalog, which is gzipped BSON (mongodump) and is restored with mongorestore. See the README and METADATA documents for details.
| Dataset | Documents | Size | Format | sha256 | |
|---|---|---|---|---|---|
|
Climate-factor GWAS (raw p-values) Raw GWAS p-values per SNP across ~196 climate factors. Climate-factor keys join to climate_factors_upd for human-readable descriptions. |
3,042,994 | 250 MB | jsonl.gz |
ce9831a2c6c5… |
↓ Download |
|
Climate-factor GWAS (enriched p<1 subset) Enriched subset of GWAS hits with p<1; climate-factor sub-documents carry pval, description and link inline. |
185,450 | 12 MB | jsonl.gz |
ccc12893ee6c… |
↓ Download |
|
Climate-factor descriptions Canonical descriptions of the ~196 climate factors (ID, Description, Link, Source, Units). |
196 | 6.2 KB | jsonl.gz |
c2e256803d60… |
↓ Download |
|
Climate-factor key → human-id join Join table mapping machine GWAS keys (actualID) to human-readable ID, Description, Link, Source and Units. |
195 | 7.2 KB | jsonl.gz |
97ba41c0969a… |
↓ Download |
Ferrero-Serrano Á, Assmann SM (2019). Nat. Ecol. Evol. 3, 274-285. doi:10.1038/s41559-018-0754-5
| Dataset | Documents | Size | Format | sha256 | |
|---|---|---|---|---|---|
|
AraGWAS significant hits Curated significant SNP×phenotype associations from the AraGWAS Catalog (snp_id_c, chr, position, score, maf). |
36,399 | 1.3 MB | jsonl.gz |
ef77e7a85371… |
↓ Download |
Togninalli M et al. (2018). Nucleic Acids Res. 46(D1), D1150-D1156. doi:10.1093/nar/gkx954
| Dataset | Documents | Size | Format | sha256 | |
|---|---|---|---|---|---|
|
SNP catalog Full SNP catalog used by the website (transcript_id, snp_id, snp_dp, snp_location, ...). Gzipped BSON; restore with mongorestore. Large file (≈11 GB) — the complete ARC is the recommended way to obtain it. |
20,010,332 | 11 GB | bson.gz |
a0405c535f8e… |
↓ Download |
|
Haplotypes Per-transcript haplotypes across accessions (haplotype_id, haplotype_number, num_snps, num_accessions). |
6,348,060 | 685 MB | jsonl.gz |
b3ed98041a72… |
↓ Download |
|
Transcript regions Transcript genomic regions (transcript_id, chr, strand, start, end). |
52,060 | 219 MB | jsonl.gz |
bf148c32fad1… |
↓ Download |
|
Proteotype haplotypes Protein-level haplotype groups (transcript_id, protein_haplotype_id, protein_haplotype_number, num_accessions, accessions). |
3,495,265 | 422 MB | jsonl.gz |
488ebe820857… |
↓ Download |
|
Proteotype structures AlphaFold/PDB structural data per proteotype (transcript_id, proteotype_id, pdb_structure, filename, file_path). Very large file (≈26 GB) — the complete ARC is the recommended way to obtain it. |
356,807 | 26 GB | jsonl.gz |
1fd5cd88c2a9… |
↓ Download |
|
ThermoMPNN stability predictions ThermoMPNN ΔΔG stability predictions per protein (uniprot_id, protein_length, predictions, source_file, model_version). |
41,597 | 1.3 GB | jsonl.gz |
abba4ef25900… |
↓ Download |
Derived from the 1001 Genomes panel — 1001 Genomes Consortium (2016). Cell 166, 481-491. doi:10.1016/j.cell.2016.05.063
| Dataset | Documents | Size | Format | sha256 | |
|---|---|---|---|---|---|
|
Accession data Accession metadata (acc_id, name, cs_number, country, group). |
1,135 | 18.5 KB | jsonl.gz |
5d58eb9473b8… |
↓ Download |
|
Accession locations Geolocations per accession (accession_id, name, lat, lng, country). |
7,406 | 87 KB | jsonl.gz |
9aaaed3d36aa… |
↓ Download |
|
Transcript → UniProt mapping Mapping between transcripts and UniProt entries (uniprot_id, isoform_id, gene_name). |
48,321 | 778 KB | jsonl.gz |
0ed018b664b0… |
↓ Download |
Accession metadata derived from the 1001 Genomes panel (Cell 2016, 166, 481-491).
The full Arabidopsis thaliana / 1001 Genomes GWAS dataset can be accessed programmatically through the public AraGWAS REST API: https://aragwas.1001genomes.org/api/.
If you use this resource, please cite snpStar. The Collaborative Research Centre CRC 1664 is funded by Deutsche Forschungsgemeinschaft (DFG).