For a project I have been working on lately I wanted to download all archaeal gyrA sequences from NCBI and make a blast database out of them. I searched up and down the internet for a curl
(or wget
) solution, but didn’t find anything that did exactly what I wanted. It took quite a bit of time for me to piecemeal one together, but my solution is below. Under the solution, I have listed my myriad google searches in hopes that someone will land on this page faster than it took me to craft this solution at some point in the future!
I found the sequences I was interested in by navigating to NCBI, selecting “protein” from the drop down menu selection, and searching for “gyra[All Fields] AND archaea[filter]”. I selected “Send to:” from the drop down menu in the top right hand corner, selected file, and changed the format to Accession List. This produced a file with all accessions I wanted to download on their own line. I then fed that file into the following while
loop:
while read inline
do
i=$inline
curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${i}&rettype=fasta&retmode=txt">>archaea-gyra.faa
done < archaea-gyra.seq
Google queries:
- download ncbi protein sequences wget
- download all ncbi protein seqs for a protein name unix
- programmatically download all ncbi protein seqs for a protein name unix
- download all protein sequences from protein db ncbi
- how to download all protein sequences ncbi
- how to download all genbank protein sequences matching a query