PubChem offers an FTP site. This script will download all of the gzipped SDfiles to a local directory. There are about 58GBs of zipped data and 535GBs of unzipped data.
require 'net/ftp'
Net :: FTP . open ( 'ftp.ncbi.nlm.nih.gov' ) do | ftp |
ftp . passive = true
ftp . login
ftp . chdir ( '/pubchem/Compound/CURRENT-Full/SDF' )
files = ftp . list ( '*' )
total = 0
sdf_files = files . select { | f | f . match ( /\.sdf\.gz$/ ) }
sdf_files . each_with_index do | file , index |
tokens = file . split ( /\s+/ )
size = tokens [ 4 ]. to_i
total += size
filename = tokens . last
puts "Getting [ #{ filename } ] [ #{ size } ] [ #{ index } of #{ sdf_files . count } ]"
ftp . getbinaryfile ( filename , filename , 1024 )
end
puts " #{ total } :: #{ total . to_f / ( 1024 * 1024 * 1024 ). to_f } "
end ; nil
You can then unzip these using:
gunzip * .gz
(expect it to take a while)