A better way to scrape crates.io
⚓ Rust 📅 2025-07-27 👤 surdeus 👁️ 10I am required to create an internal mirror from crates.io for a closed environment.
Effectively: I need to stand up an internal mirror that has no connection to the web that is purely standalone and self contained in what is called an "air-gapped-closed-environment" - I literally can download something, burn it to a CD (DVD) then I must "sneakernet" the CD/DVD into to that area and load it onto the machine.
I've got most of this figured out ..
The basic technique I am using is this:
a) step 1 - Clone git hub: crates.io-index
b) Loop through this to get all names, and all versions and status.
That gives:
188K crate names
Or 1.6 million unique NAMES + VERSIONS
Next - is I can apply this the "api/v1" interface to fetch files.
SO .. and I am about to download 188K crates
(if I expand this to all non-yanked versions the count is 1.6 million)
yea, I can write a python tool to scrape these...
the for() loop does not care it will just continue to pull things until the loop ends.
I can throttle (ie: there are 86400 seconds in a day, if I limit to 2 per second (172000 per day, its about 2 days to download the basic list but 10 days for all versions (1.6 million) - or I can not-throttle the requests and go for it.
I would rather be a "good citizen" and not create anything like a DOS attack.
So my ask is this:
Is there a better way (preferred way) for me to pull all of these?
or do I just throttle the request to just a few requests?
I am ready to "release the hounds" upon the beast.. and prefer to politely ask first.
Thanks.
4 posts - 3 participants
🏷️ Rust_feed