A better way to scrape crates.io

⚓ Rust    📅 2025-07-27    👤 surdeus    👁️ 10      

surdeus

Warning

This post was published 129 days ago. The information described in this article may have changed.

I am required to create an internal mirror from crates.io for a closed environment.

Effectively: I need to stand up an internal mirror that has no connection to the web that is purely standalone and self contained in what is called an "air-gapped-closed-environment" - I literally can download something, burn it to a CD (DVD) then I must "sneakernet" the CD/DVD into to that area and load it onto the machine.

I've got most of this figured out ..

The basic technique I am using is this:
a) step 1 - Clone git hub: crates.io-index
b) Loop through this to get all names, and all versions and status.

That gives:
188K crate names
Or 1.6 million unique NAMES + VERSIONS

Next - is I can apply this the "api/v1" interface to fetch files.

SO .. and I am about to download 188K crates
(if I expand this to all non-yanked versions the count is 1.6 million)

yea, I can write a python tool to scrape these...
the for() loop does not care it will just continue to pull things until the loop ends.

I can throttle (ie: there are 86400 seconds in a day, if I limit to 2 per second (172000 per day, its about 2 days to download the basic list but 10 days for all versions (1.6 million) - or I can not-throttle the requests and go for it.

I would rather be a "good citizen" and not create anything like a DOS attack.

So my ask is this:
Is there a better way (preferred way) for me to pull all of these?
or do I just throttle the request to just a few requests?

I am ready to "release the hounds" upon the beast..  and prefer to politely ask first.

Thanks.

4 posts - 3 participants

Read full topic

🏷️ Rust_feed