Unfolding the universe of possibilities..

Dancing with the stars of binary realms.

Use Python to Download Multiple Files (or URLs) in Parallel

Get more data in less time

Photo by Wesley Tingey on Unsplash

We live in a world of big data. Often, big data is organized as a large collection of small datasets (i.e., one large dataset comprised of multiple files). Obtaining these data is often frustrating because of the download (or acquisition burden). Fortunately, with a little code, there are ways to automate and speed-up file download and acquisition.

Automating file downloads can save a lot of time. There are several ways to automate file downloads with Python. The easiest way to download files is using a simple Python loop to iterate through a list of URLs to download. This serial approach can work well with a few small files, but if you are downloading many files or large files, you’ll want to use a parallel approach to maximize your computational resources.

With a parallel file download routine, you can better use your computer’s resources to download multiple files simultaneously, saving you time. This tutorial demonstrates how to develop a generic file download function in Python and apply it to download multiple files with serial and parallel approaches. The code in this tutorial uses only modules available from the Python standard library, so no installations are required.

Import modules

For this example, we only need the requests and multiprocessing Python modules to download files in parallel. The requests and multiprocessing modules are both available from the Python standard library, so you won’t need to perform any installations.

We’ll also import the time module to keep track of how long it takes to download individual files and compare performance between the serial and parallel download routines. The time module is also part of the Python standard library.

import requests import time from multiprocessing import cpu_count from multiprocessing.pool import ThreadPool

Define URLs and filenames

I’ll demonstrate parallel file downloads in Python using gridMET NetCDF files that contain daily precipitation data for the United States.

Here, I specify the URLs to four files in a list. In other applications, you may programmatically generate a list of files to download.

urls = [‘https://www.northwestknowledge.net/metdata/data/pr_1979.nc’, ‘https://www.northwestknowledge.net/metdata/data/pr_1980.nc’, ‘https://www.northwestknowledge.net/metdata/data/pr_1981.nc’, ‘https://www.northwestknowledge.net/metdata/data/pr_1982.nc’]

Each URL must be associated with its download location. Here, I’m downloading the files to the Windows ‘Downloads’ directory. I’ve hardcoded the filenames in a list for simplicity and transparency. Given your application, you may want to write code that will parse the input URL and download it to a specific directory.

fns = [r’C:UserskonradDownloadspr_1979.nc’, r’C:UserskonradDownloadspr_1980.nc’, r’C:UserskonradDownloadspr_1981.nc’, r’C:UserskonradDownloadspr_1982.nc’]

Multiprocessing requires parallel functions to have only one argument (there are some workarounds, but we won’t get into that here). To download a file we’ll need to pass two arguments, a URL and a filename. So we’ll zip the urls and fns lists together to get a list of tuples. Each tuple in the list will contain two elements; a URL and the download filename for the URL. This way we can pass a single argument (the tuple) that contains two pieces of information.

inputs = zip(urls, fns)

Function to download a URL

Now that we have specified the URLs to download and their associated filenames, we need a function to download the URLs ( download_url).

We’ll pass one argument ( arg) to download_url. This argument will be an iterable (list or tuple) where the first element is the URL to download ( url) and the second element is the filename ( fn). The elements are assigned to variables ( url and fn) for readability.

Now create a try statement in which the URL is retrieved and written to the file after it is created. When the file is written the URL and download time are returned. If an exception occurs a message is printed.

The download_url function is the meat of our code. It does the actual work of downloading and file creation. We can now use this function to download files in serial (using a loop) and in parallel. Let’s go through those examples.

def download_url(args):
t0 = time.time()
url, fn = args[0], args[1]
try:
r = requests.get(url)
with open(fn, ‘wb’) as f:
f.write(r.content)
return(url, time.time() – t0)
except Exception as e:
print(‘Exception in download_url():’, e)

Download multiple files with a Python loop

To download the list of URLs to the associated files, loop through the iterable ( inputs) that we created, passing each element to download_url. After each download is complete we will print the downloaded URL and the time it took to download.

The total time to download all URLs will print after all downloads have been completed.

t0 = time.time()
for i in inputs:
result = download_url(i)
print(‘url:’, result[0], ‘time:’, result[1])
print(‘Total time:’, time.time() – t0)

Output:

url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time: 16.381176710128784
url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time: 11.475878953933716
url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time: 13.059367179870605
url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time: 12.232381582260132
Total time: 53.15849542617798

It took between 11 and 16 seconds to download the individual files. The total download time was a little less than one minute. Your download times will vary based on your specific network connection.

Let’s compare this serial (loop) approach to the parallel approach below.

Download multiple files in parallel with Python

To start, create a function ( download_parallel) to handle the parallel download. The function ( download_parallel) will take one argument, an iterable containing URLs and associated filenames (the inputs variable we created earlier).

Next, get the number of CPUs available for processing. This will determine the number of threads to run in parallel.

Now use the multiprocessing ThreadPool to map the inputs to the download_url function. Here we use the imap_unordered method of ThreadPool and pass it the download_url function and input arguments to download_url (the inputs variable). The imap_unordered method will run download_url simultaneously for the number of specified threads (i.e. parallel download).

Thus, if we have four files and four threads all files can be downloaded at the same time instead of waiting for one download to finish before the next starts. This can save a considerable amount of processing time.

In the final part of the download_parallel function the downloaded URLs and the time required to download each URL are printed.

def download_parallel(args):
cpus = cpu_count()
results = ThreadPool(cpus – 1).imap_unordered(download_url, args)
for result in results:
print(‘url:’, result[0], ‘time (s):’, result[1])

Once the inputs and download_parallel are defined, the files can be downloaded in parallel with a single line of code.

download_parallel(inputs)

Output:

url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time (s): 14.641696214675903
url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time (s): 14.789752960205078
url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time (s): 15.052601337432861
url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time (s): 23.287317752838135
Total time: 23.32273244857788

Notice that it took longer to download each individual file with the approach. This may be a result of changing network speed, or overhead required to map the downloads to their respective threads. Even though the individual files took longer to download, the parallel method resulted in a 50% decrease in total download time.

You can see how parallel processing can greatly reduce processing time for multiple files. As the number of files increases, you will save much more time by using a parallel download approach.

Conclusion

Automating file downloads in your development and analysis routines can save you a lot of time. As demonstrated by this tutorial implementing a parallel download routine can greatly decrease file acquisition time if you require many files or large files.

Originally published at https://opensourceoptions.com.

Use Python to Download Multiple Files (or URLs) in Parallel was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment