Downloading zip files from the web and getting csv files
Problem Description
- Create the directory downloads if it doesn't exist
- You need to download 10 files that are sitting at the following specified HTTP urls
- They should be downloaded into downloads folder
- Each file is a zip, extract the csv from the zip and delete the zip file
- split out the filename from the uri, so the file keeps its original filename
- download the files in an async manner
Answer
First create the directory if it doesn't exist. When you run the code multiple times, it is possible the directory and files are already existing.
If you use os.Mkdir(outdir, os.ModePerm)
it will do nothing if the dir already exists. But if the files exists, the program will create copies. So let us delete the folder and files if it exists
outdir := "./downloads"
// delete if dir exists
if _, err := os.Stat(outdir); os.IsExist(err) {
os.RemoveAll(outdir)
}
os.Mkdir(outdir, os.ModePerm)
Let us put all the files to download into a string slice:
durls := []string{
"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2018_Q4.zip",
"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q1.zip",
"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q2.zip",
"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q3.zip",
"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2220_Q1.zip",
"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q4.zip",
"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2020_Q1.zip",
}
We will use github.com/cavaliergopher/grab/v3
to download files from the net.
It supports concurrent downloads, batch downloads and many more features.
You can download a single file using this code:
client := grab.NewClient()
req, _ := grab.NewRequest(".", url)
resp := client.Do(req)
But we are looking at multiple files and concurrent. Grab supports it with DoBatch
Let us use it.
To do that we need to construct a batch of these requests.
reqs := make([]*grab.Request, 0)
client := grab.NewClient()
for _, durl := range durls {
// extract filename from uri
u, _:= url.ParseRequestURI(durl)
// / is already part of u.Path
fn := outdir + u.Path
req, err := grab.NewRequest(fn, durl)
if err != nil {
log.Fatal("Error while creating new grab request: ", err)
}
reqs = append(reqs, req)
}
Now we can issue batch requests to download the files
respch := client.DoBatch(4, reqs…)
Grab uses golang channels to download files concurrently. We will process (unzip) each file once it is completed:
for resp := range respch {
err := resp.Err()
if err != nil {
log.Println("Error getting ", resp.Filename, " is: ", err)
continue
}
fmt.Printf("Downloaded %s to %s\n", resp.Request.URL(), resp.Filename)
}
Now we need to unzip and extract csv files
Each zip file can contain many files, so we need to loop through them. Each file could be a directory too.
The steps will go like this:
• open zip file
• loop through all files
• if the file is a directory, create it
• if it is a file, copy it
You can open the zip file with r, err := zip.OpenReader(zipfile)
Let us now process this zip file
for k, f := range r.File {
rc, _ := f.Open()
defer rc.Close()
// define the new file path
newFilePath := fmt.Sprintf("%s/%s", outputdir, f.Name)
if f.FileInfo().IsDir() {
os.MkdirAll(newFilePath, 0777)
continue
}
uncompressedFile, _ := os.Create(newFilePath)
io.Copy(uncompressedFile, rc)
}
For clarity, I have removed error handling.
If you want to delete the file you can do so with os.Remove(zipfile)
That's it. This is the first of many #dataengineering problems to solve in #golang.
If you've any comments, please comment on this thread via any fediverse applications. You can read other Data Engineering Solutions Using Golang & DuckDB too.