Downloading zip files from the web and getting csv files

Problem Description

Answer

First create the directory if it doesn't exist. When you run the code multiple times, it is possible the directory and files are already existing.

If you use os.Mkdir(outdir, os.ModePerm) it will do nothing if the dir already exists. But if the files exists, the program will create copies. So let us delete the folder and files if it exists

outdir := "./downloads"
// delete if dir exists
if _, err := os.Stat(outdir); os.IsExist(err) {
	os.RemoveAll(outdir)
}

os.Mkdir(outdir, os.ModePerm)

Let us put all the files to download into a string slice:

durls := []string{
	"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2018_Q4.zip",
	"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q1.zip",
	"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q2.zip",
	"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q3.zip",
	"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2220_Q1.zip",
	"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2019_Q4.zip",
	"https://divvy-tripdata.s3.amazonaws.com/Divvy_Trips_2020_Q1.zip",
}

We will use github.com/cavaliergopher/grab/v3 to download files from the net.

It supports concurrent downloads, batch downloads and many more features.

You can download a single file using this code:

client := grab.NewClient()
req, _ := grab.NewRequest(".", url)
resp := client.Do(req)

But we are looking at multiple files and concurrent. Grab supports it with DoBatch

Let us use it.

To do that we need to construct a batch of these requests.

reqs := make([]*grab.Request, 0)
client := grab.NewClient()
for _, durl := range durls {
	// extract filename from uri
	u, _:= url.ParseRequestURI(durl)

	// / is already part of u.Path
	fn := outdir + u.Path

	req, err := grab.NewRequest(fn, durl)
	if err != nil {
		log.Fatal("Error while creating new grab request: ", err)
	}
	reqs = append(reqs, req)
}

Now we can issue batch requests to download the files

respch := client.DoBatch(4, reqs…)

Grab uses golang channels to download files concurrently. We will process (unzip) each file once it is completed:

for resp := range respch {
	err := resp.Err()
	if err != nil {
		log.Println("Error getting ", resp.Filename, " is: ", err)
		continue
	}
	fmt.Printf("Downloaded %s to %s\n", resp.Request.URL(), resp.Filename)
}

Now we need to unzip and extract csv files

Each zip file can contain many files, so we need to loop through them. Each file could be a directory too.

The steps will go like this:
• open zip file
• loop through all files
• if the file is a directory, create it
• if it is a file, copy it

You can open the zip file with r, err := zip.OpenReader(zipfile)

Let us now process this zip file

for k, f := range r.File {
	rc, _ := f.Open()
	defer rc.Close()
	// define the new file path
	newFilePath := fmt.Sprintf("%s/%s", outputdir, f.Name)

	if f.FileInfo().IsDir() {
		os.MkdirAll(newFilePath, 0777)		
		continue
	}
	
	uncompressedFile, _ := os.Create(newFilePath)
	io.Copy(uncompressedFile, rc)
}

For clarity, I have removed error handling.

If you want to delete the file you can do so with os.Remove(zipfile)

That's it. This is the first of many #dataengineering problems to solve in #golang.

If you've any comments, please comment on this thread via any fediverse applications. You can read other Data Engineering Solutions Using Golang & DuckDB too.