Web Scraping

Problem Description

Answer

We will use github.com/PuerkitoBio/goquery to parse html. goquery is popular because it is comprehensive and efficient in handling all type of html pages.

response, err := http.Get(server)
if err != nil {
	log.Fatal("Error fetching URL:", err)
}
defer response.Body.Close()

This particular page contains a table with rows and columns.

The html looks like this:

<table>
   <tbody><tr><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
   <tr><th colspan="4"><hr></th></tr>
<tr><td><a href="/data/local-climatological-data/access/">Parent Directory</a></td><td>&nbsp;</td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td><a href="01001099999.csv">01001099999.csv</a></td><td align="right">2024-01-19 09:51  </td><td align="right">4.0M</td><td>&nbsp;</td></tr>
<tr><td><a href="01001499999.csv">01001499999.csv</a></td><td align="right">2024-01-19 10:33  </td><td align="right">2.8M</td><td>&nbsp;</td></tr>
</table>

As per the problem, we are not suppose to give the file name directly but has to loop through to find out the file that was modified (2nd column) on a particular time.

So we need to loop through every row and columns within the row to look for the modified time.

dateToFind := "2024-01-19 09:48"
fileToDownload := ""
// Find all tables in the HTML
doc.Find("table").Each(func(index int, tableHtml *goquery.Selection) {
	// Find all rows in the table
	tableHtml.Find("tr").Each(func(rowIndex int, rowHtml *goquery.Selection) {
		// if we found what we need; let's get out
		if fileToDownload != "" {
			return
		}
		// Initialize slice to store values of each row
		var rowData []string
		// Loop through all columns (td elements) in the row
		rowHtml.Find("td").Each(func(colIndex int, colHtml *goquery.Selection) {
			// Get the text content of the column
			value := strings.TrimSpace(colHtml.Text())
			// Add the value to the rowData slice
			rowData = append(rowData, value)
			// Check if the value matches the particular string
			if value == dateToFind {
				// go array index starts with 0
				fileToDownload = rowData[colIndex-1]
				// Exit the loop or return from this function if needed
				return
			}
		})
	})
})

Now that we have the file name by parsing the html page, we can download the file:

resp, err := http.Get(urlToDownload)
if err != nil {
	log.Fatal("Error downloading file: ", err)
}
defer resp.Body.Close()

out, err := os.Create(fileToDownload)
if err != nil {
	log.Fatal("Error creating file: ", err)
}
defer out.Close()
io.Copy(out, resp.Body)

We have downloaded the csv file. We have to load it up into duckdb and find the max value of a column. If you've not already seen the How to use DuckDB with Golang, this is the time to read it.

We can load the csv file and create a table in a single command of duckdb.

_, err = db.Exec(`CREATE TABLE noaatbl AS FROM '01003099999.csv';`)
if err != nil {
	log.Fatal("Error creating table: ", err)
}

We have to query this table for max of HourlyDryBulbTemperature and then pick all the rows matching that value. It is an aggregation query.

select station, date, HourlyDryBulbTemperature from noaatbl 
where HourlyDryBulbTemperature = 
(select max(HourlyDryBulbTemperature) from noaatbl);

Let us execute this using go.

var (
	station string
	date    time.Time
	temp    int
)

rows, err := db.Query(`
select station, date, HourlyDryBulbTemperature from noaatbl 
where HourlyDryBulbTemperature = 
(select max(HourlyDryBulbTemperature) from noaatbl);
`)
if err != nil {
	log.Fatal(err)
}
defer rows.Close()

for rows.Next() {
	err := rows.Scan(&station, &date, &temp)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(station, date, temp)
}

err = rows.Err()
if err != nil {
	log.Fatal("Error reading rows: ", err)
}

That's it. We parsed the html page, looked for a specific last modified time, downloaded the file, loaded it into duckdb, and got the rows that matched max of a column.

You can view all the Data Engineering Solutions Using Golang & DuckDB. If you have comments, please comment using fediverse app: @jjude@cpn.jjude.com