Web Scraping
Problem Description
- You need to download a file of weather data from a government website. files that are sitting at the following specified location: https://www.ncei.noaa.gov/data/local-climatological-data/access/2021/
- Analyze the page and look for a file that was
Last Modified
on2022-02-07 14:03
. You can't cheat and lookup the file number yourself. You must scrape this webpage, finding the corresponding file-name for this timestamp,2022-02-07 14:03
- Once you have obtained the correct file, and downloaded it, you must load the file into
DuckDB
and find the record(s) with the highestHourlyDryBulbTemperature
. Print these record(s) to the command line.
Answer
We will use github.com/PuerkitoBio/goquery
to parse html. goquery
is popular because it is comprehensive and efficient in handling all type of html pages.
response, err := http.Get(server)
if err != nil {
log.Fatal("Error fetching URL:", err)
}
defer response.Body.Close()
This particular page contains a table with rows and columns.
The html looks like this:
<table>
<tbody><tr><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
<tr><th colspan="4"><hr></th></tr>
<tr><td><a href="/data/local-climatological-data/access/">Parent Directory</a></td><td> </td><td align="right"> - </td><td> </td></tr>
<tr><td><a href="01001099999.csv">01001099999.csv</a></td><td align="right">2024-01-19 09:51 </td><td align="right">4.0M</td><td> </td></tr>
<tr><td><a href="01001499999.csv">01001499999.csv</a></td><td align="right">2024-01-19 10:33 </td><td align="right">2.8M</td><td> </td></tr>
</table>
As per the problem, we are not suppose to give the file name directly but has to loop through to find out the file that was modified (2nd column) on a particular time.
So we need to loop through every row and columns within the row to look for the modified time.
dateToFind := "2024-01-19 09:48"
fileToDownload := ""
// Find all tables in the HTML
doc.Find("table").Each(func(index int, tableHtml *goquery.Selection) {
// Find all rows in the table
tableHtml.Find("tr").Each(func(rowIndex int, rowHtml *goquery.Selection) {
// if we found what we need; let's get out
if fileToDownload != "" {
return
}
// Initialize slice to store values of each row
var rowData []string
// Loop through all columns (td elements) in the row
rowHtml.Find("td").Each(func(colIndex int, colHtml *goquery.Selection) {
// Get the text content of the column
value := strings.TrimSpace(colHtml.Text())
// Add the value to the rowData slice
rowData = append(rowData, value)
// Check if the value matches the particular string
if value == dateToFind {
// go array index starts with 0
fileToDownload = rowData[colIndex-1]
// Exit the loop or return from this function if needed
return
}
})
})
})
Now that we have the file name by parsing the html page, we can download the file:
resp, err := http.Get(urlToDownload)
if err != nil {
log.Fatal("Error downloading file: ", err)
}
defer resp.Body.Close()
out, err := os.Create(fileToDownload)
if err != nil {
log.Fatal("Error creating file: ", err)
}
defer out.Close()
io.Copy(out, resp.Body)
We have downloaded the csv
file. We have to load it up into duckdb
and find the max value of a column. If you've not already seen the How to use DuckDB with Golang, this is the time to read it.
We can load the csv file and create a table in a single command of duckdb.
_, err = db.Exec(`CREATE TABLE noaatbl AS FROM '01003099999.csv';`)
if err != nil {
log.Fatal("Error creating table: ", err)
}
We have to query this table for max of HourlyDryBulbTemperature
and then pick all the rows matching that value. It is an aggregation query.
select station, date, HourlyDryBulbTemperature from noaatbl
where HourlyDryBulbTemperature =
(select max(HourlyDryBulbTemperature) from noaatbl);
Let us execute this using go.
var (
station string
date time.Time
temp int
)
rows, err := db.Query(`
select station, date, HourlyDryBulbTemperature from noaatbl
where HourlyDryBulbTemperature =
(select max(HourlyDryBulbTemperature) from noaatbl);
`)
if err != nil {
log.Fatal(err)
}
defer rows.Close()
for rows.Next() {
err := rows.Scan(&station, &date, &temp)
if err != nil {
log.Fatal(err)
}
fmt.Println(station, date, temp)
}
err = rows.Err()
if err != nil {
log.Fatal("Error reading rows: ", err)
}
That's it. We parsed the html page, looked for a specific last modified time, downloaded the file, loaded it into duckdb, and got the rows that matched max of a column.
You can view all the Data Engineering Solutions Using Golang & DuckDB. If you have comments, please comment using fediverse app: @jjude@cpn.jjude.com