Reading huge excelx files with Go
Lets check some ideas to read excelx files with a lot of rows using go. The use case is this: we want to build he next waze and we need to read this file related to Medellin Colombia traffic accidents in order to populate our database with geolocated information
Installing excelize
Excelize seems a great candidate for this proof of concept, to install it just use go get:
go get github.com/360EntSecGroup-Skylar/excelize
Basic go script
Lets create a basic main.go file adding the excelize dependency
// main.go
package main
import (
"github.com/360EntSecGroup-Skylar/excelize"
)
func main() {
...
}
Read a file
We are going to get the .xlsx from the command line and open it with excelize OpenFile method
package main
import (
"fmt"
"os"
"github.com/360EntSecGroup-Skylar/excelize"
)
func main() {
filename := os.Args[1]
fmt.Println(filename)
file, err := excelize.OpenFile(filename)
if err != nil {
fmt.Println(err)
return
}
}
At this point we are able to compile and test our main.go file:
# compile
go build main.go
#execute
./main /path_of_your_xlsx_file/
Get file preview
Now we can add some code to get a preview (the first 4 rows of the file):
// main.go
...
rows, err := file.GetRows("Sheet1")
if err != nil {
fmt.Println(err)
return
}
for _, row := range rows[0:4] {
for _, colCell := range row {
fmt.Print(colCell, "\t")
}
fmt.Println()
}
...
Compiling and executing again main.go file we are gonna get something like this:
./main ...
X Y OBJECTID RADICADO FECHA HORA DIA PERIODO CLASE DIRECCION DIRECCION_ENC CBML TIPO_GEOCOD GRAVEDAD BARRIO COMUNA DISENO DIA_NOMBRE MES MES_NOMBRE X_MAGNAMED Y_MAGNAMED LONGITUD LATITUD
-75.6058873684679 6.22275163091513 753145 1705201 2020/01/26 00:00:00+00 11:30:00 26 2020 Caida Ocupante CR 83 CL 17 CR 083 017 000 00000 1612 Malla vial HERIDO La Gloria Belén Lote o Predio DOMINGO ....
Reading the file row by row and getting the desired cells
For the scope of this little project we going to take from rows the following columns: OBJECTID, FECHA (Date of the accident), HORA(Hour of the accident), LONGITUD and LATITUD. First we are going to create a go struct that are typed collections of fields in or order to save accident relevant data:
// main.go
package main
import (
"fmt"
"os"
"time"
"github.com/360EntSecGroup-Skylar/excelize"
)
type accident struct {
id int64
dateTime string
latitude float64
longitude float64
}
....
To get cell values in rows iteration we can use array indexes for quick access(assuming of course that excelx file rows are uniform and always has the same amount of cells), so lets create a new accident for each row:
...
for _, row := range rows[2:6] {
accidentTime := fmt.Sprintf("%s-%s-%sT%s", row[7], row[18], row[6], row[5])
accidentID, _ := strconv.ParseInt(row[2], 10, 64)
latitude, _ := strconv.ParseFloat(row[0], 8)
longitude, _ := strconv.ParseFloat(row[1], 8)
acc := accident{id: accidentID, dateTime: accidentTime, latitude: latitude, longitude: longitude}
fmt.Println(acc)
}
...
Compile and execute, the result must be something like this:
....
{769038 2020-5-29T12:00:00 -75.5997786257898 6.22794491686597}
{766043 2020-6-30T12:00:00 -75.7038158742467 6.22180566397042}
...
Adding some concurrency
Now that we can read excel file row by row we are going support to execute row reading concurrently, go provides goroutines to run functions in a lightweight thread, so the first step is to separate accident creation in a new funcion:
// main.go
...
func newAccident(row []string) accident {
accidentTime := fmt.Sprintf("%s-%s-%sT%s", row[7], row[18], row[6], row[5])
accidentID, _ := strconv.ParseInt(row[2], 10, 64)
latitude, _ := strconv.ParseFloat(row[0], 8)
longitude, _ := strconv.ParseFloat(row[1], 8)
accident := accident{id: accidentID, dateTime: accidentTime, latitude: latitude, longitude: longitude}
return accident
}
...
Now we are able to run newAccident asynchronously for each row, to visualize this more easily add a 1 second sleep to print each accident:
// main.go
...
for _, row := range rows[2:6] {
go newAccident(row)
time.Sleep(time.Second)
}
...
Waiting for all goroutines to end
Now that row parsing happens in separate threads lets add a mechanism for all goroutines to print how many are created, to accomplish this go provides waitGroups, first we are going to create a slice to store all the accident structs:
// main.go
...
func main() {
var accidents []accident
...
The next step is create a worker function in order to:
- Call newAccident function for a new row
- Append new accident struct to accidents slice
- Print job start and end time
// main.go
...
func rowWorker(id int, row []string, accidents *[]accident, waitGroup *sync.WaitGroup) {
defer waitGroup.Done()
fmt.Printf("Worker %d starting\n", id)
*accidents = append(*accidents, newAccident(row))
fmt.Printf("Worker %d done\n", id)
}
...
Golang defer defers the execution of a function until the surrounding function returns, in this case we defer waitGroup.Done() call (who decrement wait group counter by one) until accident struct are created and appended to accidents slice. Finally inside main()
function we can iterate over all rows and delegate row processing to workers synced by a wait group:
// main.go
...
func main() {
var waitGroup sync.WaitGroup
...
for index, row := range rows[2:] {
waitGroup.Add(1)
go rowWorker(index+1, row, &accidents, &waitGroup)
}
waitGroup.Wait()
fmt.Println(len(accidents))
waitGroup.Add(1)
add a plus one positive delta to the wait group counter and waitGroup.Wait()
blocks execution until wait group count became zero, finally we print accidents slice size.
Saving accidents in postgreSQL
Now we can persists our accident structs in a database, for test purposes we are going to create a sample postgreSQL database:
createdb new_waze
Now create a new table for accidents:
psql new_waze
CREATE TABLE accidents (
id serial PRIMARY KEY,
accident_id TEXT NOT NULL,
date_time TIMESTAMP NOT NULL,
latitude DOUBLE PRECISION NOT NULL,
longitude DOUBLE PRECISION NOT NULL,
created_at TIMESTAMP NOT NULL
);
To connect to postgreSQL from go lets add the pq driver:
go get github.com/lib/pq
Then, in main.go file we can create a connection function to connect to postgreSQL:
// main.go
...
func getConnection() *sql.DB {
dsn := "postgres://:@127.0.0.1:5432/new_waze?sslmode=disable"
db, err := sql.Open("postgres", dsn)
db.SetMaxOpenConns(32)
if err != nil {
log.Fatal(err)
}
return db
}
...
The next step is to create a new function to persist each accident struct in the database, so we need to pass to it the accident struct and the connection to database:
func persistAccident(accident *accident, db *sql.DB) {
sqlStatement := `
INSERT INTO accidents (accident_id, date_time, latitude, longitude, created_at)
VALUES ($1, $2, $3, $4, $5)`
_, err := db.Exec(sqlStatement, accident.id,
accident.dateTime, accident.latitude, accident.longitude, time.Now())
if err != nil {
fmt.Println(err)
}
}
Finally we need to update our newAccident
function in order to persist the accident:
// main.go
func newAccident(row []string, db *sql.DB) accident {
accidentTime := fmt.Sprintf("%s-%s-%sT%s", row[7], row[18], row[6], row[5])
accidentID, _ := strconv.ParseInt(row[2], 10, 64)
latitude, _ := strconv.ParseFloat(row[0], 8)
longitude, _ := strconv.ParseFloat(row[1], 8)
accident := accident{id: accidentID, dateTime: accidentTime, latitude: latitude, longitude: longitude}
persistAccident(&accident, db)
return accident
}
Also we need to update rowWorker
function in order to pass the required arguments to newAccident
// main.go
func rowWorker(id int, row []string, accidents *[]accident, db *sql.DB, waitGroup *sync.WaitGroup) {
defer waitGroup.Done()
fmt.Printf("Worker %d starting\n", id)
*accidents = append(*accidents, newAccident(row, db))
fmt.Printf("Worker %d done\n", id)
}
Now is time to update our main.go function to:
- Create the db connection
- Set a start time and elapsed time to measure goroutines execution time
- Print the execution time
// main.go
func main() {
var waitGroup sync.WaitGroup
var accidents []accident
db := getConnection()
defer db.Close()
filename := os.Args[1]
fmt.Println(filename)
file, err := excelize.OpenFile(filename)
if err != nil {
fmt.Println(err)
return
}
rows, err := file.GetRows("Sheet1")
if err != nil {
fmt.Println(err)
return
}
start := time.Now()
for index, row := range rows[2:] {
waitGroup.Add(1)
go rowWorker(index+1, row, &accidents, db, &waitGroup)
}
waitGroup.Wait()
fmt.Println(len(accidents))
elapsed := time.Since(start)
log.Printf("Excel processing took %s", elapsed)
}
Is time to execute and test our program:
go run main.go <path_to_your_accidents_file>
...
Worker 13865 done
Worker 10856 done
...
2020/08/06 13:29:56 Excel processing took 5.980922785s
As you can see program takes around 6 seconds to read and add to database 15177 rows:
psql new_waze
psql (12.3)
Type "help" for help.
new_waze=# select count(*) from accidents;
count
-------
15177
(1 row)
At this point we ca say that excelize combined with goroutines can help us a lot to read and process big excel files in very small amounts of time, full program code is available here