Reading huge excelx files with Go

Lets check some ideas to read excelx files with a lot of rows using go. The use case is this: we want to build he next waze and we need to read this file related to Medellin Colombia traffic accidents in order to populate our database with geolocated information

Installing excelize

Excelize seems a great candidate for this proof of concept, to install it just use go get:

go get github.com/360EntSecGroup-Skylar/excelize

Basic go script

Lets create a basic main.go file adding the excelize dependency

// main.go
package main

import (
	"github.com/360EntSecGroup-Skylar/excelize"
)

func main() {
	...
}

Read a file

We are going to get the .xlsx from the command line and open it with excelize OpenFile method

package main

import (
	"fmt"
	"os"

	"github.com/360EntSecGroup-Skylar/excelize"
)

func main() {
	filename := os.Args[1]
	fmt.Println(filename)

	file, err := excelize.OpenFile(filename)
	if err != nil {
		fmt.Println(err)
		return
	}
}

At this point we are able to compile and test our main.go file:

# compile
go build main.go

#execute
./main /path_of_your_xlsx_file/

Get file preview

Now we can add some code to get a preview (the first 4 rows of the file):

// main.go
...
rows, err := file.GetRows("Sheet1")
if err != nil {
  fmt.Println(err)
  return
}

for _, row := range rows[0:4] {
  for _, colCell := range row {
    fmt.Print(colCell, "\t")
  }
  fmt.Println()
}
...

Compiling and executing again main.go file we are gonna get something like this:

./main ...
X       Y       OBJECTID        RADICADO        FECHA   HORA    DIA     PERIODO CLASE   DIRECCION       DIRECCION_ENC   CBML    TIPO_GEOCOD     GRAVEDAD        BARRIO  COMUNA  DISENO  DIA_NOMBRE      MES     MES_NOMBRE      X_MAGNAMED  Y_MAGNAMED      LONGITUD        LATITUD
-75.6058873684679       6.22275163091513        753145  1705201 2020/01/26 00:00:00+00  11:30:00        26      2020    Caida Ocupante  CR 83 CL 17     CR  083   017  000 00000        1612    Malla vial      HERIDO  La Gloria  Belén    Lote o Predio   DOMINGO  ....

Reading the file row by row and getting the desired cells

For the scope of this little project we going to take from rows the following columns: OBJECTID, FECHA (Date of the accident), HORA(Hour of the accident), LONGITUD and LATITUD. First we are going to create a go struct that are typed collections of fields in or order to save accident relevant data:

// main.go
package main

import (
	"fmt"
	"os"
    "time"
	"github.com/360EntSecGroup-Skylar/excelize"
)

type accident struct {
    id int64
    dateTime string
    latitude float64
    longitude float64
}
....

To get cell values in rows iteration we can use array indexes for quick access(assuming of course that excelx file rows are uniform and always has the same amount of cells), so lets create a new accident for each row:

...
for _, row := range rows[2:6] {
   accidentTime := fmt.Sprintf("%s-%s-%sT%s", row[7], row[18], row[6], row[5])
	accidentID, _ := strconv.ParseInt(row[2], 10, 64)
	latitude, _ := strconv.ParseFloat(row[0], 8)
	longitude, _ := strconv.ParseFloat(row[1], 8)

	acc := accident{id: accidentID, dateTime: accidentTime, latitude: latitude, longitude: longitude}
	fmt.Println(acc)
}
...

Compile and execute, the result must be something like this:

....
{769038 2020-5-29T12:00:00 -75.5997786257898 6.22794491686597}
{766043 2020-6-30T12:00:00 -75.7038158742467 6.22180566397042}
...

Adding some concurrency

Now that we can read excel file row by row we are going support to execute row reading concurrently, go provides goroutines to run functions in a lightweight thread, so the first step is to separate accident creation in a new funcion:

// main.go
...
func newAccident(row []string) accident {
	accidentTime := fmt.Sprintf("%s-%s-%sT%s", row[7], row[18], row[6], row[5])
	accidentID, _ := strconv.ParseInt(row[2], 10, 64)
	latitude, _ := strconv.ParseFloat(row[0], 8)
	longitude, _ := strconv.ParseFloat(row[1], 8)

	accident := accident{id: accidentID, dateTime: accidentTime, latitude: latitude, longitude: longitude}
	return accident
}
...

Now we are able to run newAccident asynchronously for each row, to visualize this more easily add a 1 second sleep to print each accident:

// main.go
...
for _, row := range rows[2:6] {
  go newAccident(row)
  time.Sleep(time.Second)
}
...

Waiting for all goroutines to end

Now that row parsing happens in separate threads lets add a mechanism for all goroutines to print how many are created, to accomplish this go provides waitGroups, first we are going to create a slice to store all the accident structs:

// main.go
...
func main() {
  var accidents []accident 
...

The next step is create a worker function in order to:

Call newAccident function for a new row
Append new accident struct to accidents slice
Print job start and end time

// main.go
...
func rowWorker(id int, row []string, accidents *[]accident, waitGroup *sync.WaitGroup) {
  defer waitGroup.Done()

  fmt.Printf("Worker %d starting\n", id)
  *accidents = append(*accidents, newAccident(row))
  fmt.Printf("Worker %d done\n", id)
}
...

Golang defer defers the execution of a function until the surrounding function returns, in this case we defer waitGroup.Done() call (who decrement wait group counter by one) until accident struct are created and appended to accidents slice. Finally inside main() function we can iterate over all rows and delegate row processing to workers synced by a wait group:

// main.go
...
func main() {
  var waitGroup sync.WaitGroup
  ...
  
  for index, row := range rows[2:] {
    waitGroup.Add(1)
    go rowWorker(index+1, row, &accidents, &waitGroup)
  }
  waitGroup.Wait()
  fmt.Println(len(accidents))

waitGroup.Add(1) add a plus one positive delta to the wait group counter and waitGroup.Wait() blocks execution until wait group count became zero, finally we print accidents slice size.

Saving accidents in postgreSQL

Now we can persists our accident structs in a database, for test purposes we are going to create a sample postgreSQL database:

createdb new_waze

Now create a new table for accidents:

psql new_waze
CREATE TABLE accidents (
  id serial PRIMARY KEY,
  accident_id TEXT NOT NULL,
  date_time TIMESTAMP NOT NULL,
  latitude DOUBLE PRECISION NOT NULL,
  longitude DOUBLE PRECISION NOT NULL,
  created_at TIMESTAMP NOT NULL
);

To connect to postgreSQL from go lets add the pq driver:

go get github.com/lib/pq

Then, in main.go file we can create a connection function to connect to postgreSQL:

// main.go
...
func getConnection() *sql.DB {
  dsn := "postgres://:@127.0.0.1:5432/new_waze?sslmode=disable"
  db, err := sql.Open("postgres", dsn)
  db.SetMaxOpenConns(32)
  if err != nil {
    log.Fatal(err)
  }
  return db
}
...

The next step is to create a new function to persist each accident struct in the database, so we need to pass to it the accident struct and the connection to database:

func persistAccident(accident *accident, db *sql.DB) {
	sqlStatement := `
	INSERT INTO accidents (accident_id, date_time, latitude, longitude, created_at)
	VALUES ($1, $2, $3, $4, $5)`
	_, err := db.Exec(sqlStatement, accident.id, 
       accident.dateTime, accident.latitude, accident.longitude, time.Now())
	if err != nil {
		fmt.Println(err)
	}
}

Finally we need to update our newAccident function in order to persist the accident:

// main.go
func newAccident(row []string, db *sql.DB) accident {
	accidentTime := fmt.Sprintf("%s-%s-%sT%s", row[7], row[18], row[6], row[5])
	accidentID, _ := strconv.ParseInt(row[2], 10, 64)
	latitude, _ := strconv.ParseFloat(row[0], 8)
	longitude, _ := strconv.ParseFloat(row[1], 8)
	accident := accident{id: accidentID, dateTime: accidentTime, latitude: latitude, longitude: longitude}
	persistAccident(&accident, db)
	return accident
}

Also we need to update rowWorker function in order to pass the required arguments to newAccident

// main.go
func rowWorker(id int, row []string, accidents *[]accident, db *sql.DB, waitGroup *sync.WaitGroup) {
	defer waitGroup.Done()

	fmt.Printf("Worker %d starting\n", id)
	*accidents = append(*accidents, newAccident(row, db))
	fmt.Printf("Worker %d done\n", id)
}

Now is time to update our main.go function to:

Create the db connection
Set a start time and elapsed time to measure goroutines execution time
Print the execution time

// main.go
func main() {
	var waitGroup sync.WaitGroup
	var accidents []accident
	db := getConnection()
	defer db.Close()

	filename := os.Args[1]
	fmt.Println(filename)

	file, err := excelize.OpenFile(filename)
	if err != nil {
		fmt.Println(err)
		return
	}

	rows, err := file.GetRows("Sheet1")

	if err != nil {
		fmt.Println(err)
		return
	}

	start := time.Now()
	for index, row := range rows[2:] {
		waitGroup.Add(1)
		go rowWorker(index+1, row, &accidents, db, &waitGroup)
	}
	waitGroup.Wait()
	fmt.Println(len(accidents))
	elapsed := time.Since(start)
	log.Printf("Excel processing took %s", elapsed)
}

Is time to execute and test our program:

 go run main.go <path_to_your_accidents_file>
...
Worker 13865 done
Worker 10856 done
...
2020/08/06 13:29:56 Excel processing took 5.980922785s

As you can see program takes around 6 seconds to read and add to database 15177 rows:

psql new_waze   
psql (12.3)
Type "help" for help.

new_waze=# select count(*) from accidents;
 count 
-------
 15177
(1 row)

At this point we ca say that excelize combined with goroutines can help us a lot to read and process big excel files in very small amounts of time, full program code is available here