Reading from Parquet #46

space55 · 2021-01-06T15:29:52Z

Hello,

Are there any plans to support reading a Parquet file into a dataframe? I have a need for this and am evaluating this library to use in an application.

Thanks!

pjebs · 2021-01-07T00:12:17Z

I've had lots of people in the past people asking for exporting to parquet, which I implemented.
You're the first to ask about importing, but I had put it in my todo list in may.
I won't have time to implement it soon. However, you can issue as PR.

pjebs · 2021-01-07T00:13:38Z

Hmmm. I noticed in my TODO list (#17), there had been 3 thumbs up for that request.

khughitt · 2021-01-17T22:46:30Z

In case it helps, here is some code I wrote to read a parquet file into a DataFrame that you may be able to adapt in the meantime:

package main

import (
	dataframe "github.com/rocketlaunchr/dataframe-go"
	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
        "context"
	"runtime"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
	// create local parquet/reader instances
	entriesFr, err := local.NewLocalFileReader(inputParquet)

	if err != nil {
		log.Println("Can't open file")
	}

	entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

	if err != nil {
		log.Println("Unable to create parquet reader", err)
	}

	// determine numer of rows in input parquet file
	numRows := int64(entriesPr.GetNumRows())

	// read columns from parquet and use them to construct a DataFrame instance of the
	// same form
	var paths, titles, bodies, accesscounts, accessdates, createddates, deadlines, priorities, archived []interface{}

	paths, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.path", numRows)
	titles, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.title", numRows)
	bodies, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.body", numRows)
	accesscounts, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.accesscount", numRows)
	accessdates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.lastaccess", numRows)
	createddates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.datecreated", numRows)
	deadlines, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.deadline", numRows)
	priorities, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.priority", numRows)
	archived, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.archived", numRows)

	entries := dataframe.NewDataFrame(
		dataframe.NewSeriesString("path", nil, paths...),
		dataframe.NewSeriesString("title", nil, titles...),
		dataframe.NewSeriesString("body", nil, bodies...),
		dataframe.NewSeriesInt64("accesscount", nil, accesscounts...),
		dataframe.NewSeriesInt64("lastaccess", nil, accessdates...),
		dataframe.NewSeriesInt64("datecreated", nil, createddates...),
		dataframe.NewSeriesInt64("deadline", nil, deadlines...),
		dataframe.NewSeriesInt64("priority", nil, priorities...),
		dataframe.NewSeriesInt64("archived", nil, archived...),
	)

	entriesPr.ReadStop()
	entriesFr.Close()

	// sort entries by date of creation
	sortKey := []dataframe.SortKey{
		{Key: "datecreated", Desc: true},
	}

	ctx := context.Background()
	entries.Sort(ctx, sortKey)

	return entries
}

Few comments:

I can't make any claims that it is the most efficient approach, and feedback is welcome, but at least this should do the job..
The function loads a parquet dataframe containing "entries", with an expected format.. I left a lot of the file-specific logic in there to provide examples of how to handle different variable types.
I also left some logic in the bottom to help sort the dataframe once it's been loaded, in case that is useful.

Cheers.

pjebs · 2021-01-19T23:57:28Z

Thanks @khughitt . I need to generalise it so that it works for anything parquet data.

CeciliaCoelho · 2021-02-16T10:16:09Z

@pjebs Did you manage to generalise it? Can't get it to work, getting this error.

pjebs · 2021-02-16T10:20:00Z

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: xitongsys/parquet-go#360

CeciliaCoelho · 2021-02-16T10:35:04Z

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: xitongsys/parquet-go#360

Getting a new error now. This was a CSV that I converted to parquet using python but wanted to open and use in Go because of efficiency.
The CSV was like this:

I have this code:

package main

import (
	"context"
	"log"
	"runtime"

	dataframe "github.com/rocketlaunchr/dataframe-go"
	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
	// create local parquet/reader instances
	entriesFr, err := local.NewLocalFileReader(inputParquet)

	if err != nil {
		log.Println("Can't open file")
	}

	entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

	if err != nil {
		log.Println("Unable to create parquet reader", err)
	}

	// determine numer of rows in input parquet file
	numRows := int64(entriesPr.GetNumRows())

	// read columns from parquet and use them to construct a DataFrame instance of the
	// same form
	var id, name, res, spill, turb, pump []interface{}

	id, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.id", numRows)
	name, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.name", numRows)
	res, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.res", numRows)
	spill, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.spill", numRows)
	turb, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.turb", numRows)
	pump, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.pump", numRows)

	entries := dataframe.NewDataFrame(
		dataframe.NewSeriesString("id", nil, id...),
		dataframe.NewSeriesString("name", nil, name...),
		dataframe.NewSeriesString("res", nil, res...),
		dataframe.NewSeriesInt64("spill", nil, spill...),
		dataframe.NewSeriesInt64("turb", nil, turb...),
		dataframe.NewSeriesInt64("pump", nil, pump...),
	)

	entriesPr.ReadStop()
	entriesFr.Close()

	// sort entries by date of creation
	sortKey := []dataframe.SortKey{
		{Key: "datecreated", Desc: true},
	}

	ctx := context.Background()
	entries.Sort(ctx, sortKey)

	return entries
}

func main() {
	loadEntriesParquet("cascades2.parquet")
}

Now the error I'm getting is this:

pjebs · 2021-02-16T10:43:31Z

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

CeciliaCoelho · 2021-02-16T10:46:56Z

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

Oh right, didn't notice that. It's running now. Thanks :)
How do I print or access the dataframe? (bet it's a stupid question, sorry I'm a Golang newbie)

pjebs · 2021-02-16T10:48:26Z

the function returns a *dataframe.DataFrame object. You can see examples in the Readme.

However, when I look at the code, it's not efficient at loading the Dataframe with the data. I need to understand that Parquet package better before I can improve the code.

pjebs · 2021-02-17T09:30:26Z

Parquet importing is now supported (experimental): @CeciliaCoelho @khughitt @space55

pjebs closed this as completed Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading from Parquet #46

Reading from Parquet #46

space55 commented Jan 6, 2021

pjebs commented Jan 7, 2021

pjebs commented Jan 7, 2021

khughitt commented Jan 17, 2021

pjebs commented Jan 19, 2021

CeciliaCoelho commented Feb 16, 2021

pjebs commented Feb 16, 2021

CeciliaCoelho commented Feb 16, 2021

pjebs commented Feb 16, 2021

CeciliaCoelho commented Feb 16, 2021

pjebs commented Feb 16, 2021

pjebs commented Feb 17, 2021

Reading from Parquet #46

Reading from Parquet #46

Comments

space55 commented Jan 6, 2021

pjebs commented Jan 7, 2021

pjebs commented Jan 7, 2021

khughitt commented Jan 17, 2021

pjebs commented Jan 19, 2021

CeciliaCoelho commented Feb 16, 2021

pjebs commented Feb 16, 2021

CeciliaCoelho commented Feb 16, 2021

pjebs commented Feb 16, 2021

CeciliaCoelho commented Feb 16, 2021

pjebs commented Feb 16, 2021

pjebs commented Feb 17, 2021