Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading from Parquet #46

Closed
space55 opened this issue Jan 6, 2021 · 11 comments
Closed

Reading from Parquet #46

space55 opened this issue Jan 6, 2021 · 11 comments

Comments

@space55
Copy link

space55 commented Jan 6, 2021

Hello,

Are there any plans to support reading a Parquet file into a dataframe? I have a need for this and am evaluating this library to use in an application.

Thanks!

@pjebs
Copy link
Collaborator

pjebs commented Jan 7, 2021

I've had lots of people in the past people asking for exporting to parquet, which I implemented.
You're the first to ask about importing, but I had put it in my todo list in may.
I won't have time to implement it soon. However, you can issue as PR.

@pjebs
Copy link
Collaborator

pjebs commented Jan 7, 2021

Hmmm. I noticed in my TODO list (#17), there had been 3 thumbs up for that request.

@khughitt
Copy link

In case it helps, here is some code I wrote to read a parquet file into a DataFrame that you may be able to adapt in the meantime:

package main

import (
	dataframe "github.com/rocketlaunchr/dataframe-go"
	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
        "context"
	"runtime"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
	// create local parquet/reader instances
	entriesFr, err := local.NewLocalFileReader(inputParquet)

	if err != nil {
		log.Println("Can't open file")
	}

	entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

	if err != nil {
		log.Println("Unable to create parquet reader", err)
	}

	// determine numer of rows in input parquet file
	numRows := int64(entriesPr.GetNumRows())

	// read columns from parquet and use them to construct a DataFrame instance of the
	// same form
	var paths, titles, bodies, accesscounts, accessdates, createddates, deadlines, priorities, archived []interface{}

	paths, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.path", numRows)
	titles, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.title", numRows)
	bodies, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.body", numRows)
	accesscounts, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.accesscount", numRows)
	accessdates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.lastaccess", numRows)
	createddates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.datecreated", numRows)
	deadlines, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.deadline", numRows)
	priorities, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.priority", numRows)
	archived, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.archived", numRows)

	entries := dataframe.NewDataFrame(
		dataframe.NewSeriesString("path", nil, paths...),
		dataframe.NewSeriesString("title", nil, titles...),
		dataframe.NewSeriesString("body", nil, bodies...),
		dataframe.NewSeriesInt64("accesscount", nil, accesscounts...),
		dataframe.NewSeriesInt64("lastaccess", nil, accessdates...),
		dataframe.NewSeriesInt64("datecreated", nil, createddates...),
		dataframe.NewSeriesInt64("deadline", nil, deadlines...),
		dataframe.NewSeriesInt64("priority", nil, priorities...),
		dataframe.NewSeriesInt64("archived", nil, archived...),
	)

	entriesPr.ReadStop()
	entriesFr.Close()

	// sort entries by date of creation
	sortKey := []dataframe.SortKey{
		{Key: "datecreated", Desc: true},
	}

	ctx := context.Background()
	entries.Sort(ctx, sortKey)

	return entries
}

Few comments:

  1. I can't make any claims that it is the most efficient approach, and feedback is welcome, but at least this should do the job..
  2. The function loads a parquet dataframe containing "entries", with an expected format.. I left a lot of the file-specific logic in there to provide examples of how to handle different variable types.
  3. I also left some logic in the bottom to help sort the dataframe once it's been loaded, in case that is useful.

Cheers.

@pjebs
Copy link
Collaborator

pjebs commented Jan 19, 2021

Thanks @khughitt . I need to generalise it so that it works for anything parquet data.

@CeciliaCoelho
Copy link

@pjebs Did you manage to generalise it? Can't get it to work, getting this error.

image

@pjebs
Copy link
Collaborator

pjebs commented Feb 16, 2021

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: xitongsys/parquet-go#360

@CeciliaCoelho
Copy link

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: xitongsys/parquet-go#360

Getting a new error now. This was a CSV that I converted to parquet using python but wanted to open and use in Go because of efficiency.
The CSV was like this:
image

I have this code:

package main

import (
	"context"
	"log"
	"runtime"

	dataframe "github.com/rocketlaunchr/dataframe-go"
	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
	// create local parquet/reader instances
	entriesFr, err := local.NewLocalFileReader(inputParquet)

	if err != nil {
		log.Println("Can't open file")
	}

	entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

	if err != nil {
		log.Println("Unable to create parquet reader", err)
	}

	// determine numer of rows in input parquet file
	numRows := int64(entriesPr.GetNumRows())

	// read columns from parquet and use them to construct a DataFrame instance of the
	// same form
	var id, name, res, spill, turb, pump []interface{}

	id, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.id", numRows)
	name, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.name", numRows)
	res, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.res", numRows)
	spill, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.spill", numRows)
	turb, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.turb", numRows)
	pump, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.pump", numRows)

	entries := dataframe.NewDataFrame(
		dataframe.NewSeriesString("id", nil, id...),
		dataframe.NewSeriesString("name", nil, name...),
		dataframe.NewSeriesString("res", nil, res...),
		dataframe.NewSeriesInt64("spill", nil, spill...),
		dataframe.NewSeriesInt64("turb", nil, turb...),
		dataframe.NewSeriesInt64("pump", nil, pump...),
	)

	entriesPr.ReadStop()
	entriesFr.Close()

	// sort entries by date of creation
	sortKey := []dataframe.SortKey{
		{Key: "datecreated", Desc: true},
	}

	ctx := context.Background()
	entries.Sort(ctx, sortKey)

	return entries
}

func main() {
	loadEntriesParquet("cascades2.parquet")
}

Now the error I'm getting is this:
image

@pjebs
Copy link
Collaborator

pjebs commented Feb 16, 2021

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

@CeciliaCoelho
Copy link

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

Oh right, didn't notice that. It's running now. Thanks :)
How do I print or access the dataframe? (bet it's a stupid question, sorry I'm a Golang newbie)

@pjebs
Copy link
Collaborator

pjebs commented Feb 16, 2021

the function returns a *dataframe.DataFrame object. You can see examples in the Readme.

However, when I look at the code, it's not efficient at loading the Dataframe with the data. I need to understand that Parquet package better before I can improve the code.

@pjebs
Copy link
Collaborator

pjebs commented Feb 17, 2021

Parquet importing is now supported (experimental): @CeciliaCoelho @khughitt @space55

@pjebs pjebs closed this as completed Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants