-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error to read csv encoding utf-8 with bom and export back to parquet #62
Comments
Can got provide your code too |
var ctx = context.Background()
func main() {
//c := config.Config.Server
//web.Server.SetAddr(c.GetAddress() + ":" + strconv.Itoa(int(c.GetPort())))
//web.Server.Run()
exportParquet("export.parquet", readCSV("export.csv"))
}
func readCSV(filepath string) *dataframe.DataFrame {
fr, err := os.Open(filepath)
if err != nil {
panic(err)
}
df, err := imports.LoadFromCSV(ctx, fr)
if err != nil {
panic(err)
}
return df
}
func exportParquet(out string, df *dataframe.DataFrame) {
f, err := os.Create(out)
if err != nil {
panic(err)
}
defer f.Close()
err = exports.ExportToParquet(ctx, f, df)
if err != nil {
panic(err)
}
} |
Your column names are:
Currently I implemented this function for exporting to Parquet: It was based on this article: https://html.developreference.com/article/11087043/Spark+dataframe+column+naming+conventions+++restrictions Do you know if parquet supports chinese characters? Can you point me to the specs? |
Actually, I think the issue is from this package: https://github.com/Ompluscator/dynamic-struct It is used to create a struct dynamically. I think Go prohibits using chinese characters for the first letter for an export field. I will have to explore it further. |
Yes, it supported. I think the problem is the first character is the bom character |
I think removing the first character if it's |
How did that character get there? |
by reading a csv encoded with UTF-8 with bom? import pandas
df = pandas.DataFrame({
"A": [1, 2, 3, 4, 5],
"B": [1, 2, 3, 4, 5],
})
df.to_csv("withbom.csv", encoding="UTF-8-sig", index=False) # encoding is UTF-8-BOM It cause the same problem
|
As an experiment, can you add a Upper-case English letter at the front of each column name and tell me if it exports? eg. I believe the actual issue is that in Go, structs must have a uppercase english letter for the first letter of the field, in order to be exported. Without it, the field is not exported and the parquet writer package (used for exporting) can't see the field: https://stackoverflow.com/questions/40256161/exported-and-unexported-fields-in-go-language |
I use |
Is |
yes |
As an experiment can you:
|
df := readCSV("/Users/tanyaofei/Desktop/export.csv")
for _, s := range df.Series {
s.Rename("X" + s.Name())
}
fmt.Println(df)
exportParquet("p.parquet", df)
|
Okay, also call https://pkg.go.dev/strings#TrimSpace |
Also when importing CSV:
That option is available. |
no effect |
no effect either |
|
There are 2 different issues here:
|
This has been a problem for many years: golang/go#5763. |
I believe I have a solution. Does Python always produce See: golang/go#33887 |
no, |
In fact, |
Can you test out this branch: #63 It should solve both problems. |
I will test the branch later, but i am wondering why not name struct fields by index such as |
The |
the error is exception recovered: reflect.StructOf: field 0 has invalid name
at
ompluscator/[email protected]/builder.go:192
export.csv
The text was updated successfully, but these errors were encountered: