Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery load fails with JSON array #5542

Open
turb opened this issue Jan 15, 2025 · 9 comments
Open

BigQuery load fails with JSON array #5542

turb opened this issue Jan 15, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@turb
Copy link
Contributor

turb commented Jan 15, 2025

When loading data to BigQuery with a JSON column, it fails when the actual data is a Json array [ data1, data2 ]:

Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot deserialize value of type `com.google.api.services.bigquery.model.TableRow` from Array value (token `JsonToken.START_ARRAY`)
 at [Source: REDACTED (`StreamReadFeature.INCLUDE_SOURCE_IN_LOCATION` disabled); line: 1, column: 1]
	com.fasterxml.jackson.databind.exc.MismatchedInputException.from(MismatchedInputException.java:59)
	com.fasterxml.jackson.databind.DeserializationContext.reportInputMismatch(DeserializationContext.java:1767)
	com.fasterxml.jackson.databind.DeserializationContext.handleUnexpectedToken(DeserializationContext.java:1541)
	com.fasterxml.jackson.databind.deser.std.StdDeserializer._deserializeFromArray(StdDeserializer.java:222)
	com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:457)
	com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:32)
	com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:342)
	com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4917)
	com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3860)
	com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3828)
	com.spotify.scio.bigquery.types.package$Json$.parse(package.scala:76)
	@BigQueryType.toTable <== line where there is the toTable annotation

It did work with JacksonNode, but I wonder if it is possible with TableRow.

@RustedBones
Copy link
Contributor

Indeed, I overlooked the fact that a Json root can either be

  • a Json Object (TableRow)
  • a Json Array

@RustedBones RustedBones added the bug Something isn't working label Jan 15, 2025
@turb
Copy link
Contributor Author

turb commented Jan 16, 2025

Looking after it, now I remember why I came up with JacksonNode.

Beam is using a unmaintained BigQuery client. I could not find anything in it dealing with JSON fields, so I supposed it to be directly stored as a simple JSON element in its representation tree.

Also BigQuery doc states:

BigQuery supports the JSON type even if schema information is not known at the time of ingestion. A field that is declared as JSON type is loaded with the raw JSON values.

Beam does not seem to really care about it.

Thing is, google-api-services-bigquery relies on google-http-client JsonFactory, which is backed by either Jackson or Gson. I tried JacksonNode and worked like a charm.

I don't know what was the issue with it, but maybe a Gson JsonElement can do instead?

@RustedBones
Copy link
Contributor

RustedBones commented Jan 27, 2025

After long time digging around the Json type support in TableRow, I think the least bad solution is to

row.set("column", Json.parse(row.getJson("column"))

WDYT of that @turb ?

@turb
Copy link
Contributor Author

turb commented Jan 27, 2025

Do you mean #5544 is a dead-end? (edit: fixed the number)

Avro can do, I suppose it would be transparent for the user, except for the JSON column case?

For JSON, the user should have a hint on how to do that (where to access row). I wonder if the JSON columns cannot be declared (or detected) so that this operation is performed by the library.

@RustedBones
Copy link
Contributor

RustedBones commented Jan 27, 2025

except for the JSON column case

avro handles JSON well too. it will be an string with the additional properties sqlType: JSON so the translated BQ schema uses JSON

@turb
Copy link
Contributor Author

turb commented Jan 27, 2025

handle Json as string in TableRow, user will have to manually call the following when loading with JSON
row.set("column", Json.parse(row.getJson("column"))

So if it would use avro, that handles it well, why is this necessary? (I'm not sure I understand what you're saying)

@RustedBones
Copy link
Contributor

RustedBones commented Jan 27, 2025

Sorry I wasn't clear: This snipet is only relevant when using the TableRow API and writing with JSON format

@turb
Copy link
Contributor Author

turb commented Jan 27, 2025

OK, then if using avro under the hood solves it, I've nothing against it :)

I'll try a load with JSON against #5529.

@turb
Copy link
Contributor Author

turb commented Jan 27, 2025

I'll try a load with JSON against #5529.

Worked like a charm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants