-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect data_type
for BigQuery REPEATED
fields
#190
Comments
Thanks for reaching out @Thrasi! TLDRYour proposed solution to handle BigQuery Background contextHere's the context of why it was vendored in the first place: #122 (comment) The aim was to preserve the wide range of Ultimately, we'd be interested in just ripping these definitions altogether and replacing In the meantimeThe definition of The current definition for bigquery is here. The only difference in the definition for bigquery is using The good news is that the bigquery of Recommended codeSo these might do the trick (until we don't care anymore about preserving compatibility in the full {# Vendored from: https://github.com/dbt-labs/dbt-adapters/blob/c7b12aee533184bad391a657d1753539d1dd496a/dbt/include/global_project/macros/relations/column/columns_spec_ddl.sql#L85-L89 #}
{% macro default__format_column(column) -%}
{% set data_type = column.dtype %}
{% set formatted = column.column.lower() ~ " " ~ data_type %}
{{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%}
{# Vendored from: https://github.com/dbt-labs/dbt-bigquery/blob/4d255b2f854d21d5d8871bdaa8d7ab47e7e863a3/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L1-L5 #}
{% macro bigquery__format_column(column) -%}
{% set data_type = column.data_type %}
{% set formatted = column.column.lower() ~ " " ~ data_type %}
{{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%} I'd prefer to keep them in the file Would welcome a PR if you are still interested in contributing one! |
data_type
for BigQuery REPEATED
fields
Great! I will add a test for this too. |
Using the vendored code we get "appropriate" select
[1, 2] AS repeated_int,
[
STRUCT(1 as int_field, [STRUCT("a" as string_field)] as nested_repeated_struct),
STRUCT(2 AS int_field, [STRUCT("a" as string_field)] as nested_repeated_struct)
] as repeated_struct generates: version: 2
models:
- name: repeated_nested
description: ""
columns:
- name: repeated_int
data_type: array<int64>
description: ""
- name: repeated_struct
data_type: array<struct<`int_field` int64, `nested_repeated_struct` array<struct<`string_field` string>>>>
description: ""
- name: repeated_struct.int_field
data_type: int64
description: ""
- name: repeated_struct.nested_repeated_struct
data_type: array<struct<`string_field` string>>
description: ""
- name: repeated_struct.nested_repeated_struct.string_field
data_type: string
description: "" which matches: "schema": {
"fields": [
{
"description": "",
"mode": "REPEATED",
"name": "repeated_int",
"type": "INTEGER"
},
{
"description": "",
"fields": [
{
"description": "",
"name": "int_field",
"type": "INTEGER"
},
{
"description": "",
"fields": [
{
"description": "",
"name": "string_field",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "nested_repeated_struct",
"type": "RECORD"
}
],
"mode": "REPEATED",
"name": "repeated_struct",
"type": "RECORD"
}
]
}, So this is positive. SELECT
*
FROM (
SELECT
CAST(NULL AS ARRAY<int64>) AS repeated_int,
CAST(NULL AS STRUCT<int_field int64, nested_repeated_struct STRUCT<string_field string> string>>> int64, `nested_repeated_struct` ARRAY<STRUCT<`string_field` string>>>>) AS repeated_struct ) AS __dbt_sbq
WHERE
FALSE
AND CURRENT_TIMESTAMP() = CURRENT_TIMESTAMP()
LIMIT
0 The repeated nested column should produce: CAST(NULL AS ARRAY<STRUCT<int_field int64, nested_repeated_struct ARRAY<STRUCT<string_field string>>>>) AS nested_repeated_struct to match the schema. I cannot see the It could be good to add tests enforcing contracts to catch these kind of issues. |
Looking good @Thrasi ! What is the YAML that would work when the model contract is enforced? Per your suggestion about overriding the explicit version: 2
models:
- name: repeated_nested
description: ""
columns:
- name: repeated_int
data_type: array<int64>
description: ""
- name: repeated_struct
data_type: array
description: ""
- name: repeated_struct.int_field
data_type: int64
description: ""
- name: repeated_struct.nested_repeated_struct
data_type: array
description: ""
- name: repeated_struct.nested_repeated_struct.string_field
data_type: string
description: "" |
Yes precisely |
@Thrasi Would adding logic similar to what you suggested earlier handle those cases? See below for a quick shot I took at this, but didn't test out at all. It uses {# Vendored from: https://github.com/dbt-labs/dbt-bigquery/blob/4d255b2f854d21d5d8871bdaa8d7ab47e7e863a3/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L1-L5 #}
{# But modified to handle https://github.com/dbt-labs/dbt-codegen/issues/190 #}
{% macro bigquery__format_column(column) -%}
{% set data_type = column.data_type %}
{% if column.mode.lower() == "repeated" and column.dtype.lower() == "record" %}
{% set data_type = "array" %}
{% endif %}
{% set formatted = column.column.lower() ~ " " ~ data_type %}
{{ return({'name': column.name, 'data_type': data_type, 'formatted': formatted}) }}
{%- endmacro -%} |
Yes this works, it's same logic but much cleaner and uses |
Describe the bug
BigQuery supports arrays as REPEATED fields
Running generate_model_yaml for a model with a repeated field of a certain datatype will result in a
schema with
data_type: datatype
rather thandata_type: array<datatype>
Running the model again with
contract.enforced=true
will show the error:Repeated records should have
data_type: array
Steps to reproduce
create a model_with_repeated_field.sql:
run it
Expected results
Actual results
Screenshots and log output
Running the model with the following yml:
System information
Which database are you using dbt with?
The output of
dbt --version
:The operating system you're using:
MacOS Sequoia Version 15.1
The output of
python --version
:Python 3.10.15
Additional context
generate_model_yaml gets the data_type using
data_type_format_model
which in turn calls codegen.format_column
format_column is vendored in macros/vendored/dbt_core/format_column.sql
But it doesn't handle the specific case for repeated fields.
I am tempted to create a
default__format_column
and abigquery__format_column
to handle BigQuery specifically:and moving it all into helpers.sql while removing the vendored format_column
Alternatively, bigquery__format_column could return a
'mode': column.mode
fieldand the adapter specific datatype conversion would be taken care of in
bigquery__data_type_format_model
andbigquery__data_type_format_source
Are you interested in contributing the fix?
Yes. I would like some input on whether the proposed solution in Additional context is reasonable.
The text was updated successfully, but these errors were encountered: