Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrays with null values are written as empty tags on the XML file #692

Open
Matew92 opened this issue Oct 24, 2024 · 4 comments
Open

Arrays with null values are written as empty tags on the XML file #692

Matew92 opened this issue Oct 24, 2024 · 4 comments

Comments

@Matew92
Copy link

Matew92 commented Oct 24, 2024

Im using the library on a nested dataframe ex:

this is my schema:

 StructField("A", ArrayType(StructType([
      StructField("B", StructType([
          StructField("C", StringType(), True),
          StructField("D", ArrayType(StructType([
              StructField("E", StringType(), True),
              StructField("F", StringType(), True)
          ])), True)
      ]))
  ])))

This my data:

  "A": [{
            "B": {
                "C": "somthing",
                "D": [{
                    "E": None,
                    "F": None
                }]
            }
        }]

What would i expect would be somthing like:

<A>
    <B>
       <C>somthing</C>
   </B>
</A>

But i get :

<A>
    <B>
       <C>somthing</C>
       <D/>
   </B>
</A>

Did someone find the same issue? Is there a way to get the behaviour i want ? i tried with .option("ignoreNullFields", "true") but i get the same described above

@srowen
Copy link
Collaborator

srowen commented Oct 24, 2024

I don't know if one is right-er than the other. They are slighly different situations: a child with nothing in it, vs a parent with no children. That said I don't think the current behavior is strongly motivated, just how it happened.

I would probably not change behavior at this point unless it's demonstrably problematic.

@Matew92
Copy link
Author

Matew92 commented Oct 24, 2024

Hi Srowen,

thanks for your fast reply. I get the same behaviour with the fields (if a field on the df is null will be not printed in the xml file)so i was expecting the same for a empty array (or at least an option for it?)

@srowen
Copy link
Collaborator

srowen commented Oct 24, 2024

I think there's a difference between [] and None which is sort of mirrored here - that's not a missing array, it's an empty array. I think you could argue behavior either way, neither is that much more reasonable. But I would not change behavior that's stood for so long unless it was clearly wrong.

@Matew92
Copy link
Author

Matew92 commented Oct 24, 2024

Yes, I agree with you that an empty array is different from a None (so indeed, I would not change the default behavior). However, for big data purposes, having an option to print or not print empty nested arrays would be really helpful because it optimizes the size of the XML file.

For example, in my case, I get 2-3 level nested data frames, and the results are all these empty tags for the arrays in a 100GB file.

The result is something like this for each row:

  <a>
      <b>
          <c/>
          <d/>
      </b>
      <e/>
      <f>
          <g/>
          <h/>
          <i>
              <m/>
              <n/>
          </i>
          <o/>
      </f>
  </a>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants