Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The problem with the case of words for identical names #685

Open
hipp0gryph opened this issue May 21, 2024 · 3 comments
Open

The problem with the case of words for identical names #685

hipp0gryph opened this issue May 21, 2024 · 3 comments

Comments

@hipp0gryph
Copy link

Hello! If I load files with identical names, but different letter case - I'm getting an error. But I wish get NULL string or two columns with different letter case in schema. I think it's logical.

Code:

spark = SparkSession.builder \
    .appName("Read XML") \
    .config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.18.0")\
    .getOrCreate()

df = spark.read.format("xml") \
    .option("rowTag", "Root") \
    .option("attributePrefix", "") \
    .option("mode", "PERMISSIVE") \
    .option("charset", "utf-8") \
    .option("inferSchema", False) \
    .option("ignoreNamespace", False) \
    .load(f"case_test/*.xml")
df.printSchema()

xml 1 for folder case_test:

<Root>
    <Element>Block for case switch</Element>
</Root>

xml 2 for folder case_test:

<Root>
    <ElemenT>Block for case switch</ElemenT>
</Root>

Error:

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-2-b867e6c5fcd7> in <module>
    348     .option("inferSchema", False) \
    349     .option("ignoreNamespace", False) \
--> 350     .load(f"case_test/*.xml")
    351 df.printSchema()
    352 init_new_spark_df_methods()

/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    202         self.options(**options)
    203         if isinstance(path, str):
--> 204             return self._df(self._jreader.load(path))
    205         elif path is not None:
    206             if type(path) != list:

/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Found duplicate column(s) in the data schema: `element`

Thank you in advice!

@srowen
Copy link
Collaborator

srowen commented May 21, 2024

Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is.
You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.

@hipp0gryph
Copy link
Author

hipp0gryph commented May 21, 2024

Hm, I actually don't even know if that's 'correct' behavior or not. Spark is not case sensitive but XML is. You're welcome to investigate and come up with an argument about what it should do and see if the schema inference can be changed. I just don't want to break any existing behavior over this as it's operated this way forever. But making something work that never worked could be OK.

Thank you for fast answer! Into w3 doc about xml: https://www.w3.org/TR/xml/#dt-entref We see into 4.3.3 Character Encoding in Entities: XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

I think the right way is to read entity's with different case as the same.

@hipp0gryph
Copy link
Author

I also doubted that is the same entity's)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants