WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException #346

pedromb · 2017-05-10T15:21:40Z

Hello guys, I am getting this warn

WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException
        at java.lang.String.startsWith(String.java:1385)
        at java.lang.String.startsWith(String.java:1414)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:102)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:98)
        at scala.collection.Iterator$class.exists(Iterator.scala:753)
        at scala.collection.AbstractIterator.exists(Iterator.scala:1157)
        at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
        at scala.collection.AbstractIterable.exists(Iterable.scala:54)
        at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:98)
        at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:361)
        at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

I have seen this issue here before, but it still occurs for me.

I do have a lifecycle configuration for my bucket. I've traced this warn to this piece of code

def checkThatBucketHasObjectLifecycleConfiguration(
      tempDir: String,
      s3Client: AmazonS3Client): Unit = {
    try {
      val s3URI = createS3URI(Utils.fixS3Url(tempDir))
      val bucket = s3URI.getBucket
      assert(bucket != null, "Could not get bucket from S3 URI")
      val key = Option(s3URI.getKey).getOrElse("")
      val hasMatchingBucketLifecycleRule: Boolean = {
        val rules = Option(s3Client.getBucketLifecycleConfiguration(bucket))
          .map(_.getRules.asScala)
          .getOrElse(Seq.empty)
        rules.exists { rule =>
          // Note: this only checks that there is an active rule which matches the temp directory;
          // it does not actually check that the rule will delete the files. This check is still
          // better than nothing, though, and we can always improve it later.
          rule.getStatus == BucketLifecycleConfiguration.ENABLED && key.startsWith(rule.getPrefix)
        }
      }
      if (!hasMatchingBucketLifecycleRule) {
        log.warn(s"The S3 bucket $bucket does not have an object lifecycle configuration to " +
          "ensure cleanup of temporary files. Consider configuring `tempdir` to point to a " +
          "bucket with an object lifecycle policy that automatically deletes files after an " +
          "expiration period. For more information, see " +
          "https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html")
      }
    } catch {
      case NonFatal(e) =>
        log.warn("An error occurred while trying to read the S3 bucket lifecycle configuration", e)
    }
  }

I believe the exception is thrown because of this
key.startsWith(rule.getPrefix)

I checked the Amazon SDK documents, the method getPrefix returns null if the prefix wasn't set using the setPrefix method, therefore it will always return null in this case.

I have a very limited knowledge of the Amazon SDK and Scala, so I'm not really sure about this.

The text was updated successfully, but these errors were encountered:

dmnava · 2017-05-12T12:11:43Z

The same here:

17/05/12 13:57:56 WARN redshift.Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException
	at java.lang.String.startsWith(String.java:1405)
	at java.lang.String.startsWith(String.java:1434)
	at com.databricks.spark.redshift.Utils$$anonfun$5.apply(Utils.scala:140)
	at com.databricks.spark.redshift.Utils$$anonfun$5.apply(Utils.scala:136)
	at scala.collection.Iterator$class.exists(Iterator.scala:919)
	at scala.collection.AbstractIterator.exists(Iterator.scala:1336)
	at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
	at scala.collection.AbstractIterable.exists(Iterable.scala:54)
	at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:136)
	at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:389)
	at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
...
...

I thought it had to do with not setting a bucket prefix when configuring the lifecycle policy but even after setting it, it keeps showing (although the operation succeeds)

divnalam · 2017-06-13T09:16:05Z

+1

Joe29 · 2017-06-21T14:55:18Z

+1

WajdiF · 2017-07-10T13:20:12Z

+1

markdessain · 2017-07-18T16:48:14Z

+1

`getPrefix` method on `Rule` [got deprecated](https://github.com/aws/aws-sdk-java/blob/355424771b951ef0066b19c3eab4b4356e270cf4/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/model/BucketLifecycleConfiguration.java#L145-L153) It seems that reponse on the wire was also changed so this method no longer returns the prefix even on older versions of AWS SDK (as the one used by this project). I've bumped the AWS SDK dependencies version and implemented the check using new visitor pattern. I am not sure it is the nicest scala code, but I think it works. Tests stil pass. I believe this fixes databricks#346.

BorePlusPlus · 2017-07-21T17:35:52Z

I think I have a PR that fixes this (you need to upgrade AWS SDK dependencies). See: #357

watchingant · 2017-08-24T15:08:05Z

Disclaimer: New to Scala
I am using the jars from http://repo1.maven.org/maven2/com/databricks/ and I am getting this same error when trying to write to redshift. I am running this from the Spark shell for debugging and every time I do I get this error and my shell hangs but the operation never completes.

mylons · 2017-10-18T17:54:02Z

+1 seeing this in pyspark

gorros · 2017-12-13T13:04:59Z

+1

gorros · 2017-12-14T10:15:41Z

It seems to me that this issue has something to do with the fact that com.amazonaws.aws-java-sdk-s3 dependency is provided. When I run spark job locally from ide it works, but it does not work on AWS EMR. I think when deployed jar uses libraries, provided by aws environment, which are probably new and conflict with spark-redshift. Meanwhile locally it uses old libraries and as a result, it works. As a temporary solution a suggest to use fix by @BorePlusPlus .

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>
...
    <dependency>
        <groupId>com.github.BorePlusPlus</groupId>
        <artifactId>spark-redshift_2.11</artifactId>
        <version>bucket-lifecycle-check-upgrade-SNAPSHOT</version>
    </dependency>

RyanZotti · 2018-03-09T18:14:56Z

I agree that this is a super annoying error, since the stack trace is so long. This solution worked for me:

spark.sparkContext.setLogLevel("ERROR")

I got the suggestion from here.

aymkhalil · 2018-09-07T23:02:38Z

+1

dvelle · 2018-12-14T17:49:50Z

For us it turned out "the file is not there" - that is being attempted to be read and thus

"An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException"

and a subsequent

"S3ServiceException:Access Denied,Status 403,Error AccessDenied,"

It would seem we are reading before the file is available - parallel processing woes?

Object not found results in 403 (access denied) rather than 404 (not found) because different return codes would provide an attacker with useful information - it leaks information that an object of a given name actually exists. A simple dictionary-style attack could then enumerate all of the objects in someone's bucket. For a similar reason, a login page should never emit "Invalid user" and "Invalid password" for the two authentication failure scenarios; it should always emit "Invalid credentials".

A fix would then be

Check the regions.

For example: It was because the region was set to "us-west-2" that was visible on the aws console link. However the contents were hosted on ap-southeast-1.

Check Permissions.

By default, permissions are given to the AWS user only. If you use IAM authentication with access keys, you must add permissions to "authenticated users" in S3.

"...If the object you request does not exist, the error Amazon S3 returns depends on whether you also have the s3:ListBucket permission.

If you have the s3:ListBucket permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error. if you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 ("access denied") error."

Keep your role policy as in the helloV post.
Go to S3. Select your bucket. Click Permissions. Click Bucket Policy.
Try something like this:
{
"Version": "2012-10-17",
"Id": "Lambda access bucket policy",
"Statement": [
{
"Sid": "All on objects in bucket lambda",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AWSACCOUNTID:root"
},
"Action": "s3:",
"Resource": "arn:aws:s3:::BUCKET-NAME/"
},
{
"Sid": "All on bucket by lambda",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AWSACCOUNTID:root"
},
"Action": "s3:*",
"Resource": "arn:aws:s3:::BUCKET-NAME"
}
]
}

Your architecture chooses the right solution, hope this helps

gorros · 2018-12-14T18:59:06Z

By now, I have implemented multiple Spark applications with this library and the issue does not affect anything.

1311543 · 2019-08-19T23:14:50Z

i solve the problem inverting this params
before :

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_key)

after:

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

kovacshuni · 2021-03-10T10:28:53Z

+1
Any news how to fix this properly?

feliperoos · 2022-08-13T05:18:53Z

+1 Just saw this happening using databricks runs (using Spark 3.2.1).

vnktsh · 2024-02-01T04:49:01Z

I was able to silence this by setting this piece of code's logger to ERROR

import org.apache.log4j.{Level, Logger}

// insert this line after spark session initiation
Logger.getLogger("com.databricks.spark.redshift.Utils$").setLevel(Level.ERROR)

markdessain mentioned this issue Jul 21, 2017

NPE in check lifecycle rule configuration #354

Open

BorePlusPlus linked a pull request Jul 21, 2017 that will close this issue

Update bucket lifecycle check #357

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException #346

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException #346

pedromb commented May 10, 2017

dmnava commented May 12, 2017

divnalam commented Jun 13, 2017

Joe29 commented Jun 21, 2017

WajdiF commented Jul 10, 2017

markdessain commented Jul 18, 2017

BorePlusPlus commented Jul 21, 2017

watchingant commented Aug 24, 2017

mylons commented Oct 18, 2017

gorros commented Dec 13, 2017

gorros commented Dec 14, 2017

RyanZotti commented Mar 9, 2018

aymkhalil commented Sep 7, 2018

dvelle commented Dec 14, 2018

gorros commented Dec 14, 2018

1311543 commented Aug 19, 2019

kovacshuni commented Mar 10, 2021

feliperoos commented Aug 13, 2022

vnktsh commented Feb 1, 2024

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException #346

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException #346

Comments

pedromb commented May 10, 2017

dmnava commented May 12, 2017

divnalam commented Jun 13, 2017

Joe29 commented Jun 21, 2017

WajdiF commented Jul 10, 2017

markdessain commented Jul 18, 2017

BorePlusPlus commented Jul 21, 2017

watchingant commented Aug 24, 2017

mylons commented Oct 18, 2017

gorros commented Dec 13, 2017

gorros commented Dec 14, 2017

RyanZotti commented Mar 9, 2018

aymkhalil commented Sep 7, 2018

dvelle commented Dec 14, 2018

gorros commented Dec 14, 2018

1311543 commented Aug 19, 2019

kovacshuni commented Mar 10, 2021

feliperoos commented Aug 13, 2022

vnktsh commented Feb 1, 2024