Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException #346

Open
pedromb opened this issue May 10, 2017 · 18 comments · May be fixed by #357

Comments

@pedromb
Copy link

pedromb commented May 10, 2017

Hello guys, I am getting this warn

WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException
        at java.lang.String.startsWith(String.java:1385)
        at java.lang.String.startsWith(String.java:1414)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:102)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:98)
        at scala.collection.Iterator$class.exists(Iterator.scala:753)
        at scala.collection.AbstractIterator.exists(Iterator.scala:1157)
        at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
        at scala.collection.AbstractIterable.exists(Iterable.scala:54)
        at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:98)
        at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:361)
        at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

I have seen this issue here before, but it still occurs for me.

I do have a lifecycle configuration for my bucket. I've traced this warn to this piece of code

def checkThatBucketHasObjectLifecycleConfiguration(
      tempDir: String,
      s3Client: AmazonS3Client): Unit = {
    try {
      val s3URI = createS3URI(Utils.fixS3Url(tempDir))
      val bucket = s3URI.getBucket
      assert(bucket != null, "Could not get bucket from S3 URI")
      val key = Option(s3URI.getKey).getOrElse("")
      val hasMatchingBucketLifecycleRule: Boolean = {
        val rules = Option(s3Client.getBucketLifecycleConfiguration(bucket))
          .map(_.getRules.asScala)
          .getOrElse(Seq.empty)
        rules.exists { rule =>
          // Note: this only checks that there is an active rule which matches the temp directory;
          // it does not actually check that the rule will delete the files. This check is still
          // better than nothing, though, and we can always improve it later.
          rule.getStatus == BucketLifecycleConfiguration.ENABLED && key.startsWith(rule.getPrefix)
        }
      }
      if (!hasMatchingBucketLifecycleRule) {
        log.warn(s"The S3 bucket $bucket does not have an object lifecycle configuration to " +
          "ensure cleanup of temporary files. Consider configuring `tempdir` to point to a " +
          "bucket with an object lifecycle policy that automatically deletes files after an " +
          "expiration period. For more information, see " +
          "https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html")
      }
    } catch {
      case NonFatal(e) =>
        log.warn("An error occurred while trying to read the S3 bucket lifecycle configuration", e)
    }
  }

I believe the exception is thrown because of this
key.startsWith(rule.getPrefix)

I checked the Amazon SDK documents, the method getPrefix returns null if the prefix wasn't set using the setPrefix method, therefore it will always return null in this case.

I have a very limited knowledge of the Amazon SDK and Scala, so I'm not really sure about this.

@dmnava
Copy link

dmnava commented May 12, 2017

The same here:

17/05/12 13:57:56 WARN redshift.Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException
	at java.lang.String.startsWith(String.java:1405)
	at java.lang.String.startsWith(String.java:1434)
	at com.databricks.spark.redshift.Utils$$anonfun$5.apply(Utils.scala:140)
	at com.databricks.spark.redshift.Utils$$anonfun$5.apply(Utils.scala:136)
	at scala.collection.Iterator$class.exists(Iterator.scala:919)
	at scala.collection.AbstractIterator.exists(Iterator.scala:1336)
	at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
	at scala.collection.AbstractIterable.exists(Iterable.scala:54)
	at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:136)
	at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:389)
	at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
...
...

I thought it had to do with not setting a bucket prefix when configuring the lifecycle policy but even after setting it, it keeps showing (although the operation succeeds)

@divnalam
Copy link

+1

3 similar comments
@Joe29
Copy link

Joe29 commented Jun 21, 2017

+1

@WajdiF
Copy link

WajdiF commented Jul 10, 2017

+1

@markdessain
Copy link

+1

BorePlusPlus added a commit to BorePlusPlus/spark-redshift that referenced this issue Jul 21, 2017
`getPrefix` method on `Rule` [got deprecated](https://github.com/aws/aws-sdk-java/blob/355424771b951ef0066b19c3eab4b4356e270cf4/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/model/BucketLifecycleConfiguration.java#L145-L153)
It seems that reponse on the wire was also changed so this method no
longer returns the prefix even on older versions of AWS SDK (as the one
used by this project).

I've bumped the AWS SDK dependencies version and implemented the check
using new visitor pattern. I am not sure it is the nicest scala code,
but I think it works. Tests stil pass.

I believe this fixes databricks#346.
@BorePlusPlus BorePlusPlus linked a pull request Jul 21, 2017 that will close this issue
@BorePlusPlus
Copy link

I think I have a PR that fixes this (you need to upgrade AWS SDK dependencies). See: #357

@watchingant
Copy link

Disclaimer: New to Scala
I am using the jars from http://repo1.maven.org/maven2/com/databricks/ and I am getting this same error when trying to write to redshift. I am running this from the Spark shell for debugging and every time I do I get this error and my shell hangs but the operation never completes.

@mylons
Copy link

mylons commented Oct 18, 2017

+1 seeing this in pyspark

@gorros
Copy link

gorros commented Dec 13, 2017

+1

@gorros
Copy link

gorros commented Dec 14, 2017

It seems to me that this issue has something to do with the fact that com.amazonaws.aws-java-sdk-s3 dependency is provided. When I run spark job locally from ide it works, but it does not work on AWS EMR. I think when deployed jar uses libraries, provided by aws environment, which are probably new and conflict with spark-redshift. Meanwhile locally it uses old libraries and as a result, it works. As a temporary solution a suggest to use fix by @BorePlusPlus .

    <repositories>
        <repository>
            <id>jitpack.io</id>
            <url>https://jitpack.io</url>
        </repository>
    </repositories>
...
    <dependency>
        <groupId>com.github.BorePlusPlus</groupId>
        <artifactId>spark-redshift_2.11</artifactId>
        <version>bucket-lifecycle-check-upgrade-SNAPSHOT</version>
    </dependency>

@RyanZotti
Copy link

I agree that this is a super annoying error, since the stack trace is so long. This solution worked for me:

spark.sparkContext.setLogLevel("ERROR")

I got the suggestion from here.

@aymkhalil
Copy link

+1

@dvelle
Copy link

dvelle commented Dec 14, 2018

For us it turned out "the file is not there" - that is being attempted to be read and thus

"An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException"

and a subsequent

"S3ServiceException:Access Denied,Status 403,Error AccessDenied,"

It would seem we are reading before the file is available - parallel processing woes?

Object not found results in 403 (access denied) rather than 404 (not found) because different return codes would provide an attacker with useful information - it leaks information that an object of a given name actually exists. A simple dictionary-style attack could then enumerate all of the objects in someone's bucket. For a similar reason, a login page should never emit "Invalid user" and "Invalid password" for the two authentication failure scenarios; it should always emit "Invalid credentials".

A fix would then be

Check the regions.

For example: It was because the region was set to "us-west-2" that was visible on the aws console link. However the contents were hosted on ap-southeast-1.

Check Permissions.

By default, permissions are given to the AWS user only. If you use IAM authentication with access keys, you must add permissions to "authenticated users" in S3.

"...If the object you request does not exist, the error Amazon S3 returns depends on whether you also have the s3:ListBucket permission.

If you have the s3:ListBucket permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error. if you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 ("access denied") error."

Keep your role policy as in the helloV post.
Go to S3. Select your bucket. Click Permissions. Click Bucket Policy.
Try something like this:
{
"Version": "2012-10-17",
"Id": "Lambda access bucket policy",
"Statement": [
{
"Sid": "All on objects in bucket lambda",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AWSACCOUNTID:root"
},
"Action": "s3:",
"Resource": "arn:aws:s3:::BUCKET-NAME/
"
},
{
"Sid": "All on bucket by lambda",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AWSACCOUNTID:root"
},
"Action": "s3:*",
"Resource": "arn:aws:s3:::BUCKET-NAME"
}
]
}

Your architecture chooses the right solution, hope this helps

@gorros
Copy link

gorros commented Dec 14, 2018

By now, I have implemented multiple Spark applications with this library and the issue does not affect anything.

@1311543
Copy link

1311543 commented Aug 19, 2019

i solve the problem inverting this params
before :

sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_key)

after:

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_id)
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_key)
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

@kovacshuni
Copy link

+1
Any news how to fix this properly?

@feliperoos
Copy link

+1 Just saw this happening using databricks runs (using Spark 3.2.1).

@vnktsh
Copy link

vnktsh commented Feb 1, 2024

I was able to silence this by setting this piece of code's logger to ERROR

import org.apache.log4j.{Level, Logger}

// insert this line after spark session initiation
Logger.getLogger("com.databricks.spark.redshift.Utils$").setLevel(Level.ERROR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.