Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading csv with "spark_write_csv" = spark_write_csv #8

Open
DorisAmoakohene opened this issue Oct 25, 2023 · 4 comments
Open

Reading csv with "spark_write_csv" = spark_write_csv #8

DorisAmoakohene opened this issue Oct 25, 2023 · 4 comments

Comments

@DorisAmoakohene
Copy link
Owner

@tdhock

I am trying to write a code for writing a csv with spark
This is the code I run
"spark_write_csv" = spark_write_csv(input.df,tempfile(),TRUE)

write.colors <- c(
  "readr::write_csv"="#9970AB",
  "data.table::fwrite"="#D6604D",
  "write_csv_arrow"="#BF812D", 
  "write.csv2" = "#722f37",
  "spark_read_csv"= "pink",
  "utils::write.csv"="deepskyblue")

n.rows <- 100
seconds.limit <- 1

atime.write.vary.cols <- atime::atime(
  N=as.integer(10^seq(2, 6, by=0.5)),
  setup={
    set.seed(1)
    input.vec <- rnorm(n.rows*N)
    input.mat <- matrix(input.vec, n.rows, N)
    input.df <- data.frame(input.mat)
  },
  seconds.limit = seconds.limit,
  "data.table::fwrite"={
    data.table::fwrite(input.df, tempfile(), showProgress = FALSE)
  },
  "write_csv_arrow"={
    arrow::write_csv_arrow(input.df, tempfile())
  },
  "readr::write_csv"={
    readr::write_csv(input.df, tempfile(), progress = FALSE)
  },
  "write.csv2" = {
    write.csv2(input.df, tempfile())
  },
  "spark_write_csv" = spark_write_csv(input.df,tempfile(),TRUE),
  "utils::write.csv"=utils::write.csv(input.df, tempfile()))

I'm getting this Error, kindly assist

Error in UseMethod("spark_write_csv") :
no applicable method for 'spark_write_csv' applied to an object of class "data.frame"

@tdhock
Copy link

tdhock commented Oct 25, 2023

This error message indicates that the spark_write_csv function does not work for data.frame inputs. you should look at the man page ?spark_write_csv to see what is the right input data type.

@DorisAmoakohene
Copy link
Owner Author

@tdhock I did what you suggested before and I added the required for running spark but i still run it into the error
"Error in eval(mc.args$setup, N.env) : object 'sc' not found"

i defined sc as this:

 sc <- spark_connect(master = "local")

and I'm running into this error message.

  • Using Spark: 3.2.0
    Error in force(code) :
    Failed during initialize_connection: org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: unable to find LoginModule class: org.apache.hadoop.shaded.com.ibm.security.auth.module.JAASLoginModule
    Log: C:\Users\DORISA~1\AppData\Local\Temp\Rtmp0MTCRB\file2ecc305a8de_spark.log

---- Output Log ----

In addition, I get this error when trying to run the efficiency of polars for writing csv

this is the error message
Error in py_get_attr_impl(x, name, silent) :
AttributeError: module 'polars' has no attribute 'write_csv'
Run reticulate::py_last_error() for details.

this is the whole code I'm running

library(data.table)
library(readr)
library(arrow)
library(ggplot2)
library(collapse)
library(sparklyr)
library(reticulate)
sc <- spark_connect(master = "local")

write.colors <- c(
  "readr::write_csv"="#9970AB",
  "data.table::fwrite"="#D6604D",
  "write_csv_arrow"="#BF812D", 
  "polars::write_csv"="#33A02C",
  "write_CSV_COllapse" = "#722f37",
  "write_csv_spark"= "pink",
  "write.csv2"= "#1F78B4",
  "utils::write.csv"="deepskyblue")

n.rows <- 100
seconds.limit <- 1

py_polars<-import("polars")

atime.write.vary.cols <- atime::atime(
  N=as.integer(10^seq(2, 6, by=0.5)),
  setup={
    set.seed(1)
    input.vec <- rnorm(n.rows*N)
    input.mat <- matrix(input.vec, n.rows, N)
    input.df <- data.frame(input.mat)
    spark_df<- copy_to(sc,input.mat, name = "spark_df")
    py_df<-py_polars$DataFrame(input.df)
  },
  seconds.limit = seconds.limit,
  "data.table::fwrite"={
    data.table::fwrite(input.df, tempfile(), showProgress = FALSE)
  },
  "write_csv_arrow"={
    arrow::write_csv_arrow(input.df, tempfile())
  },
  "readr::write_csv"={
    readr::write_csv(input.df, tempfile(), progress = FALSE)
  },
  "polars::write_csv" = {
    py_polars$write_csv(py_df, tempfile())
  },
  "write_csv_collapse"={
    write.csv.collapse(input.df,tempfile())
  },
  "write_csv_spark"={
    spark_write_csv(spark_df, tempfile(), mode = "overwrite")
  },
  "write.csv2" = {
    write.csv2(input.df, tempfile())
  },
  "utils::write.csv"= {
    utils::write.csv(input.df, tempfile())
  }
)

@tdhock

@tdhock
Copy link

tdhock commented Oct 26, 2023

your code is trying to use reticulate / python polars? please use R polars as described here https://github.com/pola-rs/r-polars/blob/main/vignettes/userguide.Rmd

@tdhock
Copy link

tdhock commented Oct 26, 2023

for spark I'm not sure the issue but you should check that the sc object is actually being created.
also maybe try the other R package? https://learn.microsoft.com/en-us/azure/databricks/sparkr/sparkr-vs-sparklyr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants