Reading csv with "spark_write_csv" = spark_write_csv #8

DorisAmoakohene · 2023-10-25T16:47:06Z

I am trying to write a code for writing a csv with spark
This is the code I run
"spark_write_csv" = spark_write_csv(input.df,tempfile(),TRUE)

write.colors <- c(
  "readr::write_csv"="#9970AB",
  "data.table::fwrite"="#D6604D",
  "write_csv_arrow"="#BF812D", 
  "write.csv2" = "#722f37",
  "spark_read_csv"= "pink",
  "utils::write.csv"="deepskyblue")

n.rows <- 100
seconds.limit <- 1

atime.write.vary.cols <- atime::atime(
  N=as.integer(10^seq(2, 6, by=0.5)),
  setup={
    set.seed(1)
    input.vec <- rnorm(n.rows*N)
    input.mat <- matrix(input.vec, n.rows, N)
    input.df <- data.frame(input.mat)
  },
  seconds.limit = seconds.limit,
  "data.table::fwrite"={
    data.table::fwrite(input.df, tempfile(), showProgress = FALSE)
  },
  "write_csv_arrow"={
    arrow::write_csv_arrow(input.df, tempfile())
  },
  "readr::write_csv"={
    readr::write_csv(input.df, tempfile(), progress = FALSE)
  },
  "write.csv2" = {
    write.csv2(input.df, tempfile())
  },
  "spark_write_csv" = spark_write_csv(input.df,tempfile(),TRUE),
  "utils::write.csv"=utils::write.csv(input.df, tempfile()))

I'm getting this Error, kindly assist

Error in UseMethod("spark_write_csv") :
no applicable method for 'spark_write_csv' applied to an object of class "data.frame"

The text was updated successfully, but these errors were encountered:

tdhock · 2023-10-25T18:25:05Z

This error message indicates that the spark_write_csv function does not work for data.frame inputs. you should look at the man page ?spark_write_csv to see what is the right input data type.

DorisAmoakohene · 2023-10-26T18:59:59Z

@tdhock I did what you suggested before and I added the required for running spark but i still run it into the error
"Error in eval(mc.args$setup, N.env) : object 'sc' not found"

i defined sc as this:

 sc <- spark_connect(master = "local")

and I'm running into this error message.

Using Spark: 3.2.0
Error in force(code) :
Failed during initialize_connection: org.apache.hadoop.security.KerberosAuthException: failure to login: javax.security.auth.login.LoginException: unable to find LoginModule class: org.apache.hadoop.shaded.com.ibm.security.auth.module.JAASLoginModule
Log: C:\Users\DORISA~1\AppData\Local\Temp\Rtmp0MTCRB\file2ecc305a8de_spark.log

---- Output Log ----

In addition, I get this error when trying to run the efficiency of polars for writing csv

this is the error message
Error in py_get_attr_impl(x, name, silent) :
AttributeError: module 'polars' has no attribute 'write_csv'
Run reticulate::py_last_error() for details.

this is the whole code I'm running

library(data.table)
library(readr)
library(arrow)
library(ggplot2)
library(collapse)
library(sparklyr)
library(reticulate)

sc <- spark_connect(master = "local")

write.colors <- c(
  "readr::write_csv"="#9970AB",
  "data.table::fwrite"="#D6604D",
  "write_csv_arrow"="#BF812D", 
  "polars::write_csv"="#33A02C",
  "write_CSV_COllapse" = "#722f37",
  "write_csv_spark"= "pink",
  "write.csv2"= "#1F78B4",
  "utils::write.csv"="deepskyblue")

n.rows <- 100
seconds.limit <- 1

py_polars<-import("polars")

atime.write.vary.cols <- atime::atime(
  N=as.integer(10^seq(2, 6, by=0.5)),
  setup={
    set.seed(1)
    input.vec <- rnorm(n.rows*N)
    input.mat <- matrix(input.vec, n.rows, N)
    input.df <- data.frame(input.mat)
    spark_df<- copy_to(sc,input.mat, name = "spark_df")
    py_df<-py_polars$DataFrame(input.df)
  },
  seconds.limit = seconds.limit,
  "data.table::fwrite"={
    data.table::fwrite(input.df, tempfile(), showProgress = FALSE)
  },
  "write_csv_arrow"={
    arrow::write_csv_arrow(input.df, tempfile())
  },
  "readr::write_csv"={
    readr::write_csv(input.df, tempfile(), progress = FALSE)
  },
  "polars::write_csv" = {
    py_polars$write_csv(py_df, tempfile())
  },
  "write_csv_collapse"={
    write.csv.collapse(input.df,tempfile())
  },
  "write_csv_spark"={
    spark_write_csv(spark_df, tempfile(), mode = "overwrite")
  },
  "write.csv2" = {
    write.csv2(input.df, tempfile())
  },
  "utils::write.csv"= {
    utils::write.csv(input.df, tempfile())
  }
)

@tdhock

tdhock · 2023-10-26T21:51:23Z

your code is trying to use reticulate / python polars? please use R polars as described here https://github.com/pola-rs/r-polars/blob/main/vignettes/userguide.Rmd

tdhock · 2023-10-26T22:02:57Z

for spark I'm not sure the issue but you should check that the sc object is actually being created.
also maybe try the other R package? https://learn.microsoft.com/en-us/azure/databricks/sparkr/sparkr-vs-sparklyr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading csv with "spark_write_csv" = spark_write_csv #8

Reading csv with "spark_write_csv" = spark_write_csv #8

DorisAmoakohene commented Oct 25, 2023

tdhock commented Oct 25, 2023

DorisAmoakohene commented Oct 26, 2023

tdhock commented Oct 26, 2023

tdhock commented Oct 26, 2023

Reading csv with "spark_write_csv" = spark_write_csv #8

Reading csv with "spark_write_csv" = spark_write_csv #8

Comments

DorisAmoakohene commented Oct 25, 2023

tdhock commented Oct 25, 2023

DorisAmoakohene commented Oct 26, 2023

tdhock commented Oct 26, 2023

tdhock commented Oct 26, 2023