diff --git a/docs/Function library.md b/docs/Function library.md index de2aa1a9eb..6ecfbc28c7 100644 --- a/docs/Function library.md +++ b/docs/Function library.md @@ -663,7 +663,7 @@ returns the (single) JSON value read from the supplied JSON file. This will also ## Integration with HDFS and Spark -We support two more functions to read a JSON file from HDFS or send a large sequence to the cluster: +We support more functions to read JSON, Parquet, CSV, text and ROOT files from various storage layers such as S3 and HDFS, or send a large sequence to the cluster. Supported schemes are: file, s3, s3a, s3n, hdfs, wasb, gs and root. ### json-file (Rumble specific) diff --git a/docs/Getting started.md b/docs/Getting started.md index a503cd0bd2..842fa8c7bc 100644 --- a/docs/Getting started.md +++ b/docs/Getting started.md @@ -43,7 +43,7 @@ Create, in the same directory as Rumble, a file data.json and put the following In a shell, from the directory where the rumble .jar lies, type, all on one line: - spark-submit --master local[*] --deploy-mode client spark-rumble-1.4.jar --shell yes + spark-submit --master local[*] --deploy-mode client spark-rumble-1.5.jar --shell yes The Rumble shell appears: diff --git a/docs/JSONiq.md b/docs/JSONiq.md index 2440adbe1c..a0ceb22d73 100644 --- a/docs/JSONiq.md +++ b/docs/JSONiq.md @@ -42,7 +42,7 @@ return count($z) ### Expressions pushed down to Spark -Some expressions are pushed down to Spark out of the box. For example, this will work on a large file leveraging the parallelism of Spark: +Many expressions are pushed down to Spark out of the box. For example, this will work on a large file leveraging the parallelism of Spark: ``` count(json-file("file.json")[$$.field eq "foo"].bar[].foo[[1]]) @@ -54,6 +54,8 @@ What is pushed down so far is: - aggregation functions such as count - JSON navigation expressions: object lookup (as well as keys() call), array lookup, array unboxing, filtering predicates - predicates on positions, include use of context-dependent functions position() and last(), e.g., +- type checking (instance of, treat as) +- many builtin function calls (head, tail, exist, etc) ``` json-file("file.json")[position() ge 10 and position() le last() - 2] @@ -61,7 +63,7 @@ json-file("file.json")[position() ge 10 and position() le last() - 2] More expressions working on sequences will be pushed down in the future, prioritized on the feedback we receive. -We also started to push down some expressions to DataFrames and Spark SQL. In particular, keys() pushes down the schema lookup if used on parquet-file() and structured-json-file(). Likewise, count() on these is also pushed down. +We also started to push down some expressions to DataFrames and Spark SQL (obtained via structured-json-file, csv-file and parquet-file calls). In particular, keys() pushes down the schema lookup if used on parquet-file() and structured-json-file(). Likewise, count() as well as object lookup, array unboxing and array lookup is also pushed down on DataFrames. When an expression does not support pushdown, it will materialize automaticaly. To avoid issues, the materializion is capped by default at 100 items, but this can be changed on the command line with --result-size. A warning is issued if a materialization happened and the sequence was truncated. @@ -69,9 +71,11 @@ When an expression does not support pushdown, it will materialize automaticaly. Prologs with user-defined functions are now supported, but not yet global global variables, settings and library modules. -Dynamic functions (aka, function items that can be passed as values and dynamically called) are now supported. +Global external variables with string values are supported (use "--variable:foo bar" on the command line to assign values to them). -Builtin function calls are type-checked, but user-defined function calls and dynamic calls are not type-checked yet. +Dynamic functions (aka, function items that can be passed as values and dynamically called) are supported. + +All function calls are type-checked. ### Unsupported try/catch @@ -88,6 +92,7 @@ The type system is not quite complete yet, although a lot of progress was made. | Type | Status | |-------|--------| | atomic | supported | +| anyURI | supported | | base64Binary | supported | | boolean | supported | | byte | not supported | diff --git a/docs/Run on a cluster.md b/docs/Run on a cluster.md index 219d031169..b549492368 100644 --- a/docs/Run on a cluster.md +++ b/docs/Run on a cluster.md @@ -5,19 +5,19 @@ simply by modifying the command line parameters as documented [here for spark-su If the Spark cluster is running on yarn, then the --master option must be changed from local[\*] to yarn compared to the getting started guide. - spark-submit --master yarn --deploy-mode client spark-rumble-1.4.jar --shell yes + spark-submit --master yarn --deploy-mode client spark-rumble-1.5.jar --shell yes You can also adapt the number of executors, etc. spark-submit --master yarn --deploy-mode client --num-executors 30 --executor-cores 3 --executor-memory 10g - spark-rumble-1.4.jar --shell yes + spark-rumble-1.5.jar --shell yes The size limit for materialization can also be made higher with --result-size (the default is 100). This affects the number of items displayed on the shells as an answer to a query, as well as any materializations happening within the query with push-down is not supported. Warnings are issued if the cap is reached. spark-submit --master yarn --deploy-mode client --num-executors 30 --executor-cores 3 --executor-memory 10g - spark-rumble-1.4.jar + spark-rumble-1.5.jar --shell yes --result-size 10000 ## Creation functions @@ -58,7 +58,7 @@ Rumble also supports executing a single query from the command line, reading fro spark-submit --master yarn --deploy-mode client --num-executors 30 --executor-cores 3 --executor-memory 10g - spark-rumble-1.4.jar + spark-rumble-1.5.jar --query-path "hdfs:///user/me/query.jq" --output-path "hdfs:///user/me/results/output" --log-path "hdfs:///user/me/logging/mylog" @@ -67,7 +67,7 @@ The query path can also be a local, absolute path. It is also possible to omit t spark-submit --master yarn --deploy-mode client --num-executors 30 --executor-cores 3 --executor-memory 10g - spark-rumble-1.4.jar + spark-rumble-1.5.jar --query-path "/home/me/my-local-machine/query.jq" --output-path "/user/me/results/output" --log-path "hdfs:///user/me/logging/mylog" diff --git a/docs/install.md b/docs/install.md index 5523027e25..e2a2abd5b1 100644 --- a/docs/install.md +++ b/docs/install.md @@ -58,13 +58,13 @@ Once the ANTLR sources have been generated, you can compile the entire project l $ mvn clean compile assembly:single -After successful completion, you can check the `target` directory, which should contain the compiled classes as well as the JAR file `spark-rumble-1.4-jar-with-dependencies.jar`. +After successful completion, you can check the `target` directory, which should contain the compiled classes as well as the JAR file `spark-rumble-1.5.jar`. ## Running locally The most straightforward to test if the above steps were successful is to run the Rumble shell locally, like so: - $ spark-submit --master local[2] --deploy-mode client target/spark-rumble-1.4-jar-with-dependencies.jar --shell yes + $ spark-submit --master local[2] --deploy-mode client target/spark-rumble-1.5.jar --shell yes The Rumble shell should start: @@ -113,6 +113,6 @@ This is it. Rumble is step and ready to go locally. You can now move on to a JSO You can also try to run the Rumble shell on a cluster if you have one available and configured -- this is done in the same way as any other `spark-submit` command: - $ spark-submit --master yarn --deploy-mode client --num-executors 40 spark-rumble-1.4.jar + $ spark-submit --master yarn --deploy-mode client --num-executors 40 spark-rumble-1.5.jar More details are provided in the rest of the documentation. diff --git a/mkdocs.yml b/mkdocs.yml index 186badd5f9..45376d943b 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,4 +1,4 @@ -site_name: Rumble 1.4.0 "Willow Oak" beta +site_name: Rumble 1.5.0 "Southern Live Oak" beta pages: - '1. Documentation home': 'index.md' - '2. Getting started': 'Getting started.md' diff --git a/src/main/resources/assets/banner.txt b/src/main/resources/assets/banner.txt index 6747cc6947..87c9d3b453 100644 --- a/src/main/resources/assets/banner.txt +++ b/src/main/resources/assets/banner.txt @@ -1,6 +1,6 @@ ____ __ __ / __ \__ ______ ___ / /_ / /__ / /_/ / / / / __ `__ \/ __ \/ / _ \ The distributed JSONiq engine - / _, _/ /_/ / / / / / / /_/ / / __/ 1.4 "Willow Oak" beta + / _, _/ /_/ / / / / / / /_/ / / __/ 1.5 "Southern Live Oak" beta /_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/