Skip to content

Commit

Permalink
Merge pull request #606 from RumbleDB/Version1.5
Browse files Browse the repository at this point in the history
Version1.5
  • Loading branch information
ghislainfourny authored Mar 30, 2020
2 parents a87fb05 + d7e225b commit d161bab
Show file tree
Hide file tree
Showing 7 changed files with 21 additions and 16 deletions.
2 changes: 1 addition & 1 deletion docs/Function library.md
Original file line number Diff line number Diff line change
Expand Up @@ -663,7 +663,7 @@ returns the (single) JSON value read from the supplied JSON file. This will also

## Integration with HDFS and Spark

We support two more functions to read a JSON file from HDFS or send a large sequence to the cluster:
We support more functions to read JSON, Parquet, CSV, text and ROOT files from various storage layers such as S3 and HDFS, or send a large sequence to the cluster. Supported schemes are: file, s3, s3a, s3n, hdfs, wasb, gs and root.

### json-file (Rumble specific)

Expand Down
2 changes: 1 addition & 1 deletion docs/Getting started.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Create, in the same directory as Rumble, a file data.json and put the following

In a shell, from the directory where the rumble .jar lies, type, all on one line:

spark-submit --master local[*] --deploy-mode client spark-rumble-1.4.jar --shell yes
spark-submit --master local[*] --deploy-mode client spark-rumble-1.5.jar --shell yes
The Rumble shell appears:

Expand Down
13 changes: 9 additions & 4 deletions docs/JSONiq.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ return count($z)

### Expressions pushed down to Spark

Some expressions are pushed down to Spark out of the box. For example, this will work on a large file leveraging the parallelism of Spark:
Many expressions are pushed down to Spark out of the box. For example, this will work on a large file leveraging the parallelism of Spark:

```
count(json-file("file.json")[$$.field eq "foo"].bar[].foo[[1]])
Expand All @@ -54,24 +54,28 @@ What is pushed down so far is:
- aggregation functions such as count
- JSON navigation expressions: object lookup (as well as keys() call), array lookup, array unboxing, filtering predicates
- predicates on positions, include use of context-dependent functions position() and last(), e.g.,
- type checking (instance of, treat as)
- many builtin function calls (head, tail, exist, etc)

```
json-file("file.json")[position() ge 10 and position() le last() - 2]
```

More expressions working on sequences will be pushed down in the future, prioritized on the feedback we receive.

We also started to push down some expressions to DataFrames and Spark SQL. In particular, keys() pushes down the schema lookup if used on parquet-file() and structured-json-file(). Likewise, count() on these is also pushed down.
We also started to push down some expressions to DataFrames and Spark SQL (obtained via structured-json-file, csv-file and parquet-file calls). In particular, keys() pushes down the schema lookup if used on parquet-file() and structured-json-file(). Likewise, count() as well as object lookup, array unboxing and array lookup is also pushed down on DataFrames.

When an expression does not support pushdown, it will materialize automaticaly. To avoid issues, the materializion is capped by default at 100 items, but this can be changed on the command line with --result-size. A warning is issued if a materialization happened and the sequence was truncated.

### Unsupported global variables, settings and modules

Prologs with user-defined functions are now supported, but not yet global global variables, settings and library modules.

Dynamic functions (aka, function items that can be passed as values and dynamically called) are now supported.
Global external variables with string values are supported (use "--variable:foo bar" on the command line to assign values to them).

Builtin function calls are type-checked, but user-defined function calls and dynamic calls are not type-checked yet.
Dynamic functions (aka, function items that can be passed as values and dynamically called) are supported.

All function calls are type-checked.

### Unsupported try/catch

Expand All @@ -88,6 +92,7 @@ The type system is not quite complete yet, although a lot of progress was made.
| Type | Status |
|-------|--------|
| atomic | supported |
| anyURI | supported |
| base64Binary | supported |
| boolean | supported |
| byte | not supported |
Expand Down
10 changes: 5 additions & 5 deletions docs/Run on a cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,19 @@ simply by modifying the command line parameters as documented [here for spark-su

If the Spark cluster is running on yarn, then the --master option must be changed from local[\*] to yarn compared to the getting started guide.

spark-submit --master yarn --deploy-mode client spark-rumble-1.4.jar --shell yes
spark-submit --master yarn --deploy-mode client spark-rumble-1.5.jar --shell yes
You can also adapt the number of executors, etc.

spark-submit --master yarn --deploy-mode client
--num-executors 30 --executor-cores 3 --executor-memory 10g
spark-rumble-1.4.jar --shell yes
spark-rumble-1.5.jar --shell yes

The size limit for materialization can also be made higher with --result-size (the default is 100). This affects the number of items displayed on the shells as an answer to a query, as well as any materializations happening within the query with push-down is not supported. Warnings are issued if the cap is reached.

spark-submit --master yarn --deploy-mode client
--num-executors 30 --executor-cores 3 --executor-memory 10g
spark-rumble-1.4.jar
spark-rumble-1.5.jar
--shell yes --result-size 10000

## Creation functions
Expand Down Expand Up @@ -58,7 +58,7 @@ Rumble also supports executing a single query from the command line, reading fro

spark-submit --master yarn --deploy-mode client
--num-executors 30 --executor-cores 3 --executor-memory 10g
spark-rumble-1.4.jar
spark-rumble-1.5.jar
--query-path "hdfs:///user/me/query.jq"
--output-path "hdfs:///user/me/results/output"
--log-path "hdfs:///user/me/logging/mylog"
Expand All @@ -67,7 +67,7 @@ The query path can also be a local, absolute path. It is also possible to omit t

spark-submit --master yarn --deploy-mode client
--num-executors 30 --executor-cores 3 --executor-memory 10g
spark-rumble-1.4.jar
spark-rumble-1.5.jar
--query-path "/home/me/my-local-machine/query.jq"
--output-path "/user/me/results/output"
--log-path "hdfs:///user/me/logging/mylog"
Expand Down
6 changes: 3 additions & 3 deletions docs/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,13 @@ Once the ANTLR sources have been generated, you can compile the entire project l

$ mvn clean compile assembly:single

After successful completion, you can check the `target` directory, which should contain the compiled classes as well as the JAR file `spark-rumble-1.4-jar-with-dependencies.jar`.
After successful completion, you can check the `target` directory, which should contain the compiled classes as well as the JAR file `spark-rumble-1.5.jar`.

## Running locally

The most straightforward to test if the above steps were successful is to run the Rumble shell locally, like so:

$ spark-submit --master local[2] --deploy-mode client target/spark-rumble-1.4-jar-with-dependencies.jar --shell yes
$ spark-submit --master local[2] --deploy-mode client target/spark-rumble-1.5.jar --shell yes

The Rumble shell should start:

Expand Down Expand Up @@ -113,6 +113,6 @@ This is it. Rumble is step and ready to go locally. You can now move on to a JSO

You can also try to run the Rumble shell on a cluster if you have one available and configured -- this is done in the same way as any other `spark-submit` command:

$ spark-submit --master yarn --deploy-mode client --num-executors 40 spark-rumble-1.4.jar
$ spark-submit --master yarn --deploy-mode client --num-executors 40 spark-rumble-1.5.jar

More details are provided in the rest of the documentation.
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
site_name: Rumble 1.4.0 "Willow Oak" beta
site_name: Rumble 1.5.0 "Southern Live Oak" beta
pages:
- '1. Documentation home': 'index.md'
- '2. Getting started': 'Getting started.md'
Expand Down
2 changes: 1 addition & 1 deletion src/main/resources/assets/banner.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
____ __ __
/ __ \__ ______ ___ / /_ / /__
/ /_/ / / / / __ `__ \/ __ \/ / _ \ The distributed JSONiq engine
/ _, _/ /_/ / / / / / / /_/ / / __/ 1.4 "Willow Oak" beta
/ _, _/ /_/ / / / / / / /_/ / / __/ 1.5 "Southern Live Oak" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/

0 comments on commit d161bab

Please sign in to comment.