Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of the Julia climate stack & desired additions #2

Open
gaelforget opened this issue Jan 24, 2020 · 18 comments
Open

State of the Julia climate stack & desired additions #2

gaelforget opened this issue Jan 24, 2020 · 18 comments

Comments

@gaelforget
Copy link
Member

In relation to #1 but somewhat distinct, let's discuss the current stack of packages and functionalities

Topics could include:

  • what exists or seems missing?
  • how mature are different pieces that you know or contribute to?
  • what do you use and really like in relation to JuliaClimate?
  • what do lack and really would want to have / contribute to?
  • ...
@gaelforget
Copy link
Member Author

In a slightly different context I put a list together that is not unrelated @ PraCTES/MIT-PraCTES#17

@gaelforget gaelforget changed the title State of the climate stack & desired additions State of the Julia climate stack & desired additions Jan 24, 2020
@Datseris
Copy link
Member

I believe I speak for everyone when I say that I desire a stable and well-accepted interface for dimensional data, with a quality maching e.g. Python's xarray. I am currently contributing to and using DimensionalData.jl. It is not mature yet, and also not part of an org yet.

I personally only talk about low-level dimensional data format; a "dedicated format for geo-related data" that e.g. GeoData.jl tries to do is definitely not for me. I always prefer working with basic structs where I decide what is what and can be shared with scientists that use other programming languages (the more basic your data structure, the easier it is to share and communicate). Although I would imagine that something like GeoData.jl, but stable, well accepted and part of an organization would be helpful to many.

Having GeoMakie.jl in an improved state would be awesome. The lead dev @asinghvi17 is an awesome person I've worked with in the past, and I am sure we will work together more on GeoMakie.jl, which I will soon start contributing. I also plan to make interactive applications based on GeoMakie.jl. Only a few issues keep me back at the moment before I start contributing there, and I am sure they will be resolved fast.

ClimateTools has a lot of convenient things, but I think we could add flexibility on what is the core underlying data representation (and how detached it is from other packages). I have some methods I am currently writing for my project that I would be very happy to contribute, e.g. inter-annual variabilities.

@natgeo-wong
Copy link
Member

natgeo-wong commented Jan 24, 2020

Also, I feel a lot of attention for the org now is going into efficient data-interface and handling, but I also think there should also be discussion on the more applied aspects of what's being created here, such as data and analysis from different missions, and analysis of models that climate scientists use. (tag @briochemc because I've noticed he's created packages for analysis of oceanographic mission data, which is relevant)

I'm less on the data-interface side, and more on coding up things like

  • retrieval of data from satellite missions (think NASA's GPM or TRMM missions) (see ClimateSatellite.jl)
  • analysis of GCM data (e.g. Isca https://github.com/ExeClim/Isca) (see ClimateIsca.jl) (tag @aramirezreyes here for SAM things)
  • retrieval, analysis (and maybe plotting) of reanalysis data such as ERA-Interim and ERA5 (see ClimateERA.jl)

And I usually code with those goals in mind, so a lot of my functions are relatively general (e.g. I don't use the ClimGrid structure, partially because I manipulate the data in the backend as raw data arrays), and store meta-information in textfiles.

I also do random things like GillMatusno.jl which allows for the exploration of the solutions to tropical heating on a beta plane (similar to @milankl's ShallowWater.jl), but maybe these packages should eventually be organised separately in another place dedicated for idealised/simple models, or something.

One more thing I need to do is get documentation up for ease of use, but I still haven't figured out how Documenter.jl works - it's on my todo list tho.

@hdrake
Copy link

hdrake commented Jan 24, 2020

In response to @meggart's comment in the master thread: #1 (comment)

I would not know how to efficiently do this workflow in xarray (fitting a PCA on multivariate time series for every pixel in a gloabl dataset)

This is fairly straightforward with the eofs package, which leverages xarray. I've used it and found it incredibly easy to do EOF analysis on my dataset, despite not knowing what an EOF / PCA was at the beginning of the day.

@Balinus
Copy link
Member

Balinus commented Jan 24, 2020

In response to @meggart's comment in the master thread: #1 (comment)

I would not know how to efficiently do this workflow in xarray (fitting a PCA on multivariate time series for every pixel in a gloabl dataset)

This is fairly straightforward with the eofs package, which leverages xarray. I've used it and found it incredibly easy to do EOF analysis on my dataset, despite not knowing what an EOF / PCA was at the beginning of the day.

This is interesting to see the pattern: People are developing things on a common data structure relevant for climate studies (which is based on N-dimension arrays). Define a common API for that in Julia and then we can develop things related on those data structures much more easily.

One thing that will show Julia strength is the addition of the modelization in the pipeline. In most high-level language we have something like (python example):

  1. Results of a simulation (Fortran) -> 2. xarray (Python) -> 3. xarray compatible library for specific analysis (EOF, Python) -> 4. matplotlib (Python)

where the simulation is usually a model in Fortran, etc... Now, what I see emerging in Julia is a lot of model development (see for example climate-machine). How can we leverage this for science? A quick access to a modular Julia-coded model is certainly something that we should promote. Hence, one strength of Julia is the availability (in the sense that it is Julia source code) and performance of models. I just think this accelerate the work that is done by students and academics in general (I had one PhD colleague that spent months trying to launch Fortran-based climate model with some custom grid -> debugging was a nightmare). Anyway, that's mostly random thoughts I had today. Perhaps it's not relevant and not something we should aim for at the beginning. Or perhaps that's a good starting point, I don't know.

@hdrake
Copy link

hdrake commented Jan 25, 2020

It seems to me like our best bet right now is to build tools around something like ESDL.jl (see their Pangeo data example), which seems to have the core functionality of xarray: labelled n-dimensional out-of-memory datasets w/ indexing and operations broadcasting, etc. (not sure how much it can all be distributed).

The underlying infrastructure behind all of this seems a bit unstable, however, since the julia community doesn't seem to have settled on a named-array / named-indexing package and even the ESDL.jl devs seem like they may shift their own array types to depend on a package like DimensionalData.jl. Alternatives are AxisArrays.jl (note: even NetCDF.jl may be deprecated in favor of NCDatasets.jl).

Maybe I'm missing something, I need to dig a bit deeper into these repos (unfortunately ESDL.jl documentation is broken for me).

@hdrake
Copy link

hdrake commented Jan 25, 2020

Define a common API for that in Julia and then we can develop things related on those data structures much more easily.

Agreed @Balinus, this is key. Right now I don't even know where to start because there are so many different labelled array types that you could build climate tools around, but it is not clear to me which features each have and if any of them have all the features we would want in a package to rally around.

My recent paper https://github.com/hdrake/AbyssalFlow was so much easier because both the GCM and my post-processing are in julia. (Unfortunately, the model itself is only coded to run in serial and very, very far from optimized).

@Balinus
Copy link
Member

Balinus commented Jan 26, 2020

It seems to me like our best bet right now is to build tools around something like ESDL.jl (see their Pangeo data example), which seems to have the core functionality of xarray: labelled n-dimensional out-of-memory datasets w/ indexing and operations broadcasting, etc. (not sure how much it can all be distributed).

I also think we should look closely at ESDL and try to extend the package further to meet our common needs as most boxes are checked imho.

Agreed @Balinus, this is key. Right now I don't even know where to start because there are so many different labelled array types that you could build climate tools around, but it is not clear to me which features each have and if any of them have all the features we would want in a package to rally around.

Indeed! I'm using AxisArrays in ClimateTools and was quite happy with it. However, not sure it's possible to build out-of-core arrays with labels. How are labels used in ESDL? @meggart

edit - Also found this package ChunkedArrayBase.

@meggart
Copy link

meggart commented Jan 29, 2020

This is fairly straightforward with the eofs package, which leverages xarray. I've used it and found it incredibly easy to do EOF analysis on my dataset, despite not knowing what an EOF / PCA was at the beginning of the day.

I don't think so. Please look at the example, this is not a standard eof analysis. Here we fit a new PCA for every single pixel, and the reduced dimension is not time but the different variables. I think the clou in Julia is also that you could simply swap the PCA with a nonlinear DR method or anything else. To summarize, what I think makes ESDL.jl attractive is that you have nice mapslices syntax for really arbitrary code, you don't have to rely on the fact that someone has already wrapped and vectorized your use case.

@meggart
Copy link

meggart commented Jan 29, 2020

BTW, sorry for being so slow in replying these days. I am currently putting a lot of effort into DiskArrays.jl, which I hope will eventually give a big improvement in the way disk-mapped arrays can be treated inside the Julia-ecosystem.

After working with ESDL.jl for a while I think that treating climate data should feel as natural as possible. When currently using ESDL.jl one still has the feeling to be inside a framework, so you have separate data types and functions for everything. Simple things like broadcasting syntax, sums/means over dimensions etc are all possible but suffer from the fact that they need a different syntax than what one would expect from Base Julia. So really hope that as soon as we have stable DiskArrays, we can wrap them into other package implementing a labelled array and base our computing on these.

Currently ESDL.jl comes with its own labelled array type, but I would be happy to support other implementations of labelled arrays as well. My idea to get there was that we define a set of traits for Dimensional Arrays that just defines empty functions for querying the dimension names and dimension values for every axis.

So processing and plotting packages can query the coordinates of every point through the common interface but don't have to be specific on the actual data type they are operating on. This way, packages like ESDL.jl could operate on a variety of data types, as long as they implement the labelled array interface. I once made a gist to propose such an interface and after some discussion it resulted in this package https://github.com/JuliaGeo/DimensionalArrayTraits.jl which contains a lot of ideas but lacks a clear philosophy. I think any work towards a common interface for labelled array data types might have a huge impact on the interoperability of different packages and approaches inside the comunity.

@Balinus
Copy link
Member

Balinus commented Jan 30, 2020

It seems to me like our best bet right now is to build tools around something like ESDL.jl (see their Pangeo data example), which seems to have the core functionality of xarray: labelled n-dimensional out-of-memory datasets w/ indexing and operations broadcasting, etc. (not sure how much it can all be distributed).

Distributed calculations is supported in ESDL! See this thread: https://github.com/esa-esdl/ESDL.jl/issues/170

This is very nice imho. The API is not totally clear in my head, but we have a working example of how we could do a massive big-data analysis through ESDL.

@gaelforget
Copy link
Member Author

gaelforget commented Jan 30, 2020

Distributed calculations is supported in ESDL! See this thread: esa-esdl/ESDL.jl#170

Cool. Will give it a try.

On a related note, I should mention https://github.com/gaelforget/ClimateTasks.jl (being registered now) which is meant to support distributed tasks (as opposed to the array, nc, etc parts) with a slightly more general but yet topical focus (e.g. to run models or analysis functions). Will expand on this thread soon ... once I have a couple more examples (the included example is an interpolation loop)

@meggart
Copy link

meggart commented Feb 18, 2020

In case there is interest in a short introduction to the ESDL.jl API, I would be happy to have a call meeting where we could talk through the concepts in ESDL, comparisons to ClimateTasks.jl etc.

@Balinus
Copy link
Member

Balinus commented Feb 18, 2020

Yes, I would be interested to know more about the details of ESDL and how I can use it for climate analysis. For instance, coming Friday (21st) might be possible for me.

Cheers!

@gaelforget
Copy link
Member Author

gaelforget commented Mar 4, 2020

In case there is interest in a short introduction to the ESDL.jl API, I would be happy to have a call meeting where we could talk through the concepts in ESDL, comparisons to ClimateTasks.jl etc.

Sorry for the lag in response -- I am still not running behind on a few things after coming back from OSM20 ...

Would be great to learn more about ESDL.jl which I have been meaning to try...

Unless the call meeting already happened, maybe next week would be good for all interested?

@Balinus
Copy link
Member

Balinus commented Mar 9, 2020

Not too late to the show @gaelforget. :)

@meggart
Copy link

meggart commented Mar 12, 2020

Yes, not too late. Let's try to schedule a telecon on potential use of ESDL, Maybe next week? Which time zones are you in? I am in CET and would be available in general either during the day (8am-5pm) or in the evening (after 8:30pm). When we know which times apply for all of us, we could try to fix a date, otherwise feel free to start a doodle or similar.

@Balinus
Copy link
Member

Balinus commented Mar 20, 2020

I'm in Eastern time: https://www.timeanddate.com/time/zones/et

Now with the COVID-19, here in Québec all schools are closed, probably until mid-May. I have 3 small kids and needs to also work! Hence, not sure I'm gonna be free before a couple of weeks.

If the demo goes ahead, I suggest you try to record it. Might be valuable information/tutorial material.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants