Programmatically query the available model results #359

willu47 · 2019-04-15T14:47:13Z

Child issue of #350

Query the available model results across various levels in the hierarchy of:

modelrun
timestep <- these are defined in a model run
decision_iteration <- iterations exist or not depending on the decision module and may change from run to run. Note also that the numbers of iterations per timestep may change.
model_name <- this is a model within a system of systems
output_name <- these are defined in the model_name configuration
dimensions within the output_name <- these are defined in an output's Spec, also in config

Users should be able to fix one or more of the above levels and receive a multi-dimensional array of data that represents the unfixed data.

Suggest the following interface:

>>> from smif import Results
>>> results = Results(path_to_project)
>>> results.list()
[energy_central, energy_water_cp_cr]
>>> results.available_results('energy_central')
{model run: energy_central
  - sos model: energy  # <- note that there is only ever one Sos model in a model run, so don't need this nesting
    - sector model: energy_demand
      - output: cost
        - decision 1: 2010
        - decision 2: 2010, 2015
        - decision 3: 2010, 2015, 2020
        - decision 4: 2010, 2020
      - output: water_demand
        - decision 1: 2010
        - decision 2: 2010, 2015
        - decision 3: 2010, 2015, 2020
        - decision 4: 2010, 2020
}
>>> results.as_df(modelrun='energy_central', timestep=[2010, 2015, 2020], output_name=['water_demand']
<returns correctly formatted pandas.DataFrame>

tomalrussell · 2019-04-15T15:48:06Z

>>> from smif.data_layer import Results

# pass arguments sufficient to create a Store behind the scenes, which we'll have read-only access to
>>> results = Results({'interface': 'local_csv', 'directory': path_to_project})  

>>> results.list_model_runs()
['energy_central', 'energy_water_cp_cr']

# writing out this example for concreteness - there may be a more convenient data structure shape...
>>> results.available_results('energy_central')  
{
    'model_run': 'energy_central'
    'sos_model': 'energy'
    'models': [
        {
            'name': 'energy_demand'
            'outputs': [
                {
                    'name': 'cost',
                    'decision_timesteps': {
                        1: [2010],
                        2: [2010, 2015]
                    },
                },
                {
                    'name': 'water_demand',
                    'decision_timesteps': {
                        1: [2010],
                        2: [2010, 2015]
                    }
                }
            }
        }
    ]
}
>>> da = results.read(modelruns=['energy_central'], timesteps=[2010, 2015, 2020], output_names=['water_demand'])
>>> da.as_df()
<returns correctly formatted pandas.DataFrame, including columns for timestep,modelrun,decision>

decide how to handle the decision/timestep interaction

results.read(..., decision_timesteps=[(1, 2010), (3, 2020)])  # explicit
results.read(..., timesteps=[2010])  # implicitly for all decisions at each timestep
results.read(..., decisions=[1,2])  # implicitly for all timesteps at each decision 
results.read(outputs=['water_demand'])  # implicitly for all decision/timesteps available
results.read(..., timesteps=[2010], decisions=[1, 2])  # ??? error? something reasonable?

read a single output for a single model run for a single decision iteration/timestep
read a single output for a single model run for multiple decision iteration/timesteps
read a single output for multiple model runs for multiple decision iteration/timesteps

Later, maybe:

read multiple outputs, if they have the same spec dimensions
read multiple outputs, if they have different spec dimensions (fill out coordinates with unknown/NaN where there's a mismatch)
aggregate-on-read along specified dimensions

tomalrussell · 2019-04-15T15:51:23Z

Also, I don't think that output names need to be unique across all models. We may need to specify the (model_name, output_name) to read a result.

willu47 · 2019-04-15T16:05:40Z

[thinking out loud] Do decision iterations/timesteps differ across outputs of a model, aside from there being 'no results'?

tomalrussell · 2019-04-15T16:08:29Z

Only if there's some partial output or failure - I'd expect all the outputs to be present for each (decision, timestep) in a successful model run.

tomalrussell · 2019-04-15T16:19:32Z

Here's a straw man alternative - doesn't quite have all the information (i.e. doesn't tell you about those partial outputs), but is much more compact:

>>> results.available_results('energy_central')  
{
    'model_run': 'energy_central'
    'sos_model': 'energy'
    'model_outputs': [
        ('energy_demand', 'cost'),
        ('energy_demand', 'water_demand')
    'decision_timesteps': [
        (1, 2010),
        (2, 2010),
        (3, 2015)
    ]
}

willu47 · 2019-04-16T09:11:36Z

I think that's better - we could always raise a warning to the user about partial results but use this compact form.

tlestang · 2019-04-17T17:13:11Z

The way we handled things is very close to Tom's suggestion, namely

store.get_result_darray(..., timesteps=[2010]) # implicitly for all decisions at each timestep
store.get_result_darray(..., decisions=[1,2]) # implicitly for all timesteps at each decision
store.get_result_darray(..., decision_timesteps=[(1, 2010), (3, 2020)]) # explicit
store.get_result_darray(...) # implicitly for all decision/timesteps available
store.get_result_darray(..., timesteps=[2010], decisions=[1, 2]) Returns data for all (timestep,decision) pairs available, along with a warning.
Where ... refers to model_run_name, sector_model_name, output_name which are just strings at the moment (output is fixed)

The Store.get_result_darray returns a DataArray describing the output data and metadata, with one extra dimension: the (decision_iteration,timestep) pair. We initially added two dimensions to the DataArray, namely decision_iterations and timesteps, but realized that they are not independent...
and that the only thing the data can be index with is the (decision_iteration,timestep) pair.

willu47 · 2019-04-18T09:28:06Z

Hi @tlestang - this looks like an excellent first go at the problem and provides the functionality we need.

I guess the next steps are to think about how we expose this to a user? I think Tom's suggestion for a read-only wrapper around the store seem sensible. We don't want to allow users to edit the results, delete files etc. accidentally while making plots! With this in mind, it would be worth taking a look at DataHandle.

I would imagine that users may want to use an interactive Python environment for analysing results and writing scripts against the data to produce plots. So the Results.available_results() method in Tom's example will be important so that users can find out what data is present (and should build directly on the functionality from #351).

tlestang · 2019-04-23T16:37:42Z

I've been trying to modify our get_results_darray method in Store so that it can yield the results for multiple outputs, in a single DataArray object. The only way I could think of is adding an additional dimension output_label that runs from 0 to number_of_queried_outputs-1. This DataArray does however not contain the information about the name of the outputs anymore, just their label.
Would you think of another way to embed more than one output in a single DataArray ?
A simpler solution would be to yield a data structure containing several DataArray objects, one for each queried output.

tomalrussell · 2019-04-23T17:07:28Z

Hi @tlestang - good question.

I think we can keep it simple to start: get_results_darray limited to a single output. The other functionality is already useful, and I think (if I read correctly) you're right that the output_label approach is fairly awkward - it would lose metadata, and would rely on dtype being uniform across outputs.

On the other hand, note that xarray has been a big influence on the design of DataArray - they have the concept of a DataSet which might be worth looking at as a design for a data structure to contain multiple DataArrays - or might be too 'heavy' a solution or slightly mismatched to our needs for now.

tlestang · 2019-04-24T07:22:16Z

That's great, I always wanted to have a look at xarray! I guess now is the time.
In the meantime, I will finish the multiple output get_result_darray, so that we have something that works for now, and to interface with the Results interface that Fergus has written yesterday.

willu47 · 2019-04-29T08:18:52Z

Comments regarding 4b886ed

I think modelrun should be a dimension in the Spec (and thus in the index of the dataframe) rather than a key in a dict of model runs
I would prefer to have access to both timestep and decision iteration in separate columns in the dataframe, rather than bundled together. For example, I may want to pick the rows that have the max iteration for each timestep
Ideally, timestep should appear on the outside of the dataframe - it being the slowest moving index, followed by decision iteration, then the dimensions as listed in the Spec of the output
For later - when we are dealing with multiple outputs, it might be nice to have a units exposed as another column in the dataframe (edited)

fcooper8472 · 2019-04-29T14:33:40Z

I'll summarise the discussion from slack.

The main issue with having model_run as a dimension in the Spec is that there may well be different (timestep, decision)-pairs for different model runs. This means that there is no neat way to encode that information as a dimension: you would have to pad the output as necessary so that every (timestep, decision)-pair appears for each model run.

Instead, we can easily present the data in the form of a dataframe with a column for model_run. To do this, the Store.get_results() method can return whichever is the most convenient data structure, and formatting that into an appropriate dataframe format can be delegated to the Results.read() method.

This change is made in ac88b36.

fcooper8472 · 2019-04-29T14:48:36Z

In terms of future work:

Ideally, timestep should appear on the outside of the dataframe - it being the slowest moving index, followed by decision iteration, then the dimensions as listed in the Spec of the output

Not quite sure what you mean here - do you mean literally re-ordering the columns, so that you have ['model_run', 'timestep', 'decision', <whatever dims in spec>, 'output']? If yes, then that should be straightforward.

For later - when we are dealing with multiple outputs, it might be nice to have a units exposed as another column in the dataframe (edited)

This would be straightforward now, too - handling multiple outputs (presuming the specs are the same for each) would essentially add an additional column to the resulting dataframe. Units (presumably) could be added as an additional column per output.

willu47 · 2019-04-30T14:40:09Z

In terms of future work:

Ideally, timestep should appear on the outside of the dataframe - it being the slowest moving index, followed by decision iteration, then the dimensions as listed in the Spec of the output

Not quite sure what you mean here - do you mean literally re-ordering the columns, so that you have ['model_run', 'timestep', 'decision', <whatever dims in spec>, 'output']? If yes, then that should be straightforward.

For later - when we are dealing with multiple outputs, it might be nice to have a units exposed as another column in the dataframe (edited)

This would be straightforward now, too - handling multiple outputs (presuming the specs are the same for each) would essentially add an additional column to the resulting dataframe. Units (presumably) could be added as an additional column per output.

Yes, exactly that
Might get a bit messy with multiple outputs, but would be useful for now. The only issue is that units will be the same for all rows in an output, so adding a column seems a bit of a waste of space. An alternative could be a helper script on the results class which returns the units for an output?

…ilability of quieried output as well as dimensionality Issue #359

fcooper8472 · 2019-05-01T16:12:14Z

@willu47 a quick update:

Columns are now ordered as you suggest
Multiple outputs are now retrievable in a single call to Results.read() (provided the spec coords match)
Results.get_units('name_of_output') now gives you the unit, which is also added to the column headers of each output for ease of reference

Could you give it another try and see if it's working for you?

willu47 · 2019-05-02T15:20:30Z

Hi @fcooper8472 - many thanks. Almost there!

The column ordering is great, and makes the outputs immediately intelligible.
Having the units in the column name is useful, but is a barrier to programmatic plotting of data, so I would suggest to remove it for now.
The Result.get_units() method does all we need I think (very handily too)!

fcooper8472 · 2019-05-02T15:41:03Z

Thanks for the feedback - the latest commit on the PR removes the units in column names.

…sults

fcooper8472 · 2019-05-07T13:28:22Z

Closed by #367

fcooper8472 mentioned this issue Apr 15, 2019

Parent issue for results API #350

Closed

willu47 assigned fcooper8472 and tlestang Apr 16, 2019

willu47 added the enhancement label Apr 16, 2019

willu47 added this to the Results milestone Apr 16, 2019

willu47 mentioned this issue Apr 16, 2019

I351 list results #358

Merged

fcooper8472 added a commit that referenced this issue Apr 17, 2019

#359 First attempt at data access method

53477fb

fcooper8472 added a commit that referenced this issue Apr 23, 2019

#359 Work towards read-only Results interface

d068681

fcooper8472 added a commit that referenced this issue Apr 26, 2019

#359 Add read() method to Results and add interface tests

4ee685a

fcooper8472 added a commit that referenced this issue Apr 26, 2019

#359 Improve testing of Results()

c29c97a

fcooper8472 added a commit that referenced this issue Apr 29, 2019

#359 Return a dataframe with cols for model run, timestep and decision

ac88b36

fcooper8472 added a commit that referenced this issue Apr 29, 2019

#359 Tidying

0ee60a1

fcooper8472 added a commit that referenced this issue Apr 29, 2019

#359 Add test stub and todo for testing

079152d

fcooper8472 added a commit that referenced this issue May 1, 2019

#359 Add functionality to keep tabs on units

f71b570

fcooper8472 added a commit that referenced this issue May 1, 2019

#359 Reorder columns model_run -> timestep -> decision

844fb8f

tlestang pushed a commit that referenced this issue May 1, 2019

Modify store.get_results to return multiple outputs and check for ava…

92b8a13

…ilability of quieried output as well as dimensionality Issue #359

fcooper8472 added a commit that referenced this issue May 1, 2019

#359 Update wrt multiple outputs on store class

655785c

fcooper8472 added a commit that referenced this issue May 1, 2019

#359 Update Results.read() validation

bf6ceba

fcooper8472 added a commit that referenced this issue May 1, 2019

#359 Tidy Store.get_results()

a14f35b

fcooper8472 added a commit that referenced this issue May 2, 2019

#359 Remove units from column names

7ef8389

willu47 mentioned this issue May 3, 2019

Release smif v1.1 #369

Merged

fcooper8472 added a commit that referenced this issue May 3, 2019

#359 Change dict to OrderedDict to ensure ordered Pandas dataframe

0fe05f8

fcooper8472 added a commit that referenced this issue May 3, 2019

#359 Differentiate between Results instance with or without actual re…

3c8bfbf

…sults

fcooper8472 added a commit that referenced this issue May 3, 2019

#359 Add coverage for multiple model runs

db7cf82

fcooper8472 added a commit that referenced this issue May 7, 2019

#359 Change to OrderedDict for reproducibility between 3.5 and 3.6

b94bce9

fcooper8472 closed this as completed May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Programmatically query the available model results #359

Programmatically query the available model results #359

willu47 commented Apr 15, 2019 •

edited

Loading

tomalrussell commented Apr 15, 2019

tomalrussell commented Apr 15, 2019

willu47 commented Apr 15, 2019

tomalrussell commented Apr 15, 2019

tomalrussell commented Apr 15, 2019

willu47 commented Apr 16, 2019

tlestang commented Apr 17, 2019

willu47 commented Apr 18, 2019

tlestang commented Apr 23, 2019

tomalrussell commented Apr 23, 2019

tlestang commented Apr 24, 2019

willu47 commented Apr 29, 2019

fcooper8472 commented Apr 29, 2019

fcooper8472 commented Apr 29, 2019

willu47 commented Apr 30, 2019

fcooper8472 commented May 1, 2019

willu47 commented May 2, 2019

fcooper8472 commented May 2, 2019

fcooper8472 commented May 7, 2019

Programmatically query the available model results #359

Programmatically query the available model results #359

Comments

willu47 commented Apr 15, 2019 • edited Loading

tomalrussell commented Apr 15, 2019

tomalrussell commented Apr 15, 2019

willu47 commented Apr 15, 2019

tomalrussell commented Apr 15, 2019

tomalrussell commented Apr 15, 2019

willu47 commented Apr 16, 2019

tlestang commented Apr 17, 2019

willu47 commented Apr 18, 2019

tlestang commented Apr 23, 2019

tomalrussell commented Apr 23, 2019

tlestang commented Apr 24, 2019

willu47 commented Apr 29, 2019

fcooper8472 commented Apr 29, 2019

fcooper8472 commented Apr 29, 2019

willu47 commented Apr 30, 2019

fcooper8472 commented May 1, 2019

willu47 commented May 2, 2019

fcooper8472 commented May 2, 2019

fcooper8472 commented May 7, 2019

willu47 commented Apr 15, 2019 •

edited

Loading