Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programmatically query the available model results #359

Closed
willu47 opened this issue Apr 15, 2019 · 19 comments
Closed

Programmatically query the available model results #359

willu47 opened this issue Apr 15, 2019 · 19 comments
Assignees
Milestone

Comments

@willu47
Copy link
Member

willu47 commented Apr 15, 2019

Child issue of #350

Query the available model results across various levels in the hierarchy of:

  • modelrun
  • timestep <- these are defined in a model run
  • decision_iteration <- iterations exist or not depending on the decision module and may change from run to run. Note also that the numbers of iterations per timestep may change.
  • model_name <- this is a model within a system of systems
  • output_name <- these are defined in the model_name configuration
  • dimensions within the output_name <- these are defined in an output's Spec, also in config

Users should be able to fix one or more of the above levels and receive a multi-dimensional array of data that represents the unfixed data.

Suggest the following interface:

>>> from smif import Results
>>> results = Results(path_to_project)
>>> results.list()
[energy_central, energy_water_cp_cr]
>>> results.available_results('energy_central')
{model run: energy_central
  - sos model: energy  # <- note that there is only ever one Sos model in a model run, so don't need this nesting
    - sector model: energy_demand
      - output: cost
        - decision 1: 2010
        - decision 2: 2010, 2015
        - decision 3: 2010, 2015, 2020
        - decision 4: 2010, 2020
      - output: water_demand
        - decision 1: 2010
        - decision 2: 2010, 2015
        - decision 3: 2010, 2015, 2020
        - decision 4: 2010, 2020
}
>>> results.as_df(modelrun='energy_central', timestep=[2010, 2015, 2020], output_name=['water_demand']
<returns correctly formatted pandas.DataFrame>
@tomalrussell
Copy link
Member

>>> from smif.data_layer import Results

# pass arguments sufficient to create a Store behind the scenes, which we'll have read-only access to
>>> results = Results({'interface': 'local_csv', 'directory': path_to_project})  

>>> results.list_model_runs()
['energy_central', 'energy_water_cp_cr']

# writing out this example for concreteness - there may be a more convenient data structure shape...
>>> results.available_results('energy_central')  
{
    'model_run': 'energy_central'
    'sos_model': 'energy'
    'models': [
        {
            'name': 'energy_demand'
            'outputs': [
                {
                    'name': 'cost',
                    'decision_timesteps': {
                        1: [2010],
                        2: [2010, 2015]
                    },
                },
                {
                    'name': 'water_demand',
                    'decision_timesteps': {
                        1: [2010],
                        2: [2010, 2015]
                    }
                }
            }
        }
    ]
}
>>> da = results.read(modelruns=['energy_central'], timesteps=[2010, 2015, 2020], output_names=['water_demand'])
>>> da.as_df()
<returns correctly formatted pandas.DataFrame, including columns for timestep,modelrun,decision>
  1. decide how to handle the decision/timestep interaction
    results.read(..., decision_timesteps=[(1, 2010), (3, 2020)])  # explicit
    results.read(..., timesteps=[2010])  # implicitly for all decisions at each timestep
    results.read(..., decisions=[1,2])  # implicitly for all timesteps at each decision 
    results.read(outputs=['water_demand'])  # implicitly for all decision/timesteps available
    results.read(..., timesteps=[2010], decisions=[1, 2])  # ??? error? something reasonable?
  2. read a single output for a single model run for a single decision iteration/timestep
  3. read a single output for a single model run for multiple decision iteration/timesteps
  4. read a single output for multiple model runs for multiple decision iteration/timesteps

Later, maybe:

  1. read multiple outputs, if they have the same spec dimensions
  2. read multiple outputs, if they have different spec dimensions (fill out coordinates with unknown/NaN where there's a mismatch)
  3. aggregate-on-read along specified dimensions

@tomalrussell
Copy link
Member

Also, I don't think that output names need to be unique across all models. We may need to specify the (model_name, output_name) to read a result.

@willu47
Copy link
Member Author

willu47 commented Apr 15, 2019

[thinking out loud] Do decision iterations/timesteps differ across outputs of a model, aside from there being 'no results'?

@tomalrussell
Copy link
Member

Only if there's some partial output or failure - I'd expect all the outputs to be present for each (decision, timestep) in a successful model run.

@tomalrussell
Copy link
Member

Here's a straw man alternative - doesn't quite have all the information (i.e. doesn't tell you about those partial outputs), but is much more compact:

>>> results.available_results('energy_central')  
{
    'model_run': 'energy_central'
    'sos_model': 'energy'
    'model_outputs': [
        ('energy_demand', 'cost'),
        ('energy_demand', 'water_demand')
    'decision_timesteps': [
        (1, 2010),
        (2, 2010),
        (3, 2015)
    ]
}

@willu47
Copy link
Member Author

willu47 commented Apr 16, 2019

I think that's better - we could always raise a warning to the user about partial results but use this compact form.

@tlestang
Copy link
Contributor

The way we handled things is very close to Tom's suggestion, namely

  1. store.get_result_darray(..., timesteps=[2010]) # implicitly for all decisions at each timestep
  2. store.get_result_darray(..., decisions=[1,2]) # implicitly for all timesteps at each decision
  3. store.get_result_darray(..., decision_timesteps=[(1, 2010), (3, 2020)]) # explicit
  4. store.get_result_darray(...) # implicitly for all decision/timesteps available
  5. store.get_result_darray(..., timesteps=[2010], decisions=[1, 2]) Returns data for all (timestep,decision) pairs available, along with a warning.
    Where ... refers to model_run_name, sector_model_name, output_name which are just strings at the moment (output is fixed)

The Store.get_result_darray returns a DataArray describing the output data and metadata, with one extra dimension: the (decision_iteration,timestep) pair. We initially added two dimensions to the DataArray, namely decision_iterations and timesteps, but realized that they are not independent...
and that the only thing the data can be index with is the (decision_iteration,timestep) pair.

@willu47
Copy link
Member Author

willu47 commented Apr 18, 2019

Hi @tlestang - this looks like an excellent first go at the problem and provides the functionality we need.

I guess the next steps are to think about how we expose this to a user? I think Tom's suggestion for a read-only wrapper around the store seem sensible. We don't want to allow users to edit the results, delete files etc. accidentally while making plots! With this in mind, it would be worth taking a look at DataHandle.

I would imagine that users may want to use an interactive Python environment for analysing results and writing scripts against the data to produce plots. So the Results.available_results() method in Tom's example will be important so that users can find out what data is present (and should build directly on the functionality from #351).

@tlestang
Copy link
Contributor

I've been trying to modify our get_results_darray method in Store so that it can yield the results for multiple outputs, in a single DataArray object. The only way I could think of is adding an additional dimension output_label that runs from 0 to number_of_queried_outputs-1. This DataArray does however not contain the information about the name of the outputs anymore, just their label.
Would you think of another way to embed more than one output in a single DataArray ?
A simpler solution would be to yield a data structure containing several DataArray objects, one for each queried output.

@tomalrussell
Copy link
Member

Hi @tlestang - good question.

I think we can keep it simple to start: get_results_darray limited to a single output. The other functionality is already useful, and I think (if I read correctly) you're right that the output_label approach is fairly awkward - it would lose metadata, and would rely on dtype being uniform across outputs.

On the other hand, note that xarray has been a big influence on the design of DataArray - they have the concept of a DataSet which might be worth looking at as a design for a data structure to contain multiple DataArrays - or might be too 'heavy' a solution or slightly mismatched to our needs for now.

@tlestang
Copy link
Contributor

That's great, I always wanted to have a look at xarray! I guess now is the time.
In the meantime, I will finish the multiple output get_result_darray, so that we have something that works for now, and to interface with the Results interface that Fergus has written yesterday.

@willu47
Copy link
Member Author

willu47 commented Apr 29, 2019

Comments regarding 4b886ed

  • I think modelrun should be a dimension in the Spec (and thus in the index of the dataframe) rather than a key in a dict of model runs
  • I would prefer to have access to both timestep and decision iteration in separate columns in the dataframe, rather than bundled together. For example, I may want to pick the rows that have the max iteration for each timestep
  • Ideally, timestep should appear on the outside of the dataframe - it being the slowest moving index, followed by decision iteration, then the dimensions as listed in the Spec of the output
  • For later - when we are dealing with multiple outputs, it might be nice to have a units exposed as another column in the dataframe (edited)

@fcooper8472
Copy link
Contributor

I'll summarise the discussion from slack.

The main issue with having model_run as a dimension in the Spec is that there may well be different (timestep, decision)-pairs for different model runs. This means that there is no neat way to encode that information as a dimension: you would have to pad the output as necessary so that every (timestep, decision)-pair appears for each model run.

Instead, we can easily present the data in the form of a dataframe with a column for model_run. To do this, the Store.get_results() method can return whichever is the most convenient data structure, and formatting that into an appropriate dataframe format can be delegated to the Results.read() method.

This change is made in ac88b36.

@fcooper8472
Copy link
Contributor

In terms of future work:

  • Ideally, timestep should appear on the outside of the dataframe - it being the slowest moving index, followed by decision iteration, then the dimensions as listed in the Spec of the output

Not quite sure what you mean here - do you mean literally re-ordering the columns, so that you have ['model_run', 'timestep', 'decision', <whatever dims in spec>, 'output']? If yes, then that should be straightforward.

  • For later - when we are dealing with multiple outputs, it might be nice to have a units exposed as another column in the dataframe (edited)

This would be straightforward now, too - handling multiple outputs (presuming the specs are the same for each) would essentially add an additional column to the resulting dataframe. Units (presumably) could be added as an additional column per output.

fcooper8472 added a commit that referenced this issue Apr 29, 2019
@willu47
Copy link
Member Author

willu47 commented Apr 30, 2019

In terms of future work:

  • Ideally, timestep should appear on the outside of the dataframe - it being the slowest moving index, followed by decision iteration, then the dimensions as listed in the Spec of the output
  1. Not quite sure what you mean here - do you mean literally re-ordering the columns, so that you have ['model_run', 'timestep', 'decision', <whatever dims in spec>, 'output']? If yes, then that should be straightforward.
  • For later - when we are dealing with multiple outputs, it might be nice to have a units exposed as another column in the dataframe (edited)
  1. This would be straightforward now, too - handling multiple outputs (presuming the specs are the same for each) would essentially add an additional column to the resulting dataframe. Units (presumably) could be added as an additional column per output.
  1. Yes, exactly that
  2. Might get a bit messy with multiple outputs, but would be useful for now. The only issue is that units will be the same for all rows in an output, so adding a column seems a bit of a waste of space. An alternative could be a helper script on the results class which returns the units for an output?

tlestang pushed a commit that referenced this issue May 1, 2019
…ilability of quieried output as well as dimensionality

Issue #359
fcooper8472 added a commit that referenced this issue May 1, 2019
@fcooper8472
Copy link
Contributor

@willu47 a quick update:

  1. Columns are now ordered as you suggest
  2. Multiple outputs are now retrievable in a single call to Results.read() (provided the spec coords match)
  3. Results.get_units('name_of_output') now gives you the unit, which is also added to the column headers of each output for ease of reference

Could you give it another try and see if it's working for you?

@willu47
Copy link
Member Author

willu47 commented May 2, 2019

Hi @fcooper8472 - many thanks. Almost there!

  • The column ordering is great, and makes the outputs immediately intelligible.
  • Having the units in the column name is useful, but is a barrier to programmatic plotting of data, so I would suggest to remove it for now.
  • The Result.get_units() method does all we need I think (very handily too)!

fcooper8472 added a commit that referenced this issue May 2, 2019
@fcooper8472
Copy link
Contributor

Thanks for the feedback - the latest commit on the PR removes the units in column names.

@fcooper8472
Copy link
Contributor

Closed by #367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants