Implement a custom allocator #694

K20shores · 2024-12-16T21:20:46Z

Much of the memory in the state is used in the hot path to solve chemistry. In our performance benchmarks, we see a large amount of variance compared to some fortran code which solve the same systems with the same numerical routines. The large variance could be due to locality. The fortran data is all stack data allocated in data segment of the program, while the c++ variables are allocated on the heap which could be from each other.

micm/include/micm/solver/state.hpp

Lines 46 to 63 in cdfd2e7

    
           DenseMatrixPolicy variables_; 
        
           /// @brief Rate paramters particular to user-defined rate constants, may vary in time 
        
           DenseMatrixPolicy custom_rate_parameters_; 
        
           /// @brief The reaction rates, may vary in time 
        
           DenseMatrixPolicy rate_constants_; 
        
           /// @brief Atmospheric conditions, varies in time 
        
           std::vector<Conditions> conditions_; 
        
           /// @brief The jacobian structure, varies for each solve 
        
           SparseMatrixPolicy jacobian_; 
        
           /// @brief Immutable data required for the state 
        
           std::map<std::string, std::size_t> variable_map_; 
        
           std::map<std::string, std::size_t> custom_rate_parameter_map_; 
        
           std::vector<std::string> variable_names_{}; 
        
           SparseMatrixPolicy lower_matrix_; 
        
           SparseMatrixPolicy upper_matrix_; 
        
           std::size_t state_size_; 
        
           std::size_t number_of_grid_cells_; 
        
           std::unique_ptr<TemporaryVariables> temporary_variables_;

Also, our LU decomposition routines keep a list of ids used to index into various arrays at runtime. All of these are heap-allocated and may not be close to each other.

micm/include/micm/solver/lu_decomposition_doolittle.hpp

Lines 47 to 78 in cdfd2e7

    
           std::vector<std::pair<std::size_t, std::size_t>> niLU_; 
        
           /// True when A[i][k] is non-zero for each iteration of the middle (k) loop for the upper 
        
           /// triangular matrix; False otherwise. Used data type char instead of bool because vector<bool> representation 
        
           /// does not support easy retrieval of memory address using data() function. 
        
           std::vector<char> do_aik_; 
        
           /// Index in A.data_ for A[i][k] for each iteration of the middle (k) loop for the upper 
        
           /// triangular matrix when A[i][k] is non-zero 
        
           std::vector<std::size_t> aik_; 
        
           /// Index in U.data_ for U[i][k] for each iteration of the middle (k) loop for the upper 
        
           /// triangular matrix when U[i][k] is non-zero, and the corresponding number of elements 
        
           /// in the inner (j) loop 
        
           std::vector<std::pair<std::size_t, std::size_t>> uik_nkj_; 
        
           /// Index in L.data_ for L[i][j], and in U.data_ for U[j][k] in the upper inner (j) loop 
        
           /// when L[i][j] and U[j][k] are both non-zero. 
        
           std::vector<std::pair<std::size_t, std::size_t>> lij_ujk_; 
        
           /// True when A[k][i] is non-zero for each iteration of the middle (k) loop for the lower 
        
           /// triangular matrix; False otherwise. Used data type char instead of bool because vector<bool> representation 
        
           /// does not suppor easy retrieval of memory address using data() function. 
        
           std::vector<char> do_aki_; 
        
           /// Index in A.data_ for A[k][i] for each iteration of the middle (k) loop for the lower 
        
           /// triangular matrix when A[k][i] is non-zero. 
        
           std::vector<std::size_t> aki_; 
        
           /// Index in L.data_ for L[k][i] for each iteration of the middle (k) loop for the lower 
        
           /// triangular matrix when L[k][i] is non-zero, and the corresponding number of elements 
        
           /// in the inner (j) loop 
        
           std::vector<std::pair<std::size_t, std::size_t>> lki_nkj_; 
        
           /// Index in L.data_ for L[k][j], and in U.data_ for U[j][i] in the lower inner (j) loop 
        
           /// when L[k][j] and U[j][i] are both non-zero. 
        
           std::vector<std::pair<std::size_t, std::size_t>> lkj_uji_; 
        
           /// Index in U.data_ for U[i][i] for each interation in the middle (k) loop for the lower 
        
           /// triangular matrix when L[k][i] is non-zero 
        
           std::vector<std::size_t> uii_;

micm/include/micm/solver/lu_decomposition_mozart.hpp

Lines 59 to 84 in cdfd2e7

    
           std::vector<std::tuple<std::size_t, std::size_t, std::size_t>> lii_nuij_nlij_; 
        
           /// Index in U.data_ and A.data_ for U[i][j] and A[i][j] for each iteration of the inner (j) loop 
        
           /// used to set the initial value for the U matrix 
        
           std::vector<std::pair<std::size_t, std::size_t>> uij_aij_; 
        
           /// Index in L.data_ and A.data_ for L[i][j] and A[i][j] for each iteration of the inner (j) loop 
        
           /// used to set the initial value for the L matrix 
        
           std::vector<std::pair<std::size_t, std::size_t>> lij_aij_; 
        
           /// Index in U.data_ for each non-zero element in U that is zero in A 
        
           std::vector<std::size_t> fill_uij_; 
        
           /// Index in L.data_ for each non-zero element in L that is zero in A 
        
           std::vector<std::size_t> fill_lij_; 
        
           /// Index in U.data_ for U[i][i] and the number of elements in the middle (j) and (k) loops 
        
           /// for each iteration of the outer (i) loop 
        
           std::vector<std::tuple<std::size_t, std::size_t, std::size_t>> uii_nj_nk_; 
        
           /// Index in L.data_ for L[j][i] for each iteration of the middle (j) loop 
        
           /// for the lower triangular matrix 
        
           std::vector<std::size_t> lji_; 
        
           /// Number of elements in the inner (j) loops for each iteration of the middle (k) loop for the 
        
           /// upper and lower triangular matrices, and the index in U.data_ for U[j][k] for each iteration 
        
           std::vector<std::tuple<std::size_t, std::size_t, std::size_t>> nujk_nljk_uik_; 
        
           /// Index in U.data_ for U[j][k] and in L.data_ for L[j][i] 
        
           /// for each iteration of the inner (j) loop for the upper triangular matrix 
        
           std::vector<std::pair<std::size_t, std::size_t>> ujk_lji_; 
        
           /// Index in L.data_ for L[j][k] and in L.data_ for L[j][i] 
        
           /// for each iteration of the inner (j) loop for the lower triangular matrix 
        
           std::vector<std::pair<std::size_t, std::size_t>> ljk_lji_;

Fortunately, for any given state, the size of these data members (in both the state and LU routines) will always be the same size, and, after configuration, we should be able to calculate these sizes.

We should consider using a custom allocator which will place all of the members of state next to each other in memory and all the members of the LU routines near each other in memory.

Acceptance criteria

After configuration, the total number of bytes needed for LU and the state are able to be calculated
A custom allocator template is added which can dole memory out to the state and LU

Ideas

Use TCMalloc?
Use C++ pmr?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a custom allocator #694

Implement a custom allocator #694

K20shores commented Dec 16, 2024

Implement a custom allocator #694

Implement a custom allocator #694

Comments

K20shores commented Dec 16, 2024