Equally Contributed by: Hanxi Xiao, Aaron Rosen. Music, like words, has inherent semantics and meaning. The composition, order of the notes, are tied to the expressiveness of the music piece such as genre, tempo, instrumentation and more. Understanding and replicating semantics are crucial for generating music that is coherent and emotionally resonant. A principal obstacle in music generation is the considerable variation in the Markovian relationships among notes, which significantly fluctuates based on the music's style. This variation include factors such as genre, the number of soundtracks (ranging from solo instruments to full symphonies), and the historical context or era in which the music was produced. Transformers, with their attention mechanisms, are excellent at capturing dependencies and relationships in sequential data, making them ideal for understanding music semantics. However, transformers training often requires high computational resources. Here, we aim to leverage large pre-trained transformer and incorporate markovian models to generalize the pre-tranied transformer to unseen music style/genres for new music generation. Markov models can capture the transition probabilities between musical notes, enabling the generation of musically coherent sequences. We present a music generation model leveraging different neural network architectures to extract informative embedding and employ Markov models to comprehend and replicate the semantics of music, facilitating the creation of coherent and emotionally resonant musical compositions.
The following observations motivate our project aims: audio files in MIDI format provide a parseable, rich language for modeling music; transformers can adequately learn contextual relationships in a large corpus of MIDI data, but are expensive to train and run; and probabilistic models like GCNs can be lightweight and efficient methods for leveraging these attention-based relationships for music generation. We seek to utilize embeddings and standard tokenizer trained on the tokenized high quality Lakh MIDI (full) dataset composed over 176,581 unique MIDI files (Fig.\ref{fig:midi}).
We aim to leverage the advanced capabilities of a cutting-edge transformer for the rapid and efficient generation of cross-style music. Our goal is to create musical sequences in styles that are absent from the transformer's training data (Fig. \ref{fig:token_distribution}). To achieve this, we utilize the transformer's token and positional embeddings. Furthermore, we construct a neighboring adjacency matrix with data from styles not present in the Lakh dataset. This approach effectively captures the sequential information of the tokens. In our methodology, we integrate the transformer's refined embeddings with the sequential information from the unseen dataset, channeling this amalgamated input into a GCN. This strategy enables us to develop embeddings that are not only informed by the transformer's accurate embeddings but also by the genre- or style-specific data extracted from the neighboring adjacency matrix. Subsequently, we implement a biased RWR algorithm, utilizing the GCN node embeddings to generate innovative musical sequences. This technique ensures that our newly composed pieces are firmly rooted in the learned embeddings while also being distinctly influenced by the stylistic elements of the previously unseen data.
Our findings reveal a significant amount of relational information embedded within token embeddings, which is remarkably independent of the corresponding positional embeddings. This discovery leads to a low-compute approach for deriving token embeddings that effectively integrate task-specific positional information from out-of-sample data. By constructing the Graph Convolutional Network (GCN) using only the relationships encoded in the node embeddings, we observed modest performance in our initial model (model 1). However, a notable enhancement in model performance was achieved when we incorporated positional embeddings into both the GCN node features and the adjacency matrix (model 2) (Fig. \ref{fig:mloss}). The most remarkable improvement was observed in our third model, which utilized a task-specific adjacency matrix based on neighbor information from our out-of-sample Contemporary data (model 3). This model achieved an impressive AUC of over 0.98, suggesting that a combination of pretrained and shallow, task-specific embeddings could be an effective strategy for modeling out-of-sample data. This approach not only leverages the inherent strengths of pretrained embeddings but also adapts to the specific requirements of the task at hand through shallow, task-oriented learning (Table. \ref{table}).
Furthermore, our trained GCN demonstrated the ability to be rapidly trained to predict token relationships in out-of-sample data with high accuracy. We hypothesize that this strategy could be particularly beneficial as an ad hoc method for fine-tuning transformer token embeddings for specific tasks. The GCN-learned token embeddings, while being adapted to the task-specific context, may retain sufficient information from their pretrained origins. This retention could enable them to act as a viable substitute for task-specific applications, bridging the gap between general pretraining and specialized task requirements. In essence, these findings open up new avenues for efficiently adapting complex transformer models to new domains and tasks with minimal computational overhead, while still capitalizing on the rich representational power of pretrained embeddings.
Model steps are
- Tokenize new midi files (MIDI files available at https://github.com/asigalov61/Tegridy-MIDI-Dataset)
- Use gpu_extract_embeddings_from_forward.ipynb to get positional and token embeddings
- Use make_new_token_adj.ipynb to make adjacency matrix using token neighbors in new tokenized data
- Run GCN.py to train GCN using transformer embeddings and RWR to generate new sequences
Please contact us for the full manuscript.