End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

Wei Zeng, Xian He, Ye Wang·May 22, 2024

Summary

This paper presents a novel end-to-end piano audio-to-score transcription system that addresses the challenges of modeling complex musical structures and the gap between synthetic and real-world data. Key points include: 1. A hierarchical decoder is used to transcribe notes, keys, and time signatures simultaneously, capturing the multi-level structure of piano music. 2. The model undergoes a two-stage training process: pre-training on synthetic audio generated by an expressive performance rendering system, followed by fine-tuning on real human performance recordings. 3. The system aims to improve transcription accuracy and bridge the gap between synthetic and real performances, making it applicable for music composition, practice, and analysis. 4. The model segments audio, processes spectrograms, and predicts note sequences, time signatures, and key signatures in a structured Kern format. 5. It employs multi-task learning and uses datasets like MuseScore, Humdrum, MuseSyn, and HumSyn for training, with evaluation metrics focusing on accuracy and generalization. In conclusion, the research advances piano transcription models by combining synthetic and real-world data, improving performance, and providing a more adaptable system for various music applications.

Key findings

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the challenges faced by existing end-to-end piano audio-to-score transcription systems, specifically focusing on two main issues: difficulty in modeling hierarchical musical structures and discrepancies between synthetic data and real-world recordings from human performance . This paper introduces a novel end-to-end piano audio-to-score transcription model with a hierarchical decoder to transcribe both bar-level and note-level information, aiming to bridge the gap between synthetic data and real-world recordings . The problem of accurately transcribing piano audio into musical scores is not new, but this paper proposes innovative solutions to enhance the transcription process by incorporating hierarchical decoding and multi-task learning strategies .

What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis related to end-to-end polyphonic piano audio-to-score transcription with hierarchical decoding. The hypothesis focuses on addressing the challenges faced by existing systems in modeling hierarchical musical structures and the disparities between synthetic data and real-world human performance recordings in the context of piano audio-to-score transcription . The study proposes a novel sequence-to-sequence (Seq2Seq) model with a hierarchical decoder to transcribe both bar-level and note-level information, aiming to bridge the gap between synthetic data and human recordings through a two-stage training scheme .

What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper proposes several innovative ideas, methods, and models in the field of piano audio-to-score transcription:

Hierarchical A2S Model: The paper introduces a novel end-to-end A2S model with a hierarchical decoder that transcribes audio into both bar-level information (key and time signatures) and note-level information (note sequence) through multi-task learning .
Two-Stage Training Scheme: To address the gap between synthetic data and real-world recordings, the paper suggests a two-stage training scheme. It involves pre-training the model on synthetic data from an Expressive Performance Rendering (EPR) system and fine-tuning it using real-world piano recordings from human performance .
Score Representation Method: The paper proposes a pre-processing method for **Kern scores to preserve the voicing structure for score reconstruction, especially in scenarios with an unconstrained number of voices .
Generalization Capabilities: The paper evaluates the model's generalization capabilities in both in-distribution and out-of-distribution scenarios. It conducts experiments on human recordings to validate the effectiveness of the proposed training scheme .
Visualization of Key and Time Signature Prediction: The paper visualizes confusion matrices for key and time signature prediction to gain insights into the model's performance in predicting these elements accurately .
Impact of EPR Composers and Soundfonts: The paper investigates the impact of EPR composers and piano soundfonts on the model's performance during the pre-training stage. It highlights the model's ability to generalize to unseen performance styles and soundfonts, emphasizing the influence of diverse performance styles and sound characteristics on the model's performance . The paper introduces several key characteristics and advantages of the proposed method compared to previous approaches in piano audio-to-score transcription:
Hierarchical A2S Model: The paper presents a novel end-to-end A2S model with a hierarchical decoder that transcribes audio into both bar-level information (key and time signatures) and note-level information (note sequence) through multi-task learning. This hierarchical structure aligns with the hierarchical nature of musical scores, enabling the transcription of score information at different levels .
Two-Stage Training Scheme: To address the limitations of existing systems trained and evaluated with only synthetic data, the paper proposes a two-stage training scheme. It involves pre-training the model on synthetic data from an Expressive Performance Rendering (EPR) system and fine-tuning it using real-world piano recordings from human performance. This approach bridges the gap between synthetic data and recordings of human performance, enhancing the model's performance on real-world data .
Score Representation Method: The paper introduces a score representation method for **Kern piano scores that preserves the voicing structure, facilitating the reconstruction of scores with an unconstrained number of voices. This pre-processing approach serializes **Kern scores into tokens while maintaining their inherent voicing structure, addressing the challenge of reconstructing musical scores with multiple voices .
Generalization Capabilities: The proposed model demonstrates strong generalization capabilities in both in-distribution and out-of-distribution scenarios. By conducting experiments on human recordings and evaluating the model's performance, the paper validates the effectiveness of the proposed training scheme in enhancing the model's generalization to unseen performance styles .
Visualization of Key and Time Signature Prediction: The paper visualizes confusion matrices for key and time signature prediction to gain insights into the model's performance in predicting these elements accurately. This visualization provides a detailed analysis of the model's performance in predicting key and time signatures, highlighting areas of improvement and potential challenges .
Impact of EPR Composers and Soundfonts: The paper investigates the impact of EPR composers and piano soundfonts on the model's performance during the pre-training stage. It emphasizes the model's ability to generalize to unseen performance styles and soundfonts, showcasing the influence of diverse performance styles and sound characteristics on the model's performance .

Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research studies exist in the field of polyphonic piano audio-to-score transcription. Noteworthy researchers in this area include Dasaem Jeong, Taegyun Kwon, Yoojin Kim, Juhan Nam, Stefan Koelsch, Martin Rohrmeier, Renzo Torrecuso, Sebastian Jentschke, and many others . These researchers have contributed to various aspects of music transcription, including modeling expressive piano performance, processing hierarchical syntactic structures in music, and evaluating automatic polyphonic music transcription.

The key to the solution mentioned in the paper "End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding" involves proposing a novel end-to-end piano audio-to-score transcription model with a hierarchical decoder. This model is designed to transcribe audio into both bar-level information (including key and time signatures) and note-level information, such as the note sequence. The solution also includes a two-stage training scheme to bridge the gap between synthetic sound and real-world recordings. The model is pre-trained on synthetic data from an expressive performance rendering system and fine-tuned on real-world piano recordings from human performance. Additionally, a pre-processing method for **Kern representation is proposed to serialize piano scores into token sequences while preserving the voicing structure, facilitating the reconstruction of scores with an unconstrained number of voices .

How were the experiments in the paper designed?

The experiments in the paper were designed with a two-stage training scheme:

Pre-training Stage: The model was pre-trained on synthetic data generated from an Expressive Performance Rendering (EPR) system . This synthetic data captured subtle details in piano performance, such as deviations in note onsets, durations, velocities, and pedal usage .
Fine-tuning Stage: Following the pre-training, the model was fine-tuned using real-world piano recordings from human performances . This fine-tuning process involved transfer learning with a small set of human performance recordings from the ASAP dataset .

These two stages aimed to bridge the gap between synthetic data and real-world recordings, enhancing the model's performance and generalization capabilities .

What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the MuseSyn dataset, which includes training, validation, and test splits . The code for the project is open source and available on GitHub repositories, such as those for Beethoven piano sonatas, Haydn piano sonatas, Mozart piano sonatas, Scarlatti keyboard sonatas, Joplin compositions, and Chopin first editions .

Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide strong support for the scientific hypotheses that needed verification. The paper introduces a novel end-to-end piano audio-to-score transcription model that addresses key challenges in existing work, such as modeling hierarchical musical structures and discrepancies between synthetic and real-world recordings . The experiments conducted include a two-stage training scheme involving pre-training on synthetic data from an expressive performance rendering (EPR) system and fine-tuning on human performance recordings . This approach aims to bridge the gap between synthetic and real-world data, which is crucial for the model's performance and generalizability.

Furthermore, the paper proposes a Seq2Seq model with a hierarchical decoder that transcribes both bar-level and note-level information, demonstrating a comprehensive approach to audio-to-score transcription . The experiments conducted on the pre-trained model evaluate its generalization capabilities in various scenarios, including in-distribution and out-of-distribution settings . Additionally, the paper includes the first experiment for end-to-end A2S systems on piano recordings of human performance from the ASAP dataset, validating the effectiveness of the proposed training scheme .

Overall, the experiments and results presented in the paper provide robust evidence supporting the scientific hypotheses put forth in the study. The methodology employed, including the two-stage training scheme and the hierarchical decoder model, effectively address the challenges in piano audio-to-score transcription, showcasing the validity and efficacy of the proposed approaches .

What are the contributions of this paper?

The paper on End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding makes several key contributions:

Hierarchical A2S model by multi-task learning: The paper introduces an innovative end-to-end A2S model with a hierarchical decoder that transcribes audio into both bar-level information, including key and time signatures, and note-level information, specifically the note sequence .
Two-stage training scheme for A2S: To address the gap between synthetic sound and real-world recordings, the paper proposes a two-stage training approach. It involves pre-training the model on synthetic data from an expressive performance rendering (EPR) system and then fine-tuning it using real-world piano recordings from human performance. Experiments on human recordings demonstrate the effectiveness of this training scheme .
Score representation method for unconstrained voices: The paper introduces a pre-processing method to serialize **Kern piano scores into tokens while preserving their inherent voicing structure. This method facilitates the reconstruction of scores with an unconstrained number of voices, enhancing the model's capabilities .

What work can be continued in depth?

To further advance the field of piano audio-to-score transcription, several areas can be explored in depth based on the provided research:

Enhancing the Modeling of Hierarchical Musical Structures: Future research can focus on developing more sophisticated models that effectively capture the hierarchical nature of musical scores, including elements like notes, keys, and time signatures .
Real-World Evaluation and Dataset Expansion: There is a need to address the gap in real-world evaluation by expanding datasets for A2S transcription, especially for human performance recordings, to improve the generalization capabilities of models .
Improving Polyphonic Representation: Research can delve deeper into addressing the challenges posed by polyphonic music, particularly in scenarios with an unconstrained number of voices, to enhance the applicability of A2S systems to complex musical compositions .
Exploring Expressive Performance Rendering Systems: Further exploration of expressive performance rendering (EPR) systems, which generate human-like MIDI performances, can lead to advancements in capturing emotions and nuances in piano recordings .
Innovating Training Schemes and Pre-Processing Methods: Continued research can focus on refining training schemes, such as the two-stage approach used in the study, and developing pre-processing methods for score representation to improve transcription accuracy and voicing structure preservation .

Tables

Introduction

Background

Challenges in modeling complex musical structures

Gap between synthetic and real-world piano data

Objective

Hierarchical decoder for multi-level music representation

Two-stage training for improved accuracy and realism

Applications in music composition, practice, and analysis

Method

Data Collection and Preprocessing

Synthetic Data Generation

Expressive performance rendering system

MuseScore, MuseSyn datasets

Real-World Data

Human performance recordings

Humdrum, HumSyn datasets

Preprocessing

Audio segmentation

Spectrogram processing

Model Architecture

Hierarchical Decoder

Notes, keys, and time signatures prediction simultaneously

Capturing structure in piano music

Training Strategy

Pre-Training

Training on synthetic audio data

Fine-Tuning

Adaptation to real-world performance data

Multi-Task Learning

Integration of note, key, and time signature prediction

Joint representation learning

Evaluation

Accuracy metrics

Generalization to real and synthetic performances

Results and Discussion

Transcription accuracy improvements

Synthetic-to-real performance bridge

Model adaptability for diverse applications

Conclusion

Advancements in piano transcription models

Synthetic and real data fusion

Potential impact on music technology and analysis

Basic info

papers

sound

audio and speech processing

artificial intelligence

Advanced features

Insights

How does the hierarchical decoder contribute to the system's performance?

What is the primary approach of the novel piano transcription system described in the paper?

What are the two stages of training for the model, and why are they necessary?

What are the primary evaluation metrics used to assess the system's accuracy and generalization?