Audio editing with non-rigid text prompts
Francesco Paissan, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan
Abstract
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform AudioLDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.
Note
Style transfer edits
Prompt: "Sound of knocking the door."
|
|
|
|
|
|
Prompt: "Sound of gunshots in the background."
|
|
|
|
|
|
Prompt: "A man is giving a speech."
|
|
|
|
|
|
Addition edits
Prompt: "Sirens wailing and gunshots in the background."
|
|
|
|
|
|
Prompt: "A dog barking followed by wind blowing on the microphone."
|
|
|
|
|
|
Prompt: "A church bell ringing with loud baby crying in the background."
|
|
|
|
|
|
Inpainting edits
Prompt: "Dog barking repeatedly."
|
|
|
|
|
|
Prompt: "Sound of church bells."
|
|
|
|
|
|
Prompt: "Sound of ducks quacking."
|
|
|
|
|
|