Audio editing with non-rigid text prompts

Francesco Paissan, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan

Abstract

In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform AudioLDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

Note

Our pipeline can work with any TTA latent-diffusion model.

The edits are free-form text, meaning you can generate any edits by describing the output with text.

Figure 1: In step 1, we optimize with respect to $e_\text{text}$ to minimize the reconstruction error in Equation (1), where $z = \text{VAEEncoder}(X_\text{inp})$. The resulting optimized text embedding is denoted with $e_\text{opt}$. In step 2, we optimize with respect to the Diffusion Model parameters to minimize the same reconstruction loss as in step 1. Note that in steps 1 and 2 only the part shown with the green box is used. In step 3, the text embedding is set as the linear combination of target embedding and the optimized embedding such that $e_\text{text} = \eta e_\text{target} + (1 − \eta)e_\text{opt}$. In step 3, the whole pipeline denoted by the yellow box is used.

Style transfer edits

Prompt: "Sound of knocking the door."

Original sample	Ours	AudioLDM

Prompt: "Sound of gunshots in the background."

Original sample	Ours	AudioLDM

Prompt: "A man is giving a speech."

Original sample	Ours	AudioLDM

Addition edits

Prompt: "Sirens wailing and gunshots in the background."

Original sample	Ours	AudioLDM

Prompt: "A dog barking followed by wind blowing on the microphone."

Original sample	Ours	AudioLDM

Prompt: "A church bell ringing with loud baby crying in the background."

Original sample	Ours	AudioLDM

Inpainting edits

Prompt: "Dog barking repeatedly."

Original sample	Ours	AudioLDM

Prompt: "Sound of church bells."

Original sample	Ours	AudioLDM

Prompt: "Sound of ducks quacking."

Original sample	Ours	AudioLDM

In the following section we show the results obtained for the rebuttal of our ICASSP submission. We plot the average energy found in the spectrogram for each time step. Our goal is to measure the fidelity of the onsets in the edited and the original audio. We see that for the style transfer task our method is able to preserve the onsets of the original input better than AudioLDM. To quantify this alignment we calculate the cosine similarity between these two 1D energy signals. The resulting cosine similarity values are shown on the titles.