Welcome to my blog

OpenAI Logo

This is Beilong Tang from Duke Kunshan University.

This page currently serves as my note.

Paper

MaskSR

Publication Date: 4 June 2024, link

Distorted Speech Encoder:

using STFT with window_size:

import torch

audio = torch.randn(1, 3 * 44100) # 3 seconds with sample rate 44.1kHz

y = torch.stft(audio, 2048, 512, return_complex = True)

print(y.shape) # [F, T] 

w = torch.abs(y) # [F, T]

w = w.pow(0.3) # X_0.3 compressed power spectrogram with shape [F,T]

print(w.shape)

Cosine Scheduler

During training, the masking ratio (from 0 to 1) will be applied to \( 9 \times T \) codebook by sampling \( r \sim \mathcal{U}(0, 1) \),

The cosine schedule function should be \( f(r) = cos(\frac{\pi}{2}r) \), this makes sure that \(f(0) = 1\) and \(f(1) = 0\).

The cross entropy loss is calculated on the masking position.

Classifier-free Guidance

The Conditional logit \( l_g \) is used to measure \( P(x \mid c )\) where c is the prompt. In this case, c will be the speech encoder input.

The unconditional logit \(l_u\) is used to measure \( P(x \mid c )\) where the model predicts the output without any input. During inference, we just replace the entire codebook with a embedding (null embedding) repeated \(T\) times, and predict the output and use that as the \(l_u\) score. The logit score for inference is defined as \(l_g = (1 + w) l_c - w l_u\).

During training, for each epoch, we randomly select 10% of the training data to be replaced with the null embedding.

Notes

When reading audio, it is better to normalize it before doing other operations (to avoid nan problem).

Linux Command

This page denotes the commonly used linux commands

unzip a file

unzip example.zip -d /home/user/destination

untar .tar.gz

tar -xzvf filename.tar.gz -C /path/to/output/directory

untar .tar

mkdir output
tar -xvf x.tar -C output

Count all files under a folder

find . -type f | wc -l

zip a file

tar a file

tar -czvf archive_name.tar.gz /path/to/folder

Untar (extract) a file

tar -xzvf archive_name.tar.gz

Run commands in multiple lines

Example:

sbatch -J ${JOB_NAME} -p ${card} -e ${log_path}/${current_datetime}.err -o ${log_path}/${current_datetime}.out -N ${num_nodes} --cpus-per-task=${cpus_per_task} \
  --gres=dcu:${gpus_per_node} --ntasks-per-node=${ntasks_per_node} \
  --exclusive \

We can use \ to seperate

wer /public/home/qinxy/bltang/ml_framework_slurm/exp/dasb/target/wavlm_6_layer/output/wsj0_2mix/trans_output.txt

Restart bluetooth

sudo systemctl status bluetooth

Noise RIR

To create a audio with reverberation:

  • get the RIR, resample it to the audio frame rate
  • convolve the audio with the rir

Sample code:

import numpy as np 
import scipy.signal as s
import torchaudio
import torch
import torchaudio.transforms as T
def reverb_rir(frames,rir):
        """
        frames is the clean audio numpy with shape [1, T]
        rir is the rir numpy with shape [1, T']
        returns: reverberated audio with shape [T'] (numpy)
        """
        orig_frames_shape = frames.shape
        frames,filter = np.squeeze(frames),np.squeeze(rir)
        frames = s.convolve(frames,filter)
        actlev = np.max(np.abs(frames))
        if(actlev > 0.99):
            frames = (frames / actlev) * 0.98
        frames = frames[:orig_frames_shape[1]]
        # print(frames.shape, orig_frames_shape)
        return frames

rir_impulse = "/home/bltang/work/data/impulse/datasets_fullband/impulse_responses/SLR26/simulated_rirs_48k/largeroom/Room002/Room002-00001.wav"
## 48khz

frame_path = "/home/bltang/work/voicefixer_main/test/clean/SSB00050001.wav"
## 44.1khz 
frame, frame_rate = torchaudio.load(frame_path)

rir, rir_rate = torchaudio.load(rir_impulse)

print(f"loaded audio frame: {frame.shape}, sample rate: {frame_rate}")
print(f"loaded rir: {rir.shape}, sample rate: {rir_rate}")

## downsample the rir to be 44.1khz
resampler = T.Resample(rir_rate, frame_rate, dtype=frame.dtype)
rir = resampler(rir)

frame = frame.numpy()
rir = rir.numpy()

## doing the convolution
output = reverb_rir(frame,rir)

output = torch.from_numpy(output).unsqueeze(0)
torchaudio.save("output.wav",output, frame_rate)

perform clipping:

### perform clipping
clip_factor = 0.1
z = torch.clamp(output,min = output.min() * clip_factor, max = output.max() * clip_factor)
print(z.min())
torchaudio.save("clamp.wav",z, frame_rate)

# Pytorch

This is the notes and tricks for pytorch.

to()

When calling to() on tensors, it is not directly moving the original tensor to the specified device. Instead, it will move the copy of the tensor to the device and return the new tensor.

If you want the original tensor to be moved to the device, you should do

x = x.to(device)

However, if you want to move a model to a device, you can just do

model.to(device)

DDP

For the gradients to correctly flow during training, we should only call the model's forward method instead of some other defined methods.

torch.cuda.set_device()

When you're working with PyTorch and CUDA, setting the environment variable CUDA_VISIBLE_DEVICES will define which GPUs are available for PyTorch to use. However, it's important to note that this variable accepts a comma-separated list of device indices which PyTorch should remap to logically.

So if you have, for example, four GPUs in your system (indices 0, 1, 2, 3) and you set CUDA_VISIBLE_DEVICES=2, PyTorch will then only see one GPU, and it will be remapped to index 0 when accessed from PyTorch (not index 2).

Here's how you would set it and use torch.cuda.set_device in Python:

import os import torch  # Make only GPU with original index 2 available to PyTorch 
os.environ["CUDA_VISIBLE_DEVICES"] = "2"  # Since CUDA_VISIBLE_DEVICES is set to '2', PyTorch will see this as device '0' 
torch.cuda.set_device(0) ## Not 2

Remember, torch.cuda.set_device expects the logical device index, which means that, after setting CUDA_VISIBLE_DEVICES, the visible devices will start from zero regardless of their actual hardware indices.

So in your case, if setting os.environ['CUDA_VISIBLE_DEVICES'] = '2' and then calling torch.cuda.set_device(2) results in a device error, that's likely because you've changed what device "2" actually refers to. In reality, after setting CUDA_VISIBLE_DEVICES, the device you're trying to access is probably at index "0".

You should do the following:


import os
import torch

os.environ['CUDA_VISIBLE_DEVICES'] = '2'  # Now there's only one GPU visible to the process.
torch.cuda.set_device(0)  # This refers to the single visible GPU, which is the original device 2.

Run these snippets before executing other PyTorch code that works with CUDA, as changing CUDA_VISIBLE_DEVICES after CUDA context has been initialized may not have an effect.

Pytorch Index

This post really helps:

here

Let's assume we have tensor A of shape [B, H, C], the basic way to select the index of the batch is:

A[tensor, tensor, ...]

This one will collect all the indexes across tensors at the same position, and use that as the index.

For example:

## creating A
t = torch.tensor([[1, 2], [3, 4]])
w = -t 
###
A = torch.cat((t.unsqueeze(0), w.unsqueeze(0))) # [2, 2, 2]
"""
A:
tensor([[[ 1,  2],
         [ 3,  4]],

        [[-1, -2],
         [-3, -4]]])
"""
batch = torch.LongTensor([[0,0],[1,1]]) # [2,2]
row = torch.LongTensor([[0,1],[1,1]])   # [2,2]
col = torch.LongTensor([[1,0],[1,0]])   # [2,2]
res = A[batch, row, col]
"""
tensor([[ 2,  3],
        [-4, -3]])
"""

In the above code, the index is (0,0,1) (0,1,0) (1,1,1) (1 1 0)

The result is [[A[0,0,1],A[0,1,0]],[A[1,1,1],A[1,1,0]]]

A[tensor]

Assume we only have one tensor as indexing tensor.

As the example above, we only use the batch tensor to index A, therefore we will have a result tensor of shape 2,2,2,2

A[batch]
"""
torch.Size([2, 2, 2, 2])
"""

The principle is simple. For each element in the index, we take the corresponding element in A.

Therefore, the result will be [[A[0],A[0]], [A[1],A[1]]]. Each element in A has shape [2,2], so the result will be of shape [2,2,2,2].

Load Kmeans

import os 
import sys 
import pickle
import joblib
import torch
from sklearn.cluster import MiniBatchKMeans
sys.path.append("../") 
from models.modules.kmeans import KMeansQuantizer as Kmeans
### load kmeans model
kmeans_ckpt = "/home/bltang/work/test/kmeans_4096_batch_size_16.pkl"
def load_dict(file_path):
    with open(file_path, "rb") as file:
        return pickle.load(file)
res = load_dict(kmeans_ckpt)
/home/bltang/.local/lib/python3.8/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator MiniBatchKMeans from version 1.5.1 when using version 1.3.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
for key in res:
    print(key)
resume
step
feature_list
kmeans
kmeans_model:MiniBatchKMeans = res["kmeans"]
joblib.dump(kmeans_model, "/home/bltang/work/wavlm_kmeans_backend/ckpt/kmeans_gigaspeech_4096.pt")
['/home/bltang/work/wavlm_kmeans_backend/ckpt/kmeans_gigaspeech_4096.pt']
kmeans_pytorch = Kmeans("/home/bltang/work/wavlm_kmeans_backend/ckpt/kmeans_gigaspeech_4096.pt")

Test

x = torch.randint(0,4096, (2, 30)).long()
print(x)
kmeans_pytorch.emb(x)
tensor([[1418, 1143,  728, 3682, 2837,  299, 2354, 1865,  156,  830,  409,  892,
         3274, 3682,  872,  510, 3785, 3780, 3297, 1187, 1234, 2075,  616, 2053,
         1168,  635,  320, 1091,  738, 2829],
        [ 158,  830, 4059, 3800, 1810, 2425, 3992, 1535, 3935, 1102,  297, 1540,
         2149, 2414, 1222, 3968, 3072, 3858, 1115,  701, 2100, 2951, 1291, 3721,
         1296, 2966,  960, 2327, 2188, 1239]])





tensor([[[ 3.4845,  0.5973, -0.2941,  ...,  1.8850, -3.3292, -3.1918],
         [-0.9843, -1.8948,  0.1781,  ...,  1.0470, -2.2949, -3.0003],
         [-2.4003, -0.3012,  2.5385,  ...,  0.2260, -0.4540, -0.7819],
         ...,
         [ 1.5334, -1.6515,  0.1410,  ...,  0.8672, -1.2393, -0.5327],
         [ 1.1343, -0.0071,  0.2196,  ...,  1.9066, -3.0079, -2.2519],
         [-0.8302,  0.0227,  2.9222,  ..., -0.1998,  0.5812, -1.3175]],

        [[ 0.5794,  0.3916,  3.1679,  ..., -0.7281, -0.7220, -0.4178],
         [-0.6731, -1.0334,  2.3820,  ...,  2.1921, -1.0713, -0.1829],
         [-1.8742, -3.6283, -0.3614,  ...,  1.6077, -1.6078, -2.0266],
         ...,
         [-0.5319, -0.7900,  2.3659,  ..., -1.6811, -1.6981, -2.6017],
         [ 3.1977, -3.8897,  0.7760,  ..., -1.5524,  2.8854, -2.0698],
         [ 0.0547, -0.4102,  2.9803,  ...,  0.2289, -0.3999, -1.0432]]])

Slurm

To run single task using 4 gpus:

#!/bin/bash
#SBATCH -J train_test
#SBATCH -n 32
#SBATCH -N 1
#SBATCH --gres=dcu:4
#SBATCH -p kshdnormal02
python ... # The command to run

Note that no srun before the command

To run 8 tasks on a single node, you can run

#!/bin/bash
#SBATCH -J train_test
#SBATCH --ntasks-per-node=8
#SBATCH -N 1
#SBATCH --cpus-per-task=2
#SBATCH --gres=dcu:4
#SBATCH -p kshdnormal
srun python test.py

Conda

Backup an environment

To backup an environment, my solution is to use conda clone to clone the current environment to a environment called environment_backup.

Sample code:

conda create --name my_env_backup --clone my_env

Deep Learning Pytorch Convention

This part describes my normal Convention for deep learning using pytorch.

Environment and Config

The following simple config describes my normal Environment setting:

from argparse import Namespace
import argparse

class AttrDict(Namespace):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def __getattribute__(self, name: str):
        try:
            return super().__getattribute__(name)
        except AttributeError:
            return None

    def __getitem__(self, key):
        return self.__getattribute__(key)

x = argparse.ArgumentParser()
x.add_argument("--x", default = 2)
x.add_argument('--y', default = 3)
args = x.parse_args()


name = AttrDict(**vars(args))

print(name)
print(name.x) # valid
print(name['x']) # valid
print(name['y']) # valid

Slurm

For most model trainings, I used the slurm system from Kunshan Super Computing. It has gpus with 16GB and 32GB.

Salloc a server

This serves convinient testing.

# allocate a 16GB server
salloc -p kshdnormal --ntasks-per-node=4 --cpus-per-task=8 --gres=dcu:4 --j inference_bltang --exclusive
# allocate a 32GB server
salloc -p kshdnormal --ntasks-per-node=4 --cpus-per-task=8 --gres=dcu:4 --j inference_bltang --exclusive

Diary

2024 June 6

WavSEFLM

Using WavLM and SEF Network and Lanaguage Model for target Speaker Separation

Full audio goes to here.

The highest similarity audio is 4, 16, 19.

The encoder decoder loses the information of a male voice: 21, 22, 23 (Must train kmeans and vocoder).

Not clear audio: 6, 17, 21.

The reat audios suffer from tone loss.

demo:

Target Speaker Seperation