This is Beilong Tang from Duke Kunshan University.
This page currently serves as my note.
This is Beilong Tang from Duke Kunshan University.
This page currently serves as my note.
Publication Date: 4 June 2024, link
using STFT with window_size:
import torch
audio = torch.randn(1, 3 * 44100) # 3 seconds with sample rate 44.1kHz
y = torch.stft(audio, 2048, 512, return_complex = True)
print(y.shape) # [F, T]
w = torch.abs(y) # [F, T]
w = w.pow(0.3) # X_0.3 compressed power spectrogram with shape [F,T]
print(w.shape)
During training, the masking ratio (from 0 to 1) will be applied to \( 9 \times T \) codebook by sampling \( r \sim \mathcal{U}(0, 1) \),
The cosine schedule function should be \( f(r) = cos(\frac{\pi}{2}r) \), this makes sure that \(f(0) = 1\) and \(f(1) = 0\).
The cross entropy loss is calculated on the masking position.
The Conditional logit \( l_g \) is used to measure \( P(x \mid c )\) where c is the prompt. In this case, c will be the speech encoder input.
The unconditional logit \(l_u\) is used to measure \( P(x \mid c )\) where the model predicts the output without any input. During inference, we just replace the entire codebook with a embedding (null embedding) repeated \(T\) times, and predict the output and use that as the \(l_u\) score. The logit score for inference is defined as \(l_g = (1 + w) l_c - w l_u\).
During training, for each epoch, we randomly select 10% of the training data to be replaced with the null embedding.
When reading audio, it is better to normalize it before doing other operations (to avoid nan problem).
This page denotes the commonly used linux commands
unzip example.zip -d /home/user/destination
tar -xzvf filename.tar.gz -C /path/to/output/directory
mkdir output
tar -xvf x.tar -C output
find . -type f | wc -l
tar -czvf archive_name.tar.gz /path/to/folder
tar -xzvf archive_name.tar.gz
Example:
sbatch -J ${JOB_NAME} -p ${card} -e ${log_path}/${current_datetime}.err -o ${log_path}/${current_datetime}.out -N ${num_nodes} --cpus-per-task=${cpus_per_task} \
--gres=dcu:${gpus_per_node} --ntasks-per-node=${ntasks_per_node} \
--exclusive \
We can use \
to seperate
wer /public/home/qinxy/bltang/ml_framework_slurm/exp/dasb/target/wavlm_6_layer/output/wsj0_2mix/trans_output.txt
sudo systemctl status bluetooth
To create a audio with reverberation:
Sample code:
import numpy as np
import scipy.signal as s
import torchaudio
import torch
import torchaudio.transforms as T
def reverb_rir(frames,rir):
"""
frames is the clean audio numpy with shape [1, T]
rir is the rir numpy with shape [1, T']
returns: reverberated audio with shape [T'] (numpy)
"""
orig_frames_shape = frames.shape
frames,filter = np.squeeze(frames),np.squeeze(rir)
frames = s.convolve(frames,filter)
actlev = np.max(np.abs(frames))
if(actlev > 0.99):
frames = (frames / actlev) * 0.98
frames = frames[:orig_frames_shape[1]]
# print(frames.shape, orig_frames_shape)
return frames
rir_impulse = "/home/bltang/work/data/impulse/datasets_fullband/impulse_responses/SLR26/simulated_rirs_48k/largeroom/Room002/Room002-00001.wav"
## 48khz
frame_path = "/home/bltang/work/voicefixer_main/test/clean/SSB00050001.wav"
## 44.1khz
frame, frame_rate = torchaudio.load(frame_path)
rir, rir_rate = torchaudio.load(rir_impulse)
print(f"loaded audio frame: {frame.shape}, sample rate: {frame_rate}")
print(f"loaded rir: {rir.shape}, sample rate: {rir_rate}")
## downsample the rir to be 44.1khz
resampler = T.Resample(rir_rate, frame_rate, dtype=frame.dtype)
rir = resampler(rir)
frame = frame.numpy()
rir = rir.numpy()
## doing the convolution
output = reverb_rir(frame,rir)
output = torch.from_numpy(output).unsqueeze(0)
torchaudio.save("output.wav",output, frame_rate)
### perform clipping
clip_factor = 0.1
z = torch.clamp(output,min = output.min() * clip_factor, max = output.max() * clip_factor)
print(z.min())
torchaudio.save("clamp.wav",z, frame_rate)
This is the notes and tricks for pytorch.
When calling to()
on tensors, it is not directly moving the original tensor to the specified device. Instead, it will move the copy of the tensor to the device and return the new tensor.
If you want the original tensor to be moved to the device, you should do
x = x.to(device)
However, if you want to move a model to a device, you can just do
model.to(device)
For the gradients to correctly flow during training, we should only call the model's forward method instead of some other defined methods.
When you're working with PyTorch and CUDA, setting the environment variable CUDA_VISIBLE_DEVICES
will define which GPUs are available for PyTorch to use. However, it's important to note that this variable accepts a comma-separated list of device indices which PyTorch should remap to logically.
So if you have, for example, four GPUs in your system (indices 0, 1, 2, 3) and you set CUDA_VISIBLE_DEVICES=2
, PyTorch will then only see one GPU, and it will be remapped to index 0 when accessed from PyTorch (not index 2).
Here's how you would set it and use torch.cuda.set_device
in Python:
import os import torch # Make only GPU with original index 2 available to PyTorch
os.environ["CUDA_VISIBLE_DEVICES"] = "2" # Since CUDA_VISIBLE_DEVICES is set to '2', PyTorch will see this as device '0'
torch.cuda.set_device(0) ## Not 2
Remember, torch.cuda.set_device
expects the logical device index, which means that, after setting CUDA_VISIBLE_DEVICES
, the visible devices will start from zero regardless of their actual hardware indices.
So in your case, if setting os.environ['CUDA_VISIBLE_DEVICES'] = '2'
and then calling torch.cuda.set_device(2)
results in a device error, that's likely because you've changed what device "2" actually refers to. In reality, after setting CUDA_VISIBLE_DEVICES
, the device you're trying to access is probably at index "0".
You should do the following:
import os
import torch
os.environ['CUDA_VISIBLE_DEVICES'] = '2' # Now there's only one GPU visible to the process.
torch.cuda.set_device(0) # This refers to the single visible GPU, which is the original device 2.
Run these snippets before executing other PyTorch code that works with CUDA, as changing CUDA_VISIBLE_DEVICES
after CUDA context has been initialized may not have an effect.
This post really helps:
Let's assume we have tensor A
of shape [B, H, C], the basic way to select the index of the batch is:
This one will collect all the indexes across tensors at the same position, and use that as the index.
For example:
## creating A
t = torch.tensor([[1, 2], [3, 4]])
w = -t
###
A = torch.cat((t.unsqueeze(0), w.unsqueeze(0))) # [2, 2, 2]
"""
A:
tensor([[[ 1, 2],
[ 3, 4]],
[[-1, -2],
[-3, -4]]])
"""
batch = torch.LongTensor([[0,0],[1,1]]) # [2,2]
row = torch.LongTensor([[0,1],[1,1]]) # [2,2]
col = torch.LongTensor([[1,0],[1,0]]) # [2,2]
res = A[batch, row, col]
"""
tensor([[ 2, 3],
[-4, -3]])
"""
In the above code, the index is (0,0,1) (0,1,0) (1,1,1) (1 1 0)
The result is [[A[0,0,1],A[0,1,0]],[A[1,1,1],A[1,1,0]]]
Assume we only have one tensor as indexing tensor.
As the example above, we only use the batch
tensor to index A
, therefore
we will have a result tensor of shape 2,2,2,2
A[batch]
"""
torch.Size([2, 2, 2, 2])
"""
The principle is simple. For each element in the index, we take the corresponding element in A.
Therefore, the result will be [[A[0],A[0]], [A[1],A[1]]]
.
Each element in A has shape [2,2]
, so the result will be of shape [2,2,2,2]
.
import os
import sys
import pickle
import joblib
import torch
from sklearn.cluster import MiniBatchKMeans
sys.path.append("../")
from models.modules.kmeans import KMeansQuantizer as Kmeans
### load kmeans model
kmeans_ckpt = "/home/bltang/work/test/kmeans_4096_batch_size_16.pkl"
def load_dict(file_path):
with open(file_path, "rb") as file:
return pickle.load(file)
res = load_dict(kmeans_ckpt)
/home/bltang/.local/lib/python3.8/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator MiniBatchKMeans from version 1.5.1 when using version 1.3.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
for key in res:
print(key)
resume
step
feature_list
kmeans
kmeans_model:MiniBatchKMeans = res["kmeans"]
joblib.dump(kmeans_model, "/home/bltang/work/wavlm_kmeans_backend/ckpt/kmeans_gigaspeech_4096.pt")
['/home/bltang/work/wavlm_kmeans_backend/ckpt/kmeans_gigaspeech_4096.pt']
kmeans_pytorch = Kmeans("/home/bltang/work/wavlm_kmeans_backend/ckpt/kmeans_gigaspeech_4096.pt")
x = torch.randint(0,4096, (2, 30)).long()
print(x)
kmeans_pytorch.emb(x)
tensor([[1418, 1143, 728, 3682, 2837, 299, 2354, 1865, 156, 830, 409, 892,
3274, 3682, 872, 510, 3785, 3780, 3297, 1187, 1234, 2075, 616, 2053,
1168, 635, 320, 1091, 738, 2829],
[ 158, 830, 4059, 3800, 1810, 2425, 3992, 1535, 3935, 1102, 297, 1540,
2149, 2414, 1222, 3968, 3072, 3858, 1115, 701, 2100, 2951, 1291, 3721,
1296, 2966, 960, 2327, 2188, 1239]])
tensor([[[ 3.4845, 0.5973, -0.2941, ..., 1.8850, -3.3292, -3.1918],
[-0.9843, -1.8948, 0.1781, ..., 1.0470, -2.2949, -3.0003],
[-2.4003, -0.3012, 2.5385, ..., 0.2260, -0.4540, -0.7819],
...,
[ 1.5334, -1.6515, 0.1410, ..., 0.8672, -1.2393, -0.5327],
[ 1.1343, -0.0071, 0.2196, ..., 1.9066, -3.0079, -2.2519],
[-0.8302, 0.0227, 2.9222, ..., -0.1998, 0.5812, -1.3175]],
[[ 0.5794, 0.3916, 3.1679, ..., -0.7281, -0.7220, -0.4178],
[-0.6731, -1.0334, 2.3820, ..., 2.1921, -1.0713, -0.1829],
[-1.8742, -3.6283, -0.3614, ..., 1.6077, -1.6078, -2.0266],
...,
[-0.5319, -0.7900, 2.3659, ..., -1.6811, -1.6981, -2.6017],
[ 3.1977, -3.8897, 0.7760, ..., -1.5524, 2.8854, -2.0698],
[ 0.0547, -0.4102, 2.9803, ..., 0.2289, -0.3999, -1.0432]]])
To run single task using 4 gpus:
#!/bin/bash
#SBATCH -J train_test
#SBATCH -n 32
#SBATCH -N 1
#SBATCH --gres=dcu:4
#SBATCH -p kshdnormal02
python ... # The command to run
Note that no srun
before the command
To run 8 tasks on a single node, you can run
#!/bin/bash
#SBATCH -J train_test
#SBATCH --ntasks-per-node=8
#SBATCH -N 1
#SBATCH --cpus-per-task=2
#SBATCH --gres=dcu:4
#SBATCH -p kshdnormal
srun python test.py
To backup an environment, my solution is to use conda clone to clone the current environment to a environment called environment_backup
.
Sample code:
conda create --name my_env_backup --clone my_env
This part describes my normal Convention for deep learning using pytorch.
The following simple config describes my normal Environment setting:
from argparse import Namespace
import argparse
class AttrDict(Namespace):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def __getattribute__(self, name: str):
try:
return super().__getattribute__(name)
except AttributeError:
return None
def __getitem__(self, key):
return self.__getattribute__(key)
x = argparse.ArgumentParser()
x.add_argument("--x", default = 2)
x.add_argument('--y', default = 3)
args = x.parse_args()
name = AttrDict(**vars(args))
print(name)
print(name.x) # valid
print(name['x']) # valid
print(name['y']) # valid
For most model trainings, I used the slurm system from Kunshan Super Computing. It has gpus with 16GB and 32GB.
This serves convinient testing.
# allocate a 16GB server
salloc -p kshdnormal --ntasks-per-node=4 --cpus-per-task=8 --gres=dcu:4 --j inference_bltang --exclusive
# allocate a 32GB server
salloc -p kshdnormal --ntasks-per-node=4 --cpus-per-task=8 --gres=dcu:4 --j inference_bltang --exclusive
Using WavLM and SEF Network and Lanaguage Model for target Speaker Separation
Full audio goes to here.
The highest similarity audio is 4, 16, 19.
The encoder decoder loses the information of a male voice: 21, 22, 23 (Must train kmeans and vocoder).
Not clear audio: 6, 17, 21.
The reat audios suffer from tone loss.
demo: