I work at Descript as an AI researcher. I develop audio processing technology for software product. Back in school I studied computer science, mathematics, statistics and finance. I am originally from Malaysia.
I work across audio processing related projects, mainly speech synthesis. I develop machine learning algorithms and their interfaces with product backend. I also design evaluation system to benchmark technology progress. On product front, I work with product manager to dissect technology issues faced by users.
This professional program is geared towards training students for applied research roles in software technology industry. My major program concentration falls in the field of Machine Learning. I completed courses and technical projects covering both theories and applications in the field.
Courses completed: CSC2541 - Scalable and Flexible Models of Uncertainty CSC2547 - Learning Discrete Latent Structure CSC2548 - Machine Learning in Computer Vision
My program concentrations span across the fields of Mathematics, Statistics, Economics and Finance.
Courses completed: MAT357 - Real Analysis APM466 - Mathematical Finance ECO326 - Game Theory STA447 - Stochastic Processes STA414 - Statistical Methods for Machine Learning CSC321 - Neural Networks
In this work, we explore the features that are used by humans and by ConvNets to classify faces. We use Guided Backpropagation to visualize the facial features that influence the output of a ConvNet the most when identifying specific individuals. we develop a human intelligence task to find out which facial features humans find to be the most important. We explore the differences between the saliency information gathered from humans and from ConvNets.
PaperWe successfully train GANs to generate high quality coherent audio waveforms. We apply our models on the tasks of speech synthesis and music applications. Our model is non-autoregressive, with significantly fewer parameters than competing models and generalizes to unseen speakers for speech synthesis. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks.
Blog Post Paper CodeIn this project we work towards a Text-to-Speech (TTS) model for cloned voices with controllable prosody. Our goal is two-fold: to copy the vocal identity from a new speaker in a time-and-data efficient manner; and to generate artificial speech in the cloned voice with emotion that has not previously been observed in the new speaker’s recordings.
PosterWe provide a model for Instantaneous Feedback Recommendation Systems. We model the ratings in the MovieLens dataset using neural network probabilistic factorization models. The posterior distributions of user's and item's latent vector parameters are approximated using stochastic variational inference. We recast the items in the ratings matrix as arms in the multi-armed bandits context. The predictive reward distribution of consuming each item for a user depends on repective posterior latents, and gets adjusted as new ratings realized. By applying bandit policy functions, recommendations are provided to a user with a balance of interest exploration and exploitation intents.
ReportThe purpose of this project is to reproduce the results in the paper "Generating Sequences With Recurrent Neural Networks" by Alex Graves. This paper is an important milestone for the class of variable sequence generative models. The location-based attention model applied to condition handwriting generation with text is an important engineering innovation. It serves as important inspiration for many modern text-to-speech models.
Code