Source author record

Taesu Kim

Taesu Kim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning eess.AS Sound Artificial Intelligence Computation and Language Computer Vision

Catalog footprint

What is connected

5works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA's structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA's limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at https://github.com/SqueezeBits/GraLoRA.git

preprint2022arXiv

EdiTTS: Score-based Editing for Controllable Text-to-Speech

We present EdiTTS, an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. EdiTTS allows for targeted, granular editing of audio, both in terms of content and pitch, without the need for any additional training, task-specific optimization, or architectural modifications to the score-based model backbone. Specifically, we apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model while applying masks and softening kernels to ensure that iterative edits are applied only to the target region. Through listening tests and speech-to-text back transcription, we show that EdiTTS outperforms existing baselines and produces robust samples that satisfy user-imposed requirements.

preprint2022arXiv

GP22: A Car Styling Dataset for Automotive Designers

An automated design data archiving could reduce the time wasted by designers from working creatively and effectively. Though many datasets on classifying, detecting, and instance segmenting on car exterior exist, these large datasets are not relevant for design practices as the primary purpose lies in autonomous driving or vehicle verification. Therefore, we release GP22, composed of car styling features defined by automotive designers. The dataset contains 1480 car side profile images from 37 brands and ten car segments. It also contains annotations of design features that follow the taxonomy of the car exterior design features defined in the eye of the automotive designer. We trained the baseline model using YOLO v5 as the design feature detection model with the dataset. The presented model resulted in an mAP score of 0.995 and a recall of 0.984. Furthermore, exploration of the model performance on sketches and rendering images of the car side profile implies the scalability of the dataset for design purposes.

preprint2022arXiv

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.

preprint2020arXiv

Learning pronunciation from a foreign language in speech synthesis networks

Although there are more than 6,500 languages in the world, the pronunciations of many phonemes sound similar across the languages. When people learn a foreign language, their pronunciation often reflects their native language's characteristics. This motivates us to investigate how the speech synthesis network learns the pronunciation from datasets from different languages. In this study, we are interested in analyzing and taking advantage of multilingual speech synthesis network. First, we train the speech synthesis network bilingually in English and Korean and analyze how the network learns the relations of phoneme pronunciation between the languages. Our experimental result shows that the learned phoneme embedding vectors are located closer if their pronunciations are similar across the languages. Consequently, the trained networks can synthesize the English speakers' Korean speech and vice versa. Using this result, we propose a training framework to utilize information from a different language. To be specific, we pre-train a speech synthesis network using datasets from both high-resource language and low-resource language, then we fine-tune the network using the low-resource language dataset. Finally, we conducted more simulations on 10 different languages to show it is generally extendable to other languages.

Taesu Kim

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

EdiTTS: Score-based Editing for Controllable Text-to-Speech

GP22: A Car Styling Dataset for Automotive Designers

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

Learning pronunciation from a foreign language in speech synthesis networks