Source author record

Roberto L. Castro

Roberto L. Castro appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computer Vision Distributed, Parallel, and Cluster Computing

Catalog footprint

What is connected

3works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

preprint2020arXiv

A Hybrid Approach for Tracking Individual Players in Broadcast Match Videos

Tracking people in a video sequence is a challenging task that has been approached from many perspectives. This task becomes even more complicated when the person to track is a player in a broadcasted sport event, the reasons being the existence of difficulties such as frequent camera movements or switches, total and partial occlusions between players, and blurry frames due to the codification algorithm of the video. This paper introduces a player tracking solution which is both fast and accurate. This allows to track a player precisely in real-time. The approach combines several models that are executed concurrently in a relatively modest hardware, and whose accuracy has been validated against hand-labeled broadcast video sequences. Regarding the accuracy, the tests show that the area under curve (AUC) of our approach is around 0.6, which is similar to generic state of the art solutions. As for performance, our proposal can process high definition videos (1920x1080 px) at 80 fps.

preprint2020arXiv

Reusing Trained Layers of Convolutional Neural Networks to Shorten Hyperparameters Tuning Time

Hyperparameters tuning is a time-consuming approach, particularly when the architecture of the neural network is decided as part of this process. For instance, in convolutional neural networks (CNNs), the selection of the number and the characteristics of the hidden (convolutional) layers may be decided. This implies that the search process involves the training of all these candidate network architectures. This paper describes a proposal to reuse the weights of hidden (convolutional) layers among different trainings to shorten this process. The rationale is that if a set of convolutional layers have been trained to solve a given problem, the weights calculated in this training may be useful when a new convolutional layer is added to the network architecture. This idea has been tested using the CIFAR-10 dataset, testing different CNNs architectures with up to 3 convolutional layers and up to 3 fully connected layers. The experiments compare the training time and the validation loss when reusing and not reusing convolutional layers. They confirm that this strategy reduces the training time while it even increases the accuracy of the resulting neural network. This finding opens up the future possibility of integrating this strategy in existing AutoML methods with the purpose of reducing the total search time.

Roberto L. Castro

What is connected

Connect this record

See the researcher in context

Building this map preview

3 published item(s)

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

A Hybrid Approach for Tracking Individual Players in Broadcast Match Videos

Reusing Trained Layers of Convolutional Neural Networks to Shorten Hyperparameters Tuning Time