Researcher profile

Andrew Hines

Andrew Hines contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2026arXiv

End-to-End Evaluation and Governance of an EHR-Embedded AI Agent for Clinicians

Clinical AI systems require not just point-in-time evaluation but continuous governance: the ongoing practice of monitoring, evaluating, iterating, and re-evaluating performance throughout deployment. We present an end-to-end framework of governance that integrates rubric validation, live deployment feedback, technical performance monitoring, and cost tracking, with controlled experimentation gating system changes before deployment. Applied to Hyperscribe, an EHR-embedded agent that converts ambient audio into structured chart updates, twenty clinicians authored 1,646 validated rubrics across 823 cases. Seven Hyperscribe versions were evaluated through controlled experiments, with median scores improving from 84% to 95%. Analysis of 107 live feedback entries over three months showed feedback composition shifting from 79% error reports and 14% positive observations to 30% errors and 45% positive observations as engineering interventions resolved failures. Median processing time per audio segment was 8.1 seconds with a 99.6% effective completion rate after retry mechanisms absorbed transient model errors. These results demonstrate that continuous, multi-channel governance of deployed clinical AI is both achievable and effective.

preprint2022arXiv

AQP: An Open Modular Python Platform for Objective Speech and Audio Quality Metrics

Audio quality assessment has been widely researched in the signal processing area. Full-reference objective metrics (e.g., POLQA, ViSQOL) have been developed to estimate the audio quality relying only on human rating experiments. To evaluate the audio quality of novel audio processing techniques, researchers constantly need to compare objective quality metrics. Testing different implementations of the same metric and evaluating new datasets are fundamental and ongoing iterative activities. In this paper, we present AQP - an open-source, node-based, light-weight Python pipeline for audio quality assessment. AQP allows researchers to test and compare objective quality metrics helping to improve robustness, reproducibility and development speed. We introduce the platform, explain the motivations, and illustrate with examples how, using AQP, objective quality metrics can be (i) compared and benchmarked; (ii) prototyped and adapted in a modular fashion; (iii) visualised and checked for errors. The code has been shared on GitHub to encourage adoption and contributions from the community.

preprint2022arXiv

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction

Recent studies have shown how self-supervised models can produce accurate speech quality predictions. Speech representations generated by the pre-trained wav2vec 2.0 model allows constructing robust predicting models using small amounts of annotated data. This opens the possibility of developing strong models in scenarios where labelled data is scarce. It is known that fine-tuning improves the model's performance; however, it is unclear how the data (e.g., language, amount of samples) used for fine-tuning is influencing that performance. In this paper, we explore how using different speech corpus to fine-tune the wav2vec 2.0 can influence its performance. We took four speech datasets containing degradations found in common conferencing applications and fine-tuned wav2vec 2.0 targeting different languages and data size scenarios. The fine-tuned models were tested across all four conferencing datasets plus an additional dataset containing synthetic speech and they were compared against three external baseline models. Results showed that fine-tuned models were able to compete with baseline models. Larger fine-tune data guarantee better performance; meanwhile, diversity in language helped the models deal with specific languages. Further research is needed to evaluate other wav2vec 2.0 models pre-trained with multi-lingual datasets and to develop prediction models that are more resilient to language diversity.

preprint2022arXiv

Supervised Learning based QoE Prediction of Video Streaming in Future Networks: A Tutorial with Comparative Study

The Quality of Experience (QoE) based service management remains key for successful provisioning of multimedia services in next-generation networks such as 5G/6G, which requires proper tools for quality monitoring, prediction and resource management where machine learning (ML) can play a crucial role. In this paper, we provide a tutorial on the development and deployment of the QoE measurement and prediction solutions for video streaming services based on supervised learning ML models. Firstly, we provide a detailed pipeline for developing and deploying supervised learning-based video streaming QoE prediction models which covers several stages including data collection, feature engineering, model optimization and training, testing and prediction and evaluation. Secondly, we discuss the deployment of the ML model for the QoE prediction/measurement in the next generation networks (5G/6G) using network enabling technologies such as Software-Defined Networking (SDN), Network Function Virtualization (NFV) and Mobile Edge Computing (MEC) by proposing reference architecture. Thirdly, we present a comparative study of the state-of-the-art supervised learning ML models for QoE prediction of video streaming applications based on multiple performance metrics.

preprint2022arXiv

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.

preprint2021arXiv

WARP-Q: Quality Prediction For Generative Neural Speech Codecs

Good speech quality has been achieved using waveform matching and parametric reconstruction coders. Recently developed very low bit rate generative codecs can reconstruct high quality wideband speech with bit streams less than 3 kb/s. These codecs use a DNN with parametric input to synthesise high quality speech outputs. Existing objective speech quality models (e.g., POLQA, ViSQOL) do not accurately predict the quality of coded speech from these generative models underestimating quality due to signal differences not highlighted in subjective listening tests. We present WARP-Q, a full-reference objective speech quality metric that uses dynamic time warping cost for MFCC speech representations. It is robust to small perceptual signal changes. Evaluation using waveform matching, parametric and generative neural vocoder based codecs as well as channel and environmental noise shows that WARP-Q has better correlation and codec quality ranking for novel codecs compared to traditional metrics in addition to versatility for general quality assessment scenarios.

preprint2020arXiv

Could regulating the creators deliver trustworthy AI?

Is a new regulated profession, such as Artificial Intelligence (AI) Architect who is responsible and accountable for AI outputs necessary to ensure trustworthy AI? AI is becoming all pervasive and is often deployed in everyday technologies, devices and services without our knowledge. There is heightened awareness of AI in recent years which has brought with it fear. This fear is compounded by the inability to point to a trustworthy source of AI, however even the term "trustworthy AI" itself is troublesome. Some consider trustworthy AI to be that which complies with relevant laws, while others point to the requirement to comply with ethics and standards (whether in addition to or in isolation of the law). This immediately raises questions of whose ethics and which standards should be applied and whether these are sufficient to produce trustworthy AI in any event.

preprint2020arXiv

How Crisp is the Crease? A Subjective Study on Web Browsing Perception of Above-The-Fold

Quality of Experience (QoE) for various types of websites has gained significant attention in recent years. In order to design and evaluate websites, a metric that can estimate a user's experienced quality robustly for diverse content is necessary. SpeedIndex (SI) has been widely adopted to estimate perceived web page loading progress. It measures the speed of rendering pixels for the webpage that is visible in the browser window. This is termed Above-The-Fold (ATF). The influence of animated content on the perception of ATF has been less comprehensively explored. In this paper, we present an experimental design and methodology to measure ATF perception for websites with and without animated elements for various page content categories. We found that pages with animated elements caused people to have more varied perceptions of ATF under different network conditions. Animated content also impacts the page load estimation accuracy of SI for websites. We discuss how the difference in the perception of ATF will impact the QoE management of web applications. We explain the necessity of revisiting the visual assessment of ATF to include the animated contents and improve the robustness of metrics like SI.

preprint2020arXiv

How deep is your encoder: an analysis of features descriptors for an autoencoder-based audio-visual quality metric

The development of audio-visual quality assessment models poses a number of challenges in order to obtain accurate predictions. One of these challenges is the modelling of the complex interaction that audio and visual stimuli have and how this interaction is interpreted by human users. The No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd) deals with this problem from a machine learning perspective. The metric receives two sets of audio and video features descriptors and produces a low-dimensional set of features used to predict the audio-visual quality. A basic implementation of NAViDAd was able to produce accurate predictions tested with a range of different audio-visual databases. The current work performs an ablation study on the base architecture of the metric. Several modules are removed or re-trained using different configurations to have a better understanding of the metric functionality. The results presented in this study provided important feedback that allows us to understand the real capacity of the metric's architecture and eventually develop a much better audio-visual quality metric.

preprint2020arXiv

Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders

This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates are contrasted. Performance analysis of the coded speech is evaluated for different quality aspects: accuracy of pitch periods estimation, the word error rates for automatic speech recognition, and the influence of speaker gender and coding delays. A number of performance metrics of speech samples taken from a publicly available database were compared with subjective scores. Results from subjective quality assessment do not correlate well with existing full reference speech quality metrics. The results provide valuable insights into aspects of the speech signal that will be used to develop a novel metric to accurately predict speech quality from generative-model-based coders.

preprint2020arXiv

ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric

Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of both design and usage. As an open source C++ library or binary with permissive licensing, ViSQOL can now be deployed beyond the research context into production usage. The feedback from internal production teams at Google has helped to improve this new release, and serves to show cases where it is most applicable, as well as to highlight limitations. The new model is benchmarked against real-world data for evaluation purposes. The trends and direction of future work is discussed.

preprint2020arXiv

You Drive Me Crazy! Interactive QoE Assessment for Telepresence Robot Control

Telepresence robots (TPRs) are versatile, remotely controlled vehicles that enable physical presence and human-to-human interaction over a distance. Thanks to improving hardware and dropping price points, TPRs enjoy the growing interest in various industries and application domains. Still, a satisfying experience remains key for their acceptance and successful adoption, not only in terms of enabling remote communication with others, but also in terms of managing robot mobility by means of remote navigation. This paper focuses on the latter aspect of remote operation which has been hitherto neglected. We present the results of an extensive subjective study designed to systematically assess remote navigation Quality of Experience (QoE) in the context of using a TPR live over the Internet. Participants were 'beamed' into a remote office space and asked to perform characteristic TPR remote operation tasks (driving, turning, parking). Visual and control dimensions of their experience were systematically impaired by altering network characteristics (bandwidth, delay and packet loss rate) in a controlled fashion. Our results show that users can differentiate well between visual and navigation/control aspects of their experience. Furthermore, QoE impairment sensitivity varies with the actual task at hand.