Source author record

Waleed A. Yousef

Waleed A. Yousef appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Applications Cryptography and Security Graphics

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Classifier Calibration: with application to threat scores in cybersecurity

This paper explores the calibration of a classifier output score in binary classification problems. A calibrator is a function that maps the arbitrary classifier score, of a testing observation, onto $[0,1]$ to provide an estimate for the posterior probability of belonging to one of the two classes. Calibration is important for two reasons; first, it provides a meaningful score, that is the posterior probability; second, it puts the scores of different classifiers on the same scale for comparable interpretation. The paper presents three main contributions: (1) Introducing multi-score calibration, when more than one classifier provides a score for a single observation. (2) Introducing the idea that the classifier scores to a calibration process are nothing but features to a classifier, hence proposing expanding the classifier scores to higher dimensions to boost the calibrator's performance. (3) Conducting a massive simulation study, in the order of 24,000 experiments, that incorporates different configurations, in addition to experimenting on two real datasets from the cybersecurity domain. The results show that there is no overall winner among the different calibrators and different configurations. However, general advices for practitioners include the following: the Platt's calibrator~\citep{Platt1999ProbabilisticOutputsForSupport}, a version of the logistic regression that decreases bias for a small sample size, has a very stable and acceptable performance among all experiments; our suggested multi-score calibration provides better performance than single score calibration in the majority of experiments, including the two real datasets. In addition, expanding the scores can help in some experiments.

preprint2022arXiv

JSOL: JavaScript Open-source Library for Grammar of Graphics

In this paper, we introduce the JavaScript Open-source Library (\libname), a high-level grammar for representing data in visualization graphs and plots. \libname~perspective on the grammar of graphics is unique; it provides state-of-art rules for encoding visual primitives that can be used to generate a known scene or to invent a new one. \libname~has ton rules developed specifically for data-munging, mapping, and visualization through many layers, such as algebra, scales, and geometries. Additionally, it has a compiler that incorporates and combines all rules specified by a user and put them in a flow to validate it as a visualization grammar and check its requisites. Users can customize scenes through a pipeline that either puts customized rules or comes with new ones. We evaluated \libname~on a multitude of plots to check rules specification of customizing a specific plot. Although the project is still under development and many enhancements are under construction, this paper describes the first developed version of \libname, circa 2016, where an open-source version of it is available. One immediate practical deployment for JSOl is to be integrated with the open-source version of the Data Visualization Platform (DVP) \citep{Yousef2019DVP-arxiv}

preprint2022arXiv

On The Smoothness of Cross-Validation-Based Estimators Of Classifier Performance

Many versions of cross-validation (CV) exist in the literature; and each version though has different variants. All are used interchangeably by many practitioners; yet, without explanation to the connection or difference among them. This article has three contributions. First, it starts by mathematical formalization of these different versions and variants that estimate the error rate and the Area Under the ROC Curve (AUC) of a classification rule, to show the connection and difference among them. Second, we prove some of their properties and prove that many variants are either redundant or "not smooth". Hence, we suggest to abandon all redundant versions and variants and only keep the leave-one-out, the $K$-fold, and the repeated $K$-fold. We show that the latter is the only among the three versions that is "smooth" and hence looks mathematically like estimating the mean performance of the classification rules. However, empirically, for the known phenomenon of "weak correlation", which we explain mathematically and experimentally, it estimates both conditional and mean performance almost with the same accuracy. Third, we conclude the article with suggesting two research points that may answer the remaining question of whether we can come up with a finalist among the three estimators: (1) a comparative study, that is much more comprehensive than those available in literature and conclude no overall winner, is needed to consider a wide range of distributions, datasets, and classifiers including complex ones obtained via the recent deep learning approach. (2) we sketch the path of deriving a rigorous method for estimating the variance of the only "smooth" version, repeated $K$-fold CV, rather than those ad-hoc methods available in the literature that ignore the covariance structure among the folds of CV.

preprint2022arXiv

UN-AVOIDS: Unsupervised and Nonparametric Approach for Visualizing Outliers and Invariant Detection Scoring

The visualization and detection of anomalies (outliers) are of crucial importance to many fields, particularly cybersecurity. Several approaches have been proposed in these fields, yet to the best of our knowledge, none of them has fulfilled both objectives, simultaneously or cooperatively, in one coherent framework. The visualization methods of these approaches were introduced for explaining the output of a detection algorithm, not for data exploration that facilitates a standalone visual detection. This is our point of departure: UN-AVOIDS, an unsupervised and nonparametric approach for both visualization (a human process) and detection (an algorithmic process) of outliers, that assigns invariant anomalous scores (normalized to $[0,1]$), rather than hard binary-decision. The main aspect of novelty of UN-AVOIDS is that it transforms data into a new space, which is introduced in this paper as neighborhood cumulative density function (NCDF), in which both visualization and detection are carried out. In this space, outliers are remarkably visually distinguishable, and therefore the anomaly scores assigned by the detection algorithm achieved a high area under the ROC curve (AUC). We assessed UN-AVOIDS on both simulated and two recently published cybersecurity datasets, and compared it to three of the most successful anomaly detection methods: LOF, IF, and FABOD. In terms of AUC, UN-AVOIDS was almost an overall winner. The article concludes by providing a preview of new theoretical and practical avenues for UN-AVOIDS. Among them is designing a visualization aided anomaly detection (VAAD), a type of software that aids analysts by providing UN-AVOIDS' detection algorithm (running in a back engine), NCDF visualization space (rendered to plots), along with other conventional methods of visualization in the original feature space, all of which are linked in one interactive environment.

Waleed A. Yousef

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Classifier Calibration: with application to threat scores in cybersecurity

JSOL: JavaScript Open-source Library for Grammar of Graphics

On The Smoothness of Cross-Validation-Based Estimators Of Classifier Performance

UN-AVOIDS: Unsupervised and Nonparametric Approach for Visualizing Outliers and Invariant Detection Scoring