Document worth reading: “A systematic review of fuzzing based on machine learning techniques”

Security vulnerabilities play a vital role in network security system. Fuzzing technology is widely used as a vulnerability discovery technology to reduce damage in advance. However, traditional fuzzing techniques have many challenges, such as how to mutate input seed files, how to increase code coverage, and how to effectively bypass verification. Machine learning technology has been introduced as a new method into fuzzing test to alleviate these challenges. This paper reviews the research progress of using machine learning technology for fuzzing test in recent years, analyzes how machine learning improve the fuzz process and results, and sheds light on future work in fuzzing. Firstly, this paper discusses the reasons why machine learning techniques can be used for fuzzing scenarios and identifies six different stages in which machine learning have been used. Then this paper systematically study the machine learning based fuzzing models from selection of machine learning algorithm, pre-processing methods, datasets, evaluation metrics, and hyperparameters setting. Next, this paper assesses the performance of the machine learning models based on the frequently used evaluation metrics. The results of the evaluation prove that machine learning technology has an acceptable capability of categorize predictive for fuzzing. Finally, the comparison on capability of discovering vulnerabilities between traditional fuzzing tools and machine learning based fuzzing tools is analyzed. The results depict that the introduction of machine learning technology can improve the performance of fuzzing. However, there are still some limitations, such as unbalanced training samples and difficult to extract the characteristics related to vulnerabilities. A systematic review of fuzzing based on machine learning techniques


Magister Dixit

“The stunning achievements and advancements made on behalf of humanity within mathematics are paramount. After all, it is thanks to engineers, mathematicians and statisticians that we ever put a man on the moon, that we’ve ever mapped the floor of the ocean, that human eyes have ever been able to see out-of-this-world phenomena the likes of colliding galaxies or the rings of Jupiter. In fact, it’s arguable that mathematics is the very foundation of our physical world. The Fibonacci sequence, while prone to fodder for conspiracy theorists, can be found in every corner of the universe, from spiral galaxies, to sea shells, to the ratios of your facial features.” Tracey Wallace ( September 8, 2014 )

What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

TorchACG is a Pytorch based framework for GAN based ACG applications.

TorchStyle is a Pytorch based framework for GAN based Neural Style Transfer.

TPU index is a package for fast similarity search over large collections of high dimension vectors on Google Cloud TPUs

Train and use expectation detector in Robot Framework tests.

Transformations for Linear Model

Turkish Asciifier/Deasciifier Library

Turkish WordNet KeNet

Type stubs for Python machine learning libraries

Typo is the intelligent data quality barrier for enterprise information systems. The Typo tap retrieves results and data from the Typo platform.

Typo is the intelligent data quality barrier for enterprise information systems. The Typo target communicates with Singer taps, and consumes data that conforms to the Singer JSON specification.

Typo is the intelligent data quality barrier for enterprise information systems. The Typo target proxy communicates with Singer taps, consumes data that conforms to the Singer JSON specification, and provides data quality services to data in motion.

Utilities for preprocessing texts

What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

Statistical data visualization from the command line

Tensor Decomposition Library

Tensorboard integrations for convolut framework

TensorFlow Estimator (GPU).

Text-to-Text Transformer for Korean QA Task

The integration of some popular transferlearning learning methods

This package is intended for those who wish to conduct an extreme values analysis. It provides the whole toolkit necessary to create a threshold model in a simple and efficient way, presenting the main methods towards the Peak-Over-Threshold Method and the fit in the Generalized Pareto Distribution. For installing and use it, go to https://…/thresholdmodeling

This package represents the code used for the publication of the article https://…/1901.00519

This project is an ensemble of methods which are frequently used in python Data Science projects.

Tools for Air Quality Data Analysis.

Tools for managing data flow of groundwater series

Torch_Template – A PyTorch template with commonly used models and tools

Magister Dixit

“A data scientist must possess the knack of being able to ‘identify business value from mathematical models.’ But that vital business value can only materialize if the data scientist also networks with other departments, understands their objectives, is familiar with their data and processes – and can spot the analysis options they provide.” Andreas Schmitz, Alexander Linden ( March 11, 2015 )

If you did not already know

Stop Words google
In computing, stop words are words which are filtered out before or after processing of natural language data (text). There is not one definite list of stop words which all tools use and such a filter is not always used. Some tools specifically avoid removing them to support phrase search. Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as ‘The Who’, ‘The The’, or ‘Take That’. Other search engines remove some of the most common words-including lexical words, such as “want”-from a query in order to improve performance. …

Approximate Exploration google
Although exploration in reinforcement learning is well understood from a theoretical point of view, provably correct methods remain impractical. In this paper we study the interplay between exploration and approximation, what we call \emph{approximate exploration}. We first provide results when the approximation is explicit, quantifying the performance of an exploration algorithm, MBIE-EB \citep{strehl2008analysis}, when combined with state aggregation. In particular, we show that this allows the agent to trade off between learning speed and quality of the policy learned. We then turn to a successful exploration scheme in practical, pseudo-count based exploration bonuses \citep{bellemare2016unifying}. We show that choosing a density model implicitly defines an abstraction and that the pseudo-count bonus incentivizes the agent to explore using this abstraction. We find, however, that implicit exploration may result in a mismatch between the approximated value function and exploration bonus, leading to either under- or over-exploration. …

GPU Open Analytics Initiative (GOAI) google
Recently, Continuum Analytics,, and MapD announced the formation of the GPU Open Analytics Initiative (GOAI). GOAI-also joined by BlazingDB, Graphistry and the Gunrock project from the University of California, Davis-aims to create open frameworks that allow developers and data scientists to build applications using standard data formats and APIs on GPUs. Bringing standard analytics data formats to GPUs will allow data analytics to be even more efficient, and to take advantage of the high throughput of GPUs. NVIDIA believes this initiative is a key contributor to the continued growth of GPU computing in accelerated analytics. …

StochAstic Recursive grAdient algoritHm (SARAH) google
In this paper, we propose a StochAstic Recursive grAdient algoritHm (SARAH), as well as its practical variant SARAH+, as a novel approach to the finite-sum minimization problems. Different from the vanilla SGD and other modern stochastic methods such as SVRG, S2GD, SAG and SAGA, SARAH admits a simple recursive framework for updating stochastic gradient estimates; when comparing to SAG/SAGA, SARAH does not require a storage of past gradients. The linear convergence rate of SARAH is proven under strong convexity assumption. We also prove a linear convergence rate (in the strongly convex case) for an inner loop of SARAH, the property that SVRG does not possess. Numerical experiments demonstrate the efficiency of our algorithm. …

Document worth reading: “Explainable Machine Learning for Scientific Insights and Discoveries”

Machine learning methods have been remarkably successful for a wide range of application areas in the extraction of essential information from data. An exciting and relatively recent development is the uptake of machine learning in the natural sciences, where the major goal is to obtain novel scientific insights and discoveries from observational or simulated data. A prerequisite for obtaining a scientific outcome is domain knowledge, which is needed to gain explainability, but also to enhance scientific consistency. In this article we review explainable machine learning in view of applications in the natural sciences and discuss three core elements which we identified as relevant in this context: transparency, interpretability, and explainability. With respect to these core elements, we provide a survey of recent scientific works incorporating machine learning, and in particular to the way that explainable machine learning is used in their respective application areas. Explainable Machine Learning for Scientific Insights and Discoveries

If you did not already know

HIRO google
Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve more complex tasks. Yet, the majority of current HRL methods require careful task-specific design and on-policy training, making them difficult to apply in real-world scenarios. In this paper, we study how we can develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control. For generality, we develop a scheme where lower-level controllers are supervised with goals that are learned and proposed automatically by the higher-level controllers. To address efficiency, we propose to use off-policy experience for both higher and lower-level training. This poses a considerable challenge, since changes to the lower-level behaviors change the action space for the higher-level policy, and we introduce an off-policy correction to remedy this challenge. This allows us to take advantage of recent advances in off-policy model-free RL to learn both higher- and lower-level policies using substantially fewer environment interactions than on-policy algorithms. We term the resulting HRL agent HIRO and find that it is generally applicable and highly sample-efficient. Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques. …

Augmented Neural ODE google
We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs. …

Balancing GAN (BAGAN) google
Image classification datasets are often imbalanced, characteristic that negatively affects the accuracy of deeplearning classifiers. In this work we propose balancing GANs (BAGANs) as an augmentation tool to restore balance in imbalanced datasets. This is challenging because the few minority-class images may not be enough to train a GAN. We overcome this issue by including during training all available images of majority and minority classes. The generative model learns useful features from majority classes and uses these to generate images for minority classes. We apply class-conditioning in the latent space to drive the generation process towards a target class. Additionally, we couple GANs with autoencoding techniques to reduce the risk of collapsing toward the generation of few foolish examples. We compare the proposed methodology with state-of-the-art GANs and demonstrate that BAGAN generates images of superior quality when trained with an imbalanced dataset. …

Speaker Diarisation google
Speaker diarisation (or diarization) is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question ‘who spoke when?’ Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics. …

What’s going on on PyPI

Scanning all new published packages on PyPI I know that the quality is often quite bad. I try to filter out the worst ones and list here the ones which might be worth a look, being followed or inspire you in some way.

Machine Learning Orchestration

Be normal! (It’s a string normalization framework.)

A sub-script language for quick text processing and automation.

A toolkit for tracking energy, carbon, and compute metrics for machine learning (or any other) experiments.

easytoken is an independent Open Source, Natural Language Processing python library which implements a easytoken to create token from Both Sentence and Paragraph.

Image Super-Resolution Library for PyTorch

Implementation of Porter Stemmer algorithm (M.F Porter 1980)

s-atmech is an independent Open Source, Deep Learning python library which implements attention mechanism as a RNN(Recurrent Neural Network) Layer as Encoder-Decoder system. (only supports Bahdanau Attention right now).

Some pytorch utilities for NLP

Some tools for Luigi to cut down the length of your pipelines and work in interactive environments such as Jupyter notebooks.

Spam filtering module with Machine Learning using SVM.

SQL-like language for awkward arrays

Document worth reading: “Automatic Extraction of Personality from Text: Challenges and Opportunities”

In this study, we examined the possibility to extract personality traits from a text. We created an extensive dataset by having experts annotate personality traits in a large number of texts from multiple online sources. From these annotated texts, we selected a sample and made further annotations ending up in a large low-reliability dataset and a small high-reliability dataset. We then used the two datasets to train and test several machine learning models to extract personality from text, including a language model. Finally, we evaluated our best models in the wild, on datasets from different domains. Our results show that the models based on the small high-reliability dataset performed better (in terms of $\textrm{R}^2$) than models based on large low-reliability dataset. Also, language model based on small high-reliability dataset performed better than the random baseline. Finally, and more importantly, the results showed our best model did not perform better than the random baseline when tested in the wild. Taken together, our results show that determining personality traits from a text remains a challenge and that no firm conclusions can be made on model performance before testing in the wild. Automatic Extraction of Personality from Text: Challenges and Opportunities