This project combines two of the recent architectures StackGAN and ProGAN for synthesizing faces from textual descriptions. The project uses Face2Text dataset which contains 400 facial images and textual captions for each of them. The data can be obtained by contacting either the RIVAL group or the authors of the aforementioned paper.
Orbit is a composable framework for orchestrating change processing, tracking, and synchronization across multiple data sources. Orbit is written in Typescript and distributed on npm through the orbit organization. Pre-built distributions are provided in several module formats and ES language levels. Orbit is isomorphic – it can be run both in modern browsers as well as in the Node.js runtime.
I built a scenario for a hybrid machine learning infrastructure leveraging Apache Kafka as scalable central nervous system. The public cloud is used for training analytic models at extreme scale (e.g. using TensorFlow and TPUs on Google Cloud Platform (GCP) via Google ML Engine. The predictions (i.e. model inference) are executed on premise at the edge in a local Kafka infrastructure (e.g. leveraging Kafka Streams or KSQL for streaming analytics). This post focuses on the on premise deployment. I created a Github project with a KSQL UDF for sensor analytics. It leverages the new API features of KSQL to build UDF / UDAF functions easily with Java to do continuous stream processing on incoming events.
Shark-ML is an open-source machine learning library, it’s site tells next: ‘It provides methods for linear and nonlinear optimization, kernel-based learning algorithms, neural networks, and various other machine learning techniques. It serves as a powerful toolbox for real world applications as well as for research. Shark works on Windows, MacOS X, and Linux. It comes with extensive documentation. Shark is licensed under the GNU Lesser General Public License.’ I can confirm that it offers a wide range of machine learning algorithms together with nice documentation, tutorials and samples. So in this article I will show the basic API concepts, details can be easily found in official documentation. In this article I will show how to use this library for solving a classification problem. I’ve used Iris data set in this example, so loading data will depend on it’s format.
Below is an R code for Cox and Stuart Test for Trend Analysis. Simply, copy and paste the code into R workspace and use it. Unlike cox.stuart.test in R package named ‘randtests’, this version of the test does not return a p-value greater than one. This phenomenon occurs when the test statistic, T is half of the number of untied pairs, N.
Understanding how to detect relevant words: TF-IDF, which stands for term frequency – inverse document frequency, is a scoring measure widely used in information retrieval (IR) or summarization. TF-IDF is intended to reflect how relevant a term is in a given document. The intuition behind it is that if a word occurs multiple times in a document, we should boost its relevance as it should be more meaningful than other words that appear fewer times (TF). At the same time, if a word occurs many times in a document but also along many other documents, maybe it is because this word is just a frequent word; not because it was relevant or meaningful (IDF).
At the Insurance Data Science conference, both Eric Novik and Paul-Christian Bürkner emphasised in their talks the value of thinking about the data generating process when building Bayesian statistical models. It is also a key step in Michael Betancourt´s Principled Bayesian Workflow. In this post, I will discuss in more detail how to set priors, and review the prior and posterior parameter distributions, but also the prior predictive distributions with brms (Bürkner (2017)). The prior predictive distribution shows me how the model behaves before I use my data. Thus, I can check if the model describes the data generating process reasonably well, before I go through the lengthy process of fitting the model. As an example, I will get back to the non-linear hierarchical growth curve model (Guszcza (2008)), which is also part of the brms package vignette on non-linear models.
In this exercise, we will try to handle the model that has been over-dispersed using the quasi-Poisson model. Over-dispersion simply means that the variance is greater than the mean. It´s important because it leads to inflation in the models and increases the possibility of Type I errors. We will use a data-set on amphibian road kill (Zuur et al., 2009). It has 17 explanatory variables. We´re going to focus on nine of them using the total number of kills (TOT.N) as the response variable. Please download the data-set here and name it ‘Road.’ Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.
Briefly, clustering is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is an amount that reflects the strength of a relationship between two data objects. Clustering is mainly used for exploratory data mining. Clustering has manifold usage in many fields such as machine learning, pattern recognition, image analysis, information retrieval, bio-informatics, data compression, and computer graphics. There are many families of clustering techniques, and you may be familiar with the most popular one: K-Means (which belongs to the family of centroid-based clustering). As a quick refresher, K-Means determines k centroids in the data and clusters points by assigning them to the nearest centroid. While K-Means is easy to understand and implement in practice, the algorithm does not take care of outliers, so all points are assigned to a cluster even if they do not belong in any. In the domain of anomaly detection, this causes problems as anomalous points will be assigned to the same cluster as ‘normal’ data points. The anomalous points pull the cluster centroid towards them, making it harder to classify them as anomalous points. This tutorial will cover another type of clustering technique known as density-based clustering specifically DBSCAN (a density-based based clustering technique). Compared to centroid-based clustering like K-Means, density-based clustering works by identifying ‘dense’ clusters of points, allowing it to learn clusters of arbitrary shape and identify outliers in the data.
K Nearest Neighbor(KNN) is a very simple, easy to understand, versatile and one of the topmost machine learning algorithms. KNN used in the variety of applications such as finance, healthcare, political science, handwriting detection, image recognition and video recognition. In Credit ratings, financial institutes will predict the credit rating of customers. In loan disbursement, banking institutes will predict whether the loan is safe or risky. In political science, classifying potential voters in two classes will vote or won´t vote. KNN algorithm used for both classification and regression problems. KNN algorithm based on feature similarity approach.
This article is a quick tutorial for implementing a surveillance system using Object Detection based on deep learning.
We propose a method for simultaneously estimating regression coe cients and clustering response variables in a multivariate regression model, to increase prediction accuracy and give insights into the relationship between response variables. The estimates of the regression coe cients and clusters are found by using a penalized likelihood estimator, which includes a cluster fusion penalty, to shrink the di erence in tted values from responses in the same cluster, and an L1 penalty for simultaneous variable selection and estimation. We propose a two-step algorithm, that iterates between k-means clustering and solving the penalized likelihood function assuming the clusters are known, which has desirable parallel computational properties obtained by using the cluster fusion penalty. If the response variable clusters are known a priori then the algorithm reduces to just solving the penalized likelihood problem. Theoretical results are presented for the penalized least squares case, including asymptotic results allowing for p n. We extend our method to the setting where the responses are binomial variables. We propose a coordinate descent algorithm for the normal likelihood and a proximal gradient descent algorithm for the binomial likelihood, which can easily be extended to other generalized linear model (GLM) settings. Simulations and data examples from business operations and genomics are presented to show the merits of both the least squares and binomial methods.
The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for distributed computing environments, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly-convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly-convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the-art methods, as we illustrate with an extensive set of experiments on real distributed datasets.
We consider interactive algorithms in the pool-based setting, and in the stream-based setting. Interactive algorithms observe suggested elements (representing actions or queries), and interactively select some of them and receive responses. Pool-based algorithms can select elements at any order, while stream-based algorithms observe elements in sequence, and can only select elements immediately after observing them. We further consider an intermediate setting, which we term precognitive stream, in which the algorithm knows in advance the identity of all the elements in the sequence, but can select them only in the order of their appearance. For all settings, we assume that the suggested elements are generated independently from some source distribution, and ask what is the stream size required for emulating a pool algorithm with a given pool size, in the stream-based setting and in the precognitive stream setting. We provide algorithms and matching lower bounds for general pool algorithms, and for utility-based pool algorithms. We further derive nearly matching upper and lower bounds on the gap between the two settings for the special case of active learning for binary classification.
It is usual in machine learning theory to assume that the training and testing sets comprise of draws from the same distribution. This is rarely, if ever, true and one must admit the presence of corruption. There are many different types of corruption that can arise and as of yet there is no general means to compare the relative ease of learning in these settings. Such results are necessary if we are to make informed economic decisions regarding the acquisition of data. Here we begin to develop an abstract framework for tackling these problems. We present a generic method for learning from a fixed, known, reconstructible corruption, along with an analyses of its statistical properties. We demonstrate the utility of our framework via concrete novel results in solving supervised learning problems wherein the labels are corrupted, such as learning with noisy labels, semi-supervised learning and learning with partial labels.