This paper is a survey of a large number of informal definitions of “intelligence” that the authors have collected over the years. Naturally, compiling a complete list would be impossible as many definitions of intelligence are buried deep inside articles and books. Nevertheless, the 70-odd definitions presented here are, to the authors’ knowledge, the largest and most well referenced collection there is.
Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time. In addition, Apache Spark is fast enough to perform exploratory queries without sampling. Many industry experts have provided all the reasons why you should use Spark for Machine Learning?
Twitter is a good ressource to collect data. We can find a few libraries (R or Python) which allow you to build your own dataset with the data generated by Twitter. This tutorial is focus on the preparation of the data and no on the collect. Throughout this analysis we are going to see how to work with the twitter’s data. If You want to play with the same data you can download it here.
Data Science / Analytics is all about finding valuable insights from the given dataset. In short, Finding answers that could help business. In this tutorial, We will see how to get started with Data Analysis in Python. The Python packages that we use in this notebook are: numpy, pandas, matplotlib, and seaborn
Many scientific imaging applications, especially microscopy, can produce terabytes of data per day. These applications can benefit from recent advances in computer vision and deep learning. In our work with biologists on robotic microscopy applications (e.g., to distinguish cellular phenotypes) we’ve learned that assembling high quality image datasets that separate signal from noise is a difficult but important task. We’ve also learned that there are many scientists who may not write code, but who are still excited to utilize deep learning in their image analysis work. A particular challenge we can help address involves dealing with out-of-focus images. Even with the autofocus systems on state-of-the-art microscopes, poor configuration or hardware incompatibility may result in image quality issues. Having an automated way to rate focus quality can enable the detection, troubleshooting and removal of such images.
This infographic series features the speakers from Kaggle’s CareerCon 2018 session, ‘Real Stories from a Panel of Successful Career Switchers’.
We highlight recent developments in machine learning and Deep Learning related to multiscale methods, which analyze data at a variety of scales to capture a wider range of relevant features. We give a general overview of multiscale methods, examine recent successes, and compare with similar approaches.
In the most recent post, I profiled a Metropolis-in-Gibbs sampler for estimating the parameters of a Bayesian logistic regression model. The conclusion was that evaluation of the log-posterior was a significant run time bottleneck. In each iteration, the log-posterior is evaluated twice: once at the current draw, and another at the proposed draw. This post hones in on this issue to show how Rcpp can help get past this bottleneck. For this particular post, my code and results are in this sub-repo. If you’re short on time, TLDR: just by coding the log-posterior in C++ instead of a vectorized R function, we can significantly reduce run time. The R implementation runs about 4-7 times slower. If you’re coding your own samplers, profiling your code and re-writing bottlenecks in Rcpp can be hugely beneficial.
The use of formal statistical methods to analyse quantitative data in data science has increased considerably over the last few years. One such approach, Bayesian Decision Theory (BDT), also known as Bayesian Hypothesis Testing and Bayesian inference, is a fundamental statistical approach that quantifies the tradeoffs between various decisions using distributions and costs that accompany such decisions. In pattern recognition it is used for designing classifiers making the assumption that the problem is posed in probabilistic terms, and that all of the relevant probability values are known. Generally, we don’t have such perfect information but it is a good place to start when studying machine learning, statistical inference, and detection theory in signal processing. BDT also has many applications in science, engineering, and medicine.
The majority of industry and academic numeric predictive projects deal with deterministic or point forecasts of expected values of a random variable given some conditional information. In some cases, these predictions are enough for decision making. However, these predictions don’t say much about the uncertainty of your underlying stochastic process. A common desire of all data scientists is to make predictions for an uncertain future. Clearly then, forecasts should be probabilistic, i.e., they should take the form of probability distributions over future quantities or events. This form of prediction is known as probabilistic forecasting and in the last decade has seen a surge in popularity. Recent evidence of this are the 2014 and 2017 Global Energy Forecasting Competitions (GEFCom). GEFCom2014 focused on producing multiple quantile forecasts for wind, solar, load, and electricity prices, and GEFCom2017 focused on hierarchical rolling probabilistic forecasts of load. More recently the M4 Competition aims to produce point forecasts of 100,000-time series but has also optionally for the first time opened to submitting prediction interval forecasts too.
Fish schools, bird flocks, and bee swarms. These combinations of real-time biological systems can blend knowledge, exploration, and exploitation to unify intelligence and solve problems more efficiently. There’s no centralized control. These simple agents interact locally, within their environment, and new behaviors emerge from the group as a whole. In the world of evolutionary alogirthms one such inspired method is particle swarm optimization (PSO). It is a swarm intelligence based computational technique that can be used to find an approximate solution to a problem by iteratively trying to search candidate solutions (called particles) with regard to a given measure of quality around a global optimum. The movements of the particles are guided by their own best known position in the search-space as well as the entire swarm’s best known position. PSO makes few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. As a global optimization method PSO does not use the gradient of the problem being optimized, which means PSO does not require that the optimization problem be differentiable as is required by classic optimization methods such as gradient descent. This makes it a widely popular optimizer for many nonconvex or nondifferentiable problems.