Introduction to Genetic Algorithm & their application in data science

Few days back, I started working on a practice problem – Big Mart Sales. After applying some simple models and doing some feature engineering, I landed up on 219th position on the leader board. Not bad – but I needed something better. So, I started searching for optimization techniques which could improve my score. It was during this search that I was introduced to genetic algorithms. After applying Genetric algorithm to the practice problem, I ended up taking a considerable leap on the leaderboard.


Blockchain and Artificial Intelligence

How Future Technologies affect our life’s, How many jobs will be taken by robots and how much will left for humans; difficult to predict today. At the same time software development/ coding jobs will be less and less thats for sure with machine learning concept and its not very far may be just another decade and all software development jobs are gone. The future computing will have extraordinary capabilities with expected factoring of a 3,000 digit number (40 power to 10) faster than today. Personal health devices will send data to databases and glucose levels will be analyzed from tears by contact lenses. AI will form its own team on the fly with combination of hardware, software, infrastructure and machine learning trained programs organized to facilitate planning, control, coordination, and decision making in an organization and this process will happen any time or many times for various tasks. Blockchain to transform financial services. Disruptive technologies transform traditional industries.


Recommendation System Algorithms

Today, many companies use big data to make super relevant recommendations and growth revenue. Among a variety of recommendation algorithms, data scientists need to choose the best one according a business’s limitations and requirements. To simplify this task, my team has prepared an overview of the main existing recommendation system algorithms.


Logistic Regression

We review binary logistic regression. In particular, we derive a) the equations needed to fit the algorithm via gradient descent, b) the maximum likelihood fit’s asymptotic coefficient covariance matrix, and c) expressions for model test point class membership probability confidence intervals. We also provide python code implementing a minimal “LogisticRegressionWithError” class whose “predict_proba” method returns prediction confidence intervals alongside its point estimates.


Starting a data science project: Three things to remember about your data

1. Identify the right questions about your data
2. Don’t wait too long to start collecting data in the right way
3. Don’t expect data science models to perform well on any type of data


Random Forests explained intuitively

Random Forests algorithm has always fascinated me. I like how this algorithm can be easily explained to anyone without much hassle. One quick example, I use very frequently to explain the working of random forests is the way a company has multiple rounds of interview to hire a candidate. Let me elaborate. Say, you appeared for the position of Statistical analyst at WalmartLabs. Now like most of the companies, you don’t just have one round of interview. You have multiple rounds of interviews. Each one of these interviews is chaired by independent panels. Each panel assesses the candidate separately and independently. Generally, even the questions asked in these interviews differ from each other. Randomness is important here.


Matching, Optimal Transport and Statistical Tests

To explain the “optimal transport” problem, we usually start with Gaspard Monge’s “Mémoire sur la théorie des déblais et des remblais“, where the the problem of transporting a given distribution of matter (a pile of sand for instance) into another (an excavation for instance). This problem is usually formulated using distributions, and we seek the “optimal” transport from one distribution to the other one. The formulation, in the context of distributions has been formulated in the 40’s by Leonid Kantorovich, e.g. from the distribution on the left to the distribution on the right.


Scripting for data analysis (with R)

This was a PhD course given in the spring of 2017 at Linköping University. The course was organised by the graduate school Forum scientium and was aimed at people who might be interested in using R for data analysis. The materials developed from a part of a previous PhD course from a couple of years ago, an R tutorial given as part of the Behaviour genetics Masters course, and the Wright lab computation lunches.


Understanding Overhead Issues in Parallel Computation

In my talk at useR! earlier this month, I emphasized the fact that a major impediment to obtaining good speed from parallelizing an algorithm is systems overhead of various kinds, including:
• Contention for memory/network.
• Bandwidth limits – CPU/memory, CPU/network, CPU/GPU.
• Cache coherency problems.
• Contention for I/O ports.
• OS and/or R limits on number of sockets (network connections).
• Serialization.


Tidy Time Series Analysis, Part 3: The Rolling Correlation

In the third part in a series on Tidy Time Series Analysis, we’ll use the runCor function from TTR to investigate rolling (dynamic) correlations. We’ll again use tidyquant to investigate CRAN downloads. This time we’ll also get some help from the corrr package to investigate correlations over specific timespans, and the cowplot package for multi-plot visualizations. We’ll end by reviewing the changes in rolling correlations to show how to detect events and shifts in trend. If you like what you read, please follow us on social media to stay up on the latest Business Science news, events and information! As always, we are interested in both expanding our network of data scientists and seeking new clients interested in applying data science to business and finance. If interested, contact us.


Learn parallel programming in R with these exercises for “foreach”

The foreach package provides a simple looping construct for R: the foreach function, which you may be familiar with from other languages like Javascript or C#. It’s basically a function-based version of a ‘for’ loop. But what makes foreach useful isn’t iteration: it’s the way it makes it easy to run those iterations in parallel, and save time on multi-CPU and distributed systems. If you want to get familiar with the foreach function, Parallelizing Loops at Microsoft Docs will introduce you to foreach oops (and the companion concept, iterators), and the various ‘backends’ you can use to make the loops run in parallel without changing your loop code. Then, to make sure you’ve captured the concepts, you can try these 10 parallel computing exercises using foreach, from R-exercises.com. If you get stuck, the solutions are available here. To get started with foreach, can install the foreach package from CRAN, or simply use any edition of Microsoft R including Microsoft R Open (all come with foreach preinstalled). The foreach package is maintained by Microsoft under the open-source Apache license.


Visualising Similarity: Maps vs. Graphs

The visualization of complex data sets is of essential importance in communicating your data products. Beyond pie charts, histograms, line graphs and other common forms of visual communication begins the reign of data sets that encompass too much information to be easily captured by these simple data displays. A typical context that abounds with complexity is found in the areas of text mining, natural language processing, and cognitive computing in general; such a complex context for data presentation is pervasive in an attempt to build a visual interface for products like semantic search engines or recommendation engines. For example, statistical models like LDA (Latent Dirichlet Allocation) enable for a thorough insight into the similarity structure across textual documents or vocabulary terms used to describe them. But as the number of pairwise similarities between terms of documents to be presented to the end users increases, the problem of effective data visualization becomes harder. In this and the following blog posts, I wish to present and discuss some of the well-known approaches to visualize difficult data sets from the viewpoint of pure ergonomics, as well as to suggest some improvements and new ideas here and there.