A Complete Tutorial to Learn Data Science with Python from Scratch

It happened few years back. After working on SAS for more than 5 years, I decided to move out of my comfort zone. Being a data scientist, my hunt for other useful tools was ON! Fortunately, it didn’t take me long to decide, Python was my appetizer. I always had a inclination towards coding. This was the time to do what I really loved. Code. Turned out, coding was so easy! I learned basics of Python within a week. And, since then, I’ve not only explored this language to the depth, but also have helped many other to learn this language. Python was originally a general purpose language. But, over the years, with strong community support, this language got dedicated library for data analysis and predictive modeling. Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are comfortable and practice it at our own end.

Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers

Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies – and out of reach for most academic researchers. Research scientists at Yahoo Labs have long enjoyed working on large-scale machine learning problems inspired by consumer-facing products. This has enabled us to advance the thinking in areas such as search ranking, computational advertising, information retrieval, and core machine learning. A key aspect of interest to the external research community has been the application of new algorithms and methodologies to production traffic and to large-scale datasets gathered from real products. Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015. The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.

What is Model-Based Machine Learning?

This post is about a different viewpoint called “model-based machine learning” , which tackles these difficulties, can solve problems in a wide range of application domains, and covers most existing machine learning algorithms as special cases. It will also allow us to deal with uncertainty that we encounter in real-world applications in a principled manner.

Making Causal Impact Analysis Easy

If you ever spent time in the field of marketing analytics, chances are that you have analyzed the existence of a causal impact from a new local TV campaign, a major PR event, or the emergence of a new local competitor. From an analytical standpoint these types of events all have one thing in common: The impact cannot be tracked at the individual customer level and hence we have to analyze the impact from a bird’s eye view using time series analysis at the market level. Data science may be changing at a fast pace but this is an old-school use-case that is still very relevant no matter what industry you’re in.

What Is Machine Intelligence Vs. Machine Learning Vs. Deep Learning Vs. Artificial Intelligence (AI)?

We are frequently asked how we distinguish our technology from others. This task is made difficult by the fact that there is not an agreed vocabulary; everybody uses the above terms (and other associated terms) differently. In addition, the commonly understood meaning of some of these terms has evolved over time. What was meant by AI in 1960 is very different than what is meant today. In our view, there are three major approaches to building smart machines. Let’s call these approaches Classic AI, Simple Neural Networks, and Biological Neural Networks. The rest of this blog post will describe and differentiate these approaches. At the end, we’ll include an example as to how each approach might address the same problem. This analysis is intended for a business rather than technical audience, so we simplify somewhat and thus beg the indulgence of technical experts who might quibble with the details.

Plausibility vs. probability, prior distributions, and the garden of forking paths

I’ll start off this blog on the first work day of the new year with an important post connecting some ideas we’ve been lately talking a lot about. Someone rolls a die four times, and he tells you he got the numbers 1, 4, 3, 6. Is this a plausible outcome? Sure. Is is probable? No. The prior probability of this event is only 1 in 1296. The point is, lots and lots of things are plausible, but they can’t all be probable, cos total probability sums to 1.

ggtree supports phylip tree format

Phylip is also a widely used tree format, which contains taxa sequences with Newick tree text.

Mini AI app using TensorFlow and Shiny

My weekend was full of deep learning and AI programming so as a milestone I made a simple image recognition app that:
• Takes an image input uploaded to Shiny UI
• Performs image recognition using TensorFlow
• Plots detected objects and scores in wordcloud

New Data Sources for R

Over the past few months, a number of new CRAN packages have appeared that make it easier for R users to gain access to curated data. Most of these provide interfaces to a RESTful API written by the data publishers while a few just wrap the data set inside the package. Some of the new packages are only very simple, one function wrappers to the API. Others offer more features providing functions to control searches and the format of the returned data. Some of the packages require a user to first obtain login credentials while others don’t. Here are 17 packages that connect to data sources of all sorts. It is by no means complete. New packages in this class seem to be arriving daily at CRAN.

Power analysis for default Bayesian t-tests

One important benefit of Bayesian statistics is that you can provide relative support for the null hypothesis. When the null hypothesis is true, p-values will forever randomly wander between 0 and 1, but a Bayes factor has consistency (Rouder, Speckman, Sun, Morey, & Iverson, 2009), which means that as the sample size increases, the Bayes Factor will tell you which of two hypotheses has received more support. Bayes Factors can express relative evidence for the null hypothesis (H0) compared to the alternative hypothesis (H1), referred to as a BF01, or relative evidence for H1 compared to H0 (a BF10). A BF01 > 3 is sometimes referred to as substantial evidence for H0 (see Wagenmakers, Wetzels, Borsboom, & Van Der Maas, 2011), but what they really mean is substantial evidence for H0 compared to H1.