R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model …

Secure HTTPS Connections for R

Traditionally, the mechanisms for obtaining R and related software have used standard HTTP connections. This isn’t ideal though, as without a secure (HTTPS) connection there is less assurance that you are downloading code from a legitimate source rather than from another server posing as one.
Recently there have been a number of changes that make it easier to use HTTPS for installing R, RStudio, and packages from CRAN:
1. Downloads of R from the main CRAN website now use HTTPS;
2. Downloads of RStudio from our website now use HTTPS; and
3. It is now possible to install packages from CRAN over HTTPS.
There are a number of ways to ensure that installation of packages from CRAN are performed using HTTPS. The most recent version of R (v3.2.2) makes this the default behavior. The most recent version of RStudio (v0.99.473) also attempts to configure secure downloads from CRAN by default (even for older versions of R). Finally, any version of R or RStudio can use secure HTTPS downloads by making some configuration changes as described in the Secure Package Downloads for R article in our Knowledge Base.

11 things you should know as a Data Scientist

1. Hardware – choice of your machine
2. Operating System (OS)
3. Software – general
4. Software – Analytics / Data Science
5. Software – Data Visualization
6. Databases / File storage
6. Cloud services
7. Industry blogs and newsletters
8. Mobile apps
9. Meetups
10. Datasets for practice
11. Communities and Social Media

Eigenstyle: Principal Component Analysis and Fashion

Any set of images can be broken down with Principal Component Analysis. This has been done pretty successfully with faces. Here we’ll take a look at style. Our dataset is 807 pictures of dresses from Amazon. They have a standard image size, but unfortunately do not have a standard model pose (though they tend to be centered in the image similarly). Ideally, our principal components would only be about actual dress style, but here many of them will be concerned with model pose. Despite this, we can still do a lot with this data set.

Preserving Validity in Adaptive Data Analysis

From discovering new particles and clinical studies to predicting election results and evaluating credit scores, scientific progress and industrial innovation increasingly rely on statistical data analysis. While incredibly useful, data analysis is also notoriously easy to misuse, even when the analyst has the best of intentions. Problems stemming from such misuse can be costly and contribute to a wider concern about the reproducibility of research findings, most notably in medical research. The issue is hotly debated in the scientific community and has attracted a lot of public attention in the recent years.

Cars evolving using genetic algorithm in HTML 5

Ultimate app to find the best Data Science resources

Have you felt at loss in the jungle of data science resources? Did you try finding a resource only to conclude there are too many of them? Or you couldn’t find one which just enables your learning without confusing you further? The problem in learning data science today is not lack of resources, but the abundance of it! It has been our constant effort to provide you with the best of resources, in as simple manner as possible. To this effect, we launched our learning paths – which got a roaring response from all of you. Today, we are pleased to launch ultimate resource finder – another way to structure all the resources available out there in the wild. This resource finder aims to help you with all the resources you need in your journey to learn data science.

KDD 2015 Best Research Paper Award: “Algorithms for Public-Private Social Networks”

The 21st ACM conference on Knowledge Discovery and Data Mining (KDD’15), a main venue for academic and industry research in data management, information retrieval, data mining and machine learning, was held last week in Sydney, Australia. In the past several years, Google has been actively participating in KDD, with several Googlers presenting work at the conference in the research and industrial tracks. This year Googlers presented 12 papers at KDD (listed below, with Googlers in blue), all of which are freely available at the ACM Digital Library.

What’s new in Revolution R Enterprise 7.4.1

In its latest release Revolution has added to the platform support of Revolution R Enterprise (RRE) version 7.4. Released August 14, version 7.4.1 extends RRE 7.4 capabilities to the Teradata database, HPC Server cluster, and Windows 10 platforms.
With RRE for Teradata customers enjoy the advantage of bringing the analytics to the data by having RRE’s high performance parallelized ScaleR algorithms run in-database on Teradata 14.10 and 15.00 databases rather than incurring the overhead of the traditional extract and analyze paradigm.
With RRE for Microsoft HPC Pack 2008 or 2012 these same high performance ScaleR algorithms can be run in a distributed fashion across HPC-based Windows clusters.
Developers running RRE on laptops and desktops are now able to leverage Microsoft’s latest update to Windows with full RRE support for Windows 10.
Other new features in version 7.4.1 include:
• Default installation on top of Revolution R Open, Revolution’s enhanced distribution of open source R.
• Enhanced support for importing from and exporting to composite CSV and XDF files (experimental).
For RRE users integrating R with enterprise apps, RRE’s included DeployR integration server now supports use of Java8. A more detailed coverage of what’s new in DeployR 7.4.1 please see this companion blog post or the link below.

A case in which metric data are better analyzed by an ordinal model

Here we consider some data that might have been smoothly distributed over a metric scale, but ended up being concentrated on only a few values. The usual treatment of the data as normally or t-distributed is not appropriate, and instead the data are binned and analyzed as ordinal.

Playing with R, Shiny Dashboard and Google Analytics Data

In this post, I want to share some examples of data visualization I was playing with recently. Like in many other occasions, my field of application is digital analytics data. Precisely, data from Google Analytics.

Tutorials at KDD 2015

• VC-Dimension and Rademacher Averages: From Statistical Learning Theory to Sampling Algorithms
• Graph-Based User Behavior Modeling: From Prediction to Fraud Detection
• A New Look at the System, Algorithm and Theory Foundations of Large-Scale Distributed Machine Learning
• Dense subgraph discovery (DSD)
• Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach
• Big Data Analytics: Optimization and Randomization
• Big Data Analytics: Social Media Anomaly Detection: Challenges and Solutions
• Diffusion in Social and Information Networks: Problems, Models and Machine Learning Methods
• Medical Mining
• Large Scale Distributed Data Science using Apache Spark
• Data-Driven Product Innovation
• Web Personalization and Recommender Systems

Constructing a network of politicians from newspaper data

In the last post, we introduced the rzeit package, an R binding to the Content API at ZEIT Online. This time, we give a little demonstration of what can be done with these media data. The question we ask is the following: Can we use information from newspaper articles to learn about connections between political actors? As actors, we choose members of Angela Merkel’s cabinet—ZEIT Online is a German newspaper website, so they are particularly strong in reporting about German politics. We assume that if pairs of ministers are mentioned in the same article, this represents some form of connectivity between those politicians and/or their departments. Given this information, we might even learn about the centrality or importance of particular ministries within the government. To do so, we will use basic tools of network visualization.

10 Significant Visualisation Developments: January to June 2015

1. Tessellations of the Nations
2. Mobile or desktop first?
3. Dear Design Student
4. Mike Bostock’s New Chapter
5. Guardian Graphics
6. Writer’s Block
7. Mike Monteiro IxDA Talk
8. Dear Data
9. Design/Redesign
10. What is code?

A/B Testing with Hierarchical Models in Python

In this post, I discuss a method for A/B testing using Beta-Binomial Hierarchical models to correct for a common pitfall when testing multiple hypotheses. I will compare it to the classical method of using Bernoulli models for p-value, and cover other advantages hierarchical models have over the classical model. My Python code is available on Domino.

Anaconda Data Science Platform for R, Python, or both

Got R, Python, or both? Download conda, the leading package and environment manager for data science, which works with both R and Python packages.