CatBoost: A machine learning library to handle categorical (CAT) data automatically

How many of you have seen this error while building your machine learning models using “sklearn”?

Data Science Simplified Part 6: Model Selection Methods

In the last article of this series, we had discussed multivariate linear regression model. Fernando creates a model that estimates the price of the car based on five input parameters.

The State of Graph Databases – Worldwide Adoption and Use Case Characteristics

If your organization is like many, you may be taking a “store everything” approach to data. After all, storage has become more affordable than ever in recent years, and due to the accessibility of cloud-based technology, capacity has become almost limitless. The clear challenge is tapping into those massive volumes of data to derive actionable business value. That requires analytics. However, using a traditional relational database may not be the best approach, because adapting relational databases to answer deeply complex questions can create performance bottlenecks and added maintenance burden for your business. To gain a real-world perspective on how IT professionals are addressing these challenges, IBM, in partnership with TechValidate, conducted a global survey of 1,365 entrepreneurs and developers about the potential they see for graph databases as well as their current and planned use for this technology. We also queried them about how they are using graph to address problems, the benefits they are realizing, and examined how adoption of this technology differs by company size and industry. Survey respondents spanned small, medium and large companies in diverse industries across 74 countries. Specifically, large enterprises comprised 38 percent of the responses, with small businesses representing 36 percent, and mid-sized businesses representing 19 percent. The survey population included a wide array of professional roles, including developers, architects, IT managers and business leaders, with the largest percentages being attributed to developers/programmers (44 percent), application/ software architects (13 percent) and IT directors and managers (11 percent). Respondents represented a range of industries, with the majority in technical industries: computer services (42 percent) and computer software (22 percent). This paper provides an overview of graph technology, details the results of the survey—and highlights findings that debunk some of the most popularly held views about graph technology.

RStudio v1.1 Preview: Terminal

Today we’re excited to announce availability of our first Preview Release for RStudio 1.1, a major new release which includes the following new features:
• A Connections tab which makes it easy to connect to, explore, and view data in a variety of databases.
• A Terminal tab which provides fluid shell integration with the IDE, xterm emulation, and even support for full-screen terminal applications.
• An Object Explorer which can navigate deeply nested R data structures and objects.
• A new, modern dark theme and Retina-quality icons throughout.
• Improvements to the RStudio API which add power and flexibility to RStudio add-ins and packages.
• RStudio Server Pro support for floating licensing, notifications, self-service session management, a library of professional ODBC drivers, and more.
• Dozens of other small improvements and bugfixes.

Python script to create dummy variable from text in SQL

Hyperopt tutorial for Optimizing Neural Networks’ Hyperparameters

Hyperopt is a way to search through an hyperparameter space. For example, it can use the Tree-structured Parzen Estimator (TPE) algorithm, which explore intelligently the search space while narrowing down to the estimated best parameters. It is hence a good method for meta-optimizing a neural network which is itself an optimisation problem: tuning a neural network uses gradient descent methods, and tuning the hyperparameters needs to be done differently since gradient descent can’t apply. Therefore, Hyperopt can be useful not only for tuning hyperparameters such as the learning rate, but also to tune more fancy parameters in a flexible way, such as changing the number of layers of certain types, or the number of neurons in a layer, or even the type of layer to use at a certain place in the network given an array of choices, each with nested tunable hyperparameters. This is an oriented random search, in contrast with a Grid Search where hyperparameters are pre-established with fixed steps increase. Random Search for Hyper-Parameter Optimization (such as what Hyperopt do) has proven to be an effective search technique. The paper about this technique sits among the most cited deep learning papers. To sum up, it is more efficient to search randomly through values and to intelligently narrow the search space rather than looping on fixed sets of values for the hyperparameters.

Harness the Power of Machine Learning in Your Browser with Deeplearn.js

Machine learning (ML) has become an increasingly powerful tool, one that can be applied to a wide variety of areas spanning object recognition, language translation, health and more. However, the development of ML systems is often restricted to those with computational resources and the technical expertise to work with commonly available ML libraries. With PAIR — an initiative to study and redesign human interactions with ML — we want to open machine learning up to as many people as possible. In pursuit of that goal, we are excited to announce deeplearn.js 0.1.0, an open source WebGL-accelerated JavaScript library for machine learning that runs entirely in your browser, with no installations and no backend.

Data Version Control in Analytics DevOps Paradigm

The primary mission of DevOps is to help the teams to resolve various Tech Ops infrastructure, tools and pipeline issues. At the other hand, as mentioned in the conceptual review by Forbes in November 2016, the industrial analytics is no more going to be driven by data scientists alone. It requires an investment in DevOps skills, practices and supporting technology to move analytics out of the lab and into the business. There are even voices calling Data Scientists to concentrate on agile methodology and DevOps if they like to retain their jobs in business in the long run.

Making Predictive Models Robust: Holdout vs Cross-Validation

The validation step helps you find the best parameters for your predictive model and prevent overfitting. We examine pros and cons of two popular validation strategies: the hold-out strategy and k-fold.

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here. In our (simple) case, we will apply this process to Westgard’s desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Supervised Learning in R: Regression

We are very excited to announce a new (paid) Win-Vector LLC video training course: Supervised Learning in R: Regression now available on DataCamp

More string Hacking with Regex and Rebus

For a begineer in R or any language,regular expression might seem like a daunting task . Rebus package in R gives a lowers the barrier for common regular expression tasks and is useful for a begineer or even for advanced users for most of the common regex skills in a more intuitive yet verbose way .Check out the package and try this exercises to test your knowledge . Load stringr/stringi as well for this set of exercise . I encourage you to do this and this before working on this set .

The Twitter Waterflow Problem

I was recently introduced to the Twitter Waterflow Problem and I decided it was interesting enough to try and complete the challenge in R.