Reproducible Research: Write your Clinical Chemistry paper using R Markdown

This blog post is going to show you how to write a reproducible article in the field of clinical chemistry using R Mardkown. The only thing that will change for journal to journal will be the reference fomating and perhaps section numbering. The source code itself will be provided so that you can use it as a template.

An Introductory Guide to Regularized Greedy Forests (RGF) with a case study in Python

As a data scientist participating in multiple machine learning competition, I am always on the lookout for “not-yet-popular” algorithms. The way I define them is that these algorithms by themselves may not end up becoming a competition winner. But they bring different way for doing predictions on table. Why I am interested in these algorithms? The key is to read “by themselves” in the statement above. These algorithms can be used in ensemble models to get extra edge over mostly popular gradient boosting algorithms (XGBoost, LightGBM etc.). This article talks about one such algorithm called Regularized Greedy Forests (RGF). It performs comparable (if not better) to boosting algorithms for large number of datasets. They produce less correlated predictions and do well in ensemble with other tree boosting models.

Data Science Graphs (without the code!)

If you read my blog then you’ll probably realize there are a few data related topics that I could talk about for days. This blog dives into a tool called RAW Graphs, which solves for two of them, outlined below.

Analyzing GitHub Projects on Kaggle using Python and BigQuery

In this kernel, I’ll demonstrate how to use Sohier’s BigQuery helper module to safely query the largest BigQuery dataset we’ve made available on Kaggle, GitHub Repos. Weighing in at 3TB total, you can see how it would be easy for me to quickly exhaust my 5TB/30-day quota scanning lots of large tables. The bq_helper module simplifies the common read-only tasks we can do using the BigQuery Python client library on Kaggle and makes it a cinch to manage resource usage. This helper class works only in Kernels because we handle authentication on our side. This means you don’t need to worry about anything like managing credentials or entering in credit card information to have fun learning how to work with BigQuery datasets.

AI for the Enterprise: The Citizen Data Scientist

Enterprise AI is the new hot topic in technology, especially as the consumer space blossoms with sales and adoption. However, the assumption cannot be that the same approach for the consumer market can be taken directly to the enterprise. Consumers push the expectations of AI for the business to new heights – and if not carefully prepared, solutions will inevitably fail. In fact, we have seen this already, as software vendors – ranging from start-ups to large software conglomerates – watch users struggle to adopt, understand, and ultimately see the value in AI. In the consumer space, customers are generally willing to embrace new technology, take more risks, or even understand the current limitations. However, in a business setting, too much is at stake. For instance, a user cannot trust AI solutions to fully execute or even assist with a potentially multi-million-dollar task related to production, sales, or distribution. Users cannot assume the same risk, with so much more on the line. Therefore, the skills and objectives must be clearly defined, with specific parameters, in addition to constant feedback cycles.

Documenting R packages: roxygen2 vs. direct Rd input

This LaTeX-like syntax, combined with the fact that the actual R objects live in a separate place, feels burdensome for many developers. As a consequence, there are a handful of tools aimed at improving the documentation process, one of which is roxygen2. We may say that the R community nowadays is divided between those who use roxygen2 and those who don’t.

Cost-Effective BigQuery with R

Companies using Google BigQuery for production analytics often run into the following problem: the company has a large user hit table that spans many years. Since queries are billed based on the fields accessed, and not on the date-ranges queried, queries on the table are billed for all available days and are increasingly wasteful.

Process discovery from event data: Relating models and logs through abstractions

Event data are collected in logistics, manufacturing, finance, health care, customer relationship management, e-learning, e-government, and many other domains. The events found in these domains typically refer to activities executed by resources at particular times and for a particular case (i.e., process instances). Process mining techniques are able to exploit such data. In this article, we focus on process discovery. However, process mining also includes conformance checking, performance analysis, decision mining, organizational mining, predictions, recommendations, and so on. These techniques help to diagnose problems and improve processes. All process mining techniques involve both event data and process models. Therefore, a typical first step is to automatically learn a control-flow model from the event data. This is very challenging, but in recent years, many powerful discovery techniques have been developed. It is not easy to compare these techniques since they use different representations and make different assumptions. Users often need to resort to trying different algorithms in an ad-hoc manner. Developers of new techniques are often trying to solve specific instances of a more general problem. Therefore, we aim to unify existing approaches by focusing on log and model abstractions. These abstractions link observed and modeled behavior: Concrete behaviors recorded in event logs are related to possible behaviors represented by process models. Hence, such behavioral abstractions provide an “interface” between both of them. We discuss four discovery approaches involving three abstractions and different types of process models (Petri nets, block-structured models, and declarative models). The goal is to provide a comprehensive understanding of process discovery and show how to develop new techniques. Examples illustrate the different approaches and pointers to software are given. The discussion on abstractions and process representations is also presented to reflect on the gap between process mining literature and commercial process mining tools. This facilitates users to select an appropriate process discovery technique. Moreover, structuring the role of internal abstractions and representations helps broaden the view and facilitates the creation of new discovery approaches.