Re-sampling: Amazing Results and Applications

This crash course features a new fundamental statistics theorem — even more important than the central limit theorem — and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum k in k-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple approach with easy formulas, efficient algorithms and illustration on complex data. Little statistical knowledge is required to understand and apply the methodology described here, yet it is more advanced, more general, and more applied than standard literature on the subject. The intended audience is beginners as well as professionals in any field faced with data challenges on a daily basis. This article presents statistical science in a different light, hopefully in a style more accessible, intuitive, and exciting than standard textbooks, and in a compact format yet covering a large chunk of the traditional statistical curriculum and beyond.

Why Use Weight of Evidence?

I had been asked why I spent so much effort on developing SAS macros and R functions to do monotonic binning for the WoE transformation, given the availability of other cutting-edge data mining algorithms that will automatically generate the prediction with whatever predictors fed in the model. Nonetheless, what really distinguishes a good modeler from the rest is how to handle challenging data issues before feeding data in the model, including missing values, outliers, linearity, and predictability, in a scalable way that can be rolled out to hundreds or even thousands of potential model drivers in the production environment. The WoE transformation through monotonic binning provides a convenient way to address each of aforementioned concerns.

Fast food, causality and R packages, part 2

Fast food, causality and R packages, part 2

Optimal Control: LQR

As of April 2019, over 400,000 projects have been launched on Kickstarter. With crowdfunding becoming an ever-increasingly popular method of raising capital, I thought it would be interesting to explore the data behind Kickstarter projects and also apply a machine learning model to predict whether or not a project will be successful based on its category and fundraising target.

Creating Smart – Knowledge Base Systems(KBS) using advanced NLP library

Enterprises, Institutions or any large Organizations have built their knowledge over several years of their existence by recording it as books, journals/articles, documents, etc. Continuous access to this knowledge for its employees, students, teaching/research communities is essential for sustained operations. KM(Knowledge Management) tools in the market help to address this need to a certain extent by creating a knowledge repository of sorts and enabling access to it. KM tools require a document to be tagged (either manually or automatically) in order for it to be easily searchable by users.

Know Your Metrics

This series of articles were designed to explain how to use Python in a simplistic way to fuel your company’s growth by applying the predictive approach to all your actions. It will be a combination of programming, data analysis and machine learning.

Visualizing Naive Bayesian Theorem

Bayesian Theorem visualized though Venn, Tree and Pie diagrams

Reinforcement Learning for Real-World Robotics

Robots are pervasive throughout modern industry. Unlike most science-fiction works of the previous century, humanoid robots are still not doing our dirty dishes and taking out the trash, nor are Schwarzenegger-looking terminators fighting on the battlefields (at least for now…). But, in almost every manufacturing facility robots are doing the kind of tedious and demanding work that human workers used to do just several decades ago. Anything that requires repetitive and precise work in an environment that can be carefully controlled and monitored is a good candidate for a robot to replace a human, and today cutting edge research is fast-approaching the possibility of automating tasks that are very hard even though we might consider them tedious (such as driving).

Validation Methods

In this post we will discuss the following concepts, which all aim to evaluate the performance of a classification model: 1.Cross Validation of a model. 2.Confusion Matrix. 3.ROC Curve. 4.Cohen’s ? score.

My First Machine Learning Project: Designing a Hate Speech Detecting Algorithm

Greetings, ladies and gentlemen! I’m ???(Yoo Beyoung Woo), an aspiring amateur data scientist who likes conducting projects involving data analysis or machine learning. In this post, I’m going to share a project that I’ve been recently working on: building a hate speech detecting artificial intelligence. I’ve never made an artificial intelligence program before, and since hate-speech-detection is one of the most basic projects that beginners in machine learning can easily approach, I’ve decided to give it a try! Please enjoy~!

Are we Asking too Much of Algorithms?

Although algorithms will get better at advising on the most difficult of judgement calls, we need to remain realistic about where the boundaries currently lie and accept that humans still have a critical role to play in decision-making.

A gentle guide into Decision Trees with Python

Decision tree algorithm is a supervised learning model used in predicting a dependent variable with a series of training variables. Decision trees algorithms can be used for classification and regression purposes. In this particular project, I am going to illustrate it in the classification of a discrete random variable.

How To Design A Spam Filtering System with Machine Learning Algorithm

In my day to day work in Visa as a software developer, email is one of the very important tool for communication. To have effective communication, spam filtering is one of the important feature. So how do spam filtering system actually works? Can we design something similar from scratch?

General Language Understanding Evaluation (GLUE) Benchmark

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of:
• A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty,
• A diagnostic dataset designed to evaluate and analyze model performance with respect to a wide range of linguistic phenomena found in natural language, and
• A public leaderboard for tracking performance on the benchmark and a dashboard for visualizing the performance of models on the diagnostic set.
The format of the GLUE benchmark is model-agnostic, so any system capable of processing sentence and sentence pairs and producing corresponding predictions is eligible to participate. The benchmark tasks are selected so as to favor models that share information across tasks using parameter sharing or other transfer learning techniques. The ultimate goal of GLUE is to drive research in the development of general and robust natural language understanding systems.

The almost-5-minute Data Science Pipeline.

How can I minimize the development time of a Data Science Pipeline and check all required task boxes at the same time?’ Now is a great time to work on Data Science problems since there are a lot of tools that can help you solve them using different approaches and focusing on different goals. There are tools that focus on achieving greater model performance and others that can assist your efforts towards minimizing development time. On this post, we focus on the latter. The end target of a firm is to offer products that maximize the utility of its consumers. By maximizing that utility, the firm can attract users and have a profitable business model. Out of all the ideas that will be examined, only a small sample of them will translate to a successful product; thus, by exploring more ideas over the same time period and using your time wisely, you can have more successful products.