Descriptive Statistics

Statistics is the science of collecting data and analyzing them to infer proportions (sample) that are representative of the population. In other words, statistics is interpreting data in order to make predictions for the population.

The cold start problem: how to build your machine learning portfolio

I’m a physicist who works at a YC startup. Our job is to help new grads get hired into their first machine learning jobs. Some time ago, I wrote about the things you should do to get hired into your first machine learning job. I said in that post that one thing you should do is build a portfolio of your personal machine learning projects. But I left out the part about how to actually to do that, so in this post, I’ll tell you how. [1] Because of what our startup does, I’ve seen hundreds of examples of personal projects that ranged from very good to very bad. Let me tell you about two of the very good ones.

Should Data Scientists Know How To Write Production Code?

Why we – as a data scientist – should care about writing production code in the first place? Because this is where our analysis and models truly add values to our end users. Without the deployment of models after agonizing months (or even years) of models development, models will always remain as models if they don’t bring any benefits to customers or end users. All those long hours of data collection and cleaning, models building and optimization, and presentation are meant to show that your models are able to generate results and insights to reach business objectives. Once you’ve successfully convinced stakeholders (provided your models are robust, analysis makes sense from the business perspective, and the results are able to achieve the business goals), deployment phase will not be too far away, and that is when you need to put models into production by delivering production-level code. To be brutally honest, your boss doesn’t really care what models you used. What he/she cares is simply the RESULTS. So deliver the results and you’re good to go.

Choosing between TensorFlow/Keras, BigQuery ML, and AutoML Natural Language for text classification

Comparing text classification done three ways on Google Cloud Platfor: Google Cloud Platform offers you three¹ ways to carry out machine learning:
• Keras with a TensorFlow backend to build custom, deep learning models that are trained on Cloud ML Engine
• BigQuery ML to build linear models on structured data using just SQL
• Auto ML to train state-of-the-art deep learning models on your data without writing any code
Choose between them based on your skill set, how important additional accuracy is, and how much time/effort you are willing to devote to the problem. Use BigQuery ML for quick experimentation and easy, low-cost machine learning. Once you identify a viable ML problem using BQML, use Auto ML for code-free, state-of-the-art models. Hand-roll your own custom models only for problems where you have lots of data and enough time/effort to devote.

P2P Lending Platform Data Analysis: Exploratory Data Analysis in R – Part 2

In the previous post, we found Prosper adopted a better credit risk metric?-?Prosper Rating, for Prosepr lending platform. Knowing that Prosper Rating is composed of Prosper Score and Credit Bureau Score, we also found the Prosper Score plays a key role which makes the Prosper Rating have more discrimination than Credit Bureau Score itself. In this post, I will investigate the relationship between Prosper Score and other features in this loan data to see how do these features link to Prosper Score, and how do these features can discriminate between completed and high risk loan in this P2P lending platform.

P2P Lending Platform Data Analysis: Exploratory Data Analysis in R – Part 1

Peer-to-peer lending platform industry is thriving in recent years. Thousands of investors are making profit through these platforms; thousands of borrowers are getting money more easily. Although these platforms provide credit score and basic information of borrowers to ensure lending tradings are in a safety environment, there are still thousands of people under risk of losing money. Even Prosper?-?a leading financial peer-to-peer lending platform company, still suffered from credit risk issues before. But after their reconstruction and new credit system launched, the credit risk has been improved. It leads me want to find out stories behind the scenes. Here I will explore this Prosper data set and try to find out some patterns behind borrowers properties, different Rating type, and how they link to default loan or completed loan. Besides, I will also share some ideas abut motivations of variable exploration. After all, in the such a huge data set, if we start from knowing some basic domain knowledge, it’s easier to dig out valuable features among this big data.

Understanding and Coding a ResNet in Keras

ResNet, short for Residual Networks is a classic neural network used as a backbone for many computer vision tasks. This model was the winner of ImageNet challenge in 2015. The fundamental breakthrough with ResNet was it allowed us to train extremely deep neural networks with 150+layers successfully. Prior to ResNet training very deep neural networks was difficult due to the problem of vanishing gradients. AlexNet, the winner of ImageNet 2012 and the model that apparently kick started the focus on deep learning had only 8 convolutional layers, the VGG network had 19 and Inception or GoogleNet had 22 layers and ResNet 152 had 152 layers. In this blog we will code a ResNet-50 that is a smaller version of ResNet 152 and frequently used as a starting point for transfer learning.

How Good is Your Model? – Intro to Resampling Methods

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. This allows us to gain more information that could not be available from fitting the model only once.
Usually, the objective of a data science project is to create a model using training data, and have it make predictions on new data. Hence, the resampling methods allow us to see how the model would perform on data it has not been trained on, without collecting new data!
Two common methods will be discussed in this post:
• cross-validation
• bootstrap

Distributed Data Pre-processing using Dask, Amazon ECS and Python (Part 2)

Using Dask for EDA and Hyperparameters Optimization (HPO)

Principal Component Analysis

One of the common problems in analysis of complex data comes from a large number of variables, which requires a large amount of memory and computation power. This is where Principal Component Analysis (PCA) comes in. It is a technique to reduce the dimension of the feature space by feature extraction. For example, if we have 10 variables, in feature extraction, we create new independent variables by combining the old ten variables. By creating new variables it might seem as if more dimensions are introduced, but we select only a few variables from the newly created variables in the order of importance. Then the number of those selected variables is less than what we started with and that’s how we reduce the dimensionality.

Can automated machine learning outperform handcrafted models?

Automated machine learning (AutoML) can be used to automatically find and train machine learning models. You no longer need to create the model yourself?-?The AutoMl Algorithms will analyze your data and pick the best model automatically. But how good are those models really? Can they be compared to custom models or are they even better? Do we never need to handpick another model again? Let’s find out!

Artificial Intelligence Demystified

As we start the new year, the Tech propaganda machine is already ramping up its next generation of buzzwords, promising paradigm shifts and silver bullets that will make whole industries obsolete, enable huge efficiency gains, and make the world a better place. Blockchain, which used to top keyword search trends and social media posts, suffered a significant decline in interest, partly due to the fact that its initial hype was residual from the Bitcoin bubble. It seems that this year’s buzzword of choice is going to be Artificial Intelligence.

Development of social chatbots

Social chatbots or intelligent dialogue systems have the capability to engage in conversations with humans. Mimicking human conversations and passing Turing test has been the longest running goal of Artificial Intelligence (AI). Even in cinema, many movies like Her, and Iron Man have showcased how chatbot technologies are bringing revolutionary changes to the way humans and machines are interacting with each other. In the early 1970s, there were many attempts to create intelligent dialogue systems, but these systems were designed based on hand-crafted rules. In the year 2016 chatbots were considered ‘The Next Big Thing’. Major IT companies including Google, Microsoft, Facebook, and Amazon released their own version of chatbot platforms. These chatbots were very poor because of performance, design issues, and bad user experience.