Dummy Variable for Examining Structural Instability in Regression: An Alternative to Chow Test

One of the fast growing economies in the era of globalization is the Ethiopian economy. Among the lower income group countries, it has emerged as one of the rare countries to achieve a double digit growth rate in Grows Domestic Product (GDP). However, there is a great deal of debate regarding the double digit growth rate, especially during the recent global recession period. So, it becomes a question of empirical research whether there is a structural change in the relationship between the GDP of Ethiopia and the regressor (time). How do we find out that a structural change has in fact occurred? To answer this question, we consider the GDP of Ethiopia (measured on constant 2010 US$) over the period of 1981 to 2015. Like many other countries in the world, Ethiopia has adopted the policy of regulated globalization during the early nineties of the last century. So, our aim is to whether the GDP of Ethiopia has undergone any structural changes following the major policy shift due to adoption of globalization policy. To answer this question, we have two options in statistical and econometric research. The most important classes of tests on structural change are the tests from the generalized fluctuation test framework (Kuan and Hornik, 1995) on the one hand and tests based on F statistics (Hansen, 1992; Andrews, 1993; Andrews and Ploberger, 1994) on the other. The first class includes in particular the CUSUM and MOSUM tests and the fluctuation test, while the Chow and the supF test belong to the latter. A topic that gained more interest rather recently is to monitor structural change, i.e., to start after a history phase (without structural changes) to analyze new observations and to be able to detect a structural change as soon after its occurrence as possible.

Exploring data with pandas and MapD using Apache Arrow

At MapD, we’ve long been big fans of the PyData stack, and are constantly working on ways for our open source GPU-accelerated analytic SQL engine to play nicely with the terrific tools in the most popular stack that supports open data science. We are founding collaborators of GOAI (the GPU Open Analytics Initiative), working with the awesome folks at Anaconda and H2O.ai, and our friends at NVIDIA. In GOAI, we use Apache Arrow to mediate efficient, high-performance data interchange for analytics and AI workflows. A big reason for doing this is to make MapD itself easily accessible to Python tools. For starters, this means supporting modern Python database interfaces like DBAPI. pymapd (built with help from Continuum) is a pythonic interface to MapD’s SQL engine supporting DBAPI 2.0, and it has some extra goodness in being able to use our in-built Arrow support for both data loading and query result output.

The Line Between Commercial and Industrial Data Science

The purpose, tasks, and required skillsets are dramatically different for data scientists and their work in commercial and industrial environments.

10 Surprising Ways Machine Learning is Being Used Today

1. Predicting whether a criminal defendant is a flight risk.
2. Using Twitter to diagnose psychopathy.
3. Helping cyclists win the Tour de France.
4. Identifying endangered whales.
5. Translating legalese.
6. Preventing money laundering.
7. Figuring out which message board threads will be closed.
8. Predicting hospital wait times.
9. Calculating auction prices.
10. Predicting earthquakes.

How to Improve my ML Algorithm? Lessons from Andrew Ng’s experience

You have worked for weeks on building your machine learning system and the performance is not something you are satisfied with. You think of multiple ways to improve your algorithm’s performance, viz. collect more data, add more hidden units, add more layers, change the network architecture, change the basic algorithm etc. But which one of these will give the best improvement on your system? You can either try them all, invest a lot of time and find out what works for you. OR! You can use the following tips from Ng’s experience

The 10 Deep Learning Methods AI Practitioners Need to Apply

Interest in machine learning has exploded over the past decade. You see machine learning in computer science programs, industry conferences, and the Wall Street Journal almost daily. For all the talk about machine learning, many conflate what it can do with what they wish it could do. Fundamentally, machine learning is using algorithms to extract information from raw data and represent it in some type of model. We use this model to infer things about other data we have not yet modeled.

TensorFlow for Short-Term Stocks Prediction

In this post you will see an application of Convolutional Neural Networks to stock market prediction, using a combination of stock prices with sentiment analysis.

Top Data Science and Machine Learning Methods Used in 2017

The most used methods are Regression, Clustering, Visualization, Decision Trees/Rules, and Random Forests; Deep Learning is used by only 20% of respondents; we also analyze which methods are most ‘industrial’ and most ‘academic’.

Robust Algorithms for Machine Learning

Machine learning is often held out as a magical solution to hard problems that will absolve us mere humans from ever having to actually learn anything. But in reality, for data scientists and machine learning engineers, there are a lot of problems that are much more difficult to deal with than simple object recognition in images, or playing board games with finite rule sets. For these majority of problems, it pays to have a variety of approaches to help you reduce the noise and anomalies, to focus on something more tractable. One approach is to design more robust algorithms where the testing error is consistent with the training error, or the performance is stable after adding noise to the dataset1. The idea of any traditional (non-Bayesian) statistical test is the same: we compute a number (called a ‘statistic’) from the data, and use the known distribution of that number to answer the question, ‘What are the odds of this happening by chance?’ That number is the p-value.

Monitoring and Improving the Performance of Machine Learning Models

It’s critical to have “humans in the loop” when automating the deployment of machine learning (ML) models. Why? Because models often perform worse over time. This course covers the human directed safeguards that prevent poorly performing models from deploying into production and the techniques for evaluating models over time. We’ll use ModelDB to capture the appropriate metrics that help you identify poorly performing models. We’ll review the many factors that affect model performance (i.e., changing users and user preferences, stale data, etc.) and the variables that lose predictive power. We’ll explain how to utilize classification and prediction scoring methods such as precision recall, ROC, and jaccard similarity. We’ll also show you how ModelDB allows you to track provenance and metrics for model performance and health; how to integrate ModelDB with SparkML; and how to use the ModelDB APIs to store information when training models in Spark ML. Learners should have basic familiarity with the following: Scala or Python; Hadoop, Spark, or Pandas; SBT or Maven; cloud platforms like Amazon Web Services; Bash, Docker, and REST.

Training and Exporting Machine Learning Models in Spark

Spark ML provides a rich set of tools and models for training, scoring, evaluating, and exporting machine learning models. This video walks you through each step in the process. You’ll explore the basics of Spark’s DataFrames, Transformer, Estimator, Pipeline, and Parameter, and how to utilize the Spark API to create model uniformity and comparability. You’ll learn how to create meaningful models and labels from a raw dataset; train and score a variety of models; target price predictions; compare results using MAE, MSE, and other scores; and employ the SparkML evaluator to automate the parameter-tuning process using cross validation. To complete the lesson, you’ll learn to export and serialize a Spark trained model as PMML (an industry standard for model serialization), so you can deploy in applications outside the Spark cluster environment.

Deploying Machine Learning Models as Microservices Using Docker

Modern applications running in the cloud often rely on REST-based microservices architectures by using Docker containers. Docker enables your applications to communicate between one another and to compose and scale various components. Data scientists use these techniques to efficiently scale their machine learning models to production applications. This video teaches you how to deploy machine learning models behind a REST API—to serve low latency requests from applications—without using a Spark cluster. In the process, you’ll learn how to export models trained in SparkML; how to work with Docker, a convenient way to build, deploy, and ship application code for microservices; and how a model scoring service should support single on-demand predictions and bulk predictions. Learners should have basic familiarity with the following: Scala or Python; Hadoop, Spark, or Pandas; SBT or Maven; cloud platforms like Amazon Web Services; Bash, Docker, and REST.

Deploying Spark ML Pipelines in Production on AWS

Translating a Spark application from running in a local environment to running on a production cluster in the cloud requires several critical steps, including publishing artifacts, installing dependencies, and defining the steps in a pipeline. This video is a hands-on guide through the process of deploying your Spark ML pipelines in production. You’ll learn how to create a pipeline that supports model reproducibility—making your machine learning models more reliable—and how to update your pipeline incrementally as the underlying data change. Learners should have basic familiarity with the following: Scala or Python; Hadoop, Spark, or Pandas; SBT or Maven; Amazon Web Services such as S3, EMR, and EC2; Bash, Docker, and REST.

An Introduction to Machine Learning Models in Production

This course lays out the common architecture, infrastructure, and theoretical considerations for managing an enterprise machine learning (ML) model pipeline. Because automation is the key to effective operations, you’ll learn about open source tools like Spark, Hive, ModelDB, and Docker and how they’re used to bridge the gap between individual models and a reproducible pipeline. You’ll also learn how effective data teams operate; why they use a common process for building, training, deploying, and maintaining ML models; and how they’re able to seamlessly push models into production. The course is designed for the data engineer transitioning to the cloud and for the data scientist ready to use model deployment pipelines that are reproducible and automated. Learners should have basic familiarity with: cloud platforms like Amazon Web Services; Scala or Python; Hadoop, Spark, or Pandas; SBT or Maven; Bash, Docker, and REST.

GPU-accelerated TensorFlow on Kubernetes

Many workflows that utilize TensorFlow need GPUs to efficiently train models on image or video data. Yet, these same workflows typically also involve multi-stage data pre-processing and post-processing, which might not need to run on GPUs. This mix of processing stages, illustrated in Figure 1, results in data science teams running things requiring CPUs in one system while trying to manage GPUs resources separately by yelling across the office: “Hey is anyone using the GPU machine?” A unified methodology is desperately needed for scheduling multi-stage workflows, managing data, and offloading certain portions of the workflows to GPUs.

Pipes in R Tutorial For Beginners

You might have already seen or used the pipe operator when you’re working with packages such as dplyr, magrittr,… But do you know where pipes and the famous %>% operator come from, what they exactly are, or how, when and why you should use them? Can you also come up with some alternatives?

R in the Windows Subsystem for Linux

R has been available for Windows since the very beginning, but if you have a Windows machine and want to use R within a Linux ecosystem, that’s easy to do with the new Fall Creator’s Update (version 1709). If you need access to the gcc toolchain for building R packages, or simply prefer the bash environment, it’s easy to get things up and running. Once you have things set up, you can launch a bash shell and run R at the terminal like you would in any Linux system. And that’s because this is a Linux system: the Windows Subsystem for Linux is a complete Linux distribution running within Windows. This page provides the details on installing Linux on Windows, but here are the basic steps you need and how to get the latest version of R up and running within it.

Introduction to Skewness

In previous posts here, here, and here, we spent quite a bit of time on portfolio volatility, using the standard deviation of returns as a proxy for volatility. Today we will begin to a two-part series on additional statistics that aid our understanding of return dispersion: skewness and kurtosis. Beyond being fancy words and required vocabulary for CFA level 1, these two concepts are both important and fascinating for lovers of returns distributions. For today, we will focus on skewness. Skewness is the degree to which returns are asymmetric around the mean. Since a normal distribution is symmetric around the mean, skewness can be taken as one measure of how returns are not distributed normally. Why does skewness matter? If portfolio returns are right, or positively, skewed, it implies numerous small negative returns and a few large positive returns. If portfolio returns are left, or negatively, skewed, it implies numerous small positive returns and few large negative returns. The phrase “large negative returns” should trigger Pavlovian sweating for investors, even if it’s preceded by a diminutive modifier like “just a few”. For a portfolio manager, a negatively skewed distribution of returns implies a portfolio at risk of rare but large losses. This makes us nervous and is a bit like saying, “I’m healthy, except for my occasional massive heart attack.” Let’s get to it.

A minimal Project Tree in R

The main idea was:
•To ensure reproducibility within a stable working directory tree. She proposes the very concise here::here() but other methods are available such as the template or the ProjectTemplate packages..
•To avoid break havoc in other’s computers with rm(list = ls())!.

Introduction to Computational Linguistics and Dependency Trees in data science

In recent years, the amalgam of deep learning fundamentals with Natural Language Processing techniques has shown a great improvement in the information mining tasks on unstructured text data. The models are now able to recognize natural language and speech comparable to human levels. Despite such improvements, discrepancies in the results still exist as sometimes the information is coded very deep in the syntaxes and syntactic structures of the corpus.

Artificial Intelligence and the Move Towards Preventive Healthcare

In this special guest feature, Waqaas Al-Siddiq, Founder and CEO of Biotricity, discusses how AI’s ability to crunch Big Data will play a key role in the healthcare industry’s shift toward preventative care. A physicians’ ability to find the relevant data they need to make a diagnosis will be augmented by new AI enhanced technologies. Waqaas, the founder of Biotricity, is a serial entrepreneur, a former investment advisor and an expert in wireless communication technology. Academically, he was distinguished for his various innovative designs in digital, analog, embedded, and micro-electro-mechanical products. His work was published in various conferences such as IEEE and the National Communication Council. Waqaas has a dual Bachelor’s degree in Computer Engineering and Economics, a Master’s in Computer Engineering from Rochester Institute of Technology, and a Master’s in Business Administration from Henley Business School. He is completing his Doctorate in Business Administration at Henley, with a focus on Transformative Innovations and Billion Dollar Markets.