CompEngine

A self-organizing database of time-series data. This website allows you to upload time-series data and interactively visualize similar data that have been measured by others! Learn more, or jump right in by uploading your own time-series data, or exploring our existing collection!


Deep Reinforcement Learning with TensorFlow 2.0

In this tutorial, I will showcase the upcoming TensorFlow 2.0 features through the lens of deep reinforcement learning (DRL) by implementing an advantage actor-critic (A2C) agent to solve the classic CartPole-v0 environment. While the goal is to showcase TensorFlow 2.0, I will do my best to make the DRL aspect approachable as well, including a brief overview of the field. In fact, since the main focus of the 2.0 release is making developers’ lives easier, it’s a great time to get into DRL with TensorFlow – our full agent source is under 150 lines! The code is available as a notebook here and online on Google Colab here.


A Guide to Understanding Convolutional Neural Networks (CNNs) using Visualization

How did your neural network produce this result?’ This question has sent many data scientists into a tizzy. It’s easy to explain how a simple neural network works, but what happens when you increase the layers 1000x in a computer vision project? Our clients or end users require interpretability – they want to know how our model got to the final result. We can’t take a pen and paper to explain how a deep neural network works. So how do we shed this ‘black box’ image of neural networks? By visualizing them! The clarity that comes with visualizing the different features of a neural network is unparalleled. This is especially true when we’re dealing with a convolutional neural network (CNN) trained on thousands and millions of images.


The Lifecycle of Data

1. Discovery
2. Data Prep
2.1.Data acquisition: obtaining existing data from outside sources.
2.2.Data entry: creating new data values from data inputted within the organization.
2.3.Signal reception: capturing data created by devices.
3. Plan Model (Explore/Transform Data)
• ETL (Extract, Transform & Load) transforms data using a set of business rules, before loading it into a sandbox.
• ELT (Extract, Load, and Transform) loads raw data into the sandbox then transforms the data.
• ETLT (Extract, Transform, Load, Transform) has two levels of transformation. The first transformation is often used to eliminate noise.
4. Build the Model
4.1.Design the model: identify a suitable model (e.g. a normal distribution). This step can involve a number of different modeling techniques to identify a suitable model. These may include decision trees, regression techniques (like logistic regression), and neural networks.
4.2.Execute the model: The model is run against the data to ensure that the model fits the data.
5. Communicate Results / Publish Insights
6. Operationalize / Measure Effectiveness


Confidence Intervals Without Pain

We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non normal, and even non stationary. No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers.


Master Data Management (MDM): an Essential Part of Data Strategy

First of all, what is Master Data Management (MDM)? Master data refers to the critical data that are essential to an enterprise’s business and often used in multiple disciplines and departments. MDM is the establishment and maintenance of an enterprise level data service that provides accurate, consistent and complete master data across the enterprise and to all business partners.The concept of Master Data Management originated around 2008 when data warehousing and ERP applications became popular in many organizations. With the increase of data volume and the number of databases, and thus the increase in the number of applications for users to enter and read the data, it became more and more important to make sure that correct master data definitions are used so that there is asingle truthin the data without discrepancies, duplications or being out-of-date. The first example is related to customer information. In a big organization, there could be multiple customer databases populated and managed by multiple applications or department silos, and the same customer in the real world could receive multiple direct mails or notifications from the same company. As the data grows, the master data consists of not only the customer information, but also other key data assets, such as the data of prospects, suppliers, panelists and products. MDM has been a challenge to implement, because all three aspects of processes, technologyand tools are required to ensure that the master data is coordinated and synchronized across the enterprise.


Deconstructing the Concept of Efficiency

The concept of efficiency is not always easy to grasp. People go to work. They do their job. They follow the pace. What exactly is efficiency in the scheme of things? I think it is important to be able to distinguish between how hard different workers are working. It is not really possible to discuss efficiency if this kind of comparison is not performed. Improving efficiency is about getting more from a person or process. Therefore in order to determine whether or not efficiencies are being gained, it is necessary to know if production is increasing. If an organization cannot say whether one worker is more productive than another, in fact it has little sense of efficiency. Moreover, it might be paying one worker far too much and another much too little. If an organization cannot speak in relation to its use of labour, then efficiency is sometimes addressed as an issue of, say, using cheaper toner cartridge, buying a different brand of paper clips, and turning off the lights when the washroom is not in use. Cost of materials should not be confused with matters of efficiency. I will explain my position further in this blog.


Github Autocompletion with Machine Learning

As data scientists, one of the fields that comes closer our hearts is software development since, after all, we are avid users of all sorts of packages and frameworks that help us build our models. GitHub is one of the key technologies to support the software development lifecycle, including keeping track of defects, tasks, stories, commits, and so forth. In a large development organization, there might be a number of teams (i.e., squads) with specific responsibilities, e.g., performance squad, installer squad, UX squad, and documentation squad, etc. This introduces challenges when creating a new work item, as the user may not know which team a task or defect should be assigned to or who should be its owner. But can Machine Learning help? The answer is yes, especially, if we have some historical data from a GitHub repository.


A different way to deploy a Python model over Spark

Separate the prediction method from the rest of the Python class and then implement in Scala.


What I’ve Learned Working with 12 Machine Learning Startups

• It’s about building products, not about AI
• It’s about the problem, not about the method
• Look for synergies between data and product
• Data first, AI later
• Invest in effective communication
• Quick and dirty isn’t actually that dirty
• When in doubt, show the data
• Build trust


Hosting your ML model on AWS Lambdas + API Gateway Part 1

But now someone wants to use your model. That’s great news ain’t it? It definitely is, but a whole new problem comes up. You’ve gotta put it somewhere and make it available to other people in the business. I’ll show you how to do it easily and cheaply on AWS.


Softmax and Uncertainty

The softmax function is frequently used as the final activation function in neural networks for classification problems. This function normalizes an input vector into a range that often leads to a probabilistic interpretation. While it is true that the output of the softmax function can be treated as a probability vector, it is easy to fall into the trap of entangling this information with statements about confidence (in the statistical sense). In this brief article, I will show how the softmax function can provide misleading outputs in classification problems, and how to best interpret the results of the softmax function. I will demonstrate what can go wrong through a simple example involving the classical MNIST dataset.


How do they apply BERT in the clinical domain?

Contextual word embeddings is proven that have dramatically improved NLP model performance via ELMo (Peters et al., 2018), BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019). Lots of researches intend to fine tune BERT model on domain specific data. BioBERT and SciBERT are introduced in last time. Would like to continue on this topic as there are another 2 research fine tune BERT model and applying in the clinical domain.
This story will discuss about Publicly Available Clinical BERT Embeddings (Alsentzer et al., 2019) and ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission (Huang et al., 2019) while it will go through BERT detail but focusing how researchers applying it in clinical domain. In case, you want to understand more about BERT, you may visit this story.The following are will be covered:
• Building clinical specific BERT resource
• Application for ClinicalBERT


Different types of regularization On Neuronal Network with Pytorch

I was doing some research about sparsity in neuronal network framework and I came around the article ‘Group sparse regularization for deep neural networks’. This paper addressed the challenging task of simultaneously optimizing (i) the weights of a neural network, (ii) the number of neurons for each hidden layer, and (iii) the subset of active input features (i.e., feature selection). They essentially proposed an integration of lasso penalties in the regularization framework in the neuronal network. There is plenty of posts and extensive research on the definition and the importance of regularization. For example these posts [1] and [2] are a quick introduction to the regularization subject.
Advertisements