I Have a Lot of Data, I Just Don’t Know Where!

If you are reading this it’s probable that you are doing something with your data or want to do something with it. But, it’s not that easy right? Most data centered projects nowadays (in reality) are complex, expensive, require organizational and cultural change, have a long time-to-value period, and are not so easy to scale. What’s the problem? In this article I’ll discuss what in my opinion is the biggest problem companies trying to use their data face, and it has to do with decades of doing things that used to work but not anymore.


Building a Shiny App as a Package

In a previous post, I’ve introduced the {golem} package, which is an opinionated framework for building production-ready Shiny Applications. This framework starts by creating a package skeleton waiting to be filled. But, in a world where Shiny Applications are mostly created as a series of files, why bother with a package? This is the question I’ll be focusing on in this blog post.


The relationship between Biological and Artificial Intelligence

Intelligence can be defined as a predominantly human ability to accomplish tasks that are generally hard for computers and animals. Artificial Intelligence is a field attempting to accomplish such tasks with computers. AI is becoming increasingly widespread, as are claims of its relationship with Biological Intelligence. Often these claims are made to imply higher chances of a given technology succeeding, working on the assumption that AI systems which mimic the mechanisms of Biological Intelligence should be more successful. In this article I will discuss the similarities and differences between AI and the extent of our knowledge about the mechanisms of intelligence in biology, especially within humans. I will also explore the validity of the assumption that biomimicry in AI systems aids their advancement, and I will argue that existing similarity to biological systems in the way Artificial Neural Networks tackle tasks is due to design decisions, rather than inherent similarity of underlying mechanisms. This article is aimed at people who understand the basics of AI (especially ANNs), and would like to be better able to evaluate the often wild claims about the value of biomimicry in AI.


What’s new in TensorFlow 2.0?

The machine learning library TensorFlow has had a long history of releases starting from the initial open-source release from the Google Brain team back in November 2015. Initially developed internally under the name DistBelief, TensorFlow quickly rose to become the most widely used machine learning library today. And not without reason.


How to do reproducible research in Ruby with gKnit

The idea of ‘literate programming’ was first introduced by Donald Knuth in the 1980’s (Knuth 1984). The main intention of this approach was to develop software interspersing macro snippets, traditional source code, and a natural language such as English in a document that could be compiled into executable code and at the same time easily read by a human developer. According to Knuth ‘The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style.’ The idea of literate programming evolved into the idea of reproducible research, in which all the data, software code, documentation, graphics etc. needed to reproduce the research and its reports could be included in a single document or set of documents that when distributed to peers could be rerun generating the same output and reports.


Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis

Some claim the machine learning field is in a crisis due to software tooling that’s insufficient to ensure repeatable processes. The crisis is about difficulty in reproducing results such as machine learning models. The crisis could be solved with better software tools for machine learning practitioners. The reproducibility issue is so important that the annual NeurIPS conference plans to make this a major topic of discussion at NeurIPS 2019. The ‘Call for Papers’ announcement has more information https://…/call-for-papers-689294418f43 The so-called crisis is because of the difficulty in replicating the work of co-workers or fellow scientists, threatening their ability to build on each other’s work or to share it with clients or to deploy production services. Since machine learning, and other forms of artificial intelligence software, are so widely used across both academic and corporate research, replicability or reproducibility is a critical problem.


How to speed up the training of the sequence model(RNN) using the bucketing technique

Training of recurrent neural networks for large vocabulary, continuous speech recognition tasks is computationally very expensive and the sequential nature of the recurrent process prohibits to parallelize the training over input frames. A robust optimization requires to work on large batches of utterances and training time as well as recognition performance can vary strongly depending on the choice of how batches were put together.


Top 10 Big Data Tools for Big Data Dudes

Careful, man, there’s a Big Data beverage here. Search engines, machine translation, targeted advertising and much more are unthinkable now without processing tera- and petabytes. And if you want to be a part of these movements you need to master some of the big data analysis tools. Here are the list of frameworks, platforms, solutions and advanced analytics systems that really tie the room together (on my harmful opinion). Let’s roll.
1. Hadoop – The Baby Elephant for Big Opportunities
2. Elasticsearch: Average Volume Data King (and a Hadoop killer?)
3. Pentaho Will Turn Big Data into Big Insights
4. Talend: Develop More Quickly with Less Ramp-Up Time
5. Lumify – Simplistic and Excellent Tool
6. Skytree: Machine Learning Meets Big Data
7. Presto (SQL Query Engine)
8. RapidMiner: Extract Big Value from Big Data
9. Knime – Another Free Data Mining System
10. R Programming Environment: Last but Not Least Must-Have


Rule Based Detection?

One of the famous insults that security vendors use against competitors nowadays is ‘RULE – BASED.’ In essence, if you want to insult your peers who, in your estimation, don’t spout ‘AI’ and ‘ML’ often enough, just call them ‘rule-based’ ?? Sure, OK, we all can laugh at claims of ‘cyber AI’ (and we do, often), but what is the reality layer under this? I suspect there is a spectrum that may be worth thinking about…


Set Analysis: A face off between Venn diagrams and UpSet plots

It’s time for me to come clean about something; I think Venn diagrams are fun! Yes that’s right, I like them. They’re pretty, they’re often funny, and they convey the straight forward overlap between one or two sets somewhat easily. Because I like making nerd comedy graphs, I considered sharing with y’all how to create Venn diagrams in R. But I couldn’t do that in good conscience without showing an alternative for larger and more complex set analysis. A few weeks ago, when I saw Matthew Hendrickson and Mara Averick’s excitement over the UpSetR plot, I knew what I should do. Folks, what you are about to witness is a set analysis face off! We will be pairing off Venn diagrams and UpSet plots in a variety of scenarios for a true battle royale. Winner takes all and is able to claim the prize of set analysis master.


NeuronBlocks – Building Your NLP DNN Models Like Playing Lego

NeuronBlocks is a NLP deep learning modeling toolkit that helps engineers/researchers to build end-to-end pipelines for neural network model training for NLP tasks. The main goal of this toolkit is to minimize developing cost for NLP deep neural network model building, including both training and inference stages. For more details, please check our paper: NeuronBlocks — Building Your NLP DNN Models Like Playing Lego at https://…/1904.09535. NeuronBlocks consists of two major components: Block Zoo and Model Zoo.
• In Block Zoo, we provide commonly used neural network components as building blocks for model architecture design.
• In Model Zoo, we provide a suite of NLP models for common NLP tasks, in the form of JSON configuration files.


CenterNet

Detection identifies objects as axis-aligned boxes in an image. Most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each. This is wasteful, inefficient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a single point — the center point of its bounding box. Our detector uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, and even pose. Our center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors. CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP with multi-scale testing at 1.4 FPS. We use the same approach to estimate 3D bounding box in the KITTI benchmark and human pose on the COCO keypoint dataset. Our method performs competitively with sophisticated multi-stage methods and runs in real-time.


BentoML

BentoML is a python library for packaging and deploying machine learning models. It provides high-level APIs for defining an ML service and packaging its artifacts, source code, dependencies, and configurations into a production-system-friendly format that is ready for deployment.


The Mathematics of Forward and Back Propagation

Understanding the maths behind forward and back propagation is not very easy. There are some very good – but also very technical explanations. For example : The Matrix Calculus You Need For Deep Learning Terence Parr and Jer…is an excellent resource but still too complex for beginners. I found a much simpler explanation in the ml cheatsheet. The section below is based on this source. I have tried to simplify this explanation as below.


The Significance and Applications of Covariance Matrix

The beauty of math is that simple models can do great things.’ There is no lack of fancy algorithms and techniques in modern data science. Technology is easy to learn, but also easy to fall behind. The mathematical underpinnings, however, can benefit in the long run. Covariance matrix is one simple and useful math concept that is widely applied in financial engineering, econometrics as well as machine learning. Given its practicality, I decided to summarize some key points and examples from my notepad and put together a cohesive story.


Yet Another Kalman Filter Explanation Article

There are many articles online that explain the Kalman Filter, and many of them are quite good. They use autonomous robots and sensors as motivating examples, have nice visuals, and introduce the model equations and updates in an intuitive way. But every one I’ve seen stops short of explaining the Kalman gain matrix and exactly where it comes from. It’s excellent, right up until the point where the author prepares to explain the Kalman gain matrix, and instead finishes with ‘to be continued…’ (that was 2014). The disappointed readers beg for days for the rest of the answer, but they never get it. So welcome to yet another Kalman Filter explanation article, the distinction being that this one contains a ‘friendly’ derivation of the updating equations, all the way up to the end. It is a Bayesian explanation but requires only a cursory understanding of posterior probability, relying on two properties of the multivariate Gaussian rather than specific Bayesian results. As a bonus, this explanation will show us how the Kalman Filter is ‘optimal’ by the very nature of what it is.


Kalman Filter

There are a lot of different articles on Kalman filter, but it is difficult to find the one which contains an explanation, where all filtering formulas come from. I think that without understanding of that this science becomes completely non understandable. In this article I will try to explain everything in a simple way. Kalman filter is very powerful tool for filtering of different kinds of data. The main idea behind this that one should use an information about the physical process. For example, if you are filtering data from a car’s speedometer then its inertia give you a right to treat a big speed deviation as a measuring error. Kalman filter is also interesting by the fact that in some way it is the best filter. We will discuss precisely what does it mean. In the end of the article I will show how it is possible to simplify the formulas.


Deep Learning Book Series 3.1 to 3.3 Probability Mass and Density Functions

This content is part of a series about Chapter 3 on probability from the Deep Learning Book by Goodfellow, I., Bengio, Y., and Courville, A. (2016). It aims to provide intuitions/drawings/python code on mathematical theories and is constructed as my understanding of these concepts.


The Approximation Power of Neural Networks (with Python codes)

It is a well-known fact that neural networks can approximate the output of any continuous mathematical function, no matter how complicated it might be. Though it has a pretty complicated shape, the theorems we will discuss shortly guarantee that one can build some neural network that can approximate f(x) as accurately as we want. Neural networks, therefore, display a type of universal behavior. One of the reasons neural networks have received so much attention is that in addition to these rather remarkable universal properties, they possess many powerful algorithms for learning functions.