Linear Discriminant Analysis (LDA) 101, using R

This is really a follow-up article to my last one on Principal Component Analysis, so take a look at that if you feel like it: https://…ent-analysis-pca-101-using-r-361f4c53a9ff If not just keep reading, we’ll tackle a case without PCA first and then follow up with LDA on PCA-‘tranformed’ data afterwards.


Lazy Neural Networks

For difficult problems neural networks can sometimes lack robustness whereby they might fail to perform accurate predictions on under classified examples and edge cases. This can still be the case even if suitable architectures have been selected. I discuss how focusing attention away from model architecture and more towards intelligent data selection strategies and cost function design is often a more useful strategy. Before I get into solutions I think it is important to discuss some overarching themes of deep learning.


Keras challenges the Avengers

Sentiment Analysis, also called Opinion Mining, is a useful tool within natural language processing that allow us to identify, quantify, and study subjective information. Due to the fact that quintillion of bytes of data is produced every day, this technique gives us the possibility to extract attributes of this data such as negative or positive opinion about a subject, also information about which subject is being talked about and what characteristics hold the persons or entities expressing that opinion.


Kalman Filter: Modelling Time Series Shocks with KFAS in R

When it comes to time series forecasts, conventional models such as ARIMA are often a popular option. While these models can prove to have high degrees of accuracy, they have one major shortcoming – they do not typically account for ‘shocks’, or sudden changes in a time series. Let’s see how we can potentially alleviate this problem using a model known as the Kalman Filter.


Introduction to StanfordNLP: An Incredible State-of-the-Art NLP Library for 53 Languages (with Python code)

A common challenge I came across while learning Natural Language Processing (NLP) – can we build models for non-English languages? The answer has been no for quite a long time. Each language has its own grammatical patterns and linguistic nuances. And there just aren’t many datasets available in other languages.


Introduction to Kotlin-Statistics

Over the past few years, I have been an avid user of Kotlin. But my proclivity for Kotlin is not simply due to language boredom or zeal for JetBrains products (including the great Python IDE called PyCharm). Kotlin is a more pragmatic Scala, or ‘Scala for Dummies’ as I heard someone once describe it. It is revolutionary in the fact it tries not to be, focusing on practicality and industry rather than academic experimentation. It takes many of the most useful features from programming languages to date (including Java, Groovy, Scala, C#, and Python), and integrates them into a single language.


Introducing the AI Project Canvas

Creating an AI Project always involves answering the same questions: What is the value you’re adding? What data do you need? Who are the customers? What costs and revenue are expected?


Introducing Snorkel

Building high quality training datasets is one of the most difficult challenges of machine learning solutions in the real world. Disciplines like deep learning have helped us to build more accurate models but, to do so, they require vastly larger volumes of training data. Now, saying that effective machine learning requires a lot of training data is like saying that ‘you need a lot of money to be rich’. It’s true, but it doesn’t make it less painful to get there. In many of the machine learning projects we work on at Invector Labs, our customers spend significant more time collecting and labeling training dataset than building machine learning models. Last year, we came across a small project created by artificial intelligence(AI) researchers from Stanford University that provides a programming model for the creation of training datasets. Ever since, Snorkel has become a regular component of our machine learning implementations.


Introducing AresDB: Uber’s GPU-Powered Open Source, Real-time Analytics Engine

At Uber, real-time analytics allow us to attain business insights and operational efficiency, enabling us to make data-driven decisions to improve experiences on the Uber platform. For example, our operations team relies on data to monitor the market health and spot potential issues on our platform; software powered by machine learning models leverages data to predict rider supply and driver demand; and data scientists use data to improve machine learning models for better forecasting. In the past, we have utilized many third-party database solutions for real-time analytics, but none were able to simultaneously address all of our functional, scalability, performance, cost, and operational requirements. Released in November 2018, AresDB is an open source, real-time analytics engine that leverages an unconventional power source, graphics processing units (GPUs), to enable our analytics to grow at scale. An emerging tool for real-time analytics, GPU technology has advanced significantly over the years, making it a perfect fit for real-time computation and data processing in parallel.


Introducing AresDB

Uber has to rank among the greatest contributors to open source data science infrastructure and frameworks. From machine learning frameworks like Horovod or Pyro to time-series infrastructures such as M3, the Uber engineering team has been incredibly active open sourcing different stacks that are key building blocks of Uber’s data science pipeline. Earlier this week, Uber unveiled yet another super cool technology to enable modern analytics solutions. AresDB is a database and runtime for massively scalable, real time analytics workloads.


Interdisciplinary Data Science

Data Science is emerging as a disruptive consequence of the digital revolution. Based on the combination of big data availability, sophisticated data analysis techniques, and scalable computing infrastructures, Data Science is rapidly changing the way we do business, socialize, conduct research, and govern society. It is also changing the way scientific research is performed. Model-driven approaches are supplemented with data-driven approaches. A new paradigm emerged, where theories and models and the bottom up discovery of knowledge from data mutually support each other. Experiments and analyses over massive data sets are functional not only to the validation of existing theories and models, but also to the data-driven discovery of patterns emerging from data, which can help scientists design better theories and models, yielding deeper understanding of the complexity of social, economic, biological, technological, cultural and natural phenomena. Data science is an interdisciplinary and pervasive paradigm aiming to turn data into knowledge, born at the intersection of a diversity of scientific and technological fields: databases and data mining, machine learning and artificial intelligence, complex systems and network science, statistics and statistical physics, information retrieval and text mining, natural language understanding, applied mathematics. Spectacular advances are occurring in data-driven pattern discovery, in automated learning of predictive models and in the analysis of complex networks.


Interactive Data Visualization with Python Using Bokeh

Recently I came over this library, learned a little about it, tried it, of course, and decided to share my thoughts. From official website: ‘Bokeh is an interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of versatile graphics, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.’ I think it’s pretty clear, but it would be much better to see it in action, wouldn’t it?


How to Increase Retention and Revenue in 1,000 Nontrivial Steps

Our goal is to take a data-driven approach to a complex metric that touches many different areas of WordPress.com; devise strategies based on available data; and provide outputs teams can use to guide their strategies and tactics intended to reduce churn [i.e., non-retention]. While we will not be implementing the tactics themselves like email marketing or advertising, we will be leading the charge on ensuring the outputs we generate are implemented in useful ways and that the results of these tactics are measured.