**Local randomness in R**

One approach of using random number generation inside a function without affecting outer state of random generator.

**Extracting Data from PDF File Using Python and R**

Data is key for any analysis in data science, be it inferential analysis, predictive analysis, or prescriptive analysis. The predictive power of a model depends on the quality of the data that was used in building the model. Data comes in different forms such as text, table, image, voice or video. Most often, data that is used for analysis has to be mined, processed and transformed to render it to a form suitable for further analysis. The most common type of dataset used in most of the analysis is clean data that is stored in a comma-separated value (csv) table. However because a printable document format (pdf) file is one of the most used file formats, every data scientist should understand how to extract data from a pdf file and transform the data into a format such as ‘csv’ that can then be used for analysis or model building.

**Particle Filter : A hero in the world of Non-Linearity and Non-Gaussian**

The superiority of particle filter technology in nonlinear and non-Gaussian systems determines its wide range of applications. In addition, the multi-modal processing capability of the particle filter is one of the reasons why it is widely used. Internationally, particle filtering has been applied in various fields.

**Correlation is not causation**

Why the confusion of these concepts has profound implications, from healthcare to business management. In correlated data, a pair of variables are related in that one thing is likely to change when the other does. This relationship might lead us to assume that a change to one thing causes the change in the other. This article clarifies that kind of faulty thinking by explaining correlation, causation, and the bias that often lumps the two together. The human brain simplifies incoming information, so we can make sense of it. Our brains often do that by making assumptions about things based on slight relationships, or bias. But that thinking process isn’t foolproof. An example is when we mistake correlation for causation. Bias can make us conclude that one thing must cause another if both change in the same way at the same time. This article clears up the misconception that correlation equals causation by exploring both of those subjects and the human brain’s tendency toward bias.

**A Multitask Music Model with BERT, Transformer-XL and Seq2Seq**

This is Part III of the ‘Building An A.I. Music Generator’ series. I’ll be covering the basics of Multitask training with Music Models – which we’ll use to do really cool things like harmonization, melody generation, and song remixing. We’ll be building off of Part I and Part II.

**Should you explain your predictions with SHAP or IG?**

Some of the most accurate predictive models today are black box models, meaning it is hard to really understand how they work. To address this problem, techniques have arisen to understand feature importance: for a given prediction, how important is each input feature value to that prediction? Two well-known techniques are SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG). In fact, they each represent a different type of explanation algorithm: a Shapley-value-based algorithm (SHAP) and a gradient-based algorithm (IG). There is a fundamental difference between these two algorithm types. This post describes that difference. First, we need some background. Below, we review Shapley values, Shapley-value-based methods (including SHAP), and gradient-based methods (including IG). Finally, we get back to our central question: When should you use a Shapley-value-based algorithm (like SHAP) versus a gradient-based explanation explanation algorithm (like IG)?

**Introduction to Stream Processing**

Together with blockchain and machine learning, stream processing seems to be one of the hottest topics nowadays. Companies are onboarding modern stream processing tools, service providers are releasing better and more powerful stream processing products, and specialists are in high demand. This article introduces the basics of stream processing. It starts with a rationale for why we need stream processing and how it works under the hood. Then it goes into how to write simple, scalable distributed stream processing applications. All in fewer than 40 lines of code! Since stream processing is a vast topic, this article is focused mostly on the data management part while sophisticated processing is left for another article. To make the article more practical, it discusses AWS Kinesis, a stream processing solution from Amazon, but it also refers to other popular Open Source technologies to present a broader picture.

**3 Approaches To Modernizing Predictive Analytics**

Companies of all sizes and in all industries are developing ways to harness the power of big data for better decision-making. To provide valuable insights and meet expectations, data science teams have long turned to predictive analytics – or using historical data to model a problem and uncover the key factors that generated specific outcomes in the past to make predictions about the future. Predictive analytics has been around for years; however, prior to machine learning, the technology was not easy to adopt or scale in real-time. Machine learning is modernizing predictive analytics, providing data scientists with the ability to augment their efforts with more real-time insights. And thanks to hybrid cloud infrastructure opportunities, it’s now possible to embed and scale predictive analytics in almost any business application quickly and efficiently. The ability to process larger quantities of data in real-time results in more accurate predictions, and therefore, better business decisions. However, modernizing predictive analytics is not without its challenges. Here are a few ways companies can modernize the deployment of their legacy predictive models, and the pros and cons of these popular approaches.

**Object Tracking: Particle Filter with Ease**

One of the primary computer vision’s tasks is object tracking. Object tracking is used in the vast majority of applications such as: video surveillance, car tracking (distance estimation), people detection and tracking, etc. The object trackers usually need some initialization step such as the initial object location, which can be provided manually or automatically by using object detector such as Viola and Jones detector or fast template matching. There are several major problems related to tracking:

• occlusion

• multiple objects

• scale, illumination, appearance change

• difficult and rapid motions

• …

Although the object tracking problem is present for years, it is still not solved, and there are many object trackers, the ones that are built for special purposes and generic ones.

The Kalman filter assumes linear motion model and Gaussian noise and returns only one hypothesis (e.g. the probable position of a tracked object). If the movements are rapid and unpredictable (e.g. leaf on a tree during windy day), the Kalman filter is likely to fail. The particle filter returns multiple hypotheses (each particle presents one hypothesis) and thus can deal with non-Gaussian noise and support non-linear models. Besides the object tracking where the state is a position vector (x, y), the state can be anything, e.g., shape of the model. This article will explain the main idea behind particle filter and will focus on their practical usage for object tracking along with samples.

**The Deluge of Spurious Correlations in Big Data**

Very large databases are a major opportunity for science and data analytics is a remarkable new field of investigation in computer science. The effectiveness of these tools is used to support a ”philosophy” against the scientific method as developed throughout history. According to this view, computer-discovered correlations should replace understanding and guide prediction and action. Consequently, there will be no need to give scientific meaning to phenomena, by proposing, say, causal relations, since regularities in very large databases are enough: ”with enough data, the numbers speak for themselves”. The ”end of science” is proclaimed. Using classical results from ergodic theory, Ramsey theory and algorithmic information theory, we show that this ”philosophy” is wrong. For example, we prove that very large databases have to contain arbitrary correlations. These correlations appear only due to the size, not the nature, of data. They can be found in ”randomly” generated, large enough databases, which – as we will prove – implies that most correlations are spurious. Too much information tends to behave like very little information. The scientific method can be enriched by computer mining in immense databases, but not replaced by it.

**The hidden risk of AI and Big Data**

Recent advances in AI have been made possible through access to ‘Big Data’ and cheap computing power. But can it go wrong? Big data is suddenly everywhere. From scarcity and difficulty to find data (and information), we now have a deluge of data. In recent years, the amount of available data has been growing in an exponential pace. This is in turn made possible due to the immense growth in number of devices recording data, as well as the connectivity between all these devices through the internet of things. Everyone seems to be collecting, analyzing, making money from and celebrating (or fearing) the powers of Big data. By combining the power of modern computing, it promises to solve virtually any problem – just by crunching the numbers.

**Striking a Balance between Exploring and Exploiting**

The exploration-exploitation dilemma is faced by our agents while learning to play the game tic-tac-toe [Medium article]. This dilemma is a fundamental problem in reinforcement learning as well as in real life which we frequently face when choosing between options, would you rather:

• pick something you are familiar in order to maximise the chance of getting what you wanted

• or pick something you have not tried and possibly learning more, which may (or may not) result in you making better decisions in future

This trade-off will affect either you earn your reward sooner or you learn about the environment first then earn your rewards later.

**Building Your Data Science Technology Stack**

The tech stack for Data Science teams is misunderstood by companies of all sizes. Oftentimes there is a failure to understand what tooling is necessary for what jobs. Fortunately, most trends in technology result in standardized workflows across industry. As of yet, this seems to have been limited in the Data Science world. There isn’t a clear route to building and deploying an AI product like there is for something like a basic web application. Maybe your AI solution is going to be deployed and provide predictions through a basic web application. This adds an extra layer of complexity, a layer most teams are not prepared to deal with.

**Probability Learning II: How Bayes’ Theorem is applied in Machine Learning**

In the previous post we saw what Bayes’ Theorem is, and went through an easy, intuitive example of how it works. You can find this post here. If you don’t know what Bayes’ Theorem is, and you have not had the pleasure to read it yet, I recommend you do, as it will make understanding this present article a lot easier. In this post, we will see the uses of this theorem in Machine Learning. Ready? Lets go then!

**What is Poisson Distribution?**

Before setting the parameter ? and plugging it into the formula, let’s pause a second and ask a question. Why did Poisson have to invent the Poisson Distribution?

**The five discrete distributions every Statistician should know**

Distributions play an essential role in the life of every Statistician. Now coming from a non-statistical background, distributions always come across as something mystical to me. And the fact is that there are a lot of them. So which ones should I know? And how do I know and understand them? This post is about some of the most used discrete distributions that you need to know along with some intuition and proofs.

1. Bernoulli Distribution

2. Binomial Distribution

3. Geometric Distribution

4. Negative Binomial Distribution

5. Poisson Distribution

**Simple Ways to Improve Your Matplotlib**

Matplotlib’s default properties often yield unappealing plots. Here are several simple ways to spruce up your visualizations.