Visualizations should make the most important features of your data stand out. But too often, what’s important gets lost in the minefield of data. But now you can highlight systematic changes from random noise by adding trend lines to your chart! In Displayr, Visualizations of chart type Column, Bar, Area, Line and Scatter all support trend lines. Trend lines can be linear or non-parametric (cubic spline, Friedman´s super-smoother or LOESS).
Spark´s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). It is a fault-tolerant collection of elements which allows parallel operations upon itself. RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
Spark SQL is a part of Apache Spark big data framework designed for processing structured and semi-structured data. It provides a DataFrame API that simplifies and accelerates data manipulations. DataFrame is a special type of object, conceptually similar to a table in relational database. It represents a distributed collection of data organized into named columns. DataFrames can be created from external sources, retrieved with a query from a database, or converted from RDD; the inverse transform is also possible. This abstraction is designed for sampling, filtering, aggregating, and visualizing the data. In this blog post, we’re going to show you how to load a DataFrame and perform basic operations on DataFrames with both API and SQL. We’ll also go through DataFrame to RDD and vice-versa conversions.
The vast possibilities of artificial intelligence are of increasing interest in the field of modern information technologies. One of its most promising and evolving directions is machine learning (ML), which becomes the essential part in various aspects of our life. ML has found successful applications in Natural Languages Processing, Face Recognition, Autonomous Vehicles, Fraud detection, Machine vision and many other fields. Machine learning utilizes the mathematical algorithms that can solve specific tasks in a way analogous to the human brain. Depending on the neural network training method, ML algorithms can be divided into supervised (with labeled data), unsupervised (with unlabeled data), semi-supervised (there are both labeled and unlabeled data in the dataset) and reinforcement (based on reward receiving) learning. Solving the most basic and popular ML tasks, such as classification and regression, is mainly based on supervised learning algorithms. Among the variety of existing ML tools, Spark MLlib is a popular and easy-to-start library which enables training neural networks for solving the problems mentioned above. In this post, we would like to consider classification task. We will classify Iris plants to the 3 categories according to the size of their sepals and petals. The public dataset with Iris classification is available here. To move forward, download the file bezdekIris.data to the working folder.
Spark is a powerful tool which can be applied to solve many interesting problems. Some of them have been discussed in our previous posts. Today we will consider another important application, namely streaming. Streaming data is the data which continuously comes as small records from different sources. There are many use cases for streaming technology such as sensor monitoring in industrial or scientific devices, server logs checking, financial markets monitoring, etc. In this post, we will examine the case with sensors temperature monitoring. For example, we have several sensors (1,2,3,4,…) in our device. Their state is defined by the following parameters: date (dd/mm/year), sensor number, state (1 – stable, 0 – critical), and temperature (degrees Celsius). The data with the sensors state comes in streaming, and we want to analyze it. Streaming data can be loaded from the different sources. As we don´t have the real streaming data source, we should simulate it. For this purpose, we can use Kafka, Flume, and Kinesis, but the simplest streaming data simulator is Netcat.
While models and algorithms garner most of the media coverage, this is a great time to be thinking about building tools in data.
For some advanced uses cases, users might need to mount more than one data and/or outputs volumes. Polyaxon provides a way to mount multiple volumes so that user can choose which volume(s) to mount for a specific job or experiment.
Shogun is an open-source machine learning library that offers a wide range of machine learning algorithms. From my point of view it’s not very popular among professionals, but it have a lot of fans among enthusiasts and students. Library offers unified API for algorithms, so they can be easily managed, it somehow looks like to scikit-learn approach. There is a set of examples which can help you in learning of the library, but holistic documentation is missed.
Data Visualization is a big part of a data scientist´s jobs. In the early stages of a project, you´ll often be doing an Exploratory Data Analysis (EDA) to gain some insights into your data. Creating visualizations really helps make things clearer and easier to understand, especially with larger, high dimensional datasets. Towards the end of your project, it´s important to be able to present your final results in a clear, concise, and compelling manner that your audience, whom are often non-technical clients, can understand. Matplotlib is a popular Python library that can be used to create your Data Visualizations quite easily. However, setting up the data, parameters, figures, and plotting can get quite messy and tedious to do every time you do a new project. In this blog post, we´re going to look at 6 data visualizations and write some quick and easy functions for them with Python´s Matplotlib. In the meantime, here´s a great chart for selecting the right visualization for the job!
Getting an AI startup to scale for an IPO is currently elusive. Several different strategies are being discussed around the industry and here we talk about the horizontal strategy and the increasingly favored vertical strategy.
Natural language processing (NLP) is getting very popular today, which became especially noticeable in the background of the deep learning development. NLP is a field of artificial intelligence aimed at understanding and extracting important information from text and further training based on text data. The main tasks include speech recognition and generation, text analysis, sentiment analysis, machine translation, etc. In the past decades, only experts with appropriate philological education could be engaged in the natural language processing. Besides mathematics and machine learning, they should have been familiar with some key linguistic concepts. Now, we can just use already written NLP libraries. Their main purpose is to simplify the text preprocessing. We can focus on building machine learning models and hyperparameters fine-tuning. There are many tools and libraries designed to solve NLP problems. Today, we want to outline and compare the most popular and helpful natural language processing libraries, based on our experience. You should understand that all the libraries we look at have only partially overlapped tasks. So, sometimes it is hard to compare them directly. We will walk around some features and compare only those libraries, for which this is possible.
Autonomous cars are racing down the highway at speeds exceeding 100 MPH when suddenly a car a half-mile ahead blows out a tire sending dangerous debris across 3 lanes of traffic. Instead of relying upon sending this urgent, time-critical distress information to the world via the cloud, the cars on that particular section of the highway use peer-to-peer, immutable communications to inform all vehicles in the area of the danger so that they can slow down and move to unobstructed lanes (while also sending a message to the nearest highway maintenance robots to remove the debris).