SAX (Symbolic Aggregate approXimation)

SAX is the first symbolic representation for time series that allows for dimensionality reduction and indexing with a lower-bounding distance measure. In classic data mining tasks such as clustering, classification, index, etc., SAX is as good as well-known representations such as Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT), while requiring less storage space. In addition, the representation allows researchers to avail of the wealth of data structures and algorithms in bioinformatics or text mining, and also provides solutions to many challenges associated with current data mining tasks. One example is motif discovery, a problem which we defined for time series data. There is great potential for extending and applying the discrete representation on a wide class of data mining tasks.

Self-Regulated Learning: Beliefs, Techniques

Knowing how to manage one´s own learning has become increasingly important in recent years, as both the need and the opportunities for individuals to learn on their own outside of formal classroom settings have grown. During that same period, however, research on learning, memory, and metacognitive processes has provided evidence that people often have a faulty mental model of how they learn and remember, making them prone to both misassessing and mismanaging their own learning. After a discussion of what learners need to understand in order to become effective stewards of their own learning, we first review research on what people believe about how they learn and then review research on how people´s ongoing assessments of their own learning are influenced by current performance and the subjective sense of fluency. Weconclude with a discussion of societal assumptions and attitudes that can be counterproductive in terms of individuals becoming maximally effective learners.

Visualizations for machine learning datasets

The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive. The visualizations are implemented as Polymer web components, backed by Typescript code and can be easily embedded into Jupyter notebooks or webpages.

Data Science and Machine Learning: Great List of Resources

All you need to know about machine learning, accessible using our data science search engine, covering hundreds of articles and tutorials:
• Deep Learning
• Machine Learning
• Artificial Intelligence
• Blockchain
• Internet of Things
• R Programming Language
• Python for Data Science
• Linear and Logistic Regression
• Neural Networks
• Unsupervised Learning
• Feature Selection
• Outlier Detection
• Classification
• Model Comparison
• Data Science Libraries
• Data Sets
• Tutorials
• Books
• Courses
• One-Picture Summaries
• Algorithms

New Course: Support Vector Machines in R

Learn about our new R course. This course will introduce a powerful classifier, the support vector machine (SVM) using an intuitive, visual approach.

New Course: Interactive Maps with leaflet in R

Learn about our new R course, when completed you’ll be able to create interactive maps using leaflet.

Convex Regression Model

This morning during the lecture on nonlinear regression, I mentioned (very) briefly the case of convex regression. Since I forgot to mention the codes in R, I will publish them here. Assume that yi=m(xi)+eiy_i=m(\mathbf{x}_i)+\varepsilon_iy i =m(x i )+e i where m:Rd Rm:\mathbb{R}^d\rightarrow \mathbb{R}m:R d R is some convex function.

Introduction to Apache Spark

Apache Spark is a cluster computing platform designed to be fast and general-purpose. At its core, Spark is a ‘computational engine’ that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. One of the main features Spark offers for speed is the ability to run computations in memory. On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, interactive queries, and streaming which is often necessary in production data analysis pipelines. Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala, and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular, Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra (NoSQL database management system)

First Class R Support in Binder / Binderhub – Shiny Apps As Well as R-Kernels and RStudio

I notice from the binder-examples/r repo that Binderhub now appears to offer all sorts of R goodness out of the can, if you specify a particular R build.