Oracle acquires machine learning platform Datascience.com!!

Oracle announced today that it has acquired DataScience.com, a privately held cloud workspace platform for data science projects and workloads. Financial terms of the deal were not disclosed. In the near term, not much will change for customers of Datascience.com — it will continue to offer the same products and services to partners post-acquisition. But Oracle envisions combining its Cloud Infrastructure service with Datascience.com’s tools for a single, unified machine learning solution. “Every organization is now exploring data science and machine learning as a key way to proactively develop competitive advantage, but the lack of comprehensive tooling and integrated machine learning capabilities can cause these projects to fall short,” Amit Zavery, vice president of Oracle’s Cloud Platform, said in a statement. “With the combination of Oracle and DataScience.com, customers will be able to harness a single data science platform to more effectively leverage machine learning and big data for predictive analysis and improved business results.”


Introduction to Loss Functions

The loss function is the bread and butter of modern Machine Learning; it takes your algorithm from theoretical to practical and transforms neural networks from glorified matrix multiplication into Deep Learning. This post will explain the role of loss functions and how they work, while surveying a few of the most popular of the past decade.


Deep Learning Scaling is Predictable, Empirically

Deep learning (DL) creates impactful advances following a virtuous recipe: model architecture search, creating large training data sets, and scaling computation. It is widely believed that growing training sets and models should improve accuracy and result in better products. As DL application domains grow, we would like a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements to advance the state-of-the-art. This paper presents a large scale empirical characterization of generalization error and model size growth as training sets grow. We introduce a methodology for this measurement and test four machine learning domains: machine translation, language modeling, image processing, and speech recognition. Our empirical results show power-law generalization error scaling across a breadth of factors, resulting in power-law exponents—the ‘steepness’ of the learning curve—yet to be explained by theoretical work. Further, model improvements only shift the error but do not appear to affect the power-law exponent. We also show that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling.


An Introduction to Deep Learning for Tabular Data

There is a powerful technique that is winning Kaggle competitions and is widely used at Google (according to Jeff Dean), Pinterest, and Instacart, yet that many people don’t even realize is possible: the use of deep learning for tabular data, and in particular, the creation of embeddings for categorical variables. Despite what you may have heard, you can use deep learning for the type of data you might keep in a SQL database, a Pandas DataFrame, or an Excel spreadsheet (including time-series data). I will refer to this as tabular data, although it can also be known as relational data, structured data, or other terms (see my twitter poll and comments for more discussion).


Recommendation System in R

Recommender systems are used to predict the best products to offer to customers. These babies have become extremely popular in virtually every single industry, helping customers find products they’ll like. Most people are familiar with the idea, but nearly everyone is exposed to several forms of personalized offers and recommendations each day (Google search ads being among the biggest source). Building recommendation systems is part science, part art, and many have become extremely sophisticated. Such a system might seem daunting for those uninitiated, but it’s actually fairly straight forward to get started if you’re using the right tools. This is a post about building recommender systems in R.


Frequencies in Pandas — and a Little R Magic for Python

I’ve got a big digital mouth. Last time, I wrote on frequencies using R, noting cavalierly that I’d done similar development in Python/Pandas. I wasn’t lying, but the pertinent work I dug up from two years ago was less proof and more concept. Of course, R and Python are the two current language leaders for data science computing, while Pandas is to Python as data.table and tidyverse are to R for data management: everything. So I took on the challenge of extending the work I’d started in Pandas to replicate the frequencies functionality I’d developed in R. I was able to demonstrate to my satisfaction how it might be done, but not before running into several pitfalls.


Smart Compose: Using Neural Networks to Help Write Emails

Smart Compose: Using Neural Networks to Help Write Emails


How to Organize Data Labeling for Machine Learning: Approaches and Tools

If there was a data science hall of fame, it would have a section dedicated to labeling. The labelers’ monument could be Atlas holding that large rock symbolizing their arduous, detail-laden responsibilities. ImageNet — an image database — would deserve its own stele. For nine years, its contributors manually annotated more than 14 million images. Just thinking about it makes you tired. While labeling is not launching a rocket into space, it’s still seriously business. Labeling is an indispensable stage of data preprocessing in supervised learning. Historical data with predefined target attributes (values) is used for this model training style. An algorithm can only find target attributes if a human mapped them. Labelers must be extremely attentive because each mistake or inaccuracy negatively affects a dataset’s quality and the overall performance of a predictive model. How to get a high-quality labeled dataset without getting grey hair The main challenge is to decide who will be responsible for labeling, estimate how much time it will take, and what tools are better to use. We briefly described labeling in the article about the general structure of a machine learning project. Here we will talk more about labeling approaches, techniques, and tools.


Enterprise Dashboards with R Markdown

We have been living with spreadsheets for so long that most office workers think it is obvious that spreadsheets generated with programs like Microsoft Excel make it easy to understand data and communicate insights. Everyone in a business, from the newest intern to the CEO, has had some experience with spreadsheets. But using Excel as the de facto analytic standard is problematic. Relying exclusively on Excel produces environments where it is almost impossible to organize and maintain efficient operational workflows. In addition to fostering low productivity, organizations risk profits and reputations in an age where insightful analyses and process control translate to a competitive advantage. Most organizations want better control over accessing, distributing, and processing data. You can use the R programming language, along with with R Markdown reports and RStudio Connect, to build enterprise dashboards that are robust, secure, and manageable.


Advances in Machine Learning and Data Science – Recent Achievements and Research Directives

• Optimization of Adaptive Resonance Theory Neural Network Using Particle Swarm Optimization Technique
• Accelerating Airline Delay Prediction-Based P-CUDA Computing Environment
• IDPC-XML: Integrated Data Provenance Capture in XML
• Learning to Classify Marathi Questions and Identify Answer Type Using Machine Learning Technique
• A Dynamic Clustering Algorithm for Context Change Detection in Sensor-Based Data Stream System
• Predicting High Blood Pressure Using Decision Tree-Based Algorithm
• Design of Low-Power Area-Efficient Shift Register Using Transmission Gate
• Prediction and Analysis of Liver Patient Data Using Linear Regression Technique
• Image Manipulation Detection Using Harris Corner and ANMS
• Spatial Co-location Pattern Mining Using Delaunay Triangulation
• Review on RBFNN Design Approaches: A Case Study on Diabetes Data
• Keyphrase and Relation Extraction from Scientific Publications
• Mixing and Entrainment Characteristics of Jet Control with Crosswire
• GCV-Based Regularized Extreme Learning Machine for Facial Expression Recognition
• Prediction of Social Dimensions in a Heterogeneous Social Network
• Game Theory-Based Defense Mechanisms of Cyber Warfare
• Challenges Inherent in Building an Intelligent Paradigm for Tumor Detection Using Machine Learning Algorithms
• Segmentation Techniques for Computer-Aided Diagnosis of Glaucoma: A Review
• Performance Analysis of Information Retrieval Models on Word Pair Index Structure
• Fast Fingerprint Retrieval Using Minutiae Neighbor Structure
• Key Leader Analysis in Scientific Collaboration Network Using H-Type Hybrid Measures
• A Graph-Based Method for Clustering of Gene Expression Data with Detection of Functionally Inactive Genes and Noise
• OTAWE-Optimized Topic-Adaptive Word Expansion for Cross Domain Sentiment Classification on Tweets
• DCaP—Data Confidentiality and Privacy in Cloud Computing: Strategies and Challenges
• Design and Development of a Knowledge-Based System for Diagnosing Diseases in Banana Plants
• A Review on Methods Applied on P300-Based Lie Detectors
• Implementation of Spectral Subtraction Using Sub-band Filtering in DSP C6748 Processor for Enhancing Speech Signal
• In-silico Analysis of LncRNA-mRNA Target Prediction
• Energy Aware GSA-Based Load Balancing Method in Cloud Computing Environment
• Relative Performance Evaluation of Ensemble Classification with Feature Reduction in Credit Scoring Datasets
• Family-Based Algorithm for Recovering from Node Failure in WSN
• Classification-Based Clustering Approach with Localized Sensor Nodes in Heterogeneous WSN (CCL)
• Multimodal Biometric Authentication System Using Local Hand Features
• Automatic Semantic Segmentation for Change Detection in Remote Sensing Images
• A Model for Determining Personality by Analyzing Off-line Handwriting
• Wavelength-Convertible Optical Switch Based on Cross-Gain Modulation Effect of SOA
• Contrast Enhancement Algorithm for IR Thermograms Using Optimal Temperature Thresholding and Contrast Stretching
• Data Deduplication and Fine-Grained Auditing on Big Data in Cloud Storage


Probability and Statistics – Cookbook

This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature and in-class material from courses of the statistics department at the University of California in Berkeley but also influenced by other sources.