Data science is a buzz word reflecting the application of statistics by advances in computer science.
Data science is the study of the generalizable extraction of knowledge from data, yet the key word is science. It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Data Science is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science.
The need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more. We describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets.
Alchemist: An Apache Spark <=> MPI Interface …
Dual Discriminator Generative Adversarial net (D2GAN)
We propose in this paper a novel approach to tackle the problem of mode collapse encountered in generative adversarial network (GAN). Our idea is intuitive but proven to be very effective, especially in addressing some key limitations of GAN. In essence, it combines the Kullback-Leibler (KL) and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multi-modes. We term our method dual discriminator generative adversarial nets (D2GAN) which, unlike GAN, has two discriminators; and together with a generator, it also has the analogy of a minimax game, wherein a discriminator rewards high scores for samples from data distribution whilst another discriminator, conversely, favoring data from the generator, and the generator produces data to fool both two discriminators. We develop theoretical analysis to show that, given the maximal discriminators, optimizing the generator of D2GAN reduces to minimizing both KL and reverse KL divergences between data distribution and the distribution induced from the data generated by the generator, hence effectively avoiding the mode collapsing problem. We conduct extensive experiments on synthetic and real-world large-scale datasets (MNIST, CIFAR-10, STL-10, ImageNet), where we have made our best effort to compare our D2GAN with the latest state-of-the-art GAN’s variants in comprehensive qualitative and quantitative evaluations. The experimental results demonstrate the competitive and superior performance of our approach in generating good quality and diverse samples over baselines, and the capability of our method to scale up to ImageNet database. …