Ultimate guide to handle Big Datasets for Machine Learning using Dask (in Python)

Have you ever tried working with a large dataset on a 4GB RAM machine? It starts heating up while doing simplest of machine learning tasks? This is a common problem data scientists face when working with restricted computational resources. When I started my data science journey using python, I almost immediately realized that the existing libraries have certain limitations when it comes to handling large datasets. Pandas and Numpy are great libraries but they are not always computationally efficient, especially when there are GBs of data to manipulate. So what can you do to get around this obstacle?

Infographic – A Complete Guide on Getting Started with Deep Learning in Python

You seem to come across the term ‘Deep Learning’ everywhere these days. It´s all pervasive and seems to be at the heart of all AI related research. It has even spawned new and never-thought-of-before innovations! But how can you learn it? There are way too many resources out there, spread in a very unstructured and not a very beginner friendly manner. You complete a course on one platform, move to another course on a different platform, and so on. You learn, but not in any logical or sequential manner. That´s a bad idea.

A Complete Guide on Getting Started with Deep Learning in Python

Here our aim is to provide a learning path to all those who are new to deep learning and also the ones who want to explore it further. So are you ready to step onto the journey of conquering Deep Learning? Let´s GO!

Explore MapD Streaming Capabilities with IoT Devices

The number of Internet of Things (IoT) devices is growing everyday. It´s estimated that there will be 30 billion devices by 2020. IoT devices can range from a complicated A/C thermostat to a small sensor on an assembly line. We´ll show how MapD can work with IoT to visualize live data from sensors, which opens the door for endless applications. To show an example of IoT sensors, we created 4 devices. Each device consists of an ESP8266 based board and a DHT11 temperature/humidity sensor. We also designed and printed a 3D case and cover because why not. The assembly costs around $5 each and with the built-in ESP8266, we get WiFi capabilities.

Production ML for Data Scientists: What You Can Do and How to Make It Easy, August 22 Webinar

Learn about MLOps -machine learning operationalization that breaks down the silos between data science and IT; Streamlines deployment and orchestration, and adds advanced functionality.

Optimization 101 for Data Scientists

As a data scientist, you spend a lot of your time helping to make better decisions. You build predictive models to provide improved insights. You might be predicting whether an image is a cat or dog, store sales for the next month, or the likelihood if a part will fail. In this post, I won’t help you with making better predictions, but instead how to make the best decision. The post strives to give you some background on optimization. It starts with a simply toy example show you the math behind an optimization calculation. After that, this post tackles a more sophisticated optimization problem, trying to pick the best team for fantasy football. The FanDuel image below is a very common sort of game that is widely played (ask your in-laws). The optimization strategies in this post were shown to consistently win! Along the way, I will show a few code snippets and provide links to working code in R, Python, and Julia. And if you do win money, feel free to share it 🙂

Leave-one-out subset for your ggplot smoother

I have a dashboard at work that plots the number of contracts my office handles over time, and I wanted to add a trendline to show growth. However, trendline on the entire dataset skews low because our fiscal year just restarted. It took a little trial and error to figure out how to exclude the most recent year´s count so the trendline is more accurate for this purpose, so I thought I´d share my solution in a short post.

My 4 Key Takeaways on Data Lakes from the Gartner Data and Analytics Summit 2018

As in any hot technology market, people become either enamored or confused by new terms and acronyms: AI/ML, GDPR, IoT, big data, data hubs, and of course, ‘data lakes.’ A few weeks ago at the Gartner Data and Analytics Summit in Grapevine, TX, I attended 11 sessions, three analyst inquiries, and talked with dozens of customers and prospects. In this blog post, I´d like to share with you my key takeaways on data lakes.
1. The ‘data lake’ is a standard design pattern in today´s organizations for dealing with big data.
2. There are no silver bullets – data lakes must be governed like any other data platform.
3. Data lakes are quickly evolving in definition AND capabilities.
4. Organizations are choosing a new analytic/BI standard for their data lake.

A general framework for learning about research designs

Researchers need to select high quality research designs and communicate those designs to readers. Both tasks are difficult. We provide a framework for formally characterizing the analytically relevant features of a research design. In standard applications, the approach to design declaration that we describe requires defining a model of the world (M), an inquiry (I), a data strategy (D), and an answer strategy (A). Declaration of these features in code provides sufficient information for researchers and readers to use Monte Carlo techniques to diagnose properties such as power, bias, external validity, and other ‘diagnosands.’ Declaring a design lays researchers´ assumptions bare. Ex ante design declarations can be used to improve designs and facilitate preregistration, analysis, and reconciliation of intended and actual analyses. Ex post design declarations are also useful for describing, sharing, reanalyzing, and critiquing existing designs. We provide open-source software, DeclareDesign,to implement the proposed approach.

Proteomics Data Analysis (1/3): Data Acquisition and Cleaning

The analysis of DNA and RNA, the blueprint of life and its carbon copy, has become a staple in the burgeoning field of molecular biology. An emerging and exciting area of study that adds another dimension to our understanding of cellular biology is that of proteomics. The use of mass spectrometry has enabled the identification and quantification of thousands of proteins in a single sample run.

Docker, but for Data

Aneesh Karve, co-founder and CTO of Quilt Data, discussed the inevitability of data being managed like source code. Versions, packages, and compilation are essential elements of software engineering. Why, he asks, would we do data engineering without those elements? Karve delved into Quilt´s open source efforts to create a cross-platform data registry that stores, tracks, and marshalls data to and from memory. He further elaborated on ecosystem technologies like Apache Arrow, Presto DB, Jupyter, and Hive.