How to write a production-level code in Data Science?

Ability to write a production-level code is one of the sought-after skills for a data scientist role— either posted explicitly or not. For a software engineer turned data scientist this may not sound like a challenging task as they might have already perfected their skill at developing production level codes and deployed into production several times. This article is for those who are new to writing production-level code and interested in learning it such as fresh graduates from universities or any professionals who made into data science (or planning to make the transition). For them writing production-level code might seem like a formidable task.

Kindof’ Big Data in R

Ask for feedback from just about any critic of the R statistical package and you’ll hear two consistent responses: 1) R is a difficult language to learn, and 2) R’s data size limitation to physical RAM consigns it to toy academic applications. Having worked with R for over 15 years, I probably shared those views to some extent early on, but no longer. Yes, R’s a language where array-oriented processing and functional programming reign, but that’s a good thing and pretty much the direction of modern data science languages — think Python with Pandas, NumPy, SciPy, and scikit-learn. As for the memory limitation on data size, that’s much less onerous now than even five years ago. I point this out as I develop a Jupyter Notebook on a nifty ThinkPad with 64 GB RAM, 1 TB SSD, and 1 TB SSD that I purchased two years ago for $2500. That’s about annual license maintenance for a single seat of one of R’s commercial competitors.

The Future & Innovations of Artificial Intelligence In Manufacturing Market

The manufacturing industry has always has been open to adopting latest technologies. Industrial robots and drones have been a part of the manufacturing industry subsequently since 1960s. The next automation revolution is just around the turn and the US Manufacturing Sector is awaiting this technological advancements eagerly. The adoption of AI by the companies can keep inventories lean and decrease the cost, there is a high possibility that the American manufacturing Industry will experience an drastic growth. It is said that, the manufacturing sector has to grow fast for networked factories where design team, supply chain, production line, and quality control are highly integrated into an intelligent engine that provides actionable insights.

RStudio Connect v1.5.14

RStudio Connect v1.5.14 is now available! This release includes support for secure environment variables, customizing email subject lines, and beta support for serving TensorFlow models. This release introduces beta support for SuSE and will be the last version of RStudio Connect to support Ubuntu 12.04 and Internet Explorer 10. Contact for more information on supported platforms.

Visualizing the Capital Asset Pricing Model

Today, we will move on to visualizing the CAPM beta and explore some ggplot and highcharter functionality, along with the broom package.

Generating codebooks in R

A codebook is a technical document that provides an overview of and information about the variables in a dataset. The codebook ensures that the statistician has the complete background information necessary to undertake the analysis, and a codebook documents the data to make sure that the data is well understood and reusable in the future. Here we will show how to create codebooks in R using the dataMaid packages. The help pages for the datasets in R packages usually provide thorough information although the level of detail may vary quite substantially from dataset to dataset. As an example we will consider the iris dataset. The help page gives decent information so we will just use it to show how we would create a codebook.

Cluster Analysis – Naming Pattern in the last Century

A while back I got interested in (baby) surnames. It is interesting to follow how friends name their newly born. There are various naming “strategies”, a) pick a name that has been used in previous generations but now uncommon (e.g. in Germany that would be Oskar), b) pick a biblical name that never ages such as Johannes or David, or c) use a interesting sounding “foreign” name (e.g. Alina). I thought it would be interesting to figure out if we can cluster names and uncover these patterns (if they do really exist.) in a real dataset. Next, I scraped the last 116 years of data from this German source. Due to Germany’s federal system there is no single consolidated database for birth statistics or naming data. Hence, the scraped data is sampled and only providing a ranking of the top 30 or so given birth names per year (at least seperated for boys and girls).

R Tip: Get Out of the Habit of Calling View() Directly

View() only works correctly in interactive environments, not currently in RMarkdown contexts. It is better to call something else that safely dispatches to View(), or to something else depending if you are in an interactive or non-interactive session. The following code will work interactively, in RMarkdown, or even in a reprex.