Hyperparameter tuning is like tuning your guitar. And then magic happens !
Regardless of whatever we think about the mysterious subject of Probability, we live and breath in a stochastic environment. From the ever elusive Quantum Mechanics to our daily life (‘There is 70% chance it will rain today’, ‘The chance of getting the job done in time is less than 30%’ … ) we use it, knowingly or unknowingly. We live in a ‘Chancy, Chancy, Chancy world’. And thus, knowing how to reason about it, is one of the most important tools in the arsenal of any person.
Analyzing Text Data in Just Two Lines of Code
Whether you are working on predicting data in an office setting or just competing in a Kaggle competition, it’s important to test out different models to find the best fit for the data you are working with. I recently had the opportunity to compete with some very smart colleagues in a private Kaggle competition predicting faulty water pumps in Tanzania. I ran the following models after doing some data cleaning and I’ll show you the results.
Suppose you are in a new town and you have no map nor GPS, and you need to reach downtown. You can try assess your current position relative to your destination, as well the effectiveness (value) of each direction you take. You can think of this as computing the value function. Or you can ask a local and he tells you to go straight and when you see a fountain you go to the left and continue until you reach downtown. He gave you a policy to follow. Naturally, in this case, following the given policy is much less complicated than computing the value function on your own.
Data science has reached new levels of complexity and of course awesomeness. I’ve been doing this for years now, I’m what I want for people is to have a clear and easy path to do their job. I’ve been talking about data science and more for a while now, but it’s time to get our hands dirty and code together. This is the beginning of a series on articles about data science with Optimus, Spark and Python.
Let’s start this series by defining what time series are…
How to measure your model’s fairness and decide on the best fairness metrics.
Jupyter Kernel Gateway is a web server that provides headless access to Jupyter kernels.
Kaggle recently published its second annual Machine Learning and Data Science Survey. About 24,000 users across the globe responded to this survey, disclosing much information about their demographic, behaviour, and opinions. It gives a unique peek into the Machine Learning and Data Science industry.
EMA has just completed a groundbreaking (the word is apt here) research project assessing just how IT organizations are seeking to invest in, optimize, integrate and prioritize use cases for what we call ‘Advanced Operations Analytics’ or ‘AOA.’ AOA is our term for ‘big data for IT’ which others have termed ‘operations analytics’ and which EMA initially described as ‘advanced performance analytics’.
Majority of modern companies deal with processes which they want to be automated. This need can be caused by various reasons, in particular, due to the routine, repetitive and boring nature of manual processes. Another shortcoming is that such processes often require a lot of time and human resources; additionally, office processes are prone to input mistakes. As a result, the staff loses their motivation and the companies lose time and money. Everyone wants to build up an effective business that can be achieved by application of modern automation technologies. One of the most promising technologies in this field is Robotic Process Automation (RPA), which relies on constructing agents that can simulate different types of user activities (mouse click, keyboard input, data scraping, etc.) for the routine, mostly Windows-based, tasks implementation. RPA provides many use cases: finance and banking, insurance, telecommunications, healthcare, retail, government, HR, IT and many others. In this post, one of the simplest and popular automation tasks – data scraping – will be considered, which can be interesting for business analytics and web developers.
Recently, Graph Neural Network (GNN) has gained increasing popularity in various domains, including social network, knowledge graph, recommender system, and even life science. The power of GNN in modeling the dependencies between nodes in a graph enables the breakthrough in the research area related to graph analysis. This article aims to introduce the basics of Graph Neural Network and two more advanced algorithms, DeepWalk and GraphSage.
Hey guys! Have you ever given a thought about how banks identify the fraudulent accounts? or how can you detect some faulty servers in your network? or how do you tackle problems in machine learning where you don’t have enough knowledge about your positive examples? Well, you’ve landed in the right place. Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behaviour, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumour in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments. Ok, Ok enough of the theory dude! Show some code man. Here we GO!!
Suppose you’re playing a video game. You enter a room with two doors. Behind Door 1 are 100 gold coins, followed by a passageway. Behind Door 2 is 1 gold coin, followed by a second passageway going in a different direction. Once you go through one of the doors, there is no going back. Which of the doors should you choose? If you made your decision based solely on maximizing your immediate reward (or score), then your answer would be Door 1. However, the aim of most video games is not to maximize your score in a single section of the game, but to maximize your score for the entire game. After all, there could be 1000 gold coins at the end of the passageway behind Door 2, for all you know.
SQL is simpler and more readable than Pandas, which is why many people use it, aside from it also being a legacy system. It’s fast though, and it’s the language for talking to databases and extracting data from data warehouses. It’s the stuff of data science at scale! In this article, I’ll walk through my notebook that roughly emulates the core workflow of this scaled up data science: ETL. ETL stands for Extract, Transform, and Load. While this example is a notebook on my local computer, if the database file(s) were from a source system, extraction would involve moving it into a data warehouse. From there it would be transformed using SQL queries. The final step would be loading the data into something like Python and Pandas to do machine learning and other cool stuff.
In the previous article of the series, we looked at the nature and extent of damage that attackers can inflict if they, too, start leveraging the capabilities of AI/ML something that is but inevitable. In this part we will shift our attention towards another aspect by taking a closer look at privacy in the context of machine learning. The advancements in ML/AI have thrown a big wrench in the works as far as privacy is concerned. Each new scenario?-?however compelling from the functionality and utility standpoint?-?seems to bring in scary new ways to impact and compromise data privacy. Let us examine the many ways privacy surfaces in the context of AI/ML?-?governance, individual and organizational motivations, technical mechanisms for privacy assurance, regulatory aspects, etc.
With innovation in causal inference methods and a rise in non-experimental data availability, a growing number of prevention researchers and advocates are thinking about causal inference. In this commentary, we discuss the current state of science as it relates to causal inference in prevention research, and reflect on key assumptions of these methods. We review challenges associated with the use of causal inference methodology, as well as considerations for hoping to integrate causal inference methods into their research. In short, this commentary addresses the key concepts of causal inference and suggests a greater emphasis on thoughtfully designed studies (to avoid the need for strong and potentially untestable assumptions) combined with analyses of sensitivity to those assumptions.
What if there was a way to quantitatively measure whether your machine learning (ML) model reflects specific domain expertise or potential bias? with post-training explanations? on a global level instead of a local level? Would industry people be interested? These are the kind of questions Been Kim, Senior Research Scientist at Google Brain, poised in the MLConf 2018 talk, ‘Interpretability Beyond Feature Attribution: Testing with Concept Activation Vectors (TCAV)’. The MLConf talk is based on a paper Kim co-authored and the code is available. This Domino Data Science Field Note provides some distilled insights about TCAV, an interpretability method that allows researchers to understand and quantitatively measure the high-level concepts their neural network models are using for prediction, ‘even if the concept was not part of the training’ (Kim Slide 33). TCAV ‘uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result’ (Kim et al 2018).
Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the handdesigned extrinsic rewards of many game environments. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups. Game-play videos and code are at https://pathak22.github. io/large-scale-curiosity/.
The Mining Software Repositories (MSR) field analyzes the rich data available in software repositories to uncover interesting and actionable information about software systems and projects. The goal of this two-day conference is to advance the science and practice of MSR. The 16th International Conference on Mining Software Repositories will be co-located with ICSE 2019 in Montréal, QC, Canada. Software repositories such as source control systems, archived communications between project personnel, and defect tracking systems are used to help manage the progress of software projects. Software practitioners and researchers are recognizing the benefits of mining this information to support the maintenance of software systems, improve software design/reuse, and empirically validate novel ideas and techniques. Research is now proceeding to uncover the ways in which mining these repositories can help to understand software development and software evolution, to support predictions about software development, and to exploit this knowledge in planning future development. The goal of this two-day international conference is to advance the science and practice of software engineering via the analysis of data stored in software repositories.