• Connecting Sharpe ratio and Student t-statistic, and beyond
• Modeling Meaning Associated with Documental Entities: Introducing the Brussels Quantum Approach
• Machine Learning Promoting Extreme Simplification of Spectroscopy Equipment
• Eikos: a Bayesian unfolding method for differential cross-section measurements
• On the Feasibility of FPGA Acceleration of Molecular Dynamics Simulations
• Active Learning for Regression Using Greedy Sampling
• Affect Estimation in 3D Space Using Multi-Task Active Learning for Regression
• Algorithmic No-Cloning Theorem
• Relational dynamic memory networks
• CT Super-resolution GAN Constrained by the Identical, Residual, and Cycle Learning Ensemble(GAN-CIRCLE)
• Self-Organization Scheme for Balanced Routing in Large-Scale Multi-Hop Networks
• Model Reduction with Memory and the Machine Learning of Dynamical Systems
• Unsupervised Learning of Sentence Representations Using Sequence Consistency
• Connectivity-Driven Brain Parcellation via Consensus Clustering
• Effective Unsupervised Author Disambiguation with Relative Frequencies
• Homophonic Quotients of Linguistic Free Groups: German, Korean, and Turkish
• The effective entropy of next/previous larger/smaller value queries
• Snapshot compressed sensing: performance bounds and algorithms
• Multi-Channel Stochastic Variational Inference for the Joint Analysis of Heterogeneous Biomedical Data in Alzheimer’s Disease
• Artin Groups and Iwahori-Hecke algebras over finite fields
• Estimation of natural indirect effects robust to unmeasured confounding and mediator measurement error
• Saturation Games for Odd Cycles
• Unified power system analyses and models using equivalent circuit formulation
• Genome-Wide Association Studies: Information Theoretic Limits of Reliable Learning
• Development of an 8 channel sEMG wireless device based on ADS1299 with Virtual Instrumentation
• Simple versus Optimal Contracts
• Jamming and Tiling in Aggregation of Rectangles
• This Time with Feeling: Learning Expressive Musical Performance
• Pointwise control of the linearized Gear-Grimshaw system
• Learning to Represent Bilingual Dictionaries
• From POS tagging to dependency parsing for biomedical event extraction
• An Implementation, Empirical Evaluation and Proposed Improvement for Bidirectional Splitting Method for Argumentation Frameworks under Stable Semantics
• Learning Multi-touch Conversion Attribution with Dual-attention Mechanisms for Online Advertising
• Ancient-Modern Chinese Translation with a Large Training Dataset
• Sketch for a Theory of Constructs
• Trajectory planning optimization for real-time 6DOF robotic patient motion compensation
• Dropout during inference as a model for neurological degeneration in an image captioning network
• Identification and Bayesian inference for heterogeneous treatment effects under non-ignorable assignment condition
• Zero-sum path-dependent stochastic differential games in weak formulation
• Resilience-based performance modeling and decision optimization for transportation network
• Restricted permutations refined by number of crossings and nestings
• The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary
• Reciprocity and success in academic careers
• Improved Methods for Moment Restriction Models with Marginally Incompatible Data Combination and an Application to Two-sample Instrumental Variable Estimation
• Magnetic microstructure machine learning analysis
• Statistics of the Eigenvalues of a Noisy Multi-Soliton Pulse
• Criticality of Lagrange Multipliers in Variational Systems
• The Impact of Automatic Pre-annotation in Clinical Note Data Element Extraction – the CLEAN Tool
• Bayesian Bivariate Subgroup Analysis for Risk-Benefit Evaluation
• A Full End-to-End Semantic Role Labeler, Syntax-agnostic Over Syntax-aware?
• Fast RodFIter for Attitude Reconstruction from Inertial Measurements
• Automatically Designing CNN Architectures Using Genetic Algorithm for Image Classification
• Constant overhead quantum fault-tolerance with quantum expander codes
• Learning Discriminative 3D Shape Representations by View Discerning Networks
• Protecting the Grid against IoT Botnets of High-Wattage Devices
• Racial Disparities and Mistrust in End-of-Life Care
• On modelling positive continuous data with spatio-temporal dependence
• A Simple Network of Nodes Moving on the Circle
• Sample size determination in superiority or non-inferiority clinical trials with time-to-event data under exponential, Weibull and Gompertz distributions
• Self-Supervised Model Adaptation for Multimodal Semantic Segmentation
• Fake Sentence Detection as a Training Task for Sentence Encoding
• Fully-Automated Analysis of Body Composition from CT in Cancer Patients Using Convolutional Neural Networks
• Algorithm to Prove Formulas for the Expected Number of Questions in Mastermind Games
• A Stage-wise Decision Framework for Transportation Network Resilience Planning
• Upper and Lower Bounds on Zero-Sum Generalized Schur Numbers
• A Parameterized Complexity View on Description Logic Reasoning
• Neural Importance Sampling
• Spectral norm of a symmetric tensor and its computation
• Mixed-integer bilevel representability
• Ring statistics in 2D-silica: effective temperatures in equilibrium
• Orders-of-magnitude speedup in atmospheric chemistry modeling through neural network-based emulation
• Compound Poisson Noise Sources in Diffusion-based Molecular Communication
• Several classes of minimal linear codes with few weights from weakly regular plateaued functions
• Parallelization does not Accelerate Convex Optimization: Adaptivity Lower Bounds for Non-smooth Convex Minimization
• Backpressure-based Resource allocation for buffer-aided underlay D2D networks
• Approximately uniformly locally finite graphs
• Semi-supervised Skin Lesion Segmentation via Transformation Consistent Self-ensembling Model
• A note on hypergraph colorings
• Robust high dimensional factor models with applications to statistical machine learning
• Partitioning a graph into cycles with a specified number of chords
• The Stochastic Fejér-Monotone Hybrid Steepest Descent Method
• Engineering and Economic Analysis for Electric Vehicle Charging Infrastructure — Placement, Pricing, and Market Design
• Iterative Global Similarity Points : A robust coarse-to-fine integration solution for pairwise 3D point cloud registration
• Mobility edge and Black Hole Horizon
• Addressee and Response Selection for Multilingual Conversation
• Linguistic Relativity and Programming Languages
• Multimodal Language Analysis with Recurrent Multistage Fusion
• Sequence Labeling: A Practical Approach
• Discrete-time Risk-sensitive Mean-field Games
• $PC$-polynomial of graph
• Protocol for an observational study on the effects of playing football in adolescence on mental health in early adulthood
• Fine-grained visual recognition with salient feature detection
• Denoising of 3-D Magnetic Resonance Images Using a Residual Encoder-Decoder Wasserstein Generative Adversarial Network
• Self-Triggered Network Coordination over Noisy Communication Channels
• Unsupervised learning for cross-domain medical image synthesis using deformation invariant cycle consistency networks
• A New Look at $F$-Tests
• An Asymptotically Efficient Metropolis-Hastings Sampler for Bayesian Inference in Large-Scale Educational Measuremen
• Plithogeny, Plithogenic Set, Logic, Probability, and Statistics
• On-Device Federated Learning via Blockchain and its Latency Analysis
• A Fourier View of REINFORCE
• Open-World Stereo Video Matching with Deep RNN
We solved a stylized fact on a long memory process of volatility cluster phenomena by using Minkowski metric for GARCH(1,1) under assumption that price and time can not be separated. We provide a Yang-Mills equation in financial market and anomaly on superspace of time series data as a consequence of the proof from the general relativity theory. We used an original idea in Minkowski spacetime embedded in Kolmogorov space in time series data with behavior of traders.The result of this work is equivalent to the dark volatility or the hidden risk fear field induced by the interaction of the behavior of the trader in the financial market panic when the market crashed.
Domain adversarial learning aligns the feature distributions across the source and target domains in a two-player minimax game. Existing domain adversarial networks generally assume identical label space across different domains. In the presence of big data, there is strong motivation of transferring deep models from existing big domains to unknown small domains. This paper introduces partial domain adaptation as a new domain adaptation scenario, which relaxes the fully shared label space assumption to that the source label space subsumes the target label space. Previous methods typically match the whole source domain to the target domain, which are vulnerable to negative transfer for the partial domain adaptation problem due to the large mismatch between label spaces. We present Partial Adversarial Domain Adaptation (PADA), which simultaneously alleviates negative transfer by down-weighing the data of outlier source classes for training both source classifier and domain adversary, and promotes positive transfer by matching the feature distributions in the shared label space. Experiments show that PADA exceeds state-of-the-art results for partial domain adaptation tasks on several datasets.
In this paper we introduce a new machine learning (ML) model for nonlinear regression called Boosting Smooth Transition Regression Tree (BooST). The main advantage of the BooST is that it estimates the derivatives (partial effects) of very general nonlinear models, providing more interpretation than other tree based models concerning the mapping between the covariates and the dependent variable. We provide some asymptotic theory that shows consistency of the partial derivatives and we present some examples on simulated and empirical data.
We present LemmaTag, a featureless recurrent neural network architecture that jointly generates part-of-speech tags and lemmatizes sentences of languages with complex morphology, using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network and from using the tagger output as an input to the lemmatizer. We evaluate our model across several morphologically-rich languages, surpassing state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.
We present a new ‘grey-box’ approach to anomaly detection in smart manufacturing. The approach is designed for tools run by control systems which execute recipe steps to produce semiconductor wafers. Multiple streaming sensors capture trace data to guide the control systems and for quality control. These control systems are typically PI controllers which can be modeled as an ordinary differential equation (ODE) coupled with a control equation, capturing the physics of the process. The ODE ‘white-box’ models capture physical causal relationships that can be used in simulations to determine how the process will react to changes in control parameters, but they have limited utility for anomaly detection. Many ‘black-box’ approaches exist for anomaly detection in manufacturing, but they typically do not exploit the underlying process control. The proposed ‘grey-box’ approach uses the process-control ODE model to derive a parametric function of sensor data. Bayesian regression is used to fit the parameters of these functions to form characteristic ‘shape signatures’. The probabilistic model provides a natural anomaly score for each wafer, which captures poor control and strange shape signatures. The anomaly score can be deconstructed into its constituent parts in order to identify which parameters are contributing to anomalies. We demonstrate how the anomaly score can be used to monitor complex multi-step manufacturing processes to detect anomalies and changes and show how the shape signatures can provide insight into the underlying sources of process variation that are not readily apparent in the sensor data.
We propose a novel unsupervised keyphrase extraction approach based on outlier detection. Our approach starts by training word embeddings on the target document to capture semantic regularities among the words. It then uses the minimum covariance determinant estimator to model the distribution of non-keyphrase word vectors, under the assumption that these vectors come from the same distribution, indicative of their irrelevance to the semantics expresses by the dimensions of the learned vector representation. Candidate keyphrases are based on words that are outliers of this dominant distribution. Empirical results show that our approach outperforms state-of-the-art unsupervised keyphrase extraction methods.
Attention mechanisms in sequence to sequence models have shown great ability and wonderful performance in various natural language processing (NLP) tasks, such as sentence embedding, text generation, machine translation, machine reading comprehension, etc. Unfortunately, existing attention mechanisms only learn either high-level or low-level features. In this paper, we think that the lack of hierarchical mechanisms is a bottleneck in improving the performance of the attention mechanisms, and propose a novel Hierarchical Attention Mechanism (Ham) based on the weighted sum of different layers of a multi-level attention. Ham achieves a state-of-the-art BLEU score of 0.26 on Chinese poem generation task and a nearly 6.5% averaged improvement compared with the existing machine reading comprehension models such as BIDAF and Match-LSTM. Furthermore, our experiments and theorems reveal that Ham has greater generalization and representation ability than existing attention mechanisms.
Artificial Intelligence (AI) achieved super-human performance in a broad variety of domains. We say that an AI is made Artificially Stupid on a task when some limitations are deliberately introduced to match a human’s ability to do the task. An Artificial General Intelligence (AGI) can be made safer by limiting its computing power and memory, or by introducing Artificial Stupidity on certain tasks. We survey human intellectual limits and give recommendations for which limits to implement in order to build a safe AGI.
In the last decade, a variety of topic models have been proposed for text engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), most of existing topic models are seldom applied or considered in industrial scenarios. This phenomenon is caused by the fact that there are very few convenient tools to support these topic models so far. Intimidated by the demanding expertise and labor of designing and implementing parameter inference algorithms, software engineers are prone to simply resort to PLSA/LDA, without considering whether it is proper for their problem at hand or not. In this paper, we propose a configurable topic modeling framework named Familia, in order to bridge the huge gap between academic research fruits and current industrial practice. Familia supports an important line of topic models that are widely applicable in text engineering scenarios. In order to relieve burdens of software engineers without knowledge of Bayesian networks, Familia is able to conduct automatic parameter inference for a variety of topic models. Simply through changing the data organization of Familia, software engineers are able to easily explore a broad spectrum of existing topic models or even design their own topic models, and find the one that best suits the problem at hand. With its superior extendability, Familia has a novel sampling mechanism that strikes balance between effectiveness and efficiency of parameter inference. Furthermore, Familia is essentially a big topic modeling framework that supports parallel parameter inference and distributed parameter storage. The utilities and necessity of Familia are demonstrated in real-life industrial applications. Familia would significantly enlarge software engineers’ arsenal of topic models and pave the way for utilizing highly customized topic models in real-life problems.
A capsule is a collection of neurons which represents different variants of a pattern in the network. The routing scheme ensures only certain capsules which resemble lower counterparts in the higher layer should be activated. However, the computational complexity becomes a bottleneck for scaling up to larger networks, as lower capsules need to correspond to each and every higher capsule. To resolve this limitation, we approximate the routing process with two branches: a master branch which collects primary information from its direct contact in the lower layer and an aide branch that replenishes master based on pattern variants encoded in other lower capsules. Compared with previous iterative and unsupervised routing scheme, these two branches are communicated in a fast, supervised and one-time pass fashion. The complexity and runtime of the model are therefore decreased by a large margin. Motivated by the routing to make higher capsule have agreement with lower capsule, we extend the mechanism as a compensation for the rapid loss of information in nearby layers. We devise a feedback agreement unit to send back higher capsules as feedback. It could be regarded as an additional regularization to the network. The feedback agreement is achieved by comparing the optimal transport divergence between two distributions (lower and higher capsules). Such an add-on witnesses a unanimous gain in both capsule and vanilla networks. Our proposed EncapNet performs favorably better against previous state-of-the-arts on CIFAR10/100, SVHN and a subset of ImageNet.
Knowledge Graph Embedding (KGE) aims to represent entities and relations of knowledge graph in a low-dimensional continuous vector space. Recent works focus on incorporating structural knowledge with additional information, such as entity descriptions, relation paths and so on. However, common used additional information usually contains plenty of noise, which makes it hard to learn valuable representation. In this paper, we propose a new kind of additional information, called entity neighbors, which contain both semantic and topological features about given entity. We then develop a deep memory network model to encode information from neighbors. Employing a gating mechanism, representations of structure and neighbors are integrated into a joint representation. The experimental results show that our model outperforms existing KGE methods utilizing entity descriptions and achieves state-of-the-art metrics on 4 datasets.
In this demo paper, we introduce the DARPA D3M program for automatic machine learning (ML) and JPL’s MARVIN tool that provides an environment to locate, annotate, and execute machine learning primitives for use in ML pipelines. MARVIN is a web-based application and associated back-end interface written in Python that enables composition of ML pipelines from hundreds of primitives from the world of Scikit-Learn, Keras, DL4J and other widely used libraries. MARVIN allows for the creation of Docker containers that run on Kubernetes clusters within DARPA to provide an execution environment for automated machine learning. MARVIN currently contains over 400 datasets and challenge problems from a wide array of ML domains including routine classification and regression to advanced video/image classification and remote sensing.
Context information around words helps in determining their actual meaning, for example ‘networks’ used in contexts of artificial neural networks or biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. This results in an improved performance in terms of generalization, interpretability and applicability. We apply our modeling approach to seven data sets from various domains and demonstrate that our approach consistently outperforms stateof-the-art generative topic models. With the learned representations, we show on an average a gain of 9.6% (0.57 Vs 0.52) in precision at retrieval fraction 0.02 and 7.2% (0.582 Vs 0.543) in F1 for text categorization.
In this technical report, we present jLDADMM—an easy-to-use Java toolkit for conventional topic models. jLDADMM is released to provide alternatives for topic modeling on normal or short texts. It provides implementations of the Latent Dirichlet Allocation topic model and the one-topic-per-document Dirichlet Multinomial Mixture model (i.e. mixture of unigrams), using collapsed Gibbs sampling. In addition, jLDADMM supplies a document clustering evaluation to compare topic models. jLDADMM is open-source and available to download at: https://…/jLDADMM
Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability e.g. easy to handle positive-unlabeled inputs, fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still are insufficient. This is because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck. To address the aforementioned challenge, accelerating ALS on graphics processing units (GPUs) is a promising direction. We propose the novel approach in enhancing the MF efficiency via both memory optimization and approximate computing. The former exploits GPU memory hierarchy to increase data reuse, while the later reduces unnecessary computing without hurting the convergence of learning algorithms. Extensive experiments on large-scale datasets show that our solution not only outperforms the competing CPU solutions by a large margin but also has a 2x-4x performance gain compared to the state-of-the-art GPU solutions. Our implementations are open-sourced and publicly available.
We consider the problem of ranking a set of items from pairwise comparisons in the presence of features associated with the items. Recent works have established that samples are needed to rank well when there is no feature information present. However, this might be sub-optimal in the presence of associated features. We introduce a new probabilistic preference model called feature-Bradley-Terry-Luce (f-BTL) model that generalizes the standard BTL model to incorporate feature information. We present a new least squares based algorithm called fBTL-LS which we show requires much lesser than pairs to obtain a good ranking — precisely our new sample complexity bound is of , where denotes the number of `independent items’ of the set, in general . Our analysis is novel and makes use of tools from classical graph matching theory to provide tighter bounds that sheds light on the true complexity of the ranking problem, capturing the item dependencies in terms of their feature representations. This was not possible with earlier matrix completion based tools used for this problem. We also prove an information theoretic lower bound on the required sample complexity for recovering the underlying ranking, which essentially shows the tightness of our proposed algorithms. The efficacy of our proposed algorithms are validated through extensive experimental evaluations on a variety of synthetic and real world datasets.
Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention mechanism that recombines a fixed encoding of the source tokens based on the decoder state. We propose an alternative approach which instead relies on a single 2D convolutional neural network across both sequences. Each layer of our network re-codes source tokens on the basis of the output sequence produced so far. Attention-like properties are therefore pervasive throughout the network. Our model yields excellent results, outperforming state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters.
In the traditional framework of spectral learning of stochastic time series models, model parameters are estimated based on trajectories of fully recorded observations. However, real-world time series data often contain missing values, and worse, the distributions of missingness events over time are often not independent of the visible process. Recently, a spectral OOM learning algorithm for time series with missing data was introduced and proved to be consistent, albeit under quite strong conditions. Here we refine the algorithm and prove that the original strong conditions can be very much relaxed. We validate our theoretical findings by numerical experiments, showing that the algorithm can consistently handle missingness patterns whose dynamic interacts with the visible process.
This paper is part of a project on developing an algorithmic theory of brain networks, based on stochastic Spiking Neural Network (SNN) models. Inspired by tasks that seem to be solved in actual brains, we are defining abstract problems to be solved by these networks. In our work so far, we have developed models and algorithms for the Winner-Take-All problem from computational neuroscience [LMP17a,Mus18], and problems of similarity detection and neural coding [LMP17b]. We plan to consider many other problems and networks, including both static networks and networks that learn. This paper is about basic theory for the stochastic SNN model. In particular, we define a simple version of the model. This version assumes that the neurons’ only state is a Boolean, indicating whether the neuron is firing or not. In later work, we plan to develop variants of the model with more elaborate state. We also define an external behavior notion for SNNs, which can be used for stating requirements to be satisfied by the networks. We then define a composition operator for SNNs. We prove that our external behavior notion is ‘compositional’, in the sense that the external behavior of a composed network depends only on the external behaviors of the component networks. We also define a hiding operator that reclassifies some output behavior of an SNN as internal. We give basic results for hiding. Finally, we give a formal definition of a problem to be solved by an SNN, and give basic results showing how composition and hiding of networks affect the problems that they solve. We illustrate our definitions with three examples: building a circuit out of gates, building an ‘Attention’ network out of a ‘Winner-Take-All’ network and a ‘Filter’ network, and a toy example involving combining two networks in a cyclic fashion.
Deep learning models have achieved remarkable success in natural language inference (NLI) tasks. While these models are widely explored, they are hard to interpret and it is often unclear how and why they actually work. In this paper, we take a step toward explaining such deep learning based models through a case study on a popular neural model for NLI. In particular, we propose to interpret the intermediate layers of NLI models by visualizing the saliency of attention and LSTM gating signals. We present several examples for which our methods are able to reveal interesting insights and identify the critical information contributing to the model decisions.
Item recommendation is a personalized ranking task. To this end, many recommender systems optimize models with pairwise ranking objectives, such as the Bayesian Personalized Ranking (BPR). Using matrix Factorization (MF) — the most widely used model in recommendation — as a demonstration, we show that optimizing it with BPR leads to a recommender model that is not robust. In particular, we find that the resultant model is highly vulnerable to adversarial perturbations on its model parameters, which implies the possibly large error in generalization. To enhance the robustness of a recommender model and thus improve its generalization performance, we propose a new optimization framework, namely Adversarial Personalized Ranking (APR). In short, our APR enhances the pairwise ranking method BPR by performing adversarial training. It can be interpreted as playing a minimax game, where the minimization of the BPR objective function meanwhile defends an adversary, which adds adversarial perturbations on model parameters to maximize the BPR objective function. To illustrate how it works, we implement APR on MF by adding adversarial perturbations on the embedding vectors of users and items. Extensive experiments on three public real-world datasets demonstrate the effectiveness of APR — by optimizing MF with APR, it outperforms BPR with a relative improvement of 11.2% on average and achieves state-of-the-art performance for item recommendation. Our implementation is available at: https://…/adversarial_personalized_ranking.
In this work, we contribute a new multi-layer neural network architecture named ONCF to perform collaborative filtering. The idea is to use an outer product to explicitly model the pairwise correlations between the dimensions of the embedding space. In contrast to existing neural recommender models that combine user embedding and item embedding via a simple concatenation or element-wise product, our proposal of using outer product above the embedding layer results in a two-dimensional interaction map that is more expressive and semantically plausible. Above the interaction map obtained by outer product, we propose to employ a convolutional neural network to learn high-order correlations among embedding dimensions. Extensive experiments on two public implicit feedback data demonstrate the effectiveness of our proposed ONCF framework, in particular, the positive effect of using outer product to model the correlations between embedding dimensions in the low level of multi-layer neural recommender model. The experiment codes are available at: https://…/ConvNCF
Neuronal circuits formed in the brain are complex with intricate connection patterns. Such a complexity is also observed in the retina as a relatively simple neuronal circuit. A retinal ganglion cell receives excitatory inputs from neurons in previous layers as driving forces to fire spikes. Analytical methods are required that can decipher these components in a systematic manner. Recently a method termed spike-triggered non-negative matrix factorization (STNMF) has been proposed for this purpose. In this study, we extend the scope of the STNMF method. By using the retinal ganglion cell as a model system, we show that STNMF can detect various biophysical properties of upstream bipolar cells, including spatial receptive fields, temporal filters, and transfer nonlinearity. In addition, we recover synaptic connection strengths from the weight matrix of STNMF. Furthermore, we show that STNMF can separate spikes of a ganglion cell into a few subsets of spikes where each subset is contributed by one presynaptic bipolar cell. Taken together, these results corroborate that STNMF is a useful method for deciphering the structure of neuronal circuits.
Convolutional neural networks (CNNs) have achieved great success on grid-like data such as images, but face tremendous challenges in learning from more generic data such as graphs. In CNNs, the trainable local filters enable the automatic extraction of high-level features. The computation with filters requires a fixed number of ordered units in the receptive fields. However, the number of neighboring units is neither fixed nor are they ordered in generic graphs, thereby hindering the applications of convolutional operations. Here, we address these challenges by proposing the learnable graph convolutional layer (LGCL). LGCL automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs. To enable model training on large-scale graphs, we propose a sub-graph training method to reduce the excessive memory and computational resource requirements suffered by prior methods on graph convolutions. Our experimental results on node classification tasks in both transductive and inductive learning settings demonstrate that our methods can achieve consistently better performance on the Cora, Citeseer, Pubmed citation network, and protein-protein interaction network datasets. Our results also indicate that the proposed methods using sub-graph training strategy are more efficient as compared to prior approaches.