Unsupervised Tracklet Association Learning (UTAL)
Most existing person re-identification (re-id) methods rely on supervised model learning on per-camera-pair manually labelled pairwise training data. This leads to poor scalability in a practical re-id deployment, due to the lack of exhaustive identity labelling of positive and negative image pairs for every camera-pair. In this work, we present an unsupervised re-id deep learning approach. It is capable of incrementally discovering and exploiting the underlying re-id discriminative information from automatically generated person tracklet data end-to-end. We formulate an Unsupervised Tracklet Association Learning (UTAL) framework. This is by jointly learning within-camera tracklet discrimination and cross-camera tracklet association in order to maximise the discovery of tracklet identity matching both within and across camera views. Extensive experiments demonstrate the superiority of the proposed model over the state-of-the-art unsupervised learning and domain adaptation person re-id methods on eight benchmarking datasets. …
Text-to-Speech-System (TTS)
Speech Synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely ‘synthetic’ voice output. The quality of a speech synthesizer is judged by its similarity to the human voice and by its ability to be understood clearly. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1990s. Overview of a typical TTS system Automatic announcement Menu 0:00 A synthetic voice announcing an arriving train in Sweden. Problems playing this file? See media help. Problems playing this file? See media help. A text-to-speech system (or ‘engine’) is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion. Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end – often referred to as the synthesizer – then converts the symbolic linguistic representation into sound. In certain systems, this part includes the computation of the target prosody (pitch contour, phoneme durations), which is then imposed on the output speech. …
ReKopedia
Very important breakthroughs in data-centric machine learning algorithms led to impressive performance in transactional point applications such as detecting anger in speech, alerts from a Face Recognition system, or EKG interpretation. Non-transactional applications, e.g. medical diagnosis beyond the EKG results, require AI algorithms that integrate deeper and broader knowledge in their problem-solving capabilities, e.g. integrating knowledge about anatomy and physiology of the heart with EKG results and additional patient findings. Similarly, for military aerial interpretation, where knowledge about enemy doctrines on force composition and spread helps immensely in situation assessment beyond image recognition of individual objects. The Double Deep Learning approach advocates integrating data-centric machine self-learning techniques with machine-teaching techniques to leverage the power of both and overcome their corresponding limitations. To take AI to the next level, it is essential that we rebalance the roles of data and knowledge. Data is important but knowledge- deep and commonsense- are equally important. An initiative is proposed to build Wikipedia for Smart Machines, meaning target readers are not human, but rather smart machines. Named ReKopedia, the goal is to develop methodologies, tools, and automatic algorithms to convert humanity knowledge that we all learn in schools, universities and during our professional life into Reusable Knowledge structures that smart machines can use in their inference algorithms. Ideally, ReKopedia would be an open source shared knowledge repository similar to the well-known shared open source software code repositories. Examples in the article are based on- or inspired by- real-life non-transactional AI systems I deployed over decades of AI career that benefit hundreds of millions of people around the globe. …
Unitary Group Convolution (UGConv)
We propose unitary group convolutions (UGConvs), a building block for CNNs which compose a group convolution with unitary transforms in feature space to learn a richer set of representations than group convolution alone. UGConvs generalize two disparate ideas in CNN architecture, channel shuffling (i.e. ShuffleNet) and block-circulant networks (i.e. CirCNN), and provide unifying insights that lead to a deeper understanding of each technique. We experimentally demonstrate that dense unitary transforms can outperform channel shuffling in DNN accuracy. On the other hand, different dense transforms exhibit comparable accuracy performance. Based on these observations we propose HadaNet, a UGConv network using Hadamard transforms. HadaNets achieve similar accuracy to circulant networks with lower computation complexity, and better accuracy than ShuffleNets with the same number of parameters and floating-point multiplies. …
If you did not already know
01 Wednesday Sep 2021
Posted What is ...
in