. .. Learnable,

.. .. Conclusion,

, Fully Convolutional Speech Recognition 117

.. .. Model,

.. .. Experiments,

.. .. Results,

.. .. Conclusion,

G. Antipov, M. Baccouche, and J. Dugelay, Face aging with conditional generative adversarial networks, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01617351

L. Samuel-r-bowman, O. Vilnis, . Vinyals, M. Andrew, R. Dai et al., Generating sentences from a continuous space, 2015.

A. Brock, T. Lim, N. Ritchie, and . Weston, Neural photo editing with introspective adversarial networks, 2016.

X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever et al., Infogan: Interpretable representation learning by information maximizing generative adversarial nets, Advances in Neural Information Processing Systems, pp.2172-2180, 2016.

H. Edwards and A. Storkey, Censoring representations with an adversary, 2015.

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle et al., Domain-adversarial training of neural networks, Journal of Machine Learning Research, vol.17, issue.59, pp.1-35, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01624607

I. Goodfellow, J. Pouget-abadie, M. Mirza, B. Xu, D. Warde-farley et al., Generative adversarial nets, Advances in neural information processing systems, pp.2672-2680, 2014.

G. Hinton, A. Krizhevsky, and S. Wang, Transforming auto-encoders, Artificial Neural Networks and Machine Learning-ICANN 2011, pp.44-51, 2011.

P. Isola, J. Zhu, T. Zhou, and A. A. Efros, Image-to-image translation with conditional adversarial networks, 2016.

D. Kingma and J. Ba, Adam: A method for stochastic optimization, 2014.

D. Tejas, . Kulkarni, F. William, P. Whitney, J. Kohli et al., Deep convolutional inverse graphics network, Advances in Neural Information Processing Systems, pp.2539-2547, 2015.

C. Li and M. Wand, Precomputed real-time texture synthesis with markovian generative adversarial networks, European Conference on Computer Vision, pp.702-716

. Springer, , 2016.

Z. Liu, P. Luo, X. Wang, and X. Tang, Deep learning face attributes in the wild, Proceedings of International Conference on Computer Vision (ICCV), 2015.

G. Louppe, M. Kagan, and K. Cranmer, Learning to pivot with adversarial networks, 2016.

F. Michael, J. J. Mathieu, J. Zhao, A. Zhao, P. Ramesh et al., Disentangling factors of variation in deep representation using adversarial training, Advances in Neural Information Processing Systems, pp.5041-5049, 2016.

M. Nilsback and A. Zisserman, Automated flower classification over a large number of classes, Computer Vision, Graphics & Image Processing, pp.722-729, 2008.

G. Perarnau, B. Joost-van-de-weijer, J. M. Raducanu, and . Álvarez, Invertible conditional gans for image editing, 2016.

S. Reed, Z. Akata, H. Lee, and B. Schiele, Learning deep representations of fine-grained visual descriptions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.49-58, 2016.

E. Scott, Y. Reed, Y. Zhang, H. Zhang, and . Lee, Deep visual analogy-making, Advances in Neural Information Processing Systems, pp.1252-1260, 2015.

J. Schmidhuber, Learning factorial codes by predictability minimization, Neural Computation, vol.4, issue.6, pp.863-879, 1992.

Y. Taigman, A. Polyak, and L. Wolf, Unsupervised cross-domain image generation, 2016.

P. Upchurch, J. Gardner, K. Bala, R. Pless, N. Snavely et al., Deep feature interpolation for image content changes, 2016.

L. Wolf, Y. Taigman, and A. Polyak, Unsupervised creation of parameterized avatars, 2017.

X. Yan, J. Yang, K. Sohn, and H. Lee, Attribute2image: Conditional image generation from visual attributes, European Conference on Computer Vision, pp.776-791, 2016.

J. Yang, E. Scott, M. Reed, H. Yang, and . Lee, Weakly-supervised disentangling with recurrent transformations for 3d view synthesis, Advances in Neural Information Processing Systems, pp.1099-1107, 2015.

J. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, 2017.

M. Buhrmester, T. Kwang, and S. D. Gosling, Amazon's mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science, vol.6, pp.3-5, 2011.

H. Caracalla and A. Roebel, Gradient conversion between time and frequency domains using wirtinger calculus, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01534863

K. Ebcioglu, An expert system for harmonizing four-part chorales, Computer Music Journal, vol.12, issue.3, pp.43-51, 1988.

J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck et al., Neural audio synthesis of musical notes with wavenet autoencoders, 2017.

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin, Convolutional sequence to sequence learning, 2017.

. Jl-goldstein, Auditory nonlinearity, The Journal of the Acoustical Society of America, vol.41, issue.3, pp.676-699, 1967.

D. Griffin and J. Lim, Signal estimation from modified short-time fourier transform, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.32, issue.2, pp.236-243, 1984.

G. Hadjeres, F. Pachet, and F. Nielsen, Deepbach: a steerable model for bach chorales generation, 2016.

A. Haque, M. Guo, and P. Verma, Conditional end-to-end audio transforms, 2018.

D. Herremans, Morpheus: automatic music generation with recurrent pattern constraints and tension profiles, 2016.

G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural network, 2015.

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol.9, issue.8, pp.1735-1780, 1997.

F. Itakura, Analysis synthesis telephony based on the maximum likelihood method, The 6th international congress on acoustics, pp.280-292, 1968.

N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande et al., Efficient neural audio synthesis, 2018.

P. Diederik, J. Kingma, and . Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2015.

A. Neil, C. Macmillan, and . Douglas-creelman, Detection theory: A user's guide. Psychology press, 2004.

S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain et al., Samplernn: An unconditional end-to-end neural audio generation model, 2016.

A. Van-den-oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals et al., Parallel wavenet: Fast high-fidelity speech synthesis, 2017.

W. Ping, K. Peng, A. Gibiansky, A. Arik, S. Kannan et al., Deep voice 3: Scaling text-to-speech with convolutional sequence learning, Proc. 6th International Conference on Learning Representations, 2018.

F. Ribeiro, D. Florêncio, C. Zhang, and M. Seltzer, Crowdmos: An approach for crowdsourcing mean opinion score studies, Acoustics, Speech and Signal Processing, pp.2416-2419, 2011.

J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner et al., Char2wav: End-to-end speech synthesis, 2017.

S. Sukhbaatar, J. Weston, and R. Fergus, End-to-end memory networks, Advances in neural information processing systems, pp.2440-2448, 2015.

Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, Voice synthesis for in-the-wild speakers via a phonological loop, 2017.

A. Van-den, S. Oord, H. Dieleman, K. Zen, O. Simonyan et al., Wavenet: A generative model for raw audio, 2016.

Y. Wang, D. Skerry-ryan, Y. Stanton, R. J. Wu, N. Weiss et al., Towards end-to-end speech synthesis, 2017.

J. Ronald, D. Williams, and . Zipser, Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications, vol.1, pp.433-486, 1995.

, Gammatone-based spectrograms, using gammatone filterbanks or Fourier transform weightings

O. Abdel-hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn et al., Convolutional Neural Networks for Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.22, pp.1533-1545, 2014.

M. Airaksinen, Analysis/synthesis comparison of vocoders utilized in statistical parametric speech synthesis, 2012.

. J. Md, P. Alam, P. Ouellet, D. D. Kenny, and . O'shaughnessy, Comparative evaluation of feature normalization techniques for speaker verification, NOLISP, 2011.

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper et al., Greg Diamos, and others. Deep speech 2: End-to-end speech recognition in english and mandarin, 2015.

J. Andén and S. Mallat, Deep Scattering Spectrum, IEEE Transactions on Signal Processing, vol.62, pp.4114-4128, 2014.

M. Artetxe, G. Labaka, E. Agirre, and K. Cho, Unsupervised neural machine translation, 2017.

. Richard-n-aslin, Some developmental processes in speech perception. Child Phonology: Perception & Production, 1980.

L. Badino, A. Mereta, and L. Rosasco, Discovering discrete subword units with binarized autoencoders and hidden-markov-model encoders, INTERSPEECH, 2015.

D. Bahdanau, K. Cho, and Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, 2014.

J. Baker, The dragon system-an overview, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.23, issue.1, pp.24-29, 1975.

R. Balestriero, R. Cosentino, H. Glotin, and R. G. Baraniuk, Spline filters for end-to-end deep learning, ICML, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01879266

J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, The pascal chime speech separation and recognition challenge, Computer Speech & Language, vol.27, issue.3, pp.621-633, 2013.
URL : https://hal.archives-ouvertes.fr/inria-00584051

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, The third 'chime'speech separation and recognition challenge: Dataset, task and baselines, Automatic Speech Recognition and Understanding (ASRU), pp.504-511, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01211376

E. Battenberg, R. Child, A. Coates, C. Fougner, Y. Gaur et al., Markus Kliegl, Atul Kumar, and others. Reducing Bias in Production Speech Models, 2017.

H. Bay, A. Ess, T. Tuytelaars, and L. Van-gool, Computer vision and image understanding, vol.110, pp.346-359, 2008.

. Md-bedworth, . Bottou, . Js-bridle, . Fallside, . Flynn et al., Comparison of neural and conventional classifiers on a speech recognition problem, First IEE International Conference on, pp.86-89, 1989.

A. Bérard, O. Pietquin, C. Servan, and L. Besacier, Listen and translate: A proof of concept for end-to-end speech-to-text translation, 2016.

C. Bhat, B. Vachhani, and S. K. Kopparapu, Automatic assessment of dysarthria severity level using audio descriptors, ICASSP, pp.5070-5074, 2017.

L. Bottou, F. Soulié, P. Blanchet, and J. Lienard, Experiments with time delay networks and dynamic time warping for speaker independent isolated digits recognition, First European Conference on Speech Communication and Technology, 1989.

H. Bredin and . Tristounet, Triplet Loss for Speaker Turn Embedding. Acoustics, Speech and Signal Processing, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01830421

J. Bromley, W. James, L. Bentz, I. Bottou, Y. Guyon et al., Signature verification using a "Siamese" time delay neural network, International Journal of Pattern Recognition and Artificial Intelligence, vol.7, issue.04, pp.669-688, 1993.

A. Butcher, Australian Aboriginal Languages: Consonant Salient Phonologies and the'place-of-articulation Imperative'. Australian Speech Science and Technology Association, 2003.

R. Caruana, Multitask learning, Learning to learn, pp.95-133, 1998.

W. Chan and I. Lane, Deep recurrent neural networks for acoustic modelling, 2015.

W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, . Listen et al., , 2015.

G. Chechik, V. Sharma, U. Shalit, and S. Bengio, Large scale online learning of image similarity through ranking, The Journal of Machine Learning Research, vol.11, pp.1109-1135, 2010.

H. Chen, C. Leung, L. Xie, B. Ma, and H. Li, Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: a feasibility study, INTERSPEECH, 2015.

C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen et al., State-of-the-art speech recognition with sequence-to-sequence models, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4774-4778, 2018.

J. Chorowski and N. Jaitly, Towards better decoding and language model integration in sequence to sequence models, 2016.

D. Jan-k-chorowski, D. Bahdanau, K. Serdyuk, Y. Cho, and . Bengio, Attention-based models for speech recognition, Advances in neural information processing systems, pp.577-585, 2015.

Y. Chung, W. Weng, S. Tong, and J. Glass, Unsupervised crossmodal alignment of speech and text embedding spaces, 2018.

R. P. Clapham, L. Van-der-molen, R. J. Van-son, W. M. Michiel, F. J. Van-den-brekel et al., NKI-CCRT Corpus -Speech Intelligibility Before and After Advanced Head and Neck Cancer Treated with Concomitant Chemoradiotherapy, LREC, 2012.

L. Harvey, . Coates, S. Peter, A. J. Morris, S. Leach et al., Otitis media in aboriginal children: tackling a major health problem, The Medical Journal of Australia, vol.177, issue.4, pp.177-178, 2002.

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu et al., Natural language processing (almost) from scratch, Journal of Machine Learning Research, vol.12, pp.2493-2537, 2011.

R. Collobert, C. Puhrsch, and G. Synnaeve, Wav2letter: an end-to-end convnet-based speech recognition system, 2016.

A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou, Word translation without parallel data, 2017.

Y. Dauphin, A. Fan, M. Auli, and D. Grangier, Language Modeling with Gated Convolutional Networks, ICML, 2017.

S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE transactions on acoustics, speech, and signal processing, vol.28, issue.4, pp.357-366, 1980.

A. Défossez, N. Zeghidour, N. Usunier, L. Bottou, and F. Bach, Sing: Symbol-to-instrument neural generator, Advances in Neural Information Processing Systems, pp.9055-9065, 2018.

N. Dehak, J. Patrick, R. Kenny, P. Dehak, P. Dumouchel et al., Frontend factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing, vol.19, issue.4, pp.788-798, 2011.

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A largescale hierarchical image database, Computer Vision and Pattern Recognition, pp.248-255, 2009.

M. Dredze, A. Jansen, G. Coppersmith, and K. W. Church, Nlp on spoken documents without asr, EMNLP, 2010.

E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard et al., The zero resource speech challenge 2017, Automatic Speech Recognition and Understanding Workshop, p.2017
URL : https://hal.archives-ouvertes.fr/hal-01687504

, IEEE, pp.323-330, 2017.

F. Eyben, Real-time speech and music classification by large audio feature space extraction, 2015.

F. Eyben, M. Wöllmer, and B. W. Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, ACM Multimedia, 2010.

G. Fant, Analysis and synthesis of speech processes. Manual of phonetics, vol.2, pp.173-277, 1968.

G. Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, 1970.

C. Farabet, C. Couprie, L. Najman, and Y. Lecun, Learning hierarchical features for scene labeling, IEEE transactions on pattern analysis and machine intelligence, vol.35, pp.1915-1929, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00742077

G. Fechner, Elements of psychophysics, 1966.

E. B. Naomi-h-feldman, K. S. Myers, . White, L. Thomas, J. Griffiths et al., Word-level information influences phonetic learning in adults and infants, Cognition, vol.127, issue.3, pp.427-438, 2013.

A. Ronald and . Fisher, Statistical methods for research workers, Statistical methods for research workers, 1925.

. Jl-flanagan, Parametric coding of speech spectra, The Journal of the Acoustical Society of America, vol.68, issue.2, pp.412-419, 1980.

J. Fredes, J. Novoa, S. King, R. M. Stern, and N. Becerra-yoma, Locally normalized filter banks applied to deep neural-network-based robust speech recognition, IEEE Signal Processing Letters, vol.24, pp.377-381, 2017.

K. Fukushima and S. Miyake, Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition, Competition and cooperation in neural nets, pp.267-285, 1982.

S. John, L. F. Garofolo, . Lamel, M. William, J. G. Fisher et al., TIMIT acoustic-phonetic continuous speech corpus, Linguistic data consortium, vol.10, issue.5, p.0, 1993.

J. Gehring, M. Auli, D. Grangier, and Y. Dauphin, A Convolutional Encoder Model for Neural Machine Translation, ACL, 2017.

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin, Convolutional Sequence to Sequence Learning, ICML, 2017.

P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, Acoustic Modelling from the Signal Domain Using CNNs, 2016.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, pp.580-587, 2014.

A. Graves and N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks, International Conference on Machine Learning, pp.1764-1772, 2014.

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, Proceedings of the 23rd international conference on Machine learning, pp.369-376, 2006.

A. Graves, M. Abdel-rahman, and G. Hinton, Speech recognition with deep recurrent neural networks, Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp.6645-6649, 2013.

.. D. Dr and . Greenwood, The mel scale's disqualifying bias and a consistency of pitch-difference equisections in 1956 with equal cochlear distances and equal frequency ratios, Hearing research, vol.103, pp.1-2, 1997.

H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, End-to-end Speech Recognition Using Lattice-free MMI, 2018.

J. Kyu, A. Han, J. Chandrashekaran, I. Kim, and . Lane, The CAPIO 2017 Conversational Speech Recognition System, 2017.

Y. Awni, C. Hannun, J. Case, B. Casper, G. Catanzaro et al., Deep Speech: Scaling up end-to-end speech recognition, 2014.

K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, CVPR, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of CVPR, 2016.

F. Hilger and H. Ney, Quantile based histogram equalization for noise robust large vocabulary speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol.14, pp.845-854, 2006.

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, Signal Processing Magazine, vol.29, issue.6, pp.82-97, 2012.

N. Geoffrey-e-hinton, A. Srivastava, I. Krizhevsky, R. Sutskever, and . Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, 2012.

J. Hirschberg, S. Benus, J. M. Brenier, F. Enos, S. Friedman et al., Distinguishing deceptive from non-deceptive speech, Ninth European Conference on Speech Communication and Technology, 2005.

Y. Hoshen, R. J. Weiss, and K. W. Wilson, Speech acoustic modeling from raw multichannel waveforms, Acoustics, Speech and Signal Processing, pp.4624-4628, 2015.

P. Hsiao and C. Chen, Effective Attention Mechanism in Dynamic Models for Speech Emotion Recognition, ICASSP, pp.2526-2530, 2018.

H. David, . Hubel, N. Torsten, and . Wiesel, Receptive fields, binocular interaction and functional architecture in the cat's visual cortex, The Journal of physiology, vol.160, issue.1, pp.106-154, 1962.

M. Huckvale, Exploiting speech knowledge in neural nets for recognition, Speech Communication, vol.9, pp.1-13, 1990.

N. Jaitly and G. E. Hinton, Learning a better representation of speech soundwaves using restricted boltzmann machines, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5884-5887, 2011.

A. Jansen and B. Van-durme, Efficient spoken term discovery using randomized algorithms, Automatic Speech Recognition and Understanding (ASRU), pp.401-406, 2011.

A. Jansen, E. Dupoux, S. Goldwater, M. Johnson, S. Khudanpur et al., A summary of the 2012 JH CLSP Workshop on zero resource speech technologies and models of early language acquisition, Proceedings of ICASSP 2013, 2013.

F. Jelinek, Continuous speech recognition by statistical methods, Proceedings of the IEEE, vol.64, issue.4, pp.532-556, 1976.

Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao et al., Leveraging weakly supervised data to improve end-to-end speech-to-text translation, 2018.

M. Johnson, S. Lapkin, V. Long, P. Sanchez, H. Suominen et al., A systematic review of speech recognition technology in health care, In BMC Med. Inf. & Decision Making, 2014.

B. Juang, S. Levinson, and M. Sondhi, Maximum likelihood estimation for multivariate mixture observations of markov chains (corresp.), IEEE Transactions on Information Theory, vol.32, issue.2, pp.307-309, 1986.

H. Selen-hande-kabil, M. Muckenhirn, and . Magimai-doss, On learning to identify genders from raw speech signal using cnns, 2018.

H. Kamper, A. Jansen, and S. Goldwater, Fully unsupervised small-vocabulary speech recognition using a segmental bayesian model, INTERSPEECH, 2015.

H. Kamper, W. Wang, and K. Livescu, Deep convolutional acoustic word embeddings using word-pair side information, 2015.

. Tae-gyoon-kang, H. Kang, . Lee, H. Woo, S. H. Kang et al., Dnnbased voice activity detection with local feature shift technique, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp.1-4, 2016.

. L-g-kersta, Voiceprint identification, Nature, vol.196, pp.1253-1257, 1962.

H. Kim, M. Hasegawa-johnson, A. Perlman, J. Gunderson, and S. Thomas,

K. Huang, S. Watkin, and . Frame, Dysarthric speech database for universal access research, INTERSPEECH, 2008.

J. Kim, N. Kumar, A. Tsiartas, M. Li, and S. S. Narayanan, Automatic intelligibility classification of sentence-level pathological speech, Computer Speech & Language, vol.29, issue.1, pp.132-144, 2015.

M. Kim, H. Yoo, and . Kim, Dysarthric speech recognition using dysarthria-severitydependent and speaker-adaptive models, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp.3622-3626, 2013.

S. Kim, T. Hori, and S. Watanabe, Joint CTC-attention based end-to-end speech recognition using multi-task learning, Acoustics, Speech and Signal Processing, pp.4835-4839, 2017.

D. Klakow and J. Peters, Testing the correlation of word error rate and perplexity, Speech Communication, vol.38, issue.1-2, pp.19-28, 2002.

W. Koening, A new frequency scale for acoustic measurements, Bell Lab Rec, pp.299-301, 1949.

Z. Kons and O. Toledo-ronen, Audio event classification using deep neural networks, INTERSPEECH, 2013.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS, 2012.

D. S. Kumar, Feature normalisation for robust speech recognition, 2015.

G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer et al., Fader networks: Manipulating images by sliding attributes, NIPS, 2017.
URL : https://hal.archives-ouvertes.fr/hal-02275215

G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, Phrase-based & neural unsupervised machine translation, 2018.

Y. Lecun, B. Boser, S. John, D. Denker, R. E. Henderson et al., Backpropagation applied to handwritten zip code recognition, Neural computation, vol.1, issue.4, pp.541-551, 1989.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol.86, issue.11, pp.2278-2324, 1998.

K. Lee and H. Hon, Speaker-independent phone recognition using hidden markov models, IEEE Trans. Acoustics, Speech, and Signal Processing, vol.37, pp.1641-1648, 1988.

Y. Lei, N. Scheffer, L. Ferrer, and M. Mclaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network, Acoustics, Speech and Signal Processing, pp.1695-1699, 2014.

. Stephen-e-levinson, M. M. Lawrence-r-rabiner, and . Sondhi, An introduction to the application of the theory of probabilistic functions of a markov process to automatic speech recognition, Bell System Technical Journal, vol.62, issue.4, pp.1035-1074, 1983.

M. Lim, D. Lee, H. Park, U. Park, and J. Kim, Audio event classification using deep neural networks, 2016.

M. Lin, Q. Chen, and S. Yan, Network in network, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00737767

H. Peter, D. Lindsay, and . Norman, Human information processing: An introduction to psychology, 2013.

V. Liptchinsky, G. Synnaeve, and R. Collobert, Letter-Based Speech Recognition with Gated ConvNets. CoRR, abs/1712.09444, 2017.

M. A. Little, P. E. Mcsharry, E. J. Hunter, J. L. Spielman, and L. O. Ramig, Suitability of Dysphonia Measurements for Telemonitoring of Parkinson's Disease, IEEE Transactions on Biomedical Engineering, vol.56, pp.1015-1022, 2009.

C. Liu, J. Trmal, M. Wiesner, C. Harman, and S. Khudanpur, Topic identification for speech without asr, INTERSPEECH, 2017.

H. Liu, Z. Zhu, X. Li, and S. Satheesh, Gram-ctc: Automatic unit selection and target decomposition for sequence labelling, 2017.

S. Lloyd, Least squares quantization in pcm, IEEE transactions on information theory, vol.28, issue.2, pp.129-137, 1982.

J. Vincent-lostanlen, A. Salamon, S. Farnsworth, J. P. Kelling, and . Bello, Birdvox-full-night: A dataset and benchmark for avian flight call detection, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.266-270, 2018.

J. Vincent-lostanlen, M. B. Salamon, B. Cartwright, A. Mcfee, S. Farnsworth et al., Per-channel energy normalization: Why and how, IEEE Signal Processing Letters, vol.26, pp.39-43, 2019.

G. David and . Lowe, Object recognition from local scale-invariant features. In Computer vision, The proceedings of the seventh IEEE international conference on, vol.2, pp.1150-1157, 1999.

L. Lu, L. Kong, C. Dyer, A. Noah, S. Smith et al., Segmental recurrent neural networks for end-to-end speech recognition, 2016.

G. James, K. K. Lyons, and . Paliwal, Effect of compressing the dynamic range of the power spectrum in modulation filtering based speech enhancement, INTERSPEECH, 2008.

R. Maia, T. Toda, H. Zen, Y. Nankaku, and K. Tokuda, An excitation model for hmm-based speech synthesis based on residual modeling, SSW, 2007.

J. Makhoul and L. Cosell, Lpcw: An lpc vocoder with linear predictive spectral warping, IEEE International Conference on ICASSP'76, vol.1, pp.466-469, 1976.

A. Martin, S. Peperkamp, and E. Dupoux, Learning phonemes with a proto-lexicon, Cognitive science, vol.37, issue.1, pp.103-127, 2013.

B. Mcfee, C. Raffel, D. Liang, P. W. Daniel, M. Ellis et al., librosa: Audio and music signal analysis in python, Proceedings of the 14th python in science conference, pp.18-25, 2015.

K. Mengistu and F. Rudzicz, Adapting acoustic and lexical models to dysarthric speech, ICASSP, pp.4924-4927, 2011.

K. Mengistu and F. Rudzicz, Comparing Humans and Automatic Speech Recognition Systems in Recognizing Dysarthric Speech, vol.6657, pp.291-300, 2011.

Y. Miao, M. Gowayyed, and F. Metze, EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, Automatic Speech Recognition and Understanding Workshop (ASRU), 2015.

T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, Recurrent neural network based language model, INTERSPEECH, 2010.

J. Millet and N. Zeghidour, Learning to detect dysarthria from raw speech. CoRR, abs/1811.11101, 2018.
URL : https://hal.archives-ouvertes.fr/hal-02274504

J. Ming and F. Smith, Improved phone recognition using bayesian triphone models, Proceedings of the 1998 IEEE International Conference on, vol.1, pp.409-412, 1998.

A. Mohamed, G. Dahl, and G. Hinton, Deep belief networks for phone recognition, Nips workshop on deep learning for speech recognition and related applications, vol.1, p.39, 2009.

M. Mohri, F. Pereira, and M. Riley, Weighted finite-state transducers in speech recognition, Computer Speech & Language, vol.16, issue.1, pp.69-88, 2002.

N. Morgan, H. Bourlard, and H. Hermansky, Automatic speech recognition: An auditory perspective, Speech processing in the auditory system, pp.309-338

. Springer, , 2004.

H. Muckenhirn, M. Magimai-doss, and S. Marcel, Towards directly modeling raw speech signal for speaker verification using cnns, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4884-4888, 2018.

V. Nair and G. E. Hinton, Rectified linear units improve restricted boltzmann machines, Proceedings of the 27th international conference on machine learning (ICML-10), pp.807-814, 2010.

J. Nicholson, K. Takahashi, and R. Nakatsu, Emotion recognition in speech using neural networks, Neural computing & applications, vol.9, issue.4, pp.290-296, 2000.

M. Noll, Cepstrum pitch determination. The journal of the acoustical society of America, vol.41, pp.293-309, 1967.

M. Noll and M. R. Schroeder, Short-time "cepstrum" pitch detection, The Journal of the Acoustical Society of America, vol.36, issue.5, pp.1030-1030, 1964.

T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, Multichannel endto-end speech recognition, ICML, 2017.

D. Kimbrough-oller, P. Niyogi, S. S. Gray, J. A. Richards, J. Gilkerson et al., Automated vocal analysis of naturalistic recordings from children with autism, language delay, and typical development. Proceedings of the, vol.107, pp.13354-13363, 2010.

M. Kamal-omar and J. W. Pelecanos, A novel approach to detecting non-native speakers and their native language, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4398-4401, 2010.

O. Douglas and . Shaughnessy, , 1987.

F. Pace, F. Benard, H. Glotin, O. Adam, and P. White, Subunit definition and analysis for humpback whale call classification, Applied Acoustics, vol.71, issue.11, pp.1107-1112, 2010.
URL : https://hal.archives-ouvertes.fr/hal-02264967

D. Palaz, R. Collobert, and M. Magimai-doss, End-to-end phoneme sequence recognition using convolutional neural networks, 2013.

D. Palaz, R. Collobert, and M. Magimai-doss, Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks, In INTERSPEECH, 2013.

D. Palaz, M. M. Doss, and R. Collobert, Convolutional neural networksbased continuous speech recognition using raw speech signal, Acoustics, Speech and Signal Processing, pp.4295-4299, 2015.

D. Palaz, G. Synnaeve, and R. Collobert, Jointly Learning to Locate and Classify Words Using Convolutional Networks, 2016.

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp.5206-5210, 2015.

S. Alex, J. Park, and . Glass, Unsupervised pattern discovery in speech. Audio, Speech, and Language Processing, IEEE Transactions on, vol.16, issue.1, pp.186-197, 2008.

B. Douglas, J. M. Paul, and . Baker, The design for the Wall Street Journal-based CSR corpus, Proceedings of the workshop on Speech and Natural Language, pp.357-362, 1992.

V. Peddinti, T. Sainath, S. Maymon, B. Ramabhadran, D. Nahamoo et al., Deep scattering spectrum with deep neural networks, 2014 IEEE International Conference on, pp.210-214, 2014.

M. Perez, W. Jin, D. Le, N. Carlozzi, P. Dayalu et al., Classification of huntington disease using acoustic and lexical features, 2018.

F. Perronnin, J. Sánchez, and T. Mensink, Improving the fisher kernel for large-scale image classification, European conference on computer vision, pp.143-156, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00548630

G. Pironkov, S. Dupont, and T. Dutoit, Speaker-aware long short-term memory multi-task learning for speech recognition, Signal Processing Conference, pp.1911-1915, 2016.

M. A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond et al., Buckeye Corpus of Conversational Speech, 2007.

H. Pon-barry, Prosodic manifestations of confidence and uncertainty in spoken language, INTERSPEECH, 2008.

D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu et al., Semi-orthogonal low-rank matrix factorization for deep neural networks, In Interspeech, 2018.

. Lawrence-r-rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, issue.2, pp.257-286, 1989.

K. Rao, H. Sak, and R. Prabhavalkar, Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.193-199, 2017.

M. Ravanelli and Y. Bengio, Speaker recognition from raw waveform with sincnet, 2018.

S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco, Connectionist probability estimators in hmm speech recognition, IEEE Trans. Speech and Audio Processing, vol.2, pp.161-174, 1994.

D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge, INTERSPEECH, 2015.

F. Rudzicz, P. Van-lieshout, G. Hirst, G. Penn, F. Shein et al., Towards a Comparative Database of Dysarthric Articulation, Proceedings of ISSP, 2008.

F. Rudzicz, A. Kumar-namasivayam, and T. Wolff, The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, vol.46, pp.523-541, 2012.

S. Seyed-omid-sadjadi, J. W. Ganapathy, and . Pelecanos, The IBM 2016 speaker recognition system, 2016.

N. Tara, B. Sainath, and . Kingsbury, Learning filter banks within a deep neural network framework, ASRU, pp.297-302, 2013.

N. Tara, R. J. Sainath, A. Weiss, . Senior, W. Kevin et al., Learning the speech front-end with raw waveform CLDNNs, Sixteenth Annual Conference of the International Speech Communication Association, 2015.

T. N. Sainath, R. J. Weiss, K. W. Wilson, A. Narayanan, M. Bacchiani et al., Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.30-36, 2015.

H. Sakoe and S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.26, issue.1, pp.43-49, 1978.

J. Salamon, J. P. Bello, A. Farnsworth, and S. Kelling, Fusing shallow and deep learning for bioacoustic bird species classification, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.141-145, 2017.

T. Salimans, P. Diederik, and . Kingma, Weight normalization: A simple reparameterization to accelerate training of deep neural networks, Advances in Neural Information Processing Systems, pp.901-909, 2016.

G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas et al., English conversational telephone speech recognition by humans and machines, 2017.

S. Sapir, L. O. Ramig, J. L. Spielman, and C. Fox, Formant centralization ratio: a proposal for a new acoustic measure of dysarthric speech, Journal of speech, language, and hearing research : JSLHR, vol.53, pp.114-139, 2010.

M. Sarma, P. Ghahremani, D. Povey, K. Nagendra-kumar-goel, N. Sarma et al., Emotion identification from raw speech signals using dnns, 2018.

T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky et al., Evaluating speech features with the minimal-pair abx task: Analysis of the classical mfc/plp pipeline, INTERSPEECH 2013: 14th Annual Conference of the International Speech Communication Association, pp.1-5, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00918599

T. Schatz, ABX-discriminability measures and applications, 2016.
URL : https://hal.archives-ouvertes.fr/tel-01407461

R. Schlüter, I. Bezrukov, H. Wagner, and H. Ney, Gammatone features and feature combination for large vocabulary speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing -ICASSP '07, vol.4, 2007.

B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers et al., Paralinguistics in speech and language-state-of-theart and the challenge, Computer Speech & Language, vol.27, issue.1, pp.4-39, 2013.

B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer et al., The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism, Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, 2013.

W. Björn, S. Schuller, A. Steidl, F. Batliner, L. Burkhardt et al., Christian A. Müller, and Shrikanth Narayanan. The interspeech 2010 paralinguistic challenge, INTERSPEECH, 2010.

W. Björn, S. Schuller, A. Steidl, E. Batliner, J. Bergelson et al., George Trigeorgis, Panagiotis Tzirakis, and Stefanos P. Zafeiriou. The interspeech 2017 computational paralinguistics challenge: Addressee, cold & snoring. In INTERSPEECH, 2017.

W. Björn, Y. Schuller, F. Zhang, and . Weninger, Three recent trends in paralinguistics on the way to omniscient machine intelligence, Journal on Multimodal User Interfaces, vol.12, pp.273-283, 2018.

B. Schuller, S. Steidl, and A. Batliner, The Interspeech 2009 Emotion Challenge, Proc. Interspeech, pp.312-315, 2009.

H. Seki, T. Hori, S. Watanabe, J. L. Roux, and J. R. Hershey, A purely end-to-end system for multi-speaker speech recognition, 2018.

R. Sennrich, B. Haddow, and A. Birch, Neural machine translation of rare words with subword units, 2015.

M. Slaney, Auditory toolbox. Interval Research Corporation, vol.10, 1998.

C. Evan, . Smith, S. Michael, and . Lewicki, Efficient auditory coding, Nature, vol.439, issue.7079, pp.978-982, 2006.

S. Stevens, On the psychophysical law, Psychological review, vol.64, pp.153-81, 1957.

S. Stanley, J. Stevens, and . Volkmann, The relation of pitch to frequency: A revised scale, The American Journal of Psychology, vol.53, issue.3, pp.329-353, 1940.

S. S. Stevens, J. Volkmann, and E. Newman, A scale for the measurement of the psychological magnitude pitch, The Journal of the Acoustical Society of America, vol.8, issue.3, pp.185-190, 1937.

I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, International conference on machine learning, pp.1139-1147, 2013.

I. Sutskever, O. Vinyals, and Q. Le, Sequence to sequence learning with neural networks, Advances in neural information processing systems, pp.3104-3112, 2014.

D. Swingley, Contributions of infant word learning to language development, Philosophical transactions of the Royal Society of London. Series B, Biological sciences, vol.364, pp.3617-3649, 2009.

G. Synnaeve and E. Dupoux, Weakly Supervised Multi-Embeddings Learning of Acoustic Models, ICLR, 2014.

G. Synnaeve, T. Schatz, and E. Dupoux, Phonetics Embedding Learning with Side Information, IEEE Spoken Language Technology Workshop, 2014.

Z. Tang, L. Li, and D. Wang, Multi-task recurrent model for speech and speaker recognition, 2016.

J. Tepperman, D. Traum, and S. Narayanan, yeah right": Sarcasm recognition for spoken dialogue systems, Ninth International Conference on Spoken Language Processing, 2006.

R. Thiollière, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, A Hybrid Dynamic Time Warping-Deep Neural Network Architecture for Unsupervised Acoustic Modeling, Sixteenth Annual Conference of the International Speech Communication Association, 2015.

A. Tjandra, S. Sakti, and S. Nakamura, Attention-based Wav2text with Feature Transfer Learning, 2017.

A. Tjandra, S. Sakti, and S. Nakamura, Sequence-to-Sequence ASR Optimization via Reinforcement Learning, 2017.

S. Toshniwal, N. Tara, R. J. Sainath, B. Weiss, P. Li et al., Multilingual speech recognition with a single end-to-end model, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.4904-4908, 2018.

O. Tosi, H. Oyer, W. Lashbrook, C. Pedrey, J. Nicol et al., Experiment on voice identification, The Journal of the Acoustical Society of America, vol.51, issue.6B, pp.2030-2043, 1972.

L. Tóth, Combining time-and frequency-domain convolution in convolutional neural network-based phone recognition, Acoustics, Speech and Signal Processing, pp.190-194, 2014.

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou et al., Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, ICASSP, pp.5200-5204, 2016.

L. Tóth, Phone recognition with hierarchical convolutional deep maxout networks, EURASIP Journal on Audio, Speech, and Music Processing, vol.2015, issue.1, p.25, 2015.

Z. Tüske, P. Golik, R. Schlüter, and H. Ney, Acoustic modeling with deep neural networks using raw time signal for LVCSR, 2014.

D. Ulyanov, A. Vedaldi, and V. Lempitsky, Instance Normalization: The Missing Ingredient for Fast Stylization, 2016.

S. Umesh, L. Cohen, and D. Nelson, Fitting the mel scale, Proceedings., 1999 IEEE International Conference on, vol.1, pp.217-220, 1999.

A. Van-den-oord, S. Dieleman, and B. Schrauwen, Deep content-based music recommendation, Advances in neural information processing systems, pp.2643-2651, 2013.

A. Van-den, S. Oord, H. Dieleman, K. Zen, O. Simonyan et al., Wavenet: A generative model for raw audio, 2016.

L. Van-der-maaten and K. Weinberger, Stochastic triplet embedding, Machine Learning for Signal Processing (MLSP), pp.1-6, 2012.

T. Véniat, O. Schwander, and L. Denoyer, Stochastic adaptive neural architecture search for keyword spotting, 2018.

M. Versteegh, R. Thiolliere, T. Schatz, X. N. Cao, X. Anguera et al., The zero resource speech challenge, Proc. of Interspeech, 2015.

O. Viikki and K. Laurila, Cepstral domain segmental feature vector normalization for noise robust speech recognition, Speech Communication, vol.25, issue.1-3, pp.133-147, 1998.

O. Viikki, D. Bye, and K. Laurila, A recursive feature vector normalization approach for robust speech recognition in noise, ICASSP, 1998.

A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE transactions on Information Theory, vol.13, issue.2, pp.260-269, 1967.

N. J. De-vries, M. H. Davel, J. Badenhorst, W. D. Basson, F. Wet et al., A smartphone-based ASR data collection tool for under-resourced languages, Speech Communication, vol.56, pp.119-131, 2014.

A. H. Waibel, T. Hanazawa, G. E. Hinton, K. Shikano, and K. J. Lang, Phoneme recognition using time-delay neural networks, IEEE Trans. Acoustics, Speech, and Signal Processing, vol.37, pp.328-339, 1989.

Y. Wang, H. Yi-lee, and L. Lee, Segmental audio word2vec: Representing utterances as sequences of vectors with applications in spoken term detection, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.6269-6273, 2018.

Y. Wang, P. Getreuer, T. Hughes, F. Richard, . Lyon et al., Trainable frontend for robust and far-field keyword spotting, ICASSP, pp.5670-5674, 2017.

B. Weiss and F. Burkhardt, Voice attributes affecting likability perception, INTERSPEECH, 2010.

Y. Xian, A. Thompson, Q. Qiu, L. Nolte, D. Nowacek et al., Classification of whale vocalizations using the weyl transform, Acoustics, Speech and Signal Processing, pp.773-777, 2015.

B. Xiang, V. Upendra, J. Chaudhari, . Navrátil, N. Ganesh et al., Short-time gaussianization for robust speaker verification, IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.1, 2002.

L. Xie and A. Yuille, Genetic cnn, 2017 IEEE International Conference on Computer Vision (ICCV), pp.1388-1397, 2017.

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer et al., Achieving human parity in conversational speech recognition, 2016.

B. Xu, N. Wang, T. Chen, and M. Li, Empirical evaluation of rectified activations in convolutional network, 2015.

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, Mixed excitation for hmm-based speech synthesis, INTERSPEECH, 2001.

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw et al., The htk book, Cambridge university engineering department, vol.3, p.175, 2002.

N. Zeghidour, G. Synnaeve, N. Usunier, and E. Dupoux, Joint learning of speaker and phonetic similarities with siamese networks, INTERSPEECH, 2016.

N. Zeghidour, G. Synnaeve, M. Versteegh, and E. Dupoux, A Deep Scattering Spectrum-Deep Siamese Network Pipeline for Unsupervised Acoustic Modeling, ICASSP, 2016.

N. Zeghidour, N. Usunier, I. Kokkinos, T. Schatz, G. Synnaeve et al., Learning Filterbanks from Raw Speech for Phone Recognition, 2017.

N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, End-to-End Speech Recognition from the Raw Waveform, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01888739

N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve et al., Fully convolutional speech recognition, 2018.

D. Matthew and . Zeiler, ADADELTA: An adaptive learning rate method, 2012.

D. Matthew, R. Zeiler, and . Fergus, Visualizing and understanding convolutional networks, ECCV, 2014.

H. Zen, T. Toda, M. Nakamura, and K. Tokuda, Details of the nitech hmm-based speech synthesis system for the blizzard challenge, IEICE Transactions, pp.90-325, 2005.

A. Zeyer, K. Irie, R. Schlüter, and H. Ney, Improved training of end-to-end attention models for speech recognition, 2018.

Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, C. Laurent-yoshua et al., Towards end-to-end speech recognition with deep convolutional neural networks, 2017.

Z. Zhang, J. Han, K. Qian, and B. W. Schuller, Evolving learning for analysing mood-related infant vocalisation, 2018.

Y. Zhou, C. Xiong, and R. Socher, Improving End-to-End Speech Recognition with Policy Learning, International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.

B. Zoph, V. Quoc, and . Le, Neural architecture search with reinforcement learning, 2016.