L. Bahl, P. Brown, P. De-souza, and R. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.49-52, 1986.
DOI : 10.1109/ICASSP.1986.1169179

J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, The PASCAL CHiME speech separation and recognition challenge, Computer Speech & Language, vol.27, issue.3, pp.621-633, 2013.
DOI : 10.1016/j.csl.2012.10.004
URL : https://hal.archives-ouvertes.fr/hal-00646370

J. Bellegarda and D. Nahamoo, Tied mixture continuous parameter modeling for speech recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on, vol.38, issue.12, pp.2033-2045, 1990.

J. R. Bellegarda, Statistical language model adaptation: review and perspectives, Speech Communication, vol.42, issue.1, pp.93-108, 2004.
DOI : 10.1016/j.specom.2003.08.002

Y. Bengio and P. Frasconi, Input-output HMMs for sequence processing, IEEE Transactions on Neural Networks, vol.7, issue.5, pp.1231-1249, 1996.
DOI : 10.1109/72.536317

A. Bianne-bernard, M. Fares, L. Likforman-sulem, C. Mokbel, and C. Kermorvant, Variable length and context-dependent HMM letter form models for Arabic handwritten word recognition, Document Recognition and Retrieval XIX, pp.829708-829708, 2012.
DOI : 10.1117/12.912093

J. Bilmes, A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, 1998.

P. Boersma and D. Weenink, Praat, a system for doing phonetics by computer, Glot International, vol.5, issue.910, pp.341-345, 2001.

N. Boulanger-lewandowski, Y. Bengio, and P. Vincent, Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription, 2012.

M. Brand, Voice puppetry, Proceedings of the 26th annual conference on Computer graphics and interactive techniques , SIGGRAPH '99, pp.21-28, 1999.
DOI : 10.1145/311535.311537

C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan, Rigid head motion in expressive speech animation: Analysis and synthesis. Audio, Speech, and Language Processing, IEEE Transactions on, vol.15, issue.3, pp.1075-1086, 2007.

C. Busso, Z. Deng, U. Neumann, and S. Narayanan, Natural head motion synthesis driven by acoustic prosodic features, Computer Animation and Virtual Worlds, vol.25, issue.3-4, 2005.
DOI : 10.1002/cav.80

C. Cheng, F. Sha, and L. K. Saul, Matrix updates for perceptron training of continuous density hidden Markov models, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, p.20, 2009.
DOI : 10.1145/1553374.1553394

C. Chiu and S. C. Marsella, How to Train Your Avatar: A Data Driven Approach to Gesture Generation, The 11th International Conference on Intelligent Virtual Agents, 2011.
DOI : 10.1007/978-3-642-23974-8_14

H. Christensen, J. Barker, N. Ma, and P. Green, The chime corpus: a resource and a challenge for computational hearing in multisource environments, Proc. Interspeech´10Interspeech´ Interspeech´10 Makuhari, 2010.

M. Costa, T. Chen, and F. Lavagetto, Visual prosody analysis for realistic motion synthesis of 3d head models, Proc. of ICAV3D01 -International Conference on Augmented, Virtual Environments and 3D Imaging, pp.343-346, 2001.

X. Cui and Y. Gong, A Study of Variable-Parameter Gaussian Mixture Hidden Markov Modeling for Noisy Speech Recognition, ICASSP '03
DOI : 10.1109/TASL.2006.889791

X. Cui and Y. Gong, A study of variable-parameter gaussian mixture hidden markov modeling for noisy speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, vol.15, issue.4, pp.1366-1376, 2007.

A. P. Dempster, N. M. Laird, R. , and D. B. , Maximum likelihood from incomplete data via the em algorithm, JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, vol.39, issue.1, pp.1-38, 1977.

L. Deng, A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal, Signal Processing, vol.27, issue.1, pp.65-78, 1992.
DOI : 10.1016/0165-1684(92)90112-A

L. Deng, M. Aksmanovic, X. Sun, and C. Wu, Speech recognition using hidden markov models with polynomial regression functions as nonstationary states. Speech and Audio Processing, IEEE Transactions on, vol.2, issue.4, pp.507-520, 1994.

T. Do and T. Eres, Conditional random field for tracking user behavior based on his eye's movements, Citeseer. Bibliography, vol.139, p.19, 2005.

T. Do and T. Eres, Conditional random fields for online handwriting recognition, Adv in Neur Inf Proc Sys, pp.1097-1104, 2006.
URL : https://hal.archives-ouvertes.fr/inria-00104207

T. Do and T. Eres, Large margin training for hidden Markov models with partially observed states, Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp.265-272, 2009.
DOI : 10.1145/1553374.1553408
URL : https://hal.archives-ouvertes.fr/hal-01294610

P. Ekman and W. Friesen, Facial Action Coding System: A Technique for the Measurement of Facial Movement, 1978.

J. Epps, F. Chen, S. Oviatt, K. Mase, A. Sears et al., Chalearn multi-modal gesture recognition 2013: grand challenge and workshop summary, ICMI, pp.365-368

G. Fanelli, J. Gall, H. Romsdorfer, T. Weise, and L. V. Gool, A 3-D Audio-Visual Corpus of Affective Communication, IEEE Transactions on Multimedia, vol.12, issue.6, pp.591-598, 2010.
DOI : 10.1109/TMM.2010.2052239

Y. Fujii, K. Yamamoto, and S. Nakagawa, Automatic speech recognition using Hidden Conditional Neural Fields, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.5036-5039, 2011.
DOI : 10.1109/ICASSP.2011.5947488
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.657.1405

K. Fujinaga, M. Nakai, H. Shimodaira, and S. Sagayama, Multipleregression hidden markov model, Acoustics, Speech, and Signal Processing Proceedings. (ICASSP '01). 2001 IEEE International Conference on, pp.513-516, 2001.

S. Furui, Speaker-independent isolated word recognition using dynamic features of speech spectrum, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol.34, issue.1, pp.52-59, 1986.
DOI : 10.1109/TASSP.1986.1164788

Y. Gong, Speech recognition in noisy environments: A survey, Speech Communication, vol.16, issue.3, pp.261-291, 1995.
DOI : 10.1016/0167-6393(94)00059-J

G. Lab and C. U. , Carnegie-mellon university motion capture database

A. Graves, N. Jaitly, M. , and A. , Hybrid speech recognition with Deep Bidirectional LSTM, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp.273-278, 2013.
DOI : 10.1109/ASRU.2013.6707742

A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt, Hidden conditional random fields for phone classification, pp.1117-1120, 2005.

J. M. Hammersley and P. Cli?ord, Markov field on finite graphs and lattices, 1971.

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups, IEEE Signal Processing Magazine, vol.29, issue.6, pp.2982-97, 2012.
DOI : 10.1109/MSP.2012.2205597

G. Hinton and R. Salakhutdinov, Reducing the Dimensionality of Data with Neural Networks, Science, vol.313, issue.5786, pp.313504-507, 2006.
DOI : 10.1126/science.1127647

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, 2012.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

G. Hofer, Yamagishi j.: Speech driven head motion synthesis based on a trajectory model, Proc. SIGGRAPH, 2007.

M. Hwang and X. Huang, Shared-distribution hidden markov models for speech recognition. Speech and Audio Processing, IEEE Transactions on, vol.1, issue.4, pp.414-420, 1993.

H. Jaeger, Tutorial on training recurrent neural networks, p.48, 2002.

H. Jiang, Discriminative training of HMMs for automatic speech recognition: A survey, Computer Speech & Language, vol.24, issue.4, pp.589-608, 2010.
DOI : 10.1016/j.csl.2009.08.002

B. Juang and S. Katagiri, Discriminative learning for minimum error classification [pattern recognition]. Signal Processing, IEEE Transactions on, issue.12, pp.403043-3054, 1992.

A. Just and S. Marcel, A comparative study of two state-of-the-art sequence processing techniques for hand gesture recognition, Computer Vision and Image Understanding, vol.113, issue.4, pp.532-543, 2009.
DOI : 10.1016/j.cviu.2008.12.001

J. D. La?erty, A. Mccallum, and F. C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pp.282-289, 2001.

B. H. Le, X. Ma, and Z. Deng, Live Speech Driven Head-and-Eye Motion Generators, IEEE Transactions on Visualization and Computer Graphics, vol.18, issue.11, pp.181902-1914, 2012.
DOI : 10.1109/TVCG.2012.74

Y. Lecun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, A tutorial on energy-based learning, Predicting Structured Data, 2006.

K. Lee and H. Hon, Speaker-independent phone recognition using hidden markov models. Acoustics, Speech and Signal Processing, IEEE Transactions on, issue.11, pp.371641-1648, 1989.

S. Levine, P. Krähenbühl, S. Thrun, and V. Koltun, Gesture controllers, 2010.
DOI : 10.1145/1778765.1778861

S. Levine, C. Theobalt, and V. Koltun, Real-time prosody-driven synthesis of body language, ACM Trans. Graph, vol.28172, issue.5, pp.1-17210, 2009.

Y. Li and H. Shum, Learning dynamic audio-visual mapping with inputoutput hidden markov models. Multimedia, IEEE Transactions on, vol.8, issue.3, pp.542-549, 2006.

A. Ljolje, High accuracy phone recognition using context clustering and quasi-triphonic models, Computer Speech & Language, vol.8, issue.2, pp.129-151, 1994.
DOI : 10.1006/csla.1994.1006

M. Mahajan, A. Gunawardana, and A. Acero, Training Algorithms for Hidden Conditional Random Fields, 2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006.
DOI : 10.1109/ICASSP.2006.1660010

S. Mariooryad and C. Busso, Generating human-like behaviors using joint, speech-driven models for conversational agents. Audio, Speech, and Language Processing, IEEE Transactions on, issue.8, pp.202329-2340, 2012.

E. Mcdermott, S. Watanabe, and A. Nakamura, Margin-space integration of mpe loss via di?erencing of mmi functionals for generalized error-weighted discriminative training, INTERSPEECH, pp.224-227, 2009.

P. Mirowski, Time Series Modeling with Hidden Variables and Gradientbased Algorithms, 2011.

P. Mirowski and Y. Lecun, Dynamic Factor Graphs for Time Series Modeling, Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, ECML PKDD '09, pp.128-143, 2009.
DOI : 10.1007/978-3-642-04174-7_9

M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger et al., Documentation mocap database hdm05, 2007.

K. P. Murphy, Y. Weiss, J. , and M. I. , Loopy belief propagation for approximate inference: An empirical study, Proceedings of Uncertainty in AI, pp.467-475, 1999.

R. Nopsuwanchai, Discriminative training methods and their applications to handwriting recognition, 2005.

I. S. Pandzic and R. Forchheimer, MPEG-4 Facial Animation: The Standard, Implementation and Applications, 2003.
DOI : 10.1002/0470854626

D. Paul, The lincoln tied-mixture hmm continuous speech recognizer, Acoustics, Speech, and Signal Processing ICASSP-91., 1991 International Conference on, pp.329-332, 1991.

D. Povey and P. Woodland, Minimum phone error and i-smoothing for improved discriminative training, Acoustics, Speech, and Signal Processing (ICASSP) IEEE International Conference on, pp.105-108, 2002.

A. Quattoni, S. Wang, L. Morency, M. Collins, T. Darrell et al., Hidden-state conditional random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007.
DOI : 10.1109/tpami.2007.1124

L. Rabiner and B. Juang, Fundamentals of speech recognition, 1993.

L. R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE, pp.257-286, 1989.

S. Reiter, B. Schuller, and G. Rigoll, Hidden Conditional Random Fields for Meeting Segmentation, Multimedia and Expo, 2007 IEEE International Conference on, pp.639-642, 2007.
DOI : 10.1109/ICME.2007.4284731

M. Sargin, Y. Yemez, E. Erzin, and A. Tekalp, Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, issue.8, pp.301330-1345, 2008.

P. Senin, Dynamic Time Warping Algorithm Review, 2008.

F. Sha, Large margin training of acoustic models for speech recognition, 2006.

F. Sha and L. K. Saul, Large margin hidden markov models for automatic speech recognition, pp.1249-1256, 2007.

N. Srivastava, Improving neural networks with dropout (doctoral dissertation , university of toronto), 2013.

Y. Sung and D. Jurafsky, Hidden Conditional Random Fields for phone recognition, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pp.107-112, 2009.
DOI : 10.1109/ASRU.2009.5373329

C. Sutton and A. Mccallum, An introduction to conditional random fields for relational learning, Introduction to Statistical Relational Learning, 2007.

K. Tokuda, T. Kobayashi, T. Masuko, T. Kobayashi, and T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), pp.1315-1318, 2000.
DOI : 10.1109/ICASSP.2000.861820

L. Van-der-maaten, E. Postma, and H. Van-den-herik, Dimensionality reduction: A comparative review, 2009.

E. Vincent, J. Barker, S. Watanabe, J. Le-roux, F. Nesta et al., The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.126-130, 2013.
DOI : 10.1109/ICASSP.2013.6637622

A. Vinel, T. M. Do, and T. , Joint Optimization of Hidden Conditional Random Fields and Non Linear Feature Extraction, 2011 International Conference on Document Analysis and Recognition, pp.513-517, 2011.
DOI : 10.1109/ICDAR.2011.109
URL : https://hal.archives-ouvertes.fr/hal-00706021

L. Wan, M. D. Zeiler, S. Zhang, Y. Lecun, F. et al., Regularization of neural networks using dropconnect, ICML (3), volume 28 of JMLR Proceedings, pp.1058-1066, 2013.

J. M. Wang, D. J. Fleet, S. Member, and A. Hertzmann, Gaussian Process Dynamical Models for Human Motion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.30, issue.2, 2007.
DOI : 10.1109/TPAMI.2007.1167

F. Wessel, K. Macherey, and H. Ney, A comparison of word graph and n-best list based confidence measures, Proc. EUROSPEECH, pp.315-318, 1999.

A. Wilson and A. Bobick, Parametric hidden markov models for gesture recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.21, issue.9, pp.884-900, 1999.

J. Xue, Acoustically-driven Talking Face Animations Using Dynamic Bayesian Networks, 2008.
DOI : 10.1109/icme.2006.262743
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.331.9125

S. Young and P. Woodland, State clustering in hidden Markov model-based continuous speech recognition, Computer Speech & Language, vol.8, issue.4, pp.369-383, 1994.
DOI : 10.1006/csla.1994.1019

S. J. Young, G. Evermann, M. J. Gales, T. Hain, D. Kershaw et al., The HTK Book, version 3.4, 2006.

D. Yu and L. Deng, Large-Margin Discriminative Training of Hidden Markov Models for Speech Recognition, International Conference on Semantic Computing (ICSC 2007), pp.429-438, 2007.
DOI : 10.1109/ICSC.2007.11

D. Yu, L. Deng, Y. Gong, and A. Acero, A novel framework and training algorithm for variable-parameter hmms, IEEE Trans. on Audio, Speech, and Language Processing, issue.7, pp.171348-1360, 2009.

H. Zen, K. Tokuda, and T. Kitamura, Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences, Computer Speech & Language, vol.21, issue.1, pp.153-173, 2007.
DOI : 10.1016/j.csl.2006.01.002