, Permits issued by the Chicago Department of Buildings since, Categorical variable: Work Description (cardinality: 430k), 2006.

, Information about U.S. colleges and schools. Target (regression): Percent Pell Grant

, Incidents of crime in the City of Los Angeles since 2010. Target (regression): Victim Age. Categorical variable: Crime Code Description, crime data 3 (1.5M)

, Target (multiclass): Product Type Name. Categorical var.: Non Proprietary Name (17k). employee salaries 5 (9.2k). Salary information for employees of the Montgomery County, MD. Target (regression): Current Annual Salary. Categorical variable: Employee Position Title (385). federal election 6 (3.3M). Campaign finance data for the 2011-2012 US election cycle. Target (regression): Transaction Amount. Categorical variable: Memo Text (17k). journal influence 7 (3.6k). Scientific journals and the respective influence scores

, Target (binary): State. Categorical variable: Category (158)

, Inpatient discharges for Medicare beneficiaries for more than 3,000 U.S. hospitals. Target (regression): Average Total Payments. Categorical var.: Medical Procedure (100). met objects 10 (469k)

, Survey to know if people self-identify as Midwesterners. Target (multiclass): Census Region (10 classes). Categorical var

, Payments given by healthcare manufacturing companies to medical doctors or hospitals (year 2013)

, Public procurement data for the European Economic Area, Switzerland, and the Macedonia. Target (regression): Award Value Euro. Categorical var.: CAE Name (29k). road safety 14 (139k), p.13

, Traffic information from electronic violations issued in the Montgomery County, MD. Target (multiclass): Violation type (4 classes). Categorical var.: Description (11k), traffic violations 15 (1.2M)

, Remuneration and expenses for employees earning over $75,000 per year. Target (regression): Remuneration

, Wine reviews scrapped from WineEnthusiast. Target (regression): Points. Categorical variable: Description (89k), p.17

, size: 32k). Predict whether income exceeds $50K/yr based on census data, Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient.html 10

, Expert ratings of over 1,700 individual chocolate bars, along with information on their origin and bean variety. Target (multiclass): Bean Type. Categorical variable: Broad Bean Origin

, Based on the 1990 California census data. It contains one row per census block group (a block group typically has a population of 600 to 3,000 people). Target (regression): Median House Value, Categorical variable: Ocean Proximity

, Anonymized data of dating profiles from OkCupid. Target (regression): Age. Categorical variable: Diet

, Contains variables describing residential homes in Ames, Iowa. Target (regression): Sale Price. Categorical variable: MSSubClass (15)

, Sale prices for houses in King County, p.23

, Network intrusion simulations with a variaty od descriptors of the attack type. Target (multiclass): Attack Type

D. Achlioptas, Database-friendly random projections: johnsonlindenstrauss with binary coins, Journal of computer and System Sciences, vol.66, issue.4, pp.671-687, 2003.

A. Agresti and M. Kateri, Categorical data analysis, 2011.

D. J. Aldous, Exchangeability and related topics, École d'Été de Probabilités de Saint-Flour XIII-1983, pp.1-198, 1985.

R. Alghamdi and K. Alfalqi, A survey of topic modeling in text mining, Int. J. Adv. Comput. Sci. Appl.(IJACSA), vol.6, issue.1, 2015.

H. Alkharusi, Categorical variables in regression analysis: a comparison of dummy and effect coding, International Journal of Education, vol.4, issue.2, pp.202-210, 2012.

A. Altmann, L. Tolosi, O. Sander, and T. Lengauer, Permutation importance: a corrected feature importance measure, Bioinformatics, vol.26, issue.10, pp.1340-1347, 2010.

S. Amari and H. Nagaoka, Methods of information geometry, 2007.

R. C. Angell, G. E. Freund, and P. Willett, Automatic spelling correction using a trigram similarity measure, Information Processing & Management, vol.19, issue.4, pp.255-261, 1983.

A. Appleby, , 2014.

S. Arora, Y. Liang, and T. Ma, A simple but tough-to-beat baseline for sentence embeddings, 2016.

D. Arthur and S. Vassilvitskii, K-means++: the advantages of careful seeding, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp.1027-1035, 2007.

K. J. Berry, P. W. Mielke, and H. K. Iyer, Factorial designs and dummy coding, Perceptual and motor skills, vol.87, issue.3, pp.919-927, 1998.

D. Blei, Probabilistic topic models, Commun. ACM, vol.55, issue.4, pp.77-84, 2012.

D. M. Blei, M. David, A. Kucukelbir, and J. D. Mcauliffe, Variational inference: a review for statisticians, Journal of the American Statistical Association, vol.112, issue.518, pp.859-877, 2017.

D. M. Blei, M. David, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, Journal of machine Learning research, vol.3, pp.993-1022, 2003.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, Enriching word vectors with subword information, Transactions of the Association of Computational Linguistics, vol.5, issue.1, pp.135-146, 2017.

N. Bostrom and E. Yudkowsky, The ethics of artificial intelligence. The Cambridge handbook of artificial intelligence, vol.316, p.334, 2014.

E. Brill and R. C. Moore, An improved error model for noisy channel spelling correction, Proceedings of the 38th Annual references 85, 2000.

, Meeting on Association for Computational Linguistics, pp.286-293

A. Z. Broder, On the resemblance and containment of documents, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp.21-29, 1997.

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, , 2000.

, Min-wise independent permutations, Journal of Computer and System Sciences, vol.60, issue.3, pp.630-659

W. Buntine, Variational extensions to em and multinomial pca, ACM SIGIR Conference on Research and Development in Information Retrieval, pp.122-129, 2002.

P. Cerda and G. Varoquaux, Encoding high-cardinality string categorical variables, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02171256

P. Cerda, G. Varoquaux, and B. Kégl, Similarity encoding for learning with dirty categorical variables, Machine Learning, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01806175

M. S. Charikar, Similarity estimation techniques from rounding algorithms, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp.380-388, 2002.

T. Chen and C. Guestrin, XGBoost: a scalable tree boosting system, SIGKDD, pp.785-794, 2016.

L. Chi and X. Zhu, Hashing techniques: a survey and taxonomy, ACM Computing Surveys (CSUR), vol.50, issue.1, p.11, 2017.

P. Christen, A survey of indexing techniques for scalable record linkage and deduplication, IEEE transactions on knowledge and data engineering, vol.24, issue.9, pp.1537-1555, 2012.

P. Christen, Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection, 2012.

P. Cohen, S. G. West, and L. S. Aiken, Applied multiple regression/-correlation analysis for the behavioral sciences, 2014.

W. W. Cohen, P. Ravikumar, and S. E. Fienberg, A comparison of string distance metrics for name-matching tasks, In IIWeb, pp.73-78, 2003.

C. Conrad, N. Ali, V. Keselj, and Q. Gao, Elm: an extended logic matching method on record linkage analysis of disparate databases for profiling data mining, 2016 IEEE 18th Conference on Business Informatics (CBI), vol.1, pp.1-6, 2016.

F. J. Damerau, A technique for computer detection and correction of spelling errors, Communications of the ACM, vol.7, issue.3, pp.171-176, 1964.

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, Localitysensitive hashing scheme based on p-stable distributions, Proceedings of the twentieth annual symposium on Computational geometry, pp.253-262, 2004.

M. J. Davis, Contrast coding in multiple regression analysis: strengths, weaknesses, and utility of popular coding structures, Journal of Data Science, vol.8, issue.1, pp.61-73, 2010.

B. De-finetti, Theory of probability, vol.5, p.17, 1974.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, Journal of the American society for information science, vol.41, issue.6, pp.391-407, 1990.

J. Demsar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine learning research, vol.7, p.1, 2006.

D. Dheeru and E. Taniskidou, UCI machine learning repository, 2017.

P. M. Domingos, A few useful things to know about machine learning, Commun. acm, vol.55, issue.10, pp.78-87, 2012.

F. Doshi-velez and B. Kim, Towards a rigorous science of interpretable machine learning, 2017.

W. Duch, K. Grudzinski, and G. Stawski, Symbolic features in neural networks, Proceedings of the 5th Conference on Neural Networks and Their Applications, 2000.

C. Elkan, Deriving tf-idf as a fisher kernel, International Symposium on String Processing and Information Retrieval, pp.295-300, 2005.

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: a survey, IEEE Transactions on knowledge and data engineering, vol.19, issue.1, pp.1-16, 2007.

E. Eskin, J. Weston, W. S. Noble, and C. S. Leslie, Mismatch string kernels for SVM protein classification, Advances in neural information processing systems, pp.1441-1448, 2003.

I. P. Fellegi and A. B. Sunter, A theory for record linkage, Journal of the American Statistical Association, vol.64, issue.328, pp.1183-1210, 1969.

M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the american statistical association, vol.32, pp.675-701, 0200.

T. Gärtner, A survey of kernels for structured data, ACM SIGKDD Explorations Newsletter, vol.5, issue.1, pp.49-58, 2003.

A. Gionis, P. Indyk, and R. Motwani, Similarity search in high dimensions via hashing, Vldb, vol.99, pp.518-529, 1999.

W. H. Gomaa and A. A. Fahmy, A survey of text similarity approaches, International Journal of Computer Applications, vol.68, issue.13, pp.13-18, 2013.

P. K. Gopalan, L. Charlin, and D. Blei, Content-based recommendations with poisson factorization, Advances in Neural Information Processing Systems, pp.3176-3184, 2014.

P. Gopalan, J. M. Hofman, and D. M. Blei, Scalable recommendation with poisson factorization, 2013.

K. Grabczewski and N. Jankowski, Transformations of symbolic data for continuous data oriented models, Artificial Neural Networks and Neural Information Processing, pp.359-366, 2003.

C. Guo and F. Berkhahn, Entity embeddings of categorical variables, 2016.

N. Halko, P. Martinsson, and J. Tropp, Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions, SIAM review, vol.53, p.217, 2011.

R. W. Hamming, Error detecting and error correcting codes. The Bell system technical journal, vol.29, pp.147-160, 1950.

T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin, The elements of statistical learning: data mining, inference and prediction, The Mathematical Intelligencer, vol.27, issue.2, pp.83-85, 2005.

D. A. Hull, Stemming algorithms: a case study for detailed evaluation, JASIS, vol.47, issue.1, pp.70-84, 1996.

F. Hutter, B. Kegl, R. Caruana, I. Guyon, H. Larochelle et al., Automatic machine learning (automl), ICML Workshop on Resource-Efficient Machine Learning, 2015.
URL : https://hal.archives-ouvertes.fr/in2p3-01171463

F. Hutter, L. Kotthoff, and J. Vanschoren, Automated machine learning-methods, systems, challenges, 2019.

P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp.604-613, 1998.

T. Jaakkola and D. Haussler, Exploiting generative models in discriminative classifiers, Advances in neural information processing systems, pp.487-493, 1999.

T. S. Jaakkola, M. Diekhans, and D. Haussler, Using the fisher kernel method to detect remote protein homologies, ISMB, vol.99, pp.149-158, 1999.

M. A. Jaro, Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida, Journal of the American Statistical Association, vol.84, issue.406, pp.414-420, 1989.

J. Ji, J. Li, S. Yan, Q. Tian, and B. Zhang, Min-max hash for jaccard similarity, 2013 IEEE 13th International Conference on Data Mining, pp.301-309, 2013.

W. B. Johnson and J. Lindenstrauss, Extensions of lipschitz mappings into a hilbert space, Contemporary mathematics, vol.26, p.1, 1984.

W. Kim, B. Choi, E. Hong, S. Kim, and D. Lee, A taxonomy of dirty data, Data mining and knowledge discovery, vol.7, issue.1, pp.81-99, 2003.

D. Klein, J. Smarr, H. Nguyen, and C. D. Manning, Named entity recognition with character-level models, conference on Natural language learning at HLT-NAACL, p.180, 2003.

G. Kondrak, N-gram similarity and distance, International symposium on string processing and information retrieval, pp.115-126, 2005.

T. K. Landauer, P. W. Foltz, and D. Laham, An introduction to latent semantic analysis, p.259, 1998.

D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, pp.556-562, 2001.

A. Lefevre, F. Bach, and C. Févotte, Online algorithms for nonnegative matrix factorization with the itakura-saito divergence, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00602050

, In Applications of Signal Processing to Audio and Acoustics (WAS-PAA, p.313

J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive datasets, 2014.

C. S. Leslie, E. Eskin, A. Cohen, J. Weston, and W. S. Noble, Mismatch string kernels for discriminative protein classification, Bioinformatics, vol.20, issue.4, pp.467-476, 2004.

V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, vol.10, pp.707-710, 1966.

R. Likert, A technique for the measurement of attitudes. Archives of psychology, 1932.

H. Lodhi, C. Saunders, J. Shawe-taylor, N. Cristianini, and C. Watkins, Text classification using string kernels, Journal of Machine Learning Research, vol.2, pp.419-444, 2002.

J. B. Lovins, Development of a stemming algorithm, Mech. Translat. & Comp. Linguistics, vol.11, issue.1-2, pp.22-31, 1968.

L. V. Maaten and G. Hinton, Visualizing data using t-sne, Journal of machine learning research, vol.9, pp.2579-2605, 2008.

D. Maier, The theory of relational databases, 1983.

D. Micci-barreca, A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter, vol.3, issue.1, pp.27-32, 2001.

T. [. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, ICLR, 2013.

T. Mikolov, ]. Tomas, E. Grave, P. Bojanowski, C. Puhrsch et al., Advances in pre-training distributed word representations, International Conference on Language Resources and Evaluation (LREC), 2018.

P. J. Moreno and R. Rifkin, Using the fisher kernel method for web audio classification, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol.4, pp.2417-2420, 2000.

J. L. Myers, A. Well, and R. F. Lorch, Research design and statistical analysis, 2010.

G. Navarro, A guided tour to approximate string matching, ACM computing surveys (CSUR), vol.33, issue.1, pp.31-88, 2001.

P. Nemenyi, Distribution-free multiple comparisons, Biometrics, vol.18, p.263, 1962.

K. E. O'grady and D. R. Medoff, Categorical variables in multiple regression: some cautions, Multivariate behavioral research, vol.23, issue.2, pp.243-2060, 1988.

P. Oliveira, F. Rodrigues, and P. R. Henriques, A formal definition of data quality problems, Proceedings of the 2005 International Conference on Information Quality (MIT IQ Conference), 2005.

R. S. Olson, W. La-cava, Z. Mustahsan, A. Varik, and J. H. Moore, Data-driven advice for applying machine learning to bioinformatics problems, 2017.

E. J. Pedhazur and F. N. Kerlinger, Multiple regression in behavioral research, 1973.

J. Pennington, R. Socher, and C. Manning, Glove: global vectors for word representation, EMNLP, pp.1532-1543, 2014.

F. Perronnin and C. Dance, Fisher kernels on visual vocabularies for image categorization, 2007 IEEE conference on computer vision and pattern recognition, pp.1-8, 2007.

A. Podosinnikova, F. Bach, and S. Lacoste-julien, Rethinking lda: moment matching for discrete ica, Advances in Neural Information Processing Systems, pp.514-522, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01225271

L. Prokhorenkova, G. Gusev, A. Vorobev, A. Dorogush, and A. Gulin, Catboost: unbiased boosting with categorical features, Neural Information Processing Systems, p.6639, 2018.

D. Pyle, Data preparation for data mining, 1999.

A. Rahimi and B. Recht, Random features for large-scale kernel machines, Neural Information Processing Systems, p.1177, 2008.

E. Rahm and H. H. Do, Data cleaning: problems and current approaches, IEEE Data Engineering Bulletin, vol.23, p.3, 2000.

S. Ruggieri, D. Pedreschi, and F. Turini, Data mining for discrimination discovery, ACM Transactions on Knowledge Discovery from Data (TKDD), vol.4, issue.2, p.9, 2010.

G. Salton and M. J. Mcgill, Introduction to modern information retrieval, 1983.

S. Sarawagi and A. Bhamidipaty, Interactive deduplication using active learning, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.269-278, 2002.

B. Schölkopf and A. J. Smola, Learning with kernels, 1998.

M. Serva and F. Petroni, Indo-european languages tree by levenshtein distance, Europhysics Letters), vol.81, issue.6, p.68005, 2008.

C. E. Shannon, A mathematical theory of communication, Bell system technical journal, vol.27, issue.3, pp.379-423, 1948.

M. Steyvers and T. Griffiths, Probabilistic topic models. Handbook of latent semantic analysis, vol.427, pp.424-440, 2007.

E. Ukkonen, Approximate string-matching over suffix trees, Annual Symposium on Combinatorial Pattern Matching, pp.228-242, 1993.

V. Vapnik, The nature of statistical learning theory. Springer science & business media, 2013.

V. N. Vapnik, An overview of statistical learning theory, IEEE transactions on neural networks, vol.10, issue.5, pp.988-999, 1999.

K. R. Varshney and H. Alemzadeh, On the safety of machine learning: cyber-physical systems, decision sciences, and data products, Big data, vol.5, issue.3, pp.246-255, 2017.

A. Vellido, J. D. Martin-guerrero, and P. J. Lisboa, Making machine learning models interpretable, In ESANN, vol.12, pp.163-172, 2012.

N. X. Vinh, J. Epps, and J. Bailey, Information theoretic measures for clusterings comparison: variants, properties, 2010.

, ization and correction for chance, Journal of Machine Learning Research, vol.11, pp.2837-2854

H. Wang and J. Wang, An effective image representation method using kernel classification, 2014 IEEE 26th international conference on tools with artificial intelligence, pp.853-858, 2014.

J. Wang, ]. Jingdong, H. T. Shen, J. Song, and J. Ji, Hashing for similarity search: a survey, 2014.

K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, Feature hashing for large scale multitask learning, ICML, p.1113, 2009.

F. Wilcoxon, Individual comparisons by ranking methods, Breakthroughs in statistics, pp.196-202, 1992.

W. E. Winkler, The state of record linkage and current research problems, 1999.

W. E. Winkler, Methods for record linkage and bayesian networks, 2002.

W. E. Winkler, Overview of record linkage and current research directions, Bureau of the Census, 2006.

I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data mining: practical machine learning tools and techniques, 2016.