, Permits issued by the Chicago Department of Buildings since, Categorical variable: Work Description (cardinality: 430k), 2006.
, Information about U.S. colleges and schools. Target (regression): Percent Pell Grant
, Incidents of crime in the City of Los Angeles since 2010. Target (regression): Victim Age. Categorical variable: Crime Code Description, crime data 3 (1.5M)
, Target (multiclass): Product Type Name. Categorical var.: Non Proprietary Name (17k). employee salaries 5 (9.2k). Salary information for employees of the Montgomery County, MD. Target (regression): Current Annual Salary. Categorical variable: Employee Position Title (385). federal election 6 (3.3M). Campaign finance data for the 2011-2012 US election cycle. Target (regression): Transaction Amount. Categorical variable: Memo Text (17k). journal influence 7 (3.6k). Scientific journals and the respective influence scores
, Target (binary): State. Categorical variable: Category (158)
, Inpatient discharges for Medicare beneficiaries for more than 3,000 U.S. hospitals. Target (regression): Average Total Payments. Categorical var.: Medical Procedure (100). met objects 10 (469k)
, Survey to know if people self-identify as Midwesterners. Target (multiclass): Census Region (10 classes). Categorical var
, Payments given by healthcare manufacturing companies to medical doctors or hospitals (year 2013)
, Public procurement data for the European Economic Area, Switzerland, and the Macedonia. Target (regression): Award Value Euro. Categorical var.: CAE Name (29k). road safety 14 (139k), p.13
, Traffic information from electronic violations issued in the Montgomery County, MD. Target (multiclass): Violation type (4 classes). Categorical var.: Description (11k), traffic violations 15 (1.2M)
, Remuneration and expenses for employees earning over $75,000 per year. Target (regression): Remuneration
, Wine reviews scrapped from WineEnthusiast. Target (regression): Points. Categorical variable: Description (89k), p.17
, size: 32k). Predict whether income exceeds $50K/yr based on census data, Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Inpatient.html 10
, Expert ratings of over 1,700 individual chocolate bars, along with information on their origin and bean variety. Target (multiclass): Bean Type. Categorical variable: Broad Bean Origin
, Based on the 1990 California census data. It contains one row per census block group (a block group typically has a population of 600 to 3,000 people). Target (regression): Median House Value, Categorical variable: Ocean Proximity
, Anonymized data of dating profiles from OkCupid. Target (regression): Age. Categorical variable: Diet
, Contains variables describing residential homes in Ames, Iowa. Target (regression): Sale Price. Categorical variable: MSSubClass (15)
, Sale prices for houses in King County, p.23
, Network intrusion simulations with a variaty od descriptors of the attack type. Target (multiclass): Attack Type
Database-friendly random projections: johnsonlindenstrauss with binary coins, Journal of computer and System Sciences, vol.66, issue.4, pp.671-687, 2003. ,
Categorical data analysis, 2011. ,
Exchangeability and related topics, École d'Été de Probabilités de Saint-Flour XIII-1983, pp.1-198, 1985. ,
A survey of topic modeling in text mining, Int. J. Adv. Comput. Sci. Appl.(IJACSA), vol.6, issue.1, 2015. ,
Categorical variables in regression analysis: a comparison of dummy and effect coding, International Journal of Education, vol.4, issue.2, pp.202-210, 2012. ,
Permutation importance: a corrected feature importance measure, Bioinformatics, vol.26, issue.10, pp.1340-1347, 2010. ,
Methods of information geometry, 2007. ,
Automatic spelling correction using a trigram similarity measure, Information Processing & Management, vol.19, issue.4, pp.255-261, 1983. ,
, , 2014.
A simple but tough-to-beat baseline for sentence embeddings, 2016. ,
K-means++: the advantages of careful seeding, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp.1027-1035, 2007. ,
Factorial designs and dummy coding, Perceptual and motor skills, vol.87, issue.3, pp.919-927, 1998. ,
Probabilistic topic models, Commun. ACM, vol.55, issue.4, pp.77-84, 2012. ,
Variational inference: a review for statisticians, Journal of the American Statistical Association, vol.112, issue.518, pp.859-877, 2017. ,
Latent dirichlet allocation, Journal of machine Learning research, vol.3, pp.993-1022, 2003. ,
Enriching word vectors with subword information, Transactions of the Association of Computational Linguistics, vol.5, issue.1, pp.135-146, 2017. ,
The ethics of artificial intelligence. The Cambridge handbook of artificial intelligence, vol.316, p.334, 2014. ,
An improved error model for noisy channel spelling correction, Proceedings of the 38th Annual references 85, 2000. ,
, Meeting on Association for Computational Linguistics, pp.286-293
On the resemblance and containment of documents, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp.21-29, 1997. ,
, , 2000.
, Min-wise independent permutations, Journal of Computer and System Sciences, vol.60, issue.3, pp.630-659
Variational extensions to em and multinomial pca, ACM SIGIR Conference on Research and Development in Information Retrieval, pp.122-129, 2002. ,
Encoding high-cardinality string categorical variables, 2019. ,
URL : https://hal.archives-ouvertes.fr/hal-02171256
Similarity encoding for learning with dirty categorical variables, Machine Learning, 2018. ,
URL : https://hal.archives-ouvertes.fr/hal-01806175
Similarity estimation techniques from rounding algorithms, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp.380-388, 2002. ,
XGBoost: a scalable tree boosting system, SIGKDD, pp.785-794, 2016. ,
Hashing techniques: a survey and taxonomy, ACM Computing Surveys (CSUR), vol.50, issue.1, p.11, 2017. ,
A survey of indexing techniques for scalable record linkage and deduplication, IEEE transactions on knowledge and data engineering, vol.24, issue.9, pp.1537-1555, 2012. ,
Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection, 2012. ,
Applied multiple regression/-correlation analysis for the behavioral sciences, 2014. ,
A comparison of string distance metrics for name-matching tasks, In IIWeb, pp.73-78, 2003. ,
Elm: an extended logic matching method on record linkage analysis of disparate databases for profiling data mining, 2016 IEEE 18th Conference on Business Informatics (CBI), vol.1, pp.1-6, 2016. ,
A technique for computer detection and correction of spelling errors, Communications of the ACM, vol.7, issue.3, pp.171-176, 1964. ,
Localitysensitive hashing scheme based on p-stable distributions, Proceedings of the twentieth annual symposium on Computational geometry, pp.253-262, 2004. ,
Contrast coding in multiple regression analysis: strengths, weaknesses, and utility of popular coding structures, Journal of Data Science, vol.8, issue.1, pp.61-73, 2010. ,
Theory of probability, vol.5, p.17, 1974. ,
Indexing by latent semantic analysis, Journal of the American society for information science, vol.41, issue.6, pp.391-407, 1990. ,
Statistical comparisons of classifiers over multiple data sets, Journal of Machine learning research, vol.7, p.1, 2006. ,
UCI machine learning repository, 2017. ,
A few useful things to know about machine learning, Commun. acm, vol.55, issue.10, pp.78-87, 2012. ,
Towards a rigorous science of interpretable machine learning, 2017. ,
Symbolic features in neural networks, Proceedings of the 5th Conference on Neural Networks and Their Applications, 2000. ,
Deriving tf-idf as a fisher kernel, International Symposium on String Processing and Information Retrieval, pp.295-300, 2005. ,
Duplicate record detection: a survey, IEEE Transactions on knowledge and data engineering, vol.19, issue.1, pp.1-16, 2007. ,
Mismatch string kernels for SVM protein classification, Advances in neural information processing systems, pp.1441-1448, 2003. ,
A theory for record linkage, Journal of the American Statistical Association, vol.64, issue.328, pp.1183-1210, 1969. ,
The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the american statistical association, vol.32, pp.675-701, 0200. ,
A survey of kernels for structured data, ACM SIGKDD Explorations Newsletter, vol.5, issue.1, pp.49-58, 2003. ,
Similarity search in high dimensions via hashing, Vldb, vol.99, pp.518-529, 1999. ,
A survey of text similarity approaches, International Journal of Computer Applications, vol.68, issue.13, pp.13-18, 2013. ,
Content-based recommendations with poisson factorization, Advances in Neural Information Processing Systems, pp.3176-3184, 2014. ,
, Scalable recommendation with poisson factorization, 2013.
Transformations of symbolic data for continuous data oriented models, Artificial Neural Networks and Neural Information Processing, pp.359-366, 2003. ,
Entity embeddings of categorical variables, 2016. ,
Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions, SIAM review, vol.53, p.217, 2011. ,
Error detecting and error correcting codes. The Bell system technical journal, vol.29, pp.147-160, 1950. ,
The elements of statistical learning: data mining, inference and prediction, The Mathematical Intelligencer, vol.27, issue.2, pp.83-85, 2005. ,
Stemming algorithms: a case study for detailed evaluation, JASIS, vol.47, issue.1, pp.70-84, 1996. ,
Automatic machine learning (automl), ICML Workshop on Resource-Efficient Machine Learning, 2015. ,
URL : https://hal.archives-ouvertes.fr/in2p3-01171463
Automated machine learning-methods, systems, challenges, 2019. ,
Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp.604-613, 1998. ,
Exploiting generative models in discriminative classifiers, Advances in neural information processing systems, pp.487-493, 1999. ,
Using the fisher kernel method to detect remote protein homologies, ISMB, vol.99, pp.149-158, 1999. ,
Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida, Journal of the American Statistical Association, vol.84, issue.406, pp.414-420, 1989. ,
Min-max hash for jaccard similarity, 2013 IEEE 13th International Conference on Data Mining, pp.301-309, 2013. ,
Extensions of lipschitz mappings into a hilbert space, Contemporary mathematics, vol.26, p.1, 1984. ,
A taxonomy of dirty data, Data mining and knowledge discovery, vol.7, issue.1, pp.81-99, 2003. ,
Named entity recognition with character-level models, conference on Natural language learning at HLT-NAACL, p.180, 2003. ,
N-gram similarity and distance, International symposium on string processing and information retrieval, pp.115-126, 2005. ,
An introduction to latent semantic analysis, p.259, 1998. ,
Algorithms for non-negative matrix factorization, Advances in Neural Information Processing Systems, pp.556-562, 2001. ,
Online algorithms for nonnegative matrix factorization with the itakura-saito divergence, 2011. ,
URL : https://hal.archives-ouvertes.fr/hal-00602050
, In Applications of Signal Processing to Audio and Acoustics (WAS-PAA, p.313
Mining of massive datasets, 2014. ,
Mismatch string kernels for discriminative protein classification, Bioinformatics, vol.20, issue.4, pp.467-476, 2004. ,
Binary codes capable of correcting deletions, insertions, and reversals, Soviet Physics Doklady, vol.10, pp.707-710, 1966. ,
A technique for the measurement of attitudes. Archives of psychology, 1932. ,
Text classification using string kernels, Journal of Machine Learning Research, vol.2, pp.419-444, 2002. ,
Development of a stemming algorithm, Mech. Translat. & Comp. Linguistics, vol.11, issue.1-2, pp.22-31, 1968. ,
Visualizing data using t-sne, Journal of machine learning research, vol.9, pp.2579-2605, 2008. ,
The theory of relational databases, 1983. ,
A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter, vol.3, issue.1, pp.27-32, 2001. ,
Efficient estimation of word representations in vector space, ICLR, 2013. ,
Advances in pre-training distributed word representations, International Conference on Language Resources and Evaluation (LREC), 2018. ,
Using the fisher kernel method for web audio classification, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), vol.4, pp.2417-2420, 2000. ,
Research design and statistical analysis, 2010. ,
A guided tour to approximate string matching, ACM computing surveys (CSUR), vol.33, issue.1, pp.31-88, 2001. ,
Distribution-free multiple comparisons, Biometrics, vol.18, p.263, 1962. ,
Categorical variables in multiple regression: some cautions, Multivariate behavioral research, vol.23, issue.2, pp.243-2060, 1988. ,
A formal definition of data quality problems, Proceedings of the 2005 International Conference on Information Quality (MIT IQ Conference), 2005. ,
Data-driven advice for applying machine learning to bioinformatics problems, 2017. ,
Multiple regression in behavioral research, 1973. ,
Glove: global vectors for word representation, EMNLP, pp.1532-1543, 2014. ,
Fisher kernels on visual vocabularies for image categorization, 2007 IEEE conference on computer vision and pattern recognition, pp.1-8, 2007. ,
Rethinking lda: moment matching for discrete ica, Advances in Neural Information Processing Systems, pp.514-522, 2015. ,
URL : https://hal.archives-ouvertes.fr/hal-01225271
Catboost: unbiased boosting with categorical features, Neural Information Processing Systems, p.6639, 2018. ,
Data preparation for data mining, 1999. ,
Random features for large-scale kernel machines, Neural Information Processing Systems, p.1177, 2008. ,
Data cleaning: problems and current approaches, IEEE Data Engineering Bulletin, vol.23, p.3, 2000. ,
Data mining for discrimination discovery, ACM Transactions on Knowledge Discovery from Data (TKDD), vol.4, issue.2, p.9, 2010. ,
Introduction to modern information retrieval, 1983. ,
Interactive deduplication using active learning, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.269-278, 2002. ,
Learning with kernels, 1998. ,
Indo-european languages tree by levenshtein distance, Europhysics Letters), vol.81, issue.6, p.68005, 2008. ,
A mathematical theory of communication, Bell system technical journal, vol.27, issue.3, pp.379-423, 1948. ,
Probabilistic topic models. Handbook of latent semantic analysis, vol.427, pp.424-440, 2007. ,
Approximate string-matching over suffix trees, Annual Symposium on Combinatorial Pattern Matching, pp.228-242, 1993. ,
The nature of statistical learning theory. Springer science & business media, 2013. ,
An overview of statistical learning theory, IEEE transactions on neural networks, vol.10, issue.5, pp.988-999, 1999. ,
On the safety of machine learning: cyber-physical systems, decision sciences, and data products, Big data, vol.5, issue.3, pp.246-255, 2017. ,
Making machine learning models interpretable, In ESANN, vol.12, pp.163-172, 2012. ,
Information theoretic measures for clusterings comparison: variants, properties, 2010. ,
, ization and correction for chance, Journal of Machine Learning Research, vol.11, pp.2837-2854
An effective image representation method using kernel classification, 2014 IEEE 26th international conference on tools with artificial intelligence, pp.853-858, 2014. ,
Hashing for similarity search: a survey, 2014. ,
Feature hashing for large scale multitask learning, ICML, p.1113, 2009. ,
Individual comparisons by ranking methods, Breakthroughs in statistics, pp.196-202, 1992. ,
The state of record linkage and current research problems, 1999. ,
Methods for record linkage and bayesian networks, 2002. ,
Overview of record linkage and current research directions, Bureau of the Census, 2006. ,
Data mining: practical machine learning tools and techniques, 2016. ,