. .. Experiments,

. .. Summary, Go as far as you can see, and you will see further. -Zig Zigler Writer Contents 7.1 Introduction

.. .. Experiments,

.. .. Related-work,

.. .. Summary,

S. Abiteboul, P. C. Kanellakis, and G. Grahne, On the Representation and Querying of Sets of Possible Worlds, Proceedings of the Association for Computing Machinery Special Interest Group on Management of Data 1987 Annual Conference, p.27, 1987.

S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases, 1995.

S. Acharya, B. Phillip, V. Gibbons, S. Poosala, and . Ramaswamy, The Aqua approximate query answering system, ACM Sigmod Record, vol.28, issue.2, p.90, 1999.

A. A. Afifi and R. M. Elashoff, Missing Observations in Multivariate Statistics I. Review of the Literature, Journal of the American Statistical Association, vol.61, p.24, 1966.

. Paul-d-allison, Multiple imputation for missing data: A cautionary tale, Sociological methods & research, vol.28, p.90, 2000.

B. Babcock, S. Chaudhuri, and G. Das, Dynamic sample selection for approximate query processing, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, p.96, 2003.

J. Barateiro and H. Galhardas, A Survey of Data Quality Tools, Datenbank-Spektrum, vol.14, pp.15-21, 2005.

C. Batini, C. Cappiello, C. Francalanci, and A. Maurino, Methodologies for data quality assessment and improvement". en, ACM Computing Surveys, vol.41, issue.3, pp.22-24, 2009.

C. Batini and M. Scannapieco, Data and Information Quality -Dimensions, Principles and Techniques. Data-Centric Systems and Applications, p.20, 2016.

P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, A cost-based model and effective heuristic for repairing constraints by value modification, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, p.92, 2005.

M. Bovee, P. Rajendra, B. Srivastava, and . Mak, A conceptual framework and belief-function approach to assessing overall information quality, Int. J. Intell. Syst, vol.18, p.22, 2003.

C. Daren and . Brabham, Crowdsourcing as a model for problem solving: An introduction and cases, Convergence 14, vol.1, p.91, 2008.

F. Buccafurri, F. Furfaro, D. Sacca, and C. Sirangelo, A Quad-tree Based Multiresolution Approach for Two-dimensional Summary Data, Proceedings of the 15th International Conference on Scientific and Statistical Database Management. SSDBM '03, p.136, 2003.

F. Samuel and . Buck, A method of estimation of missing values in multivariate data suitable for use with an electronic computer, Journal of the Royal Statistical Society. Series B (Methodological), p.94, 1960.

M. Buhrmester, T. Kwang, and S. D. Gosling, Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data?, In: Perspectives on psychological science, vol.6, p.91, 2011.

F. Lane, J. Burgette, and . Reiter, Multiple imputation for missing data via sequential regression trees, American journal of epidemiology, vol.172, p.95, 2010.

J. Cambronero, K. John, M. J. Feser, S. Smith, and . Madden, Query optimization for dynamic imputation, Proceedings of the VLDB Endowment, vol.10, pp.1310-1321, 2017.

Y. Cao, T. Deng, W. Fan, and F. Geerts, On the data complexity of relative information completeness, Inf. Syst, vol.45, p.37, 2014.

. Jaime-g-carbonell, S. Ryszard, T. Michalski, and . Mitchell, An overview of machine learning, Machine Learning, vol.I, p.94, 1983.

K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim, Approximate query processing using wavelets, The VLDB Journal-The International Journal on Very Large Data Bases, vol.10, p.96, 2001.

J. Chen, J. Pan, C. Faloutsos, and S. Papadimitriou, TSum: fast, principled table summarization". en, Proceedings of the Seventh International Workshop on Data Mining for Online Advertising -ADKDD '13, p.137, 2013.

L. Anand-inasu-chittilappilly, S. Chen, and . Amer-yahia, A survey of generalpurpose crowdsourcing techniques, IEEE Transactions on Knowledge and Data Engineering, vol.28, p.91, 2016.

Y. Chung, M. L. Mortensen, C. Binnig, and T. Kraska, Estimating the impact of unknown unknowns on aggregate query results, ACM Transactions on Database Systems (TODS), vol.43, p.97, 2018.

E. F. Codd, A Relational Model of Data for Large Shared Data Banks, Commun. ACM, vol.13, pp.377-387, 1970.

E. F. Codd, Extending the Database Relational Model to Capture More Meaning, ACM Trans. Database Syst, pp.397-434

S. Cohen, W. Nutt, and A. Serebrenik, Algorithms for rewriting aggregate queries using views, Current Issues in Databases and Information Systems, p.148, 2000.

T. Condie, N. Conway, and P. Alvaro, Online aggregation and continuous query support in mapreduce, Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, p.90, 2010.

A. Cuzzocrea and D. Saccà, H-IQTS: A Semantics-aware Histogram for Compressing Categorical OLAP Data, Proceedings of the 2008 International Symposium on Database Engineering & Applications. IDEAS '08, p.136, 2008.

M. Dallachiesa, A. Ebaid, and A. Eldawy, NADEEF: a commodity data cleaning system, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, vol.104, p.93, 2013.

P. Arthur, N. M. Dempster, D. Laird, and . Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society. Series B (methodological, p.94, 1977.

D. Edwards, Quality, productivity, and competetive position, MIT Center for Advanced Engineering, vol.18, 1982.

T. Deng, W. Fan, and F. Geerts, Capturing Missing Tuples and Missing Values, ACM Trans. Database Syst, vol.41, 2016.

. John-k-dixon, Pattern recognition with partly missing data, IEEE Transactions on Systems, Man, and Cybernetics, vol.9, p.95, 1979.

A. Rogier, T. Donders, J. Geert, T. Van-der-heijden, . Stijnen et al., A gentle introduction to imputation of missing values, Journal of clinical epidemiology, vol.59, issue.10, p.94, 2006.

S. Duan and S. Babu, Processing forecasting queries, Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, p.95, 2007.

C. Elkan, Independence of Logic Database Queries and Updates, Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, p.32, 1990.

W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, Conditional functional dependencies for capturing data inconsistencies, ACM Transactions on Database Systems (TODS), vol.33, p.92, 2008.

W. Fan and F. Geerts, Relative Information Completeness". In: ACM Trans. Database Syst, vol.35, p.44, 2010.

W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Towards certain fixes with editing rules and master data, Proceedings of the VLDB Endowment, vol.3, pp.173-184, 2010.

W. Fan and F. Geerts, Foundations of Data Quality Management, Synthesis Lectures on Data Management, vol.36, p.18, 2012.

W. Fan, Data quality: From theory to practice, Acm Sigmod Record, vol.44, pp.7-18, 2015.

P. Ivan, D. Fellegi, and . Holt, A systematic approach to automatic edit and imputation, Journal of the American Statistical association, vol.71, p.90, 1976.

N. Minos, P. Garofalakis, and . Gibbons, Approximate Query Processing: Taming the TeraBytes, p.96, 2001.

A. David and . Garvin, Managing quality: The strategic and competitive edge, vol.18, 1988.

M. Ge and M. Helfert, A Review of Information Quality Research -Develop a Research Agenda, Proceedings of the 12th International Conference on Information Quality, pp.76-91, 2007.

A. Gelman and J. Hill, Data analysis using regression and multilevel/hierarchical models, p.89, 2006.

V. Goasdoué, S. Nugier, D. Duquennoy, and B. Laboisse, An Evaluation Framework For Data Quality Tools, Proceedings of the 12th International Conference on Information Quality, MIT, p.24, 2007.

C. John and . Gower, A general coefficient of similarity and some of its properties, Biometrics, p.94, 1971.

J. Grant, Null Values in a Relational Data Base, Inf. Process. Lett, vol.6, p.24, 1977.

T. Gschwandtner, J. Gärtner, W. Aigner, and S. Miksch, A taxonomy of dirty time-oriented data, International Conference on Availability, Reliability, and Security, pp.58-72, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01542440

F. Hannou, B. Amann, and M. Baazizi, Explaining Query Answer Completeness and Correctness With Partition Patterns, Database and Expert Systems Applications -30th International Conference, DEXA 2019, p.13, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02310582

, Exploring and Comparing Table Fragments With Fragment Summaries, p.13, 2019.

F. Hannou, B. Amann, and M. Baazizi, Query-Oriented answer imputation for aggregate queries, Advances in Databases and Information Systems -23rd European Conference, p.13, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02310571

J. M. Hellerstein, M. J. Franklin, and S. Chandrasekaran, Adaptive query processing: Technology in evolution, IEEE Data Eng. Bull, vol.23, p.96, 2000.

. Joseph-m-hellerstein, Quantitative data cleaning for large databases, Europe, p.89, 2008.

N. Thomas, F. J. Herzog, W. E. Scheuren, and . Winkler, Data quality and record linkage techniques, 2007.

. Kuan-tse, Y. W. Huang, R. Lee, and . Wang, Quality information and knowledge, vol.18, 1998.

T. Imielinski and W. Lipski, Incomplete Information in Relational Databases, J. ACM, vol.31, pp.761-791, 1984.

T. Imieli?ski and W. Lipski, Incomplete information in relational databases, Readings in Artificial Intelligence and Databases, pp.342-360, 1988.

T. Imieli?ski and W. Lipski, Incomplete information in relational databases, Readings in Artificial Intelligence and Databases, p.26, 1988.

A. Immonen, P. Pääkkönen, and E. Ovaska, Evaluating the Quality of Social Media Data in Big Data Architecture, IEEE Access, vol.3, p.20, 2015.

R. Shawn, . Jeffery, J. Michael, A. Franklin, and . Halevy, Pay-as-you-go user feedback for dataspace systems, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, p.90, 2008.

M. José, I. Jerez, P. J. Molina, and . García-laencina, Missing data imputation using statistical and machine learning methods in a real breast cancer problem, Artificial intelligence in medicine, vol.50, p.94, 2010.

H. Junninen, H. Niska, K. Tuppurainen, J. Ruuskanen, and M. Kolehmainen, Methods for imputation of missing values in air quality data sets, Atmospheric Environment, vol.38, p.94, 2004.

. Joseph-m-juran, Juran on leadership for quality, vol.18, 2003.

C. Kiefer, Assessing the Quality of Unstructured Data: An Initial Overview, Proceedings of the LWDA 2016 Proceedings (LWDA), CEUR Workshop Proceedings, p.20, 2016.

T. Kohonen, Exploration of very large databases by self-organizing maps, Proceedings of International Conference on Neural Networks (ICNN'97), vol.1, p.95, 1997.

A. Koopman and A. Siebes, Characteristic Relational Patterns, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD '09, p.137, 2009.

J. Laks-vs-lakshmanan, J. Pei, and . Han, Quotient cube: How to summarize the semantics of a data cube, Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, p.136, 2002.

K. Lakshminarayan, A. Steven, T. Harp, and . Samad, Imputation of missing data in industrial databases, Applied intelligence, vol.11, p.89, 1999.

W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton, Partial results in database systems". en, Proceedings of the 2014 ACM SIGMOD international conference on Management of data -SIGMOD '14, pp.1275-1286, 2014.

. Bibliography,

W. Lang, R. V. Nehme, and I. Rae, Database Optimization in the Cloud: Where Costs, Partial Results, and Consumer Choice Meet, CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, p.35, 2015.

. Lawrence-r-landerman, C. Kenneth, C. F. Land, and . Pieper, An empirical evaluation of the predictive mean matching method for imputing missing values, Sociological Methods & Research, vol.26, p.90, 1997.

J. Matthijs-van-leeuwen and . Vreeken, Mining and Using Sets of Patterns through Compression". en, Frequent Pattern Mining, p.137, 2014.

D. Lembo, Dealing with inconsistency and incompleteness in data integration, p.26, 2004.

H. Lenz and A. Shoshani, Summarizability in OLAP and statistical data bases, Scientific and Statistical Database Management, 1997. Proceedings., Ninth International Conference on, p.136, 1997.

Y. Alon, Y. Levy, and . Sagiv, Queries independent of updates, vol.93, p.145, 1993.

Y. Alon and . Levy, Obtaining complete answers from incomplete databases, VLDB, vol.96, pp.402-412, 1996.

G. Li, J. Wang, Y. Zheng, and M. Franklin, Crowdsourced data management: A survey, IEEE Transactions on Knowledge and Data Engineering, vol.28, pp.2296-2319, 2016.

J. A. Roderick, . Little, and . Donald-b-rubin, Statistical analysis with missing data, vol.333, p.90, 2014.

J. A. Roderick, . Little, and . Donald-b-rubin, The analysis of social science data with missing values, Sociological Methods & Research, vol.18, pp.292-326, 1989.

J. A. Roderick and . Little, A test of missing completely at random for multivariate data with missing values, Journal of the American statistical Association, vol.83, p.90, 1988.

M. Lo, K. Wu, and P. Yu, Tabsum: A flexible and dynamic table summarization approach, Proceedings. 20th International Conference on, p.137, 2000.

D. Loshin, Master data management, p.36, 2010.

A. N. Mahmood, C. Leckie, R. Islam, and Z. Tari, Hierarchical summarization techniques for network traffic, 2011 6th IEEE Conference on Industrial Electronics and Applications, p.136, 2011.

M. Mampaey and J. Vreeken, Summarizing categorical data by clustering attributes". en, Data Mining and Knowledge Discovery, vol.26, issue.1, p.137, 2013.

C. Mayfield, J. Neville, and S. Prabhakar, ERACER: a database approach for statistical inference and data cleaning, Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, p.95, 2010.

R. Van-der-meyden, Logical approaches to incomplete information: A survey, Logics for databases and information systems, vol.28, p.27, 1998.

A. Motro, Integrity = Validity + Completeness". In: ACM Trans. Database Syst, vol.14, pp.480-502, 1989.

H. Müller and J. Freytag, Problems, methods, and challenges in comprehensive data cleansing, p.19, 2005.

F. Naumann, Quality-driven query answering for integrated information systems, vol.2261, p.22, 2003.

P. Oliveira, F. Rodrigues, and P. R. Henriques, A Formal Definition of Data Quality Problems, Proceedings of the 2005 International Conference on Information Quality (MIT ICIQ Conference), Sponsored by Lockheed, p.19, 2005.

D. Olteanu and M. Schleich, Factorized Databases, SIGMOD Rec, vol.45, p.137, 2016.

H. Park and J. Widom, Crowdfill: collecting structured data from the crowd, Proceedings of the 2014 ACM SIGMOD international conference on Management of data, p.91, 2014.

J. M. Trivellore-e-raghunathan, J. Lepkowski, P. Van-hoewyk, and . Solenberger, A multivariate technique for multiply imputing missing values using a sequence of regression models, Survey methodology, vol.27, p.94, 2001.

E. Rahm and H. H. Do, Data cleaning: Problems and current approaches, IEEE Data Eng. Bull, vol.23, pp.3-13, 2000.

V. Raman and . Joseph-m-hellerstein, Partial results for online query processing, Proceedings of the 2002 ACM SIGMOD international conference on Management of data, p.96, 2002.

G. Raschia and N. Mouaddib, SAINTETIQ: a fuzzy set-based approach to database summarization, Fuzzy sets and systems, vol.129, p.136, 2002.

S. Razniewski and W. Nutt, Completeness of Queries over Incomplete Databases, p.33, 2011.

S. Razniewski, F. Korn, W. Nutt, and D. Srivastava, Identifying the Extent of Completeness of Query Answers over Partially Complete Databases". en, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data -SIGMOD '15, pp.561-576, 2015.

S. Razniewski, O. Savkovic, and W. Nutt, Turning The Partial-closed World Assumption Upside Down, p.142, 2016.

C. Thomas and . Redman, Data quality for the information age, vol.23, p.22, 1996.

R. Reiter, A sound and sometimes complete query evaluation algorithm for relational databases with null values, J. ACM, vol.33, p.27, 1986.

P. Royston, Multiple imputation of missing values, Stata journal, vol.4, p.90, 2004.

. Donald-b-rubin, Multiple imputation for nonresponse in surveys, vol.81, p.90, 2004.

D. B. Rubin, Inference and missing data, Biometrika, vol.63, p.25, 1976.

H. K-selçuk-candan, Y. Cao, M. L. Qi, and . Sapino, Alphasum: sizeconstrained table summarization using value lattices, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, p.136, 2009.

L. Sebastian-coleman, Measuring data quality for ongoing improvement: a data quality assessment framework, vol.18, 2012.

L. Joseph and . Schafer, Analysis of incomplete multivariate data. Chapman and Hall/CRC, 1997 (cit, p.94

D. Sonntag, Assessing the Quality of Natural Language Text Data". In: IN-FORMATIK 2004 -Informatik verbindet, Band 1, Beiträge der 34. Jahrestagung der Gesellschaft für Informatik e.V. (GI), p.20, 2004.

R. Saint-paul, G. Raschia, and N. Mouaddib, General Purpose Database Summarization, Proceedings of the 31st International Conference on Very Large Data Bases. VLDB '05, p.136, 2005.

E. Silva-ramírez, R. Pino-mejías, M. López-coello, and M. Vega, Missing value imputation on missing completely at random data using multilayer perceptrons, Neural Networks, vol.24, p.94, 2011.

G. Ssali and T. Marwala, Computational intelligence and decision trees for missing data estimation, IEEE International Joint Conference on, p.95, 2008.

M. Stonebraker and L. A. Rowe, The Design of POSTGRES, SIGMOD Rec, vol.15, p.77, 1986.

B. Sundarmurthy, P. Koutris, W. Lang, J. Naughton, and V. Tannen, m-tables: Representing missing data, LIPIcs-Leibniz International Proceedings in Informatics, vol.68, 2017.

I. Taleb, M. A. Serhani, and R. Dssouli, Big Data Quality Assessment Model for Unstructured Data, 2018 International Conference on Innovations in Information Technology (IIT), p.20, 2018.

L. Ion-george-todoran, A. Lecornu, J. Khenchaf, and . Caillec, A Methodology to Evaluate Important Dimensions of Information Quality in Systems, J. Data and Information Quality, vol.6, issue.2-3, p.20, 2015.

B. Trushkowsky, T. Kraska, J. Michael, P. Franklin, and . Sarkar, Crowdsourced enumeration queries, Data Engineering (ICDE), 2013 IEEE 29th International Conference on. IEEE. 2013, p.92

. Stef-van-buuren, Flexible imputation of missing data, p.94, 2018.

W. A. Voglozin, G. Raschia, L. Ughetto, and N. Mouaddib, Querying a summary of database". en, Journal of Intelligent Information Systems, vol.26, issue.1, p.136, 2006.
URL : https://hal.archives-ouvertes.fr/hal-00442749

J. Wang, S. Krishnan, J. Michael, and . Franklin, A sample-and-clean framework for fast and accurate query processing on dirty data, Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp.469-480, 2014.

H. Wang, Z. Qi, R. Shi, J. Li, and H. Gao, COSSET+: Crowdsourced Missing Value Imputation Optimized by Knowledge Base, Journal of Computer Science and Technology, vol.32, p.89, 2017.

Y. Wand and R. Y. Wang, Anchoring Data Quality Dimensions in Ontological Foundations, Commun. ACM, vol.39, p.22, 1996.

Y. Richard, D. M. Wang, and . Strong, Beyond Accuracy: What Data Quality Means to Data Consumers, J. of Management Information Systems, vol.12, pp.22-24, 1996.

J. Wijsen, Database repairing using updates, ACM Transactions on Database Systems (TODS), vol.30, p.92, 2005.

P. Woodall, M. A. Oberhofer, and A. Borek, A classification of data quality assessment and improvement methods, p.19, 2014.

M. Yakout, K. Ahmed, J. Elmagarmid, M. Neville, . Ouzzani et al., Guided data repair, Proceedings of the VLDB Endowment, vol.4, p.95, 2011.

, Sites internet [Amt] Amazon Mechanical Turk, p.91

. Ebita, Accessed: 2019-04-30

, Sites internet

, Harvard Business Review article, pp.2019-2024

, Planes Technology Data. Accessed: 2019-05-05

. Upwork and . Url, , p.91

, Campus map coverage by sensors

, An overview of the smart campus user interface

, Electricity consumption evolution: raw time series

. .. , Area table and normalized electricity consumption times series

, Annotated total electricity consumption per room

, Repaired aggregate query Q a gg results

, Some Data quality problems taxonomies

, Tourists dataset quality dimensions illustration

, Data completeness study tasks

.. .. Pattern-lattice,

, Synthetic datasets: Data missing randomly

, Real datasets: missing data following sensor failures

. .. F-olddata-performance,

, A taxonomy of data imputation techniques

, The imputation process illustration

, 2 Income classes pattern summaries with variable attributes sets (raw dataset), p.134

, Income classes pattern summaries with variable attributes sets (binned dataset), p.135

, User interface overview: Data completeness

, User interface overview: Query Result completeness

.. .. Sizes,

, Patterns tables sizes and compactness ratios

.. .. Data-queries,

, Complete and Missing Query Answer Patterns

. .. , Comparative table for missing data representation models, p.99

, Energy table and its pattern minimal covers

, Pattern results classes for query Q result

. Data and . .. Tables-cardinalities,

. .. List-of-queries,

. .. Correct, , vol.118

, 120 7.1 "Male white married" fragments summaries

, A set of patterns in the adult dataset

, Income classes summaries with variable attributes sets

, Execution time depending on attributes number