The Hitchhiker's Guide to the Galaxy Contents 4.1 Segmentation, p.53 ,
54 4.2.1 Sentence Alignment, p.62 ,
maxt(tf) is the maximum frequency of any term in the document and avg.dl is the average document length with respect to the number of terms. For ease of reference, we also include the BM25 tf scheme. The k 1 and b parameters of BM25 are set to their default values of 1.2 and 0.95 respectively, SMART notation for term frequency variants, p.28, 2000. ,
Short-Text Similarity Measurement Using Word Sense Disambiguation and Synonym Expansion, Lecture Notes in Computer Science, vol.2, issue.2, pp.435-444, 2010. ,
DOI : 10.1162/coli.2006.32.1.13
URL : http://arrow.latrobe.edu.au:8080/http:/www.springerlink.com/content/31541v00731x2755/fulltext.pdf : Springer-Verlag,
Multimodal corpora, Corpus Linguistics. An International Handbook, pp.207-225, 2008. ,
URL : https://hal.archives-ouvertes.fr/hprints-00511882
A comparison of extrinsic clustering evaluation metrics based on formal constraints technique, Information Retrieval, pp.261-286, 2009. ,
A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval, vol.30, issue.4, pp.461-486, 2009. ,
DOI : 10.1007/s10791-008-9066-8
Inter-Coder Agreement for Computational Linguistics, Computational Linguistics, vol.27, issue.1, pp.555-596, 2008. ,
DOI : 10.1037/0033-2909.103.3.374
Modern information retrieval, 1999. ,
A reflective view on text similarity, Proceedings of Recent Advances in Natural Language Processing, pp.515-520, 2011. ,
Monolingual text similarity measures: A comparison of models over wikipedia articles revisions, Proceedings of the 7th International Conference on Natural Language Processing, pp.29-38, 2009. ,
Corpus and Evaluation Measures for Automatic Plagiarism Detection, Proceedings of the Seventh conference on International Language Resources and Evaluation, 2010. ,
Sentence alignment for monolingual comparable corpora, Proceedings of the 2003 conference on Empirical methods in natural language processing -, pp.25-32, 2003. ,
DOI : 10.3115/1119355.1119359
Extracting paraphrases from a parallel corpus, Proceedings of the 39th Annual Meeting on Association for Computational Linguistics , ACL '01, pp.50-57, 2001. ,
DOI : 10.3115/1073012.1073020
Information fusion in the context of multi-document summarization, Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics -, 1999. ,
DOI : 10.3115/1034678.1034760
Topic-based vector space model, Proceedings of the 6th International Conference on Business Information Systems, pp.7-12, 2003. ,
Statistical models for text segmentation, Machine Learning, pp.177-210, 1999. ,
Monolingual comparable corpora and parallel corpora in the search for features of translated language, SYNAPS -A Journal of Professional Communication, 2011. ,
Using Linear Algebra for Intelligent Information Retrieval, SIAM Review, vol.37, issue.4, pp.573-595, 1995. ,
DOI : 10.1137/1037127
FCM: The fuzzy c-means clustering algorithm, Computers & Geosciences, vol.10, issue.2-3, pp.191-203, 1984. ,
DOI : 10.1016/0098-3004(84)90020-7
Topic segmentation with an aspect hidden markov model, pp.343-348, 2001. ,
Latent dirichlet allocation, Journal of Machine Learning Research, vol.3, pp.993-1022, 2003. ,
Improving corpus comparability for bilingual lexicon extraction from comparable corpora, Proceedings of 23rd international conference on computational linguistics, pp.644-652, 2010. ,
URL : https://hal.archives-ouvertes.fr/hal-00953833
CUCWeb, Proceedings of the 2nd International Workshop on Web as Corpus, WAC '06, pp.19-28, 2006. ,
DOI : 10.3115/1628297.1628301
Document retrieval and routing using the inquery system, Proceeding of Third Text Retrieval Conference, pp.29-38, 1994. ,
The importance of proper weighting methods Workshop on Human Language Technology, pp.349-352, 1993. ,
Unsupervised learning with term clustering for thematic segmentation of texts, RIAO, pp.648-657, 2004. ,
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, Text Databases and Document Management: Theory and Practice, 2001. ,
Looking for candidate translational equivalents in specialized, comparable corpora, Proceedings of the 19th international conference on Computational linguistics -, 2002. ,
DOI : 10.3115/1071884.1071904
The google similarity distance, IEEE Transactions on Knowledge and Data Engineering, pp.370-383, 2007. ,
Advances in domain independent linear text segmentation, ANLP, pp.26-33, 2000. ,
Latent semantic analysis for text segmentation, Proceedings of Empirical Methods in Natural Language Processing, pp.109-117, 2001. ,
Vectorisation, okapi et calcul de similarité pour le tal : dpour oublier enfin le tf-idf, Proceedings of JEP-TALN-RECITAL, pp.85-98 ,
Learning trees and rules with set-valued features, pp.709-716, 1996. ,
Unsupervised models for named entity classification, Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp.100-110, 1999. ,
Classical approaches to natural language processing, Handbook of Natural Language Processing, 2010. ,
Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol.41, issue.6, pp.391-407, 1990. ,
DOI : 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Unsupervised construction of large paraphrase corpora, Proceedings of the 20th international conference on Computational Linguistics , COLING '04, pp.350-356, 2004. ,
DOI : 10.3115/1220355.1220406
The most influential paper gerard salton never wrote, Library Trends, 2004. ,
Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management , CIKM '98, pp.148-155, 1998. ,
DOI : 10.1145/288627.288651
Improving the retrieval of information from external sources, Behavior Research Methods, Instruments, and Computers, pp.229-236, 1991. ,
DOI : 10.3758/BF03203370
Finding document topics for improving topic segmentation, 2007. ,
Information Retrieval: Data Structures & Algorithms, 1992. ,
A probabilistic learning approach for document indexing, ACM Transactions on Information Systems, vol.9, issue.3, pp.223-248, 1991. ,
DOI : 10.1145/125187.125189
Hierarchical document clustering using frequent itemsets, Proceedings of SIAM International Conference on Data Mining, 2003. ,
A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora, AMTA, pp.1-17, 1998. ,
DOI : 10.1007/3-540-49478-2_1
K-vec, Proceedings of the 15th conference on Computational linguistics -, pp.1096-1102, 1994. ,
DOI : 10.3115/991250.991328
The meter corpus: A corpus for analysing journalistic text reuse, pp.214-223, 2001. ,
Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization, Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries, pp.59-68, 2000. ,
DOI : 10.1007/3-540-45268-0_6
A program for aligning sentences in bilingual corpora, pp.1-8, 1991. ,
Discourse segmentation of multi-party conversation, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics , ACL '03, pp.562-569, 2003. ,
DOI : 10.3115/1075096.1075167
Conceptual spaces, Kognitionswissenschaft, vol.4, issue.4, 2000. ,
DOI : 10.1007/s001970050015
A study of information retrieval weighting schemes for sentiment analysis, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010. ,
Seven strictures on similarity. Bobbs-Merrill, 1991. ,
Comparable and translation corpora in cross-linguistic research. design, analysis and applications, Journal of Shanghai Jiaotong University, 2010. ,
Attention, intentions, and the structure of discourse, Computational Linguistics, vol.12, pp.175-204, 1986. ,
Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation, Journal of Computer Speech and Language, pp.90-104, 2012. ,
DOI : 10.1016/j.csl.2011.06.002
URL : https://hal.archives-ouvertes.fr/hal-00645705
Modeling sentences in the latent space, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp.864-872, 2012. ,
Cohesion in English. Longman Group Limited, pp.14-47, 1976. ,
Cohesion in English, 1976. ,
The Hungarian method for the assignment problem, Naval Research Logistics Quarterly, vol.3, issue.1-2, pp.83-97, 1955. ,
DOI : 10.1002/nav.3800020109
Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning, pp.13-20, 1999. ,
Simfinder: A flexible clustering tool for summarization, Proceedings of the North American Chapter of the Association for Computational Linguistics: Workshop on Automatic Summarization, pp.41-49, 2001. ,
Texttiling: Segmenting text into multi-paragraph subtopic passages, Computational Linguistics, vol.23, pp.33-64, 1997. ,
Enhancing a statistical machine translation system by using automatically extracted parallel corpus from comparable sources, Proceedings of the LREC 2008 Workshop on Comparable Corpora, 2008. ,
Patterns of Lexis in Text, 1991. ,
Probabilistic latent semantic analysis, UAI, pp.289-296, 1999. ,
WordICA???emergence of linguistic representations for words by independent component analysis, Natural Language Engineering, vol.16, issue.03, pp.277-308, 2010. ,
DOI : 10.1037/0033-295X.114.1.1
Similarity measures for text document clustering, New Zealand Computer Science Research Student Conference, pp.49-56, 2008. ,
Comparing partitions, Journal of Classification, vol.78, issue.1, pp.193-218, 1985. ,
DOI : 10.1007/BF01908075
Machine translation: a concise history, 2007. ,
Survey on independent component analysis, 1999. ,
Semantic text similarity using corpus-based word similarity and string similarity, ACM Transactions on Knowledge Discovery from Data, vol.2, issue.2, pp.55-60, 2008. ,
DOI : 10.1145/1376815.1376819
Applications of corpusbased semantic similarity and word segmentation to database schema matching. The VLDB Journal -The International Journal on Very Large Data Bases, pp.1293-1320, 2008. ,
Two-stage bootstrapping for anaphora resolution, Proceedings of Conference on Computational Linguistics: Posters, pp.507-516, 2012. ,
Mining name translations from comparable corpora by creating bilingual information networks, Proceedings of the 2nd Workshop on Building and Using Comparable Corpora from Parallel to Non-parallel Corpora, BUCC '09, pp.34-37, 2009. ,
DOI : 10.3115/1690339.1690349
Semantic similarity based on corpus statistics and lexical taxonomy, Proceedings of the International Conference on Research in Computational Linguistics, pp.19-33, 1997. ,
Principal Component Analysis, 2002. ,
DOI : 10.1007/978-1-4757-1904-8
A STATISTICAL INTERPRETATION OF TERM SPECIFICITY AND ITS APPLICATION IN RETRIEVAL, Journal of Documentation, vol.28, issue.1, pp.11-21, 1972. ,
DOI : 10.1108/eb026526
Finding Groups in Data: An Introduction to Cluster Analysis, 1990. ,
DOI : 10.1002/9780470316801
Second-Order Cohesion, Computational Intelligence, vol.16, issue.4, pp.511-524, 2000. ,
DOI : 10.1111/0824-7935.00124
Text-translation alignment, pp.121-142, 1991. ,
Introduction to the Special Issue on the Web as Corpus, Association for Computational Linguistics, 2003. ,
DOI : 10.1038/21987
Bulgarian x-language parallel corpus, pp.23-25 ,
On principal component analysis, cosine and Euclidean measures in information retrieval, Information Sciences, vol.177, issue.22, pp.4893-4905, 2007. ,
DOI : 10.1016/j.ins.2007.05.027
Using an evolving thematic clustering in a text segmentation process, 2008. ,
A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychological Review, pp.211-240, 1997. ,
Introduction to latent semantic analysis, Discourse Processes, pp.33-34, 1998. ,
Combining local context and wordnet sense similarity for word sense identification. WordNet, An Electronic Lexical Database, pp.265-284, 1998. ,
Automatic sense disambiguation using machine readable dictionaries, Proceedings of the 5th annual international conference on Systems documentation , SIGDOC '86, pp.24-26, 1986. ,
DOI : 10.1145/318723.318728
Evaluating and optmizing autonomous text classification systems, Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, 1995. ,
Text similarity: an alternative way to search MEDLINE, Bioinformatics, vol.22, issue.18, pp.2298-304, 2006. ,
DOI : 10.1093/bioinformatics/btl388
An information-theoretic definition of similarity, ICML, pp.296-304, 1998. ,
An information-theoretic definition of similarity, Proceedings of the 15th International Conference on Machine Learning, pp.296-304, 1998. ,
Statistical machine translation, ACM Computing Surveys, vol.40, issue.3, 2008. ,
DOI : 10.1145/1380584.1380586
Incorporating named entity recognition into the speech transcription process, Interspeech, p.2013 ,
URL : https://hal.archives-ouvertes.fr/hal-00843211
Learning word vectors for sentiment analysis, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.142-150, 2011. ,
Clustering Abstracts Instead of Full Texts, Proceedings of the 7th International Conference on Text, Speech, Dialog (TSD), Lecture notes in Artificial Intelligence, pp.129-135, 2004. ,
DOI : 10.1007/978-3-540-30120-2_17
Foundations of Statistical Natural Language Processing, pp.42-52, 1999. ,
Introduction to Information Retrieval, pp.28-29, 2008. ,
DOI : 10.1017/CBO9780511809071
Annotating a parallel monolingual treebank with semantic similarity relations, The Sixth International Workshop on Treebanks and Linguistic Theories, 2007. ,
Discourse cues for broadcast news segmentation, Conference on Computational Linguistics-Association for Computational Linguistics, pp.819-822, 1998. ,
Comparing clusterings???an information based distance, Journal of Multivariate Analysis, vol.98, issue.5, pp.873-895, 2007. ,
DOI : 10.1016/j.jmva.2006.11.013
Meaning-Text Models: A Recent Trend in Soviet Linguistics, Annual Review of Anthropology, vol.10, issue.1, pp.27-62, 1981. ,
DOI : 10.1146/annurev.an.10.100181.000331
Corpus-based and knowledge-based measures of text semantic similarity, AAAI'06, pp.775-780, 2006. ,
The ltg part of speech tagger, pp.50-57, 1997. ,
A short guide to the meaning-text linguistic theory, In Journal of Koralex, pp.187-233, 2006. ,
Text Segmentation with Multiple Surface Linguistic Cues, Proceedings of Conference on Computational Linguistics-Association for Computational Linguistics, pp.881-885, 1998. ,
DOI : 10.5715/jnlp.6.3_43
Improved machine translation performance via parallel sentence extraction from comparable corpora, Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.265-272, 2004. ,
Weight functions impact on lsa performance, EuroConference on Recent Advances in Natural Language Processing, pp.187-193, 2001. ,
Towards robust context-sensitive sentence alignment for monolingual corpora, European Chapter of the Association for Computational Linguistics, pp.39-56, 2006. ,
On spectral clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems, pp.849-856, 2001. ,
Feature selectiouin, perceptron learning, and a usability case study for text categorization, Proceedings of SIGIR-07, 20th ACM International Conference on Research and Development in Information Retrieval, pp.67-73, 1997. ,
On weighting clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.28, issue.8, 2006. ,
DOI : 10.1109/TPAMI.2006.168
Intention-based segmentation, Proceedings of the 31st annual meeting on Association for Computational Linguistics -, pp.148-155, 1993. ,
DOI : 10.3115/981574.981594
Terms in Context -Studies in Corpus Linguistics, John Benjamins, 1998. ,
An application of latent semantic analysis to word sense discrimination for words with related and unrelated meanings, Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, EdAppsNLP '09, pp.43-46, 2009. ,
DOI : 10.3115/1609843.1609849
Kncr: A short-text narrow-domain subcorpus of medline, TLH 2006. Advances in Computer Science, pp.266-269, 2006. ,
Clustering Abstracts of Scientific Texts Using the Transition Point Technique, International Conference on Intelligent Text Processing and Computational Linguistics, pp.536-546, 2006. ,
DOI : 10.1007/11671299_55
Clustering narrowdomain short texts by using the kullback-leibler distance, International Conference on Intelligent Text Processing and Computational Linguistics, pp.26-64, 2007. ,
C4.5: Programs for Machine Learning, 1993. ,
Objective criteria for the evaluation of clustering methods, In Journal of the American Statistical Association, vol.66, issue.336, pp.846-850, 1971. ,
The life and death of discourse entities: Identifying singleton mentions, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.627-633, 2013. ,
The NVI clustering evaluation measure, Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL '09, pp.165-173, 2009. ,
DOI : 10.3115/1596374.1596401
Using information content to evaluate semantic similarity, Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp.448-453, 1995. ,
An automatic method of finding topic boundaries, Association for Computational Linguistics, pp.331-333, 1994. ,
Topic segmentation: Algorithms and applications, 1998. ,
Okapi at trec-3, pp.109-126, 1996. ,
Comparing clusterings -an information based distance, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.410-420, 2007. ,
MATHEMATICS AND INFORMATION RETRIEVAL, Journal of Documentation, vol.35, issue.1, pp.1-29, 1979. ,
DOI : 10.1108/eb026671
Term-weighting approaches in automatic text retrieval, Information Processing and Management, pp.513-523, 1988. ,
DOI : 10.1016/0306-4573(88)90021-0
Introduction to modern information retrieval, 1986. ,
A vector space model for automatic indexing, Communications of the ACM, vol.18, issue.11, pp.613-620, 1975. ,
DOI : 10.1145/361219.361220
Enhanced Topic-based Vector Space Model for semantics-aware spam filtering, Proceedings of Expert Systems With Applications, pp.437-444 ,
DOI : 10.1016/j.eswa.2011.07.034
A comparison of classifiers and document representations for the routing problem, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '95, pp.229-237, 1995. ,
DOI : 10.1145/215206.215365
Machine learning in automated text categorization, ACM Computing Surveys, vol.34, issue.1, pp.1-47, 2002. ,
DOI : 10.1145/505282.505283
Eagles preliminary recommendations on corpus typology, 1996. ,
Pivoted document length normalization, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '96, pp.21-29, 1996. ,
DOI : 10.1145/243199.243206
Cluster Dissection and Analysis: Theory, Fortran Programs, Examples, 1985. ,
Segmenting broadcast news streams using lexical chains, Proceedings of 1st Starting AI Researchers Symposium, pp.145-154, 2002. ,
A generalized vector space model for text retrieval based on semantic relatedness, Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop on, EACL '09, pp.70-78, 2009. ,
DOI : 10.3115/1609179.1609188
Mining the web for synonyms: Pmi-ir versus lsa on toefl, Proceedings of Twelfth European Conference on Machine Learning, pp.491-502, 2001. ,
Features of similarity, Psychological Review, pp.327-352, 1977. ,
A statistical model for domainindependent text segmentation, Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2001. ,
Evaluation of parallel text alignment systems, pp.369-388, 2000. ,
DOI : 10.1007/978-94-017-2535-4_19
Multiple document summarization using principal component analysis incorporating semantic vector space model. Associtaion for Computational Linguistics and Chinese Language Processing, pp.141-156, 2008. ,
A tutorial on spectral clustering, Statistics and Computing, vol.17, pp.395-416, 2007. ,
Identifying event descriptions using co-training with online news summaries, Proceedings of the 5th International Joint Conference on Natural Language Processing, p.2011, 2011. ,
Geometry and meaning, Center for the Study of Language and Information, 2004. ,
A neural network approach to topic spotting, Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, 1995. ,
Parameters driving effectiveness of automated essay scoring with lsa, Proceedings of the 9th CAA, 2005. ,
On modeling of information retrieval concepts in vector spaces, ACM Transactions on Database Systems, vol.12, issue.2, pp.299-321, 1987. ,
DOI : 10.1145/22952.22957
Verbs semantics and lexical selection, Proceedings of the 32nd annual meeting on Association for Computational Linguistics -, pp.133-138, 1994. ,
DOI : 10.3115/981732.981751
Segmentation of expository texts by hierarchical agglomerative clustering. CoRR, 1997. ,
A hidden Markov model approach to text segmentation and event tracking, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181), pp.333-336, 1998. ,
DOI : 10.1109/ICASSP.1998.674435
An improved model of dotplotting for text segmentation, Journal of Chinese Language and Computing, pp.27-40, 2006. ,
Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology, 1949. ,