A partial list of publications by Preslav Nakov and co-authors.

  2013
  1. Semantic Relations Between Nominals. Vivi Nastase, Preslav Nakov, Diarmuid O Seaghdha, Stan Szpakowicz.
    In Synthesis Lectures on Human Language Technologies 6 (1), 1-119. Morgan & Claypool Publishers.
    [the book in Amazon]
    The BOOK!

    Abstract. People make sense of a text by identifying the semantic relations which connect the entities or concepts described by that text. A system which aspires to human-like performance must also be equipped to identify, and learn from, semantic relations in the texts it processes. Understanding even a simple sentence such as "Opportunity and Curiosity find similar rocks on Mars" requires recognizing relations (rocks are located on Mars, signalled by the word on) and drawing on already known relations (Opportunity and Curiosity are instances of the class of Mars rovers). A language-understanding system should be able to find such relations in documents and progressively build a knowledge base or even an ontology. Resources of this kind assist continuous learning and other advanced language-processing tasks such as text summarization, question answering and machine translation. The book discusses the recognition in text of semantic relations which capture interactions between base noun phrases. After a brief historical background, we introduce a range of relation inventories of varying granularity, which have been proposed by computational linguists. There is also variation in the scale at which systems operate, from snippets all the way to the whole Web, and in the techniques of recognizing relations in texts, from full supervision through weak or distant supervision to self-supervised or completely unsupervised methods. A discussion of supervised learning covers available datasets, feature sets which describe relation instances, and successful algorithms. An overview of weakly supervised and unsupervised learning zooms in on the acquisition of relations from large corpora with hardly any annotated data. We show how bootstrapping from seed examples or patterns scales up to very large text collections on the Web. We also present machine learning techniques in which data redundancy and variability lead to fast and reliable relation extraction. Table of Contents: Introduction / Relations between Nominals, Relations between Concepts / Extracting Semantic Relations with Supervision / Extracting Semantic Relations with Little or No Supervision / Conclusion

  2. Semantic Interpretation of Noun Compounds Using Verbal and Other Paraphrases. Preslav Nakov and Marti Hearst.
    ACM Transactions on Speech and Language Processing. 10(3):13:1-13:51, July 2013.
    [DOI] [Local copy] as allowed by ACM author self-archiving policy

    Abstract. We study the problem of semantic interpretation of noun compounds such as bee honey, malaria mosquito, apple cake, and stem cell. In particular, we explore the potential of using predicates that make explicit the hidden relation that holds between the nouns that form the noun compound. For example, mosquito that carries malaria is a paraphrase of the compound malaria mosquito in which the verb explicitly states the semantic relation between the two nouns. We study the utility of using such paraphrasing verbs, with associated weights, to build a representation of the semantics of a noun compound, e.g., malaria mosquito can be represented as follows: carry (23), spread (16), cause (12), transmit (9), etc. We also explore the potential of using multiple paraphrasing verbs as features for predicting abstract semantic relations such as CAUSE, and we demonstrate that using explicit paraphrases can help improve statistical machine translation.

  3. On the Interpretation of Noun Compounds: Syntax, Semantics, and Entailment. Preslav Nakov.
    Natural Language Engineering. 19(3):192-330, July 2013. Cambridge University Press.
    [DOI] [Local copy] as allowed by Cambridge University Press self-archiving policy

    Abstract. We discuss the problem of interpreting noun compounds such as colon cancer tumor suppressor protein, which pose major challenges for the automatic interpretation of English written text. We present an overview of the more general process of compounding and of noun compounds in particular, as well as of their syntax and semantics from both theoretical and computational linguistics viewpoint with an emphasis on the latter. Our main focus is on computational approaches to the syntax and semantics of noun compounds: we describe the problems, present the challenges, and discuss the most important lines of research. We also show how understanding noun compound syntax and semantics could help solve textual entailment problems, which would be potentially useful for a number of NLP applications, and which we believe to be an important direction for future research.

  4. On the semantics of noun compounds. Stan Szpakowicz, Francis Bond, Preslav Nakov, Su Nam Kim.
    Natural Language Engineering. 19(3):289-290, July 2013. Cambridge University Press.
    [DOI] [Local copy] as allowed by Cambridge University Press self-archiving policy

  5. A Tale about PRO and Monsters. Preslav Nakov, Francisco Guzman and Stephan Vogel.
    In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL'13). pp.12-17. August 4-9, 2013, Sofia, Bulgaria.
    [pdf] [local pdf] [bibtex] [slides ppt] [slides pdf]

    Abstract. While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters - overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- and BLEU+1- based cut-offs, outlier filters, stochastic sampling, and random acceptance. The best of these fixes not only slay and protect against monsters, but also yield higher stability for PRO as well as improved testtime BLEU scores. Thus, we recommend them to anybody using PRO, monster believer or not.

  6. A non-IID Framework for Collaborative Filtering with Restricted Boltzmann Machines. Kostadin Georgiev, Preslav Nakov.
    In Proceedings of the International Conference on Machine Learning (ICML'13). June 17-19, 2013, Atlanta, GA, USA.
    [preprint pdf] [slides ppt] [slides pdf] [poster ppt] [poster pdf]

    Abstract. We propose a framework for collaborative filtering based on Restricted Boltzmann Machines (RBM), which extends previous RBM-based approaches in several important directions. First, while previous RBM research has focused on modeling the correlation between item ratings, we model both user-user and item-item correlations in a unified hybrid non-IID framework. We further use real values in the visible layer as opposed to multinomial variables, thus taking advantage of the natural order between user-item ratings. Finally, we explore the potential of combining the original training data with data generated by the RBM-based model itself in a bootstrapping fashion. The evaluation on two MovieLens datasets (with 100K and 1M user-item ratings, respectively), shows that our RBM model rivals the best previously-proposed approaches.

  7. Analyzing the Use of Character-Level Translation with Sparse and Noisy Datasets. Jorg Tiedemann, Preslav Nakov.
    In Proceedings of RANLP (RANLP'13). September, 2013, Hissar, Bulgaria.
    [preprint pdf] [slides ppt] [slides pdf]

    Abstract. This paper provides an analysis of character-level machine translation models used in pivot-based translation when applied to sparse and noisy datasets, such as crowdsourced movie subtitles. In our experiments, we find that such character-level models cut the number of untranslated words by over 40% and are especially competitive (improvements of 2-3 BLEU points) in the case of limited training data. We explore the impact of character alignment, phrase table filtering, bitext size and the choice of pivot language on translation quality. We further compare cascaded translation models to the use of synthetic training data via multiple pivots, and we find that the latter works significantly better. Finally, we demonstrate that neither word- nor character-BLEU correlate perfectly with human judgments, due to BLEU's sensitivity to length.

  8. Parameter Optimization for Statistical Machine Translation: It Pays to Learn from Hard Examples. Preslav Nakov, Fahad Al Obaidli, Francisco Guzman and Stephan Vogel.
    In Proceedings of RANLP (RANLP'13). September, 2013, Hissar, Bulgaria.
    [preprint pdf] [poster ppt] [poster pdf]

    Abstract. Research on statistical machine translation has focused on particular translation directions, typically with English as the target language, e.g., from Arabic to English. When we reverse the translation direction, the multiple reference translations turn into multiple possible inputs, which offers both challenges and opportunities. We propose and evaluate several strategies for making use of these multiple inputs: (a) select one of the datasets, (b) select the best input for each sentence, and (c) synthesize an input for each sentence by fusing the available inputs. Surprisingly, we find out that it is best to tune on the hardest available input, not on the one that yields the highest BLEU score. This finding has implications on how to pick good translators and how to select useful data for parameter optimization in SMT.

  9. QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic Spoken Language Translation. Hassan Sajjad, Francisco Guzman, Preslav Nakov, Ahmed Abdelali, Kenton Murray, Fahad Al Obaidli, Stephan Vogel.
    In Proceedings of IWSLT (IWSLT'13). December 5-6, 2013, Heidelberg, Germany.
    [preprint pdf]

    Abstract. We describe the Arabic-English and English-Arabic statistical machine translation systems developed by the Qatar Computing Research Institute for the IWSLT'2013 evaluation campaign on spoken language translation. We used one phrase-based and two hierarchical decoders, exploring various settings thereof. We further experimented with three domain adaptation methods, and with various Arabic word segmentation schemes. Combining the output of several systems yielded a gain of up to 3.4 BLEU points over the baseline. Here we also describe a specialized normalization scheme for evaluating Arabic output, which was adopted for the IWSLT'2013 evaluation campaign.

  10. SemEval-2013 Task 2: Sentiment Analysis in Twitter. Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva, Veselin Stoyanov, Alan Ritter, Theresa Wilson.
    In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM'13), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval'13). pp. 312-320. June 14-15, 2013, Atlanta, GA, USA.
    [pdf] [local pdf] [bibtex] [slides ppt] [slides pdf]

    Abstract. In recent years, sentiment analysis in social media has attracted a lot of research interest and has been used for a number of applications. Unfortunately, research has been hindered by the lack of suitable datasets, complicating the comparison between approaches. To address this issue, we have proposed SemEval-2013 Task 2: Sentiment Analysis in Twitter, which included two subtasks: A, an expression-level subtask, and B, a message-level subtask. We used crowdsourcing on Amazon Mechanical Turk to label a large Twitter training dataset along with additional test sets of Twitter and SMS messages for both subtasks. All datasets used in the evaluation are released to the research community. The task attracted significant interest and a total of 149 submissions from 44 teams. The best-performing team achieved an F1 of 88.9% and 69% for subtasks A and B, respectively.

  11. SemEval-2013 Task 4: Free Paraphrases of Noun Compounds. Iris Hendrickx, Zornitsa Kozareva, Preslav Nakov, Diarmuid O Seaghdha, Stan Szpakowicz, Tony Veale.
    In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM'13), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval'13). pp. 138-143. June 14-15, 2013, Atlanta, GA, USA.
    [pdf] [local pdf] [bibtex] [slides pdf]

    Abstract. In this paper, we describe SemEval-2013 Task 4: the definition, the data, the evaluation and the results. The task is to capture some of the meaning of English noun compounds via paraphrasing. Given a two-word noun compound, the participating system is asked to produce an explicitly ranked list of its free-form paraphrases. The list is automatically compared and evaluated against a similarly ranked list of paraphrases proposed by human annotators, recruited and managed through Amazon's Mechanical Turk. The comparison of raw paraphrases is sensitive to syntactic and morphological variation. The "gold" ranking is based on the relative popularity of paraphrases among annotators. To make the ranking more reliable, highly similar paraphrases are grouped, so as to downplay superficial differences in syntax and morphology. Three systems participated in the task. They all beat a simple baseline on one of the two evaluation measures, but not on both measures. This shows that the task is difficult.

  2012
  1. Improving Statistical Machine Translation for a Resource-Poor Language Using Related Resource-Rich Languages. Preslav Nakov, Hwee Tou Ng. Journal of Artificial Intelligence Research (JAIR), vol.44, pp. 179-222, May 2012.
    [JAIR site] [DOI] [pdf] [local pdf] [bibtex]

    Abstract. We propose a novel language-independent approach for improving machine translation for resource-poor languages by exploiting their similarity to resource-rich ones. More precisely, we improve the translation from a resource-poor source language X1 into a resourcerich language Y given a bi-text containing a limited number of parallel sentences for X1-Y and a larger bi-text for X2-Y for some resource-rich language X2 that is closely related to X1. This is achieved by taking advantage of the opportunities that vocabulary overlap and similarities between the languages X1 and X2 in spelling, word order, and syntax offer: (1) we improve the word alignments for the resource-poor language, (2) we further augment it with additional translation options, and (3) we take care of potential spelling differences through appropriate transliteration. The evaluation for Indonesian!English using Malay and for Spanish!English using Portuguese and pretending Spanish is resource-poor shows an absolute gain of up to 1.35 and 3.37 BLEU points, respectively, which is an improvement over the best rivaling approaches, while using much less additional data. Overall, our method cuts the amount of necessary "real" training data by a factor of 2–5.

  2. Do Peers See More in a Paper than its Authors? Anna Divoli, Preslav Nakov, Marti Hearst. Advances in Bioinformatics (ABI) Volume 2012, Article ID 750214, doi:10.1155/2012/750214.
    [journal site] [pdf] [local pdf]

    Abstract. Recent years have shown a gradual shift in the content of biomedical publications that is freely accessible: from titles and abstracts to full text. This has enabled new forms of automatic text analysis, and has given rise to some interesting inter-related questions: How informative is the abstract compared to the full text? What important information in the full text document does not appear in the abstract? What should a summary of the full text document contain that is not already in the abstract? What are the differences in the way authors and peers see an article? We answer these questions using an underexplored information source: the sentences containing the citations to a target article, or citances. In particular, we compare the information content of the abstract of a biomedical journal article to the information in citances, i.e., we contrast the important points about an article as judged by its authors vs. as seen by peer researchers over the years after its publication. We thus use citances as an indirect way to look for important information in the full text; the idea is that any information that is not mentioned in the abstract but is important enough to be referred to in citances should be coming from the full text.
    Focusing on the area of molecular interactions, we perform a small-scale detailed manual analysis and we find that the set of all citances citing a given target article not only cover most information (entities, functions, experimental methods, and other biological concepts) found in the abstract of that article, but also contain 20% more concepts, mainly related to experimental procedures. We further present a detailed summary of the differences across different information types, and we examine the effects other citations and time have on the content of citances. Moreover, we perform large-scale fully automatic comparison of citances and abstracts, finding a very similar trend. Finally, we propose a new way to leverage peer expertise in assigning relationships among entities and concepts by utilizing citances and formal resources such as MeSH.

  3. Source Language Adaptation for Resource-Poor Machine Translation. Pidong Wang, Preslav Nakov, Hwee Tou Ng. In Proceedings of EMNLP-CoNLL (EMNLP-CoNLL'12). 2012, Jeju, Korea.
    [pdf] [local pdf] [bibtex] [slides ppt] [slides pdf]

    Abstract. We propose a novel language-independent approach for improving machine translation from a resource-poor language to X by adapting a large bi-text for a related resource-rich language and X (the same target language). We assume a small bi-text for the resourcepoor language to X pair, which we use to learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language; we then adapt the former to get closer to the latter. Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 6.7 BLEU points of improvement over the unadapted one and 2.6 BLEU points over the original small bi-text. Moreover, combining the small bi-text with the adapted bi-text outperforms the corresponding combinations with the unadapted bi-text by 1.5- 3 BLEU points. We also demonstrate applicability to other languages and domains.

  4. Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages. Preslav Nakov, Jorg Tiedemann. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL'12). 2012, Jeju, Korea.
    [pdf] [local pdf] [bibtex]

    Abstract. We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.

  5. Optimizing for Sentence-Level BLEU+1 Yields Short Translations. Preslav Nakov, Francisco Guzman, Stephan Vogel. In Proceedings of the 24th International Conference on Computational Linguistics (COLING'12). 2012, Mumbai, India.
    [pdf] [local pdf] [bibtex]

    Abstract. We study a problem with pairwise ranking optimization (PRO): that it tends to yield too short translations. We find that this is partially due to the inadequate smoothing in PRO's BLEU+1, which boosts the precision component of BLEU but leaves the brevity penalty unchanged, thus destroying the balance between the two, compared to BLEU. It is also partially due to PRO optimizing for a sentence-level score without a global view on the overall length, which introducing a bias towards short translations; we show that letting PRO optimize a corpus-level BLEU yields a perfect length. Finally, we find some residual bias due to the interaction of PRO with BLEU+1: such a bias does not exist for a version of MIRA with sentence-level BLEU+1. We propose several ways to fix the length problem of PRO, including smoothing the brevity penalty, scaling the effective reference length, grounding the precision component, and unclipping the brevity penalty, which yield sizable improvements in test BLEU on two Arabic-English datasets: IWSLT (+0.65) and NIST (+0.37).

  6. Feature-rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian. Georgi Georgiev, Valentin Zhikov, Kiril Simov, Petya Osenova, and Preslav Nakov. In Proceedings of the 13th Conference of the European Chapter of the Association for computational Linguistics (EACL'12). pp. 492-502. April 23-27, 2012, Avignon, France.
    [pdf] [local pdf] [bibtex]

    Abstract. We present experiments with part-of-speech tagging for Bulgarian, a Slavic language with rich inflectional and derivational morphology. Unlike most previous work, which has used a small number of grammatical categories, we work with 680 morpho-syntactic tags. We combine a large morphological lexicon with prior linguistic knowledge and guided learning from a POS-annotated corpus, achieving accuracy of 97.98%, which is a significant improvement over the state-of-the-art for Bulgarian.

  7. QCRI at WMT12: Experiments in Spanish-English and German-English Machine Translation of News Text. Francisco Guzman, Preslav Nakov, Ahmed Thabet, Stephan Vogel. In Proceedings of the Seventh Workshop on Statistical Machine Translation (WMT'12), collocated with NAACL'12, pp. 298-303, June 7-8, 2012, Montreal, Quebec, Canada.
    [pdf] [local pdf] [bibtex]

    Abstract. We describe the systems developed by the team of the Qatar Computing Research Institute for the WMT12 Shared Translation Task. We used a phrase-based statistical machine translation model with several non-standard settings, most notably tuning data selection and phrase table combination. The evaluation results show that we rank second in BLEU and TER for Spanish-English, and in the top tier for German-English.

  2011

  1. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach. Preslav Nakov and Hwee Tou Ng. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL'11). pp. 1298-1307. June 20-22, 2011, Portland, Oregon, USA.
    (acceptance rate: 116/634=18% for full papers)
    [pdf] [bibtex] [slides ppt] [slides pdf]

    Abstract. We propose a novel approach to translating from a morphologically complex language. Unlike previous research, which has targeted word inflections and concatenations, we focus on the pairwise relationship between morphologically related words, which we treat as potential paraphrases and handle using paraphrasing techniques at the word, phrase, and sentence level. An important advantage of this framework is that it can cope with derivational morphology, which has so far remained largely beyond the capabilities of statistical machine translation systems. Our experiments translating from Malay, whose morphology is mostly derivational, into English show significant improvements over rivaling approaches based on five automatic evaluation measures (for 320,000 sentence pairs; 9.5 million English word tokens).

  2. Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus. Su Nam Kim and Preslav Nakov. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'11). pp. 648-658. July 27-29, 2011, Edinburgh, Scotland, UK.
    (acceptance rate: 95/628=15% for oral presentations, 149/628=24% for full papers)
    [pdf] [bibtex] [slides]

    Abstract. Responding to the need for semantic lexical resources in natural language processing applications, we examine methods to acquire noun compounds (NCs), e.g., "orange juice", together with suitable fine-grained semantic interpretations, e.g., "squeezed from", which are directly usable as paraphrases. We employ bootstrapping and web statistics, and utilize the relationship between NCs and paraphrasing patterns to jointly extract NCs and such patterns in multiple alternating iterations. In evaluation, we found that having one compound noun fixed yields both a higher number of semantically interpreted NCs and improved accuracy due to stronger semantic restrictions.

  3. Combining Relational and Attributional Similarity for Semantic Relation Classification. Preslav Nakov and Zornitsa Kozareva. In Proceedings of the 8th conference on Recent Advances in Natural Language Processing (RANLP'11). pp. 323-330. September 12-14, 2011, Borovets, Bulgaria.
    (acceptance rate: 30/180=17% for full papers, oral presentation)
    *** Young Researcher Award ***
    [pdf] [bibtex] [slides ppt] [slides pdf]

    Abstract. We combine relational and attributional similarity for the task of identifying instances of semantic relations, such as PRODUCT-PRODUCER and ORIGIN-ENTITY, between nominals in text. We use no pre-existing lexical resources, thus simulating a realistic real-world situation, where the coverage of any such resource is limited. Instead, we mine the Web to automatically extract patterns (verbs, prepositions and coordinating conjunctions) expressing the relationship between the relation arguments, as well as hypernyms and co-hyponyms of the arguments, which we use in instance-based classifiers. The evaluation on the dataset of SemEval-1 Task 4 shows an improvement over the state-of-the-art for the case where using manually annotated WordNet senses is not allowed.

  4. Building a Named Entity Recognizer in Three Days: Application to Disease Name Recognition in Bulgarian Epicrises. Georgi Georgiev, Valentin Zhivkov, Borislav Popov, and Preslav Nakov. In Proceedings of the RANLP'2011 Workshop on Biomedical Natural Language Processing (BiomedicalNLP'11). pp. 27-34. September 15, 2011, Borovets, Bulgaria.
    [pdf] [bibtex]

    Abstract. We describe experiments with building a rec-ognizer for disease names in Bulgarian clinical epicrises, where both the language and the domain are different from those in mainstream research, which has focused on PubMed arti-cles in English. We show that using a general framework such as GATE and an appropriate pragmatic methodology can yield significant speed up of the manual annotation: we achieve F1=0.81 in just three days. This is the first step towards our ultimate goal: named entity nor-malization with respect to ICD-10.

  5. Proceedings of the ACL 2011 Workshop on Relational Models of Semantics. Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Stan Szpakowicz (editors). June 23, 2011, Portland, Oregon, USA. Association for Computational Linguistics. 91 pages. ISBN 978-1-932432-98-5.
    [pdf] [bibtex] [RELMS'11 website]


  6. Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition. Preslav Nakov, Zornitsa Kozareva, Kuzman Ganchev, Jerry Hobbs (editors). September 16, 2011, Hissar, Bulgaria. 47 pages.
    [pdf] [bibtex] [IEKA'11 website]


  7. Proceedings of Symposium on Learning Language Models from Multilingual Corpora. Dimitar Kazakov, Preslav Nakov, Ahmad R. Shahid (editors). AISB'11 Convention, University of York, York, UK, April 4-7, 2011. Society for the Study of Artificial Intelligence and the Simulation of Behaviour. 23 pages. ISBN 978-1-908187-05-5.
    [Symposium website]


  2010

  1. A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages. Minh-Thang Luong, Preslav Nakov and Min-Yen Kan. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP'10), pp. 148-157, MIT, Massachusetts, USA, October 9-11.
    (acceptance rate: 14% for oral presentation, 25% for full papers)
    [pdf] [bibtex] [slides]

    Abstract. We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

  2. Morphological Analysis for Resource-Poor Machine Translation. Ming-Feng Tsai, Preslav Nakov, and Hwee Tou Ng. National University of Singapore, Department of Computer Science, Technical Report TR22/10, December 2010.
    [pdf]

    Abstract. In statistical machine translation, word-to-word probabilities are usually difficult to estimate because of the problem of data sparseness, especially for resource-poor languages. Furthermore, this problem would become more serious for translation from morphologically complex languages such as Malay or Indonesian to morphologically simple ones such as English, since we need to be able to translate word forms in many different morphological variants. This paper conducts a morphological analysis for such resource-poor and morphologically rich machine translation: one is Malay-English machine translation; another is Indonesian-English. Specifically, we use morphological analysis to modify the unknown words of morphologically complex languages, and explore the effect of using the modified input on translation quality with varying number of training sentences. In our experiments, a number of trials were carried out to assess the performance of the proposed approach. The experimental results show that our proposed method can improve translation quality when the rate of unknown words is higher than 20%, and the improvement gradually increases as the unknown word rate increases.

  3. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals. Iris Hendrickx; Su Nam Kim; Zornitsa Kozareva; Preslav Nakov; Diarmuid Ó Séaghdha; Sebastian Padó; Marco Pennacchiotti; Lorenza Romano; Stan Szpakowicz. Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval-2), pp. 33-38, Uppsala, Sweden, July 15-16, 2010.
    [pdf] [bibtex] [slides] [task 8 website]

    Abstract. SemEval-2 Task 8 focuses on Multi-way classification of semantic relations between pairs of nominals. The task was designed to compare different approaches to semantic relation classification and to provide a standard testbed for future research. This paper defines the task, describes the training and test data and the process of their creation, lists the participating systems (10 teams, 28 runs), and discusses their results.

  4. SemEval-2 Task 9: The Interpretation of Noun Compounds Using Paraphrasing Verbs and Prepositions. Cristina Butnariu; Su Nam Kim; Preslav Nakov; Diarmuid Ó Séaghdha; Stan Szpakowicz; Tony Veale. Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval-2), pp. 39-44, Uppsala, Sweden, July 15-16, 2010.
    [pdf] [bibtex] [slides] [task 9 website]

    Abstract. Previous research has shown that the meaning of many noun-noun compounds N1 N2 can be approximated reasonably well by paraphrasing clauses of the form "N2 that ... N1", where ... stands for a verb with or without a preposition. For example, malaria mosquito is a "mosquito that carries malaria". Evaluating the quality of such paraphrases is the theme of Task 9 at SemEval-2010. This paper describes some background, the task definition, the process of data collection and the task results. We also venture a few general conclusions before the participating teams present their systems at the SemEval-2010 workshop. There were 5 teams who submitted 7 systems.

  5. Proceedings of the Sixth Asia Information Retrieval Societies Conference (AIRS 2010), Lecture Notes in Computer Science (LNCS), Volume 6458. Pu-Jen Cheng, Min-Yen Kan, Wai Lam, and Preslav Nakov (editors). Taiwan, December 1-3, 2010, Springer, 650 pages.
    [book website] [AIRS'2010 website]

  6. Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications. Eric Laporte, Preslav Nakov, Carlos Ramisch, Aline Villavicencio (editors). Beijing, China, August 28, 2010, Association for Computational Linguistics, 101 pages.
    [pdf] [bibtex] [MWE'2010 website]

      2009

    1. Classification of Semantic Relations between Nominals. Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, Deniz Yuret. Journal of Language Resources and Evaluation (LRE). 43(2):105-121. June, 2009.

      Abstract. The NLP community has shown a renewed interest in deeper semantic analyses, among them automatic recognition of semantic relations in text. We present the development and evaluation of a semantic analysis task: automatic recognition of relations between pairs of nominals in a sentence. The task was part of SemEval-2007, the fourth edition of the semantic evaluation event previously known as SensEval. Apart from the observations we have made, the long-lasting effect of this task may be a framework for comparing approaches to the task. We introduce the problem of recognizing relations between nominals, and in particular the process of drafting and refining the definitions of the semantic relations. We show how we created the training and test data, list and briefly describe the 15 participating systems, discuss the results, and conclude with the lessons learned in the course of this exercise.

    2. Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. Preslav Nakov and Hwee Tou Ng. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP'09). pp. 1358-1367, Singapore, August, 2009.
      (263/475=34% acceptance rate for full papers)
      [pdf] [bibtex] [slides ppt] [slides pdf]

      Abstract. We propose a novel language-independent approach for improving statistical machine translation for resource-poor languages by exploiting their similarity to resource-rich ones. More precisely, we improve the translation from a resource-poor source language X1 into a resource-rich language Y given a bi-text containing a limited number of parallel sentences for X1-Y and a larger bi-text for X2-Y for some resource-rich language X2 that is closely related to X1. The evaluation for Indonesian->English (using Malay) and Spanish->English (using Portuguese and pretending Spanish is resource-poor) shows an absolute gain of up to 1.35 and 3.37 Bleu points, respectively, which is an improvement over the rivaling approaches, while using much less additional data.

    3. Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web as a Corpus. Svetlin Nakov, Preslav Nakov, and Elena Paskaleva. Proceedings of the Recent Advances in Natural Language Processing (RANLP'09), pp. 292-298, Borovetz, Bulgaria, September 14-16, 2009.
      [pdf] [bibtex] [slides ppt] [slides pdf]

      Abstract. False friends are pairs of words in two languages that are perceived as similar, but have different meanings, e.g., Gift in German means poison in English. In this paper, we present several unsupervised algorithms for acquiring such pairs from a sentence-aligned bi-text. First, we try different ways of exploiting simple statistics about monolingual word occurrences and cross-lingual word cooccurrences in the bi-text. Second, using methods from statistical machine translation, we induce word alignments in an unsupervised way, from which we estimate lexical translation probabilities, which we use to measure cross-lingual semantic similarity. Third, we experiment with a semantic similarity measure that uses the Web as a corpus to extract local contexts from text snippets returned by a search engine, and a bilingual glossary of known word translation pairs, used as "bridges". Finally, all measures are combined and applied to the task of identifying likely false friends. The evaluation for Russian and Bulgarian shows a significant improvement over previously-proposed algorithms.

    4. Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. Georgi Georgiev, Preslav Nakov, Kuzman Ganchev, Petya Osenova, Kiril Simov. Proceedings of the Recent Advances in Natural Language Processing (RANLP'09), Borovetz, Bulgaria, September 14-16, 2009.
      [pdf] [bibtex]

      Abstract. The paper presents a feature-rich approach to the automatic recognition and categorization of named entities (persons, organizations, locations, and miscellaneous) in news text for Bulgarian. We combine well-established features used for other languages with language-specific lexical, syntactic and morphological information. In particular, we make use of the rich tagset annotation of the BulTreeBank (680 morpho-syntactic tags), from which we derive suitable task-specific tagsets (local and nonlocal). We further add domain-specific gazetteers and additional unlabeled data, achieving F1=89.4%, which is comparable to the state-of-the-art results for English.

    5. Language-Independent Sentiment Analysis Using Subjectivity and Positional Information. Veselin Raychev and Preslav Nakov. Proceedings of the Recent Advances in Natural Language Processing (RANLP'09), Borovetz, Bulgaria, September 14-16, 2009.
      [pdf] [bibtex]

      Abstract. We describe a novel language-independent approach to the task of determining the polarity, positive or negative, of the author's opinion on a specific topic in natural language text. In particular, weights are assigned to attributes, individual words or word bi-grams, based on their position and on their likelihood of being subjective. The subjectivity of each attribute is estimated in a two-step process, where first the probability of being subjective is calculated for each sentence containing the attribute, and then these probabilities are used to alter the attribute's weights for polarity classification. The evaluation results on a standard dataset of movie reviews shows 89.85% classification accuracy, which rivals the best previously published results for this dataset for systems that use no additional linguistic information nor external resources.

    6. NUS at WMT09: Domain Adaptation Experiments for English-Spanish Machine Translation of News Commentary Text Preslav Nakov and Hwee Tou Ng. In Proceedings of the Fourth Workshop on Statistical Machine Translation (WMT'09), in conjuntction with EACL'09. pp. 75-79, Athens, Greece, March, 2009.
      [pdf] [bibtex]

      Abstract. We describe the system developed by the team of the National University of Singapore for English to Spanish machine translation of News Commentary text for the WMT09 Shared Translation Task. Our approach is based on domain adaptation, combining a small in-domain News Commentary bi-text and a large out-of-domain one from the Europarl corpus, from which we built and combined two separate phrase tables. We further combined two language models (in-domain and out-of-domain), and we experimented with cognates, improved tokenization and recasing, achieving the highest lowercased NIST score of 6.963 and the second best lowercased Bleu score of 24.91% for training without using additional external data for English-to-Spanish translation at the shared task.

    7. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals. Iris Hendrickx; Su Nam Kim; Zornitsa Kozareva; Preslav Nakov; Diarmuid Ó Séaghdha; Sebastian Padó; Marco Pennacchiotti; Lorenza Romano; Stan Szpakowicz. Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), in conjunstion with NAACL-HLT 2009. pp. 94-99, Boulder, Colorado, June, 2009.
      [pdf] [bibtex]

      Abstract. We present a brief overview of the main challenges in the extraction of semantic relations from English text, and discuss the shortcomings of previous data sets and shared tasks. This leads us to introduce a new task, which will be part of SemEval-2010: multi-way classification of mutually exclusive semantic relations between pairs of common nominals. The task is designed to compare different approaches to the problem and to provide a standard testbed for future research, which can benefit many applications in Natural Language Processing.

    8. SemEval-2010 Task 9: The Interpretation of Noun Compounds Using Paraphrasing Verbs and Prepositions. Cristina Butnariu; Su Nam Kim; Preslav Nakov; Diarmuid Ó Séaghdha; Stan Szpakowicz; Tony Veale. Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), in conjunstion with NAACL-HLT 2009. pp. 100-105, Boulder, Colorado, June, 2009.
      [pdf] [bibtex]

      Abstract. We present a brief overview of the main challenges in understanding the semantics of noun compounds and consider some known methods. We introduce a new task to be part of SemEval-2010: the interpretation of noun compounds using paraphrasing verbs and prepositions. The task is meant to provide a standard testbed for future research on noun compound semantics. It should also promote paraphrase-based approaches to the problem, which can benefit many NLP applications.

    9. Tunable Domain-Independent Event Extraction in the MIRA Framework. Georgi Georgiev, Kuzman Ganchev, Vassil Momchev, Deyan Peychev, Preslav Nakov, Angus Roberts. Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task (BioNLP'09), in conjunstion with NAACL-HLT 2009. pp. 95-98, Boulder, Colorado, June 4-5, 2009.
      [pdf] [bibtex]

      Abstract. We describe the system of the PIKB team for BioNLP¡¯09 Shared Task 1, which targets tunable domain-independent event extraction. Our approach is based on a three-stage classification: (1) trigger word tagging, (2) simple event extraction, and (3) complex event extraction. We use the MIRA framework for all three stages, which allows us to trade precision for increased recall by appropriately changing the loss function during training. We report results for three systems focusing on recall (R = 28.88%), precision (P = 65.58%), and F1-measure (F1 = 33.57%), respectively.

    10. A Joint Model for Normalizing Gene and Organism Mentions in Text. Georgi Georgiev, Preslav Nakov, Kuzman Ganchev, Deyan Peychev, Vassil Momchev. Proceedings of the Workshop on Biomedical Information Extraction, in conjunction with RANLP'09, Borovetz, Bulgaria, September 18, 2009.
      [pdf] [bibtex]

      Abstract. The aim of gene mention normalization is to propose an appropriate canonical name, or an identifier from a popular database, for a gene or a gene product mentioned in a given piece of text. The task has attracted a lot of research attention for several organisms under the assumption that both the mention boundaries and the target organism are known. Here we extend the task to also recognizing whether the gene mention is valid and to finding the organism it is from. We solve this extended task using a joint model for gene and organism name normalization which allows for instances from different organisms to share features, thus achieving sizable performance gains with different learning methods: Naive Bayes, Maximum Entropy, Perceptron and mira, as well as averaged versions of the last two. The evaluation results for our joint classifier show F1 score of over 97%, which proves the potential of the approach.

    11. Cross-lingual Adaptation as a Baseline: Adapting Maximum Entropy Models to Bulgarian. 12. Georgi Georgiev, Preslav Nakov, Petya Osenova and Kiril Simov. Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains, in conjunction with RANLP'09, Borovetz, Bulgaria, September 17, 2009.
      [pdf] [bibtex]

      Abstract. We describe our efforts in adapting five basic natural language processing components to Bulgarian: sentence splitter, tokenizer, part-of-speech tagger, chunker, and syntactic parser. The components were originally developed for English within OpenNLP, an open source maximum entropy based machine learning toolkit, and were retrained based on manually annotated training data from the BulTreeBank. The evaluation results show an F1 score of 92.54% for the sentence splitter, 98.49% for the tokenizer, 94.43% for the part-of-speech tagger, 84.60% for the chunker, and 77.56% for the syntactic parser, which should be interpreted as baseline for Bulgarian.

    12. A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words. Svetlin Nakov, Elena Paskaleva, Preslav Nakov. Proceedings of the Workshop on Evaluation of Resources and Tools for Central and Eastern European Languages, in conjunction with RANLP'09, Borovetz, Bulgaria, September 17, 2009.
      [pdf] [bibtex]

      Abstract. We propose a novel knowledge-rich approach to measuring the similarity between a pair of words. The algorithm is tailored to Bulgarian and Russian and takes into account the orthographic and the phonetic correspondences between the two Slavic languages: it combines lemmatization, hand-crafted transformation rules, and weighted Levenshtein distance. The experimental results show an 11-pt interpolated average precision of 90.58%, which represents a sizeable improvement over two classic rivaling approaches.

    13. The NUS Machine Translation System for IWSLT 2009. Preslav Nakov, Chang Liu, Wei Lu, Hwee Tou Ng. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT'09), December 1-2, 2009. Tokyo, Japan.
      [pdf] [bibtex] [slides]

      Abstract. We describe the system developed by the team of the National University of Singapore for the Chinese-English BTEC task of the IWSLT 2009 evaluation campaign. We adopted a state-of-the-art phrase-based statistical machine translation approach and focused on experiments with different Chinese word segmentation standards. In our official submission, we trained a separate system for each segmenter and we combined the outputs in a subsequent re-ranking step. Given the small size of the training data, we further re-trained the system on the development data after tuning. The evaluation results show that both strategies yield sizeable and consistent improvements in translation quality.

    14. Contemporary Statistical Machine Translation: a Brief Overview. Preslav Nakov. Journal of Automatics and Informatics. Vol. 4, 2009. (in Bulgarian)

    15. Statistical Machine Translation: Problems and Approaches. Preslav Nakov. In: Computational Linguistics: History, Problems, Perspectives. Vol. 1. Anabela. Sofia, 2009. (in Bulgarian)

    16. Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. Dimitra Anastasiou, Chikara Hashimoto, Preslav Nakov, and Su Nam Kim (editors). Workshop held in conjunstion with ACL-EMNLP 2009. Singapore, June 6, 2009, Association for Computational Linguistics, 81 pp.
      [bibtex]



 Publications at the Linguistic Modelling Department, IPP-BAS
      2008

    1. Solving Relational Similarity Problems Using the Web as a Corpus. Preslav Nakov and Marti Hearst. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL'08). pp. 452-460, Columbus, OH, USA. 2008.

      Abstract. We present a simple linguistically-motivated method for characterizing the semantic relations that hold between two nouns. The approach leverages the vast size of the Web in order to build lexically-specific features. The main idea is to look for verbs, prepositions, and coordinating conjunctions that can help make explicit the hidden relations between the target nouns. Using these features in instance-based classifiers, we demonstrate state-of-the-art results on various relational similarity problems, including mapping noun-modifier pairs to abstract relations like TIME, LOCATION and CONTAINER, characterizing noun-noun compounds in terms of abstract linguistic predicates like CAUSE, USE, and FROM, classifying the relations between nominals in context, and solving SAT verbal analogy problems. In essence, the approach puts together some existing ideas, showing that they apply generally to various semantic tasks, finding that verbs are especially useful features.

    2. Improved Statistical Machine Translation Using Monolingual Paraphrases. Preslav Nakov. In M. Ghallab, C. Spyropoulos, N. Fakotakis, N. Avouris (eds): Frontiers in Artificial Intelligence and Applications, Volume 178, ECAI 2008 - 18th European Conference on Artificial Intelligence, pp. 338-342, Patras, Greece, 2008.

      Abstract. We propose a novel monolingual sentence paraphrasing method for augmenting the training data for statistical machine translation systems "for free" - by creating it from data that is already available rather than having to create more aligned data. Starting with a syntactic tree, we recursively generate new sentence variants where noun compounds are paraphrased using suitable prepositions, and vice-versa - preposition-containing noun phrases are turned into noun compounds. The evaluation shows an improvement equivalent to 33%-50% of that of doubling the amount of training data.

    3. Noun Compound Interpretation Using Paraphrasing Verbs: Feasibility Study. Preslav Nakov. In Proceedings of the 13th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA'08), pp. 103-117, Varna, Bulgaria, 2008.

      Abstract. The paper addresses an important challenge for the automatic processing of English written text: understanding noun compounds' semantics. Following Downing (1977), we define noun compounds as sequences of nouns acting as a single noun, e.g., bee honey, apple cake, stem cell, etc. In our view, they are best characterised by the set of all possible paraphrasing verbs that can connect the target nouns, with associated weights, e.g., malaria mosquito can be represented as follows: carry (23), spread (16), cause (12), transmit (9), etc. These verbs are directly usable as paraphrases, and using multiple of them simultaneously yields an appealing fine-grained semantic representation. In the present paper, we describe the process of constructing such representations for 250 noun-noun compounds previously proposed in the linguistic literature by Levi (1978). In particular, using human subjects recruited through Amazon Mechanical Turk Web Service, we create a valuable manually-annotated resource for noun compound interpretation, which we make publicly available with the hope to inspire further research in paraphrase-based noun compound interpretation. We further perform a number of experiments, including a comparison to automatically generated weight vectors, in order to assess the dataset quality and the feasibility of the idea of using paraphrasing verbs to characterise noun compounds' semantics; the results are quite promising.

    4. Paraphrasing Verbs for Noun Compound Interpretation. Preslav Nakov. In Proceedings of the Workshop on Multiword Expressions (MWE'08), in conjunction with the Language Resources and Evaluation conference, pp. 46-49, Marrakech, Morocco, 2008.

      Abstract. An important challenge for the automatic analysis of English written text is the abundance of noun compounds: sequences of nouns acting as a single noun. In our view, their semantics is best characterized by the set of all possible paraphrasing verbs, with associated weights, e.g., malaria mosquito is carry (23), spread (16), cause (12), transmit (9), etc. Using Amazon's Mechanical Turk, we collect paraphrasing verbs for 250 noun-noun compounds previously proposed in the linguistic literature, thus creating a valuable resource for noun compound interpretation. Using these verbs, we further construct a dataset of pairs of sentences representing a special kind of textual entailment task, where a binary decision is to be made about whether an expression involving a verb and two nouns can be transformed into a noun compound, while preserving the sentence meaning.

    5. Improving English-Spanish Statistical Machine Translation: Experiments with Domain Adaptation, Sentence-Level Paraphrasing, Tokenization, and Recasing. Preslav Nakov. In Proceedings of the Third Workshop on Statistical Machine Translation (WMT'08), in conjunction with ACL'2008, pp. 147-150.

      Abstract. We describe the experiments of the UC Berkeley team on improving English-Spanish machine translation of news text, as part of the WMT'08 Shared Translation Task. We experiment with domain adaptation, combining a small in-domain news bi-text and a large out-of-domain one from the Europarl corpus, building two separate phrase translation models and two separate language models. We further add a third phrase translation model trained on a version of the news bi-text augmented with monolingual sentencelevel syntactic paraphrases on the sourcelanguage side, and we combine all models in a log-linear model using minimum error rate training. Finally, we experiment with different tokenization and recasing rules, achieving 35.09% Bleu score on the WMT'07 news test data when translating from English to Spanish, which is a sizable improvement over the highest Bleu score achieved on that dataset at WMT'07: 33.10% (in fact, by our system). On the WMT'08 English to Spanish news translation, we achieve 21.92%, which makes our team the second best on Bleu score.

    6. Overview of BioCreative II Gene Mention Recognition, Larry Smith, Lorraine K. Tanabe, Rie Johnson nee Ando, Cheng-Ju Juo, I-Fang Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M. Friedrich, Kuzman Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A. Struble, Richard J. Povinelli, Andreas Vlachos, William A. Baumgartner Jr., Lawrence Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong jie Dai, Feng Liu, Yifei Chen, Chengjie Sun, Sophia Katrenko, Pieter Adriaans, Christian Blaschke, Rafel Torres Perez, Mariana Neves, Preslav Nakov, Anna Divoli, Manuel Mana, Jacinto Mata-Vazquez, and W. John Wilbur. In Journal of Genome Biology. 2008. accepted



 Publications at UC Berkeley
      2007

    1. Ph.D. thesis: Using the Web as an Implicit Training Set: Application to Noun Compound Syntax and Semantics., Nakov, Preslav Ivanov. EECS Department, University of California, Berkeley. Technical Report No. UCB/EECS-2007-173. December 20, 2007.

      Abstract. An important characteristic of English written text is the abundance of noun compounds - sequences of nouns acting as a single noun, e.g., colon cancer tumor suppressor protein. While eventually mastered by domain experts, their interpretation poses a major challenge for automated analysis. Understanding noun compounds' syntax and semantics is important for many natural language applications, including question answering, machine translation, information retrieval, and information extraction. For example, a question answering system might need to know whether "protein acting as a tumor suppressor" is an acceptable paraphrase of the noun compound tumor suppressor protein, and an information extraction system might need to decide if the terms neck vein thrombosis and neck thrombosis can possibly co-refer when used in the same document. Similarly, a phrase-based machine translation system facing the unknown phrase WTO Geneva headquarters, could benefit from being able to paraphrase it as Geneva headquarters of the WTO or WTO headquarters located in Geneva. Given a query like migraine treatment, an information retrieval system could use paraphrasing verbs like relieve and prevent for page ranking and query refinement. I address the problem of noun compounds syntax by means of novel, highly accurate unsupervised and lightly supervised algorithms using the Web as a corpus and search engines as interfaces to that corpus. Traditionally the Web has been viewed as a source of page hit counts, used as an estimate for n-gram word frequencies. I extend this approach by introducing novel surface features and paraphrases, which yield state-of-the-art results for the task of noun compound bracketing. I also show how these kinds of features can be applied to other structural ambiguity problems, like prepositional phrase attachment and noun phrase coordination. I address noun compound semantics by automatically generating paraphrasing verbs and prepositions that make explicit the hidden semantic relations between the nouns in a noun compound. I also demonstrate how these paraphrasing verbs can be used to solve various relational similarity problems, and how paraphrasing noun compounds can improve machine translation.

    2. BioText Search Engine: beyond abstract search., Marti A. Hearst, Anna Divoli, Harendra Guturu, Alex Ksikes, Preslav Nakov, Michael A. Wooldridge, and Jerry Ye. In Bioinformatics 23(16):2196-2197, 2007.

      Abstract. The BioText Search Engine is a freely available Web-based application that provides biologists with new ways to access the scientific literature. One novel feature is the ability to search and browse article figures and their captions. A grid view juxtaposes many different figures associated with the same keywords, providing new insight into the literature. An abstract/title search and list view shows at a glance many of the figures associated with each article. The interface is carefully designed according to usability principles and techniques. The search engine is a work in progress, and more functionality will be added over time.

    3. SemEval-2007 Task 04: Classification of Semantic Relations between Nominals., Roxana Girju, Preslav Nakov, Vivi Nastase, Stan Szpakowicz, Peter Turney, Deniz Yuret in Proceedings of SemEval-2007 Workshop co-located with ACL-2007, pp. 13-18, Prague, June 23-24, 2007.

      Abstract. The NLP community has shown a renewed interest in deeper semantic analyses, among them automatic recognition of relations between pairs of words in a text. We present an evaluation task designed to provide a framework for comparing different approaches to classifying semantic relations between nominals in a sentence. This is part of SemEval, the 4th edition of the semantic evaluation event previously known as SensEval. We define the task, describe the training/test data and their creation, list the participating systems and discuss their results. There were 14 teams who submitted 15 systems.

    4. UCB: System Description for SemEval Task #4., Preslav Nakov and Marti Hearst. in Proceedings of SemEval-2007 Workshop co-located with ACL-2007, pp.366-369, Prague, June 23-24, 2007.

      Abstract. The UC Berkeley team participated in the SemEval 2007 Task #4, with an approach that leverages the vast size of the Web in order to build lexically-specific features. The idea is to determine which verbs, prepositions, and conjunctions are used in sentences containing a target word pair, and to compare those to features extracted for other word pairs in order to determine which are most similar. By combining these Web features with words from the sentence context, our team was able to achieve the best results for systems of category C, and close to the best results for systems of category A.

    5. UCB System Description for the WMT 2007 Shared Task., Preslav Nakov and Marti Hearst. in Proceedings of Second Workshop on Statistical Machine Translation co-located with ACL-2007, pp.212-215, Prague, June 23, 2007.

      Abstract. For the WMT 2007 shared task, the UC Berkeley team employed three techniques of interest. First, we used monolingual syntactic paraphrases to provide syntactic variety to the source training set sentences. Second, we trained two language models: a small in-domain model and a large out-of-domain model. Finally, we made use of results from prior research that shows that cognate pairs can improve word alignments. We contributed runs translating English to Spanish, French, and German using various combinations of these techniques.

    6. BioText Report for the Second BioCreAtIvE Challenge., Nakov, P., and Divoli A. in Proceedings of BioCreAtIvE II Workshop, pp. 297-306, Madrid, Spain, April 23-25, 2007.

      Abstract. This report describes the BioText team participation in the Second BioCreAtIvE Challenge. We focused on the Interaction-Article (IAS) and the Interaction-Pair (IPS) Sub-Tasks, which ask for the identification of protein interaction information in abstracts, and the extraction of interacting protein pairs from full text documents, respectively. We identified and normalized protein names and then used an ensemble of Naive Bayes classifiers in order to decide whether protein interaction information is present in a given abstract (for IAS) or a pair of co-occurring genes interact (for IPS). Since the recognition and normalization of genes and proteins were critical components of our approach, we participated in the Gene Mention (GM) and Gene Normalization (GN) tasks as well, in order to evaluate the performance of these components in isolation. For these tasks we used a previously developed in-house tool, based on database-derived gazetteers and approximate string matching, which we augmented with a document-centered ambiguity resolution, but did not train or tune on the training data for GN and GM.

    7. Improved Word Alignments Using the Web as a Corpus, Preslav Nakov, Svetlin Nakov and Elena Paskaleva. In Proceedings of RANLP'2007, pp.400-405, Borovetz, Bulgaria, September 27-29, 2007.

      Abstract. We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, linguistically motivated weighted minimum edit distance, competitive linking, and the IBM models. Evaluation results on a Bulgarian-Russian corpus show a sizable improvement both in word alignment and in translation quality.

    8. Cognate or False Friend? Ask the Web!, Svetlin Nakov, Preslav Nakov and Elena Paskaleva. In Proceedings of the Workshop on Acquisition and Management of Multilingual Lexicons, held in conjunction with RANLP'2007, pp. 55-62, Borovetz, Bulgaria, September 30, 2007.

      Abstract. We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a glossary of known word translations used as cross-linguistic "bridges", and the vector space model. Unlike traditional orthographic similarity measures, our method can easily handle words with identical spelling. The evaluation on 200 Bulgarian-Russian word pairs shows this is a very promising approach.

    9. Extracting Translation Lexicons from Bilingual Corpora: Application to South-Slavonic Languages, Preslav Nakov, Veno Pacovski and Elena Paskaleva. In Proceedings of the Workshop on Common Natural Language Processing Paradigm For Balkan Languages, held in conjunction with RANLP'2007, pp. 23-31, Borovetz, Bulgaria, September 26, 2007.

      Abstract. The paper presents a novel approach for automatic translation lexicon extraction from a parallel sentence-aligned corpus. This is a five-step process, which includes cognate extraction, word alignment, phrase extraction, statistical phrase filtering, and linguistic phrase filtering. Unlike other approaches whose objective is to extract word or phrase pairs to be used in machine translation, we try to induce meaningful linguistic units (pairs of words or phrases) that could potentially be included as entries in a bilingual dictionary. Structural and content analysis of the extracted phrases of length up to seven words shows that over 90% of them are correctly translated, which suggests that this is a very promising approach.

      2006

    1. BioText Team Report for the TREC 2006 Genomics Track., Anna Divoli, Marti A. Hearst, Preslav I. Nakov, Ariel Schwartz, Alex Ksikes. Proceedings of TREC 2006, Gaithersburg, MD, 2005.

      Abstract. The paper reports on the work conducted by the BioText team at UC Berkeley for the TREC 2006 Genomics track. Our approach had three main focal points: First, based on our successful results in the TREC 2003 Genomics track [1], we emphasized gene name recall. Second, given the structured nature of the Generic Topic Types (GTTs), we attempted to design queries that covered every part of the topics, including synonym expansion. Third, inspired by having access to the full text of documents, we experimented with identifying and weighting information depending on which section (Introduction, Results, etc.) it appeared in. Our emphasis on covering the different pieces of the query may have helped with the aspects ranking portion of the task, as we performed best on that evaluation measure. We submitted three runs: Biotext1, BiotextWeb, and Biotext3. All runs were fully automatic. The Biotext1 run performed best, achieving MAP scores of .24 on aspects, .35 on documents, and .035 on passages.

    2. Using Verbs to Characterize Noun-Noun Relations., Nakov, P., and Hearst, M. in Proceedings of AIMSA 2006, pp. 233-244, Varna, Bulgaria, September 2006.

      Abstract. We present a novel, simple, unsupervised method for characterizing the semantic relations that hold between nouns in noun-noun compounds. The main idea is to discover predicates that make explicit the hidden relations between the nouns. This is accomplished by writing Web search engine queries that restate the noun compound as a relative clause containing a wildcard character to be filled in with a verb. A comparison to results from the literature suggest this is a promising approach.

      2005

    1. Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing., Nakov, P., and Hearst, M. in Proceedings of CoNLL-2005, Ninth Conference on Computational Natural Language Learning, pp. 17-24, Ann Arbor, MI, June 2005.

      Abstract. In order to achieve the long-range goal of semantic interpretation of noun compounds, it is often necessary to first determine their syntactic structure. This paper describes an unsupervised method for noun compound bracketing which extracts statistics fromWeb search engines using a Chi^s measure, a new set of surface features, and paraphrases. On a gold standard, the system achieves results of 89.34% (baseline 66.80%), which is a sizable improvement over the state of the art (80.70%).

    2. Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution., Nakov, P., and Hearst, M. In Proceedings of the HLT-NAACL'05, pp.835-842, Vancouver, 2005.

      Abstract. Recent work has shown that very large corpora can act as training data for NLP algorithms even without explicit labels. In this paper we show how the use of surface features and paraphrases in queries against search engines can be used to infer labels for structural ambiguity resolution tasks. Using unsupervised algorithms, we achieve 84% precision on PP-attachment and 80% on noun compound coordination.

    3. A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies., Nakov, P., and Hearst, M. In Proceedings of RANLP'05 Borovets, Bulgaria, 2005.

      Abstract. The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-gram counts across different search engines as well as for the same search engine across time, finding that although there are measurable differences, they are not statistically significantly different for the task examined.

    4. Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing., Nakov, P., Schwartz, A., Wolf, B, and Hearst, M. In ACL/ISMB BioLINK SIG: Linking Literature, Information and Knowledge for Biology Detroit, MI, June 2005.

      Abstract. We describe the use of the Layered Query Language and architecture to acquire statistics for natural language processing applications. We illustrate system¡¯s use on the problem of noun compound bracketing using MEDLINE.

    5. Supporting Annotation Layers for Natural Language Processing., Nakov, P., Schwartz, A., Wolf, B, and Hearst, M. In the ACL 2005 Poster/Demo Track, pp. 65-68, Ann Arbor, MI, June 2005.

      Abstract. We demonstrate a system for flexible querying against text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, flexibility in the format of returned results, and tight integration with SQL.We present a query language and its use on examples taken from the NLP literature.

      2004

    1. BioText Team Experiments for the TREC 2004 Genomics Track., Nakov, P., Schwartz, A., Wolf, B, and Hearst, M. Proceedings of TREC 2004, Gaithersburg, MD, 2005.

      Abstract.

    2. Citances: Citation Sentences for Semantic Analysis of Bioscience Text., Nakov P., A. Schwartz, M. Hearst. Workshop on Search and Discovery in Bioinformatics at SIGIR'04, Sheffield, UK, July 2004.

      Abstract. We propose the use of the text of the sentences surrounding citations as an important tool for semantic interpretation of bioscience text. We hypothesize several di erent uses of citation sentences (which we call citances), including the creation of training and testing data for semantic analysis (especially for entity and relation recognition), synonym set creation, database curation, document summarization, and information retrieval generally. We illustrate some of these ideas, showing that citations to one document in particular align well with what a hand-built curator extracted. We also show preliminary results on the problem of normalizing the di erent ways that the same concepts are expressed within a set of citances, using and improving on existing techniques in automatic paraphrase generation. becoming more available, providing new opportunities for automatic text processing. One such opportunity lies in the text around citations in full text papers. In this paper we put forward a new vision for the path towards robust and large-coverage algorithms for semantic interpretation of bioscience articles. We suggest using the sentences that surround the citations to related work as the data from which to build semantic interpretation models. We also introduce a neologism, citances, to mean the sentence( s) surrounding the citation within a document. Citations are used in every scientific literature, but they are

    3. BioText Team Report for the TREC 2004 Genomics Track. Preslav Nakov, Ariel Schwartz, Emilia Stoica, Marti Hearst. In Proceedings of the Text REtrieval Conference (TREC'04), Gaithersburg, MD, USA, 2004.

      Abstract. The BioText group participated in the two main tasks of the TREC 2004 Genomics track. Our approach to the ad hoc task was similar to the one used in the 2003 Genomics track, but due to the lack of training data, we did not achieve the high scores of the previous year. The most novel aspect of our submission for the categorization task centers around our method for assigning Gene Ontology (GO) codes to articles marked for curation. This approach compares the text surrounding a target gene to text that has been found to be associated with GO codes assigned to homologous genes for organisms with genomes similar to mice (namely, humans and rats). We applied the same method to GO codes that have been assigned to MGI entries in years prior to the test set. In addition, we filtered out proposed GO codes based on their previously observed likelihood to co-occur with one another.

      2003

    1. Category-based Pseudowords, Preslav Nakov and Marti Hearst, in the Companion Volume of the Proceedings of HLT-NAACL'03, pp. 67-69, Edmonton, Canada, May 2003.

      Abstract. A pseudoword is a composite comprised of two or more words chosen at random; the individual occurrences of the original words within a text are replaced by their conflation. Pseudowords are a useful mechanism for evaluating the impact of word sense ambiguity in many NLP applications. However, the standard method for constructing pseudowords has some drawbacks. Because the constituent words are chosen a t random, the word contexts that surround pseudowords do not necessarily reflect the contexts that real ambiguous words occur in. This in turn leads to an optimistic upper bound on algorithm performance. To address these drawbacks, we propose the use of lexical categories to create more realistic pseudowords, and evaluate the results of different variations of this idea against the standard approach. combine semantically distinct words. Another drawback is that the results produced using pseudowords are dif- ficult to characterize in terms of the types of ambiguity they model.

    2. BioText Team Report for the TREC 2003 Genomics Track, Gaurav Bhalotia, Preslav Nakov, Ariel S. Schwartz, Marti A. Hearst. In Proceedings of TREC'03, pp. 612-621, Gaithersburg, MD, USA, 2003.

      Abstract. The BioText project team participated in both tasks of the TREC 2003 genomics track. Key to our approach in the primary task was the use of an organism-name recognition module, a module for recognizing gene name variants, and MeSH descriptors. Text classification improved the results slightly. In the secondary task, the key insight was casting it as a classification problem of choosing between the title and the last sentence of the abstract, although MeSH descriptors helped somewhat in this task as well. These approaches yielded results within the top three groups in both tasks.




     Other publications (not at UC Berkeley)

    1. M.Sc.thesis: Nakov P. Recognition and Morphological Classification of Unknown Words for German. Sofia University, Faculty of Mathematics and Informatics, Department of Information Technologies. Sofia. July, 2001.

      Abstract. A system for recognition and morphological classification of unknown words for German is described. The System takes raw text as input and outputs list of the unknown nouns together with hypothesis about their possible morphological class and stem. The morphological classes used uniquely identify the word gender and the inflection endings it takes when changes by case and number. The System exploits both global (ending guessing rules, maximum likelihood estimations, word frequency statistics) and local information (surrounding context) as well as morphological properties (compounding, inflection, affixes) and external knowledge (specially designed lexicons, German grammar information etc.). The problem is solved as a sequence of subtasks including: unknown words identification, noun identification, inflected forms of the same word recognition and grouping (they must share the same stem), compounds splitting, morphological stem analysis, stem hypothesis for each group of inflected forms, and finally ¡ª production of ranked list of hypotheses about the possible morphological class for each group of words. The System is a kind of tool for lexical acquisition: it identifies, derives some properties and classifies unknown words from a raw text. Only nouns are currently considered but the approach can be successfully applied to other parts-of-speech as well as to other inflexional languages.

    2. Robust Ending Guessing Rules with Application to Slavonic Languages. Preslav Nakov, Elena Paskaleva. In Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND), an International Workshop in Association with COLING'04, pp. 76-85, Geneva, August 29, 2004.

      Abstract. The paper studies the automatic extraction of diagnostic word endings for Slavonic languages aimed to determine some grammatical, morphological and semantic properties of the underlying word. In particular, ending guessing rules are being learned from a large morphological dictionary of Bulgarian in order to predict POS, gender, number, article and semantics. A simple exact high accuracy algorithm is developed and compared to an approximate one, which uses a scoring function previously proposed by Mikheev for POS guessing.

    3. Towards Deeper Understanding and Personalisation in CALL. Galia Angelova, Albena Strupchanska, Ognyan Kalaydijev, Milena Yankova, Svetla Boytcheva, Irena Vitanova, Preslav Nakov. In Proceedings of "eLearning for Computational Linguistics and Computational Linguistics for eLearning", an International Workshop in Association with COLING'04, pp. 45-52, Geneva, August 28, 2004.

      Abstract. We consider in depth the semantic analysis in learning systems as well as some information retrieval techniques applied for measuring the document similarity in eLearning. These results are obtained in a CALL project, which ended by extensive user evaluation. After several years spent in the development of CALL modules and prototypes, we think that much closer cooperation with real teaching experts is necessary, to find the proper learning niches and suitable wrappings of the language technologies, which could give birth to useful eLearning solutions.

    4. Non-Parametric SPAM Filtering based on kNN and LSA. Preslav Nakov, Panayot Dobrikov. In Proceedings of the 33th National Spring Conference of the Bulgarian Mathematicians Union, Borovets, Bulgaria, April 1-4, 2004.

      Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages, also known as spam. The email messages text is represented as an LSA vector, which is then fed into a kNN classifier. The method shows a high accuracy on a collection of recent personal email messages. Tests on the standard LINGSPAM collection achieve an accuracy of over 99.65%, which is an improvement on the best-published results to date.

    5. Towards Deeper Understanding of LSA Performance. Nakov P., E. Valchanova, G. Angelova. In Proceedings of Recent Advances in Natural Language Processing (RANLP¡¯03). pp. 311-318. Borovetz, Bulgaria, September 10-12, 2003.

      Abstract. The paper presents an on-going work towards deeper understanding of the factors influencing the performance of the Latent Semantic Analysis (LSA). Unlike previous attempts that concentrate on problems such as matrix elements weighting, space dimensionality selection, similarity measure etc., we primarily study the impact of another, often neglected, but fundamental element of LSA (and of any text processing technique): the definition of "word". For the purpose, a balanced corpus of Bulgarian newspaper texts was carefully created, to allow for in-depth observations of the LSA performance, and series of experiments were performed in order to understand and compare (with respect to the task of text categorisation) six possible inputs with different level of linguistic quality, including: graphemic form as met in the text, stem, lemma, phrase, lemma&phrase and part-of-speech annotation. In addition to LSA, we made comparisons to the standard vector-space model, without any dimensionality reduction. The results show that while the linguistic processing has a substantial influence on the LSA performance, the traditional factors are even more important, and therefore we did not prove that the linguistic pre-processing substantially improves text categorisation.

    6. Guessing Morphological Classes of Unknown German Nouns. Nakov P., Bonev Y., G. Angelova, E. Gius, W. von Hahn. In Proceedings of Recent Advances in Natural Language Processing (RANLP¡¯03). pp. 319-326. Borovetz, Bulgaria, September 10-12, 2003.

      Abstract. A system for recognition and morphological classification of unknown German words is described. Given raw texts it outputs a list of the unknown nouns together with hypotheses about their possible stems and morphological class(es). The system exploits both global and local information as well as morphological properties and external linguistic knowledge sources. It learns and applies ending-guessing rules similar to the ones originally proposed for POS guessing. The paper presents the system design and implementation and discusses its performance by extensive evaluation. Similar ideas for ending-guessing rules have been applied to Bulgarian as well but the performance is worse due to the difficulties of noun recognition as well as to the highly inflexional morphology with numerous ambiguous endings.

    7. BulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian. Nakov P. In Proceedings of Workshop on Balkan Language Resources and Tools (1st Balkan Conference in Informatics), Thessaloniki, Greece, November, 2003.
      Download now!

      Abstract. The paper starts with an overview of some important approaches to stemming for English and other languages. Then, the design, implementation and evaluation of the BulStem inflectional stemmer for Bulgarian are presented. The problem is addressed from a machinelearning perspective using a large morphological dictionary. A detailed automatic evaluation in terms of understemming, over-stemming and coverage is provided. In addition, the effect of stemming and BulStem parameters setting is demonstrated on a particular task: text categorisation using kNN+LSA.

    8. ArtsSemNet: from Bilingual Dictionary to Bilingual Semantic Network. Atanassova I, Nakov S., Nakov P. In Proceedings of Workshop on Balkan Language Resources and Tools (1st Balkan Conference in Informatics), Thessaloniki, Greece, November, 2003.
      Download ArtsSemNet now! Download FineArtsDict now!

      Abstract. The paper presents two bilingual lexicographical resources for the terminology of fine arts: the ArtsDict electronic dictionary and the ArtsSemNet semantic network, and describes the process of transformation of the former into the latter. ArtsDict combines a broad range of information sources and is currently the most complete dictionary of fine arts terminology for both Bulgarian and Russian: not only electronic, but also in general. It contains 2,900 Bulgarian and 2,644 Russian terms, each annotated with complete dictionary definitions. These are further augmented with various terminological relations (polysemy, synonymy, homonymy, antonymy and hyponymy) and organised into a bilingual semantic network similar to WordNet. In addition, a specialised hypertext browser is implemented in order to enable intuitive query and navigation through the network.

    9. Adaptivity in Web-Based CALL. Angelova G., S. Boytcheva, O. Kalaydjiev, S. Trausan-Matu, P. Nakov and A. Strupchanska. In Proceedings of the 15th European Conference on Artificial Intelligence (ECAI'02), pp. 445-449. Lyon, France. July 21-26 2002.

      Abstract. This paper presents the design, implementation and some original features of a Web-based learning environment - STyLE (Scientific Terminology Learning Environment). STyLE4 supports adaptive learning of English terminology with a target user group of non-native speakers. It attempts to improve Computer- Aided Language Learning (CALL) by intelligent integration of Natural Language Processing (NLP) and personalised Information Retrieval (IR) into a single coherent system.

    10. Automatic Recognition and Morphological Classification of Unknown German Nouns. Nakov, P., G. Angelova, W. von Hahn. FBI-HH-B-243/02, Bericht 243, Fachbereich Informatik, Universitaet Hamburg, September 2002.

      Abstract. The work presented here was performed 2001 as a scientific project of the BIS-21 "Center of Excellence" project, ICA1-2000-70016 and was supported by the cooperation between Hamburg University Sofia University "St. Kl. Ohridski" Abstract A system for recognition and morphological classification of unknown words for German is described. The MorphoClass system takes raw text as input and outputs a list of the unknown nouns together with hypotheses about their morphological class and stem. The used morphological classes uniquely identify the word gender and the inflection endings it takes for changes in case and number. MorphoClass exploits both global information (ending guessing rules, maximum likelihood estimations, word frequency statistics), and local information (adjacent context) as well as morphological properties (compounding, inflection, affixes) and external linguistic knowledge (especially designed lexicons, German grammar information etc.). The task is solved by a sequence of subtasks including: unknown word identification, noun identification, recognition and grouping of inflected forms of the same word (they must share the same stem), compound splitting, morphological stem analysis, stem hypotheses for each group of inflected forms, and finally §³ production of a ranked list of hypotheses about a possible morphological class for each group of words. MorphoClass is a kind of tool for lexical acquisition: it identifies unknown words from a raw text, derives their properties and classifies them. Currently, only nouns are processed but the approach can be successfully applied to other parts of speech (especially when the PoS of the unknown word is already determined) as well as to other inflexional languages.

    11. Latent Semantic Analysis for Notional Structures Investigation. Nakov P., S. Terzieva. In Proceedings of the Annual Congress of the European Society for Philosophy and Psychology (ESPP'02). Lyon, France, July 10-13, 2002.

      Abstract. The research on the effects of study is hindered by the limitations of the techniques and methods of registering, measuring and assessing the actually formed knowledge. The problem has been solved using latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed in the form of free verbal statements. Education at higher schools has the specific objective to develop knowledge and experience both of which have two fundamental dimensions: the first is expertise training in a well-defined occupational or disciplinary domain, and the second ¡ª learning strategies and skills to be an effective learner. Various trends for stimulation of deep learning, transferring in practice the achievements of the cognitive psychology, have been developed during the last decade. Here we present a research on the cognitive activity of university students and its results in the dimension of declarative knowledge. In practice a comparative analysis is made between the input system of notions from the learning texts and the formed mental structures of the students. The research includes a sequence of actions and procedures for: facilitation of the formation of stable concepts structures (preparation of learning materials, its content, structure and visual presentation, organisation of learning, etc.); feedback output on the preservation of knowledge of certain number of key notions; and assessment of manifested knowledge. The data used is verbal - learning texts, linguistic descriptions of notions contained in them and all these are rendered in an open format by the people observed while posing indirect questions. The nature of the processed material (input stimuli and preserved knowledge), decided on the application of Latent Semantic Analysis (LSA) as a research method on the information data. This statistical technology permitted the formation of a model of semantic connections between the researched notions in the output and the general representation of the results.

    12. Latent Semantic Analysis for German literature investigation. Nakov P. In Proceedings of the 7th Fuzzy Days'01, International Conference on Computational Intelligence. B. Reusch (Ed.): LNCS 2206. pp. 834-641. Dortmund, Germany. October 1-3, 2001.

      Abstract. The paper presents the results of experiments of usage of LSA for analysis of textual data. The method is explained in brief and special attention is pointed on its potential for comparison and investigation of German literature texts. Two hypotheses are tested: 1) the texts by the same author are alike and can be distinguished from the ones by different person; 2) the prose and poetry can be automatically discovered.

    13. Weight functions impact on LSA performance. Nakov P., Popova A., Mateev P. In Proceedings of the EuroConference Recent Advances in Natural Language Processing (RANLP'01). pp. 187-193. Tzigov Chark, Bulgaria, September 5-7, 2001.

      Abstract. This paper presents experimental results of usage of LSA for analysis of English literature texts. Several preliminary transformations of the frequency text-document matrix with different weight functions are tested on the basis of control subsets. Additional clustering based on correlation matrix is applied in order to reveal the latent structure. The algorithm creates a shaded form matrix via singular values and vectors. The results are interpreted as a quality of the transformations and compared to the control set tests.

    14. Research in the Notional Structures of the Declarative Memory of Students using Latent Semantic Analysis. Terzieva S., Nakov P. Bulgarian Journal of Psychology. vol. 1-2/2001. Sofia, Bulgaria. 2001. (in Bulgarian).

      Abstract. Education at higher schools has the specific objective to develop knowledge and experience both of which have two fundamental dimensions: the first is expertise training in a well defined occupational or disciplinary domain, and the second - learning strategies and skills to be effective learner. Various trends for stimulation of deep learning have been developed for the past decade. In their concepts they transfer in practice the achievements of cognitive psychology. Here we present a research on the cognitive activity of university students and its results in the dimension of declarative knowledge. In practice a comparative analysis is made between the input system of notions from the learning texts and the formed mental structures of the students. The research includes a sequence of actions and procedures for: facilitation of the formation of stable concepts structures (preparation of learning materials, its content, structure and visual presentation, organization of learning, etc.); feedback output on the preservation of knowledge of certain number of key notions; and assessment of manifested knowledge. The data used is verbal ¨C learning texts, linguistic descriptions of notions contained in them and all these are rendered in an open format by the people observed while posing indirect questions. The nature of the processed material (input stimuli and preserved knowledge), decided on the application of Latent Semantic Analysis (LSA) as a research method on the information data. This statistical technology permitted the formation of a model of semantic connections between the researched notions in the output space against whose background is made an assessment of the individual achievements and the general representation of the results.

    15. Investigating the Degree of Adequacy of the Relations in the Concept Structure of Students using the Method of Latent Semantic Analysis. Terzieva S., Nakov P., Handjieva S. In Proceedings of the Bulgarian Computer Science Conference on Computer Systems and Technologies (CompSysTech'01). Sofia, Bulgaria. 2001.

      Abstract. The research on the effects of study is hindered by the possibilities of the techniques and methods of registering, measuring and assessing the actually formed knowledge as information represented in the memory with the appropriate correlation among its units. The problem has been solved by the use of the latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed by the students in the form of free verbal statements.

    16. Getting Better Results with Latent Semantic Indexing. Nakov P. In Proceedings of the Students Presentations at the European Summer School in Logic Language and Information (ESSLLI'00). pp. 156-166. Birmingham, UK. August 2000.

      Abstract. The paper presents an overview of some important factors in¡ãuencing the quality of the results obtained when using Latent Semantic Indexing. The factors are separated in 5 major groups and analyzed both separately and as whole. A new class of extended Boolean operations such as OR, AND and NOT (AND-NOT) and their combinations is proposed and evaluated on a corpus of religious and sacred texts.

    17. Web Personalization Using Extended Boolean Operations with Latent Semantic Indexing. Nakov P. In Lecture Notes in Artificial Intelligence - 1904 (Springer). Artificial Intelligence: Methodology, Systems and Applications. 9th International Conference (AIMSA'00), pp. 189-198. Varna, Bulgaria, September 2000.

      Abstract. The paper discusses the potential of the usage of Extended Boolean operations for personalized information delivery on the Internet based on semantic vector representation models. The final goal is the design of an e-commerce portal tracking user¡¯s clickstream activity and purchases history in order to offer them personalized information. The emphasis is put on the introduction of dynamic composite user profiles constructed by means of extended Boolean operations. The basic binary Boolean operations such as OR, AND and NOT (AND-NOT) and their combinations have been introduced and implemented in variety of ways. An evaluation is presented based on the classic Latent Semantic Indexing method for information retrieval using a text corpus of religious and sacred texts.

    18. Latent Semantic Analysis of Textual Data. Nakov P. In Proceedings of the International Conference on Computer Systems and Technologies (CompSysTech'00). pp. V.3-1-V.3-5. Sofia, Bulgaria. June 2000.

      Abstract. Latent Semantic Analysis of Text Information The paper presents an overview of the usage of LSA for analysis of textual data. The mathematical apparatus is explained in brief and special attention if pointed on the key parameters that influence the quality of the results obtained. The potential of LSA is demonstrated on selected corpus of religious and sacred texts. The results of an experimental application of LSA for educational purposes are also present.