Higher Attestation Commission (ВАК) journals

  1. Frolov D.S. Annotated suffix tree as a way of text representation for information retrieval in text collections. Business Informatics, 2015, no. 4 (34), pp. 63–70. DOI:10.17323/1998-0663.2015.4.63.70. [PDF] (indexed by Web Of Science)
  2. Dmitry Frolov. Using Annotated Suffix Trees for Fuzzy Full Text Search, in: Communications in Computer and Information Science. Information Retrieval. 10th Russian Summer School, RuSSIR 2016, Saratov, Russia, August 22-26, 2016, Revised Selected Papers. Springer, 2016. in publishing. (indexed by Scopus and Web Of Science)

Graduate works

  • 2012 - bachelor's graduate work "Mathematical Methods of Mobile Devices Power Consumption Analysis", Petrozavodsk State University (PetrSU), scientific supervisor is Ph.D., associate professor O. Yu. Bogoyavlenskaya.

    Abstract & Results:

    A problem of power consumption is one of main retarding forces in portable devices development. Typically, a battery is the only energy source of a stand-alone device. All the architectural components and processes consume battery power. For example, a camera, a GPS or a network module are most energy-consuming details. There are some models of power consumption, including mathematical and physical models. In this work we analyze the energy consumption at the application level.

    The purposes of our work include to monitor the battery charge in a view to estimate the remaining time to full battery discharge and a modeling of energy consumption process.

    We have investigated the following devices: Nokia N900 (which runs under operating system Maemo 5 on the Linux kernel) and Nokia N8 (Symbian OS). All measurements of energy consumption was carried out using a special scripts in Python. The scripts was run in daemon mode. We have received 11 log files which contain more than 50 thousand records. Each record presents the value of the battery level. The workload on the device has been variated, each log file corresponds to the period from full charge to full discharge. After processing the results we have obtained empirical distribution function for amount of energy spent per time unit and empirical density of probability. We have chosen some approximating distributions of the empirical distribution function. All chosen functions were tested with a help of criteria of consent. The process of energy consumption belongs to one of the next distributions: lognormal distribution, Erlang distribution or a linear combination of Poisson distributions. This fact depends on a level of a device loading.

    We have considered two approaches of the process modeling: Markov processes approach and ARMA (autoregressive moving-average model) approach. The model based on Markovian processes provides high accuracy of computation results, but it consumes quite a lot of energy and processing resources. ARMA is a more simple model, it is easy for programming and usage, however, this model does not provide a high-level accuracy of results.

    An application to implement these models has been developed. This application runs under Android operating system, which is based on the Linux kernel. The application is tested on Samsung Galaxy II (OS Android 2.3/4.0 versions).

  • 2014 - master's graduate work “Distributed System for Document Storage and Clustering", National Research University "Higher School of Economics" (NRU HSE), scientific supervisor is D.Sc.., professor B. G. Mirkin

    Abstract & Results:

    A work is devoted to design and experimental testing of the distributed system for processing of collections of textual documents.

    In the work, the methods of clustering and document retrieval based on Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSI) are implemented. These methods allow to make special document lustering of in collections, extract themes from collections. Furthermore, these methods are used for document retrieval. Also a new method for document retrieval based on annotated suffix trees (AST) was designed and implemented. It includes special preliminary processing of document collections, that is fragment reverse indexing, based on selection of particular features for index constructing.

    An experimental comparison of the implemented methods was carried out using "Ozon" web store xml-catalogs of goods, which are available for free downloading. Experimental results have shown that author's modifications can improve document retrieval quality, especially in the cases of incomplete search queries or queries which contain errors. Distributed processing of collections can reduce the time spent collections processing. Preliminary processing of document collections reduces the time spent by query performing. It allows to use Annotated Suffix Trees (AST) as a powerful method for document retrieval.

    The final result of our work is a distributed software system that implements LDA, PLSI and AST-based method for document retrieval.

  • 2014 - present: Ph.D. student of National Research University "Higher School of Economics" (NRU HSE), scientific supervisor is D.Sc.., professor B. G. Mirkin. Thesis theme: "Aggregated Text Representation for Information Retrieval in Collections of Text Documents".

    Abstract & Future plans (work in progress):

    Automatic natural text processing includes a wide range of problems, such as machine translation, information retrieval, semantic text analysis and so on. Due to a large diversity of natural languages and enormous complexity of language constructs, these problems are difficult to solve. In view of the fact that sphere of applied information technologies is expanding, processing of some text document sets (so-called «collections») may be considered as a one of the most important problem in information retrieval. These problems are document clustering, text classification and different problems of relevant document retrieval for some given query. Aggregated text representation is quite a modern approach, based on theoretical computer science ideas. These ideas can be suitable for individual texts (annotated suffix trees) as well as whole collections of cases.

    We already have shown the applicability of annotated suffix tree techniques for text retrieval problems. We developed a search engine based on AST method and compared this engine with some other popular text aggregating techniques: PLSA and LDA. We can conclude that the AST method has significant advantages, especially in the case of inaccurate queries. On the other hand, the AST method is slow, especially at large collections of documents. Also we developed a method for fuzzy full-text search and conduct experiments.

    Our future plans include developing effective document retrieval methods based on the AST text aggregate techniques. This includes the following subtasks:

    1. improvement of the AST-based method for document retrieval with the help of both index-based principles and approaches using no indexing;
    2. implementation of the developed methods; adaptation of the methods for distributed document storing and computation;
    3. adaptation of the methods to the cases of dynamically changing document collections;
    4. conducting experimental computations for comparison of the developed methods.