|
What is IBM Many Aspects Document Summarization Tool?
IBM® Many Aspects Document Summarization Tool is a document summarization system that ingests a text document and automatically highlights a set of sentences that are expected to cover the different aspects of the document's content. The user decides the number of sentences to be included in the summary. These sentences are picked using the following two criteria:
- Coverage: The sentences should span a large portion of the spectrum of the document's subject matter.
- Orthogonality: Each sentence should capture different aspects of the document's content. That is, the sentences in the summary should be as orthogonal to each other as possible.
The demand of a summary with high coverage and high orthogonality is amplified by today's Web 2.0 applications. For example, in online comments and discussions following blogs, videos, and news articles, it is desirable to have a summary that highlights different angles of these comments because each often has a different focus. With IBM Many Aspects Document Summarization Tool, you can get a concise yet comprehensive overview of the document without having to spend lots of time drilling down into the details.
For comparative analysis and exploratory flexibility, the system also includes other off-the-shelf text summarization methods, such as k-median clustering and singular value decomposition (SVD). Thus the system allows you to explore the content of the input document in many different ways.
How does it work?
The core algorithm is based on a novel combinatorial formulation for the document summarization problem and a greedy search strategy; this formulation captures both coverage and orthogonality requirements. You simply load a plain-text document and give the length of the summary; the system will automatically identify and highlight the most important sentences.
The entire system consists of the following three modules:
- The user interface module is responsible for user interactive operations and visualization, including the graphical user interface (GUI) and highlighter components.
- The parser module provides support for parsing and filtering the text, including linguistic parser components.
- The algorithm pool contains different document summarization algorithms.
Theoretically, for a document with m sentences and n unique words, the total running time of the algorithm is O(k*m*n), which is linear to the size of the input. However, the complexity of SVD, which is a very popular approach for text summarization, is O(m^2*n + m*n^2 + k*m*n).
The algorithm takes less than a couple of seconds to compute summaries consisting of ten sentences from a document of 57 KB (approximately 566 sentences). But SVD takes approximately four minutes for the same task.
For further information, please see "ManyAspects: A System for Highlighting Diverse Concepts in Documents," by Kun Liu, Evimaria Terzi, and Tyrone Grandison, published by International Conference on Very Large Data Bases (VLDB), Auckland, New Zealand, in August 2008.
|