IBM®
Skip to main content
    United States change      Terms of use
 
 
Select a scope:    
     Home      Products      Services & industry solutions      Support & downloads      My account     
alphaWorks  >  Information management  >  

IBM Many Aspects Document Summarization Tool

A tool that highlights various angles of a document's content.


Date Posted: July 1, 2008
OverviewRequirements Download FAQs Forum Reviews

What is IBM Many Aspects Document Summarization Tool?

IBM® Many Aspects Document Summarization Tool is a document summarization system that ingests a text document and automatically highlights a set of sentences that are expected to cover the different aspects of the document's content. The user decides the number of sentences to be included in the summary. These sentences are picked using the following two criteria:

  • Coverage: The sentences should span a large portion of the spectrum of the document's subject matter.
  • Orthogonality: Each sentence should capture different aspects of the document's content. That is, the sentences in the summary should be as orthogonal to each other as possible.

The demand of a summary with high coverage and high orthogonality is amplified by today's Web 2.0 applications. For example, in online comments and discussions following blogs, videos, and news articles, it is desirable to have a summary that highlights different angles of these comments because each often has a different focus. With IBM Many Aspects Document Summarization Tool, you can get a concise yet comprehensive overview of the document without having to spend lots of time drilling down into the details.

For comparative analysis and exploratory flexibility, the system also includes other off-the-shelf text summarization methods, such as k-median clustering and singular value decomposition (SVD). Thus the system allows you to explore the content of the input document in many different ways.

How does it work?

The core algorithm is based on a novel combinatorial formulation for the document summarization problem and a greedy search strategy; this formulation captures both coverage and orthogonality requirements. You simply load a plain-text document and give the length of the summary; the system will automatically identify and highlight the most important sentences.

The entire system consists of the following three modules:

  • The user interface module is responsible for user interactive operations and visualization, including the graphical user interface (GUI) and highlighter components.
  • The parser module provides support for parsing and filtering the text, including linguistic parser components.
  • The algorithm pool contains different document summarization algorithms.

Theoretically, for a document with m sentences and n unique words, the total running time of the algorithm is O(k*m*n), which is linear to the size of the input. However, the complexity of SVD, which is a very popular approach for text summarization, is O(m^2*n + m*n^2 + k*m*n).

The algorithm takes less than a couple of seconds to compute summaries consisting of ten sentences from a document of 57 KB (approximately 566 sentences). But SVD takes approximately four minutes for the same task.

For further information, please see "ManyAspects: A System for Highlighting Diverse Concepts in Documents," by Kun Liu, Evimaria Terzi, and Tyrone Grandison, published by International Conference on Very Large Data Bases (VLDB), Auckland, New Zealand, in August 2008.


About the technology author(s):

Kun Liu, Ph.D., joined the IBM Almaden Research Center as a postdoctoral researcher in January 2007. He is working with the Intelligent Information Systems team on healthcare informatics and privacy-preserving social network analysis. Dr. Liu earned his Ph.D. in computer science from University of Maryland, Baltimore County.

Evimaria Terzi, Ph.D., has been a researcher at IBM Almaden since June 2007. She obtained her Ph.D. from the University of Helsinki, Finland, in January 2007; her M.Sc. from Purdue University in 2002; and her B.Sc. from the University of Thessaloniki, Greece, in 2000. Her research interests are in the area of algorithmic data mining with applications to social-network analysis, information retrieval, and databases.

Tyrone Grandison, Ph.D., leads the Intelligent Information Systems in the Healthcare Informatics group at the IBM Almaden Research Center. Dr. Grandison received his B.Sc. and M.Sc. from the University of the West Indies, Jamaica, and his Ph.D. from the Imperial College of Sciences, Technology, and Medicine in the University of London, United Kingdom. He is a senior member of both the Association of Computer Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE).


IBM is a trademark of IBM Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.

View screenshots:
IBM Many Aspects Document Summarization Tool highlighting the three-sentence summary of a document containing two distinct topics:  Blackberry's system failure and Google's acquisition of DoubleClick.

Download now Download now

Related technologies

For platform(s):
Java

For topics:
analysis, data management, data mining, Java technology


 

    About IBM Privacy Contact