Search

Deep Web Data Extraction

Add Feed to YahooSubscribe in NewsGator OnlineAdd to Google ReaderSubscribe in RojoSubscribe with BloglinesAdd to netvibesAdd to PageflakesAdd to My AOLadd to my msn AddThis Feed Button

RSS feed for Knowlesys Software Inc.:XML Feed

China (SANEPR.com) November 6, 2008 -- The unabated growth of the Web has resulted in a situation in which more information is available to more people than ever in human history. Along with this unprecedented growth has come the inevitable problem of information overload. To counteract this information overload, users typically rely on search engines (like Google and AllTheWeb) or on manually-created categorization hierarchies (like Yahoo! and the Open Directory Project). Though excellent for accessing Web pages on the so-called "crawlable" web, these approaches overlook a much more massive and high-quality resource: the Deep Web.

The Deep Web (or Hidden Web) comprises all information that resides in autonomous databases behind portals and information providers' web front-ends. Web pages in the Deep Web are dynamically-generated in response to a query through a web site's search form and often contain rich content. A recent study has estimated the size of the Deep Web to be more than 500 billion pages, whereas the size of the "crawlable" web is only 1% of the Deep Web (i.e., less than 5 billion pages). Even those web sites with some static links that are "crawlable" by a search engine often have much more information available only through a query interface. Unlocking this vast deep web content presents a major research challenge.

In analogy to search engines over the "crawlable" web, we argue that one way to unlock the Deep Web is to employ a fully automated approach to extracting, indexing, and searching the query-related information-rich regions from dynamic web pages. For this miniproject, we focus on the first of these: extracting data from the Deep Web.

Extracting the interesting information from a Deep Web site requires many things: including scalable and robust methods for analyzing dynamic web pages of a given web site, discovering and locating the query-related information-rich content regions, and extracting itemized objects within each region. By full automation, we mean that the extraction algorithms should be designed independently of the presentation features or specific content of the web pages, such as the specific ways in which the query-related information is laid out or the specific locations where the navigational links and advertisement information are placed in the web pages.

There are many possible 7001-miniprojects. Feel free to talk to either of us for more details. Here are a few possibilities to consider:

1. Develop a Web-based demo for clustering pages of a similar type from a single Deep Web source. For example, AllMusic produces three types of pages in response to a user query: a direct match page (e.g. for Elvis Presley), a list of links to match pages (e.g. a list of all artists named Jackson), and a page with no matches. As a first-step to extracting the relevant data from each page, you may develop techniques to separate out the pages that contain query matches from pages that contain no matches, and perhaps, rank each group based on some metric of quality.

2. Design a system for extracting interesting data from a collection of pages from a Deep Web source. You might define a set of regular expression that can identify dates, prices, or names. Develop a small program that converts a page into a type structure. For example, given a DOM model of a web page, identify all of the types that you have defined, and replace the string tokens with XML tags identifying the types. Replace all non-type tokens with a generic type, and return the tree as a full type structure). Alternatively, you may suggest your own approach for extracting data.

3. Develop a system to recognize names in page. Given a list of names and a web page, identify possible matches in the page. Based on the structure of the page and the distribution of recognized names, identify strings that may also be names based on their location in the DOM tree heirarchy representing the page.

4. Write a survey paper about current approaches for understanding and analyzing the Deep Web. Be sure to include many of your own comments on the viability of the approaches you review.

5. Or, feel free to suggest a miniproject of your own.

Background: Knowledge of Java or Python would be helpful. Some knowledge of information retrieval and machine learning may be useful but is not required.

Deliverables: You should submit a report that clearly describes what you have learned and what you have accomplished. The report should include useful references. You should also provide any source code you may have written to validate your ideas.

Evaluation: You will be graded on the novelty and quality of your report and implementation.

......

Add to SimpySimpy Add to Google BookmarksGoogle Add to your Technorati favoritesTechnorati Add to del.icio.usdel.icio.us Add to Redditreddit Add to DiggDigg Add to FurlFurl It Add to SpurlSpurl Add to BlinklistBlink It Add to MyYahoo!MyYahoo Add to Newsvine.comNewsvine

TechnoratiTechnorati Tags:

AddThis Social Bookmark Button

Related Press Releases: 
Oscommerce shopping cart integration services by Oscommerce Experts 05/12/2009
How Does Google Adsense Work? 08/01/2009
E-Commerce Solutions are Essential to get highlighted in the crowded Internet 08/01/2009
Perception System - PHP Development Company 05/01/2009
Complete E-commerce Solutions Provider 31/12/2008
Affiliate Marketing need to possess the necessary skills to be successful! 22/12/2008
Oxford University Press (India) implements EmpXtrack Human Resources Management System (HRMS) 11/12/2008
Cyn.in Delivers Collaboration Software in a SaaS Model at New Reduced Prices 04/12/2008
cyn.in v2.1: A colossal leap in Enterprise collaboration 26/11/2008
Live Streaming 20/11/2008
Oscommerce Development with Ajax 04/11/2008
Apatar Simplifies Migration of Customer History from GoldMine to Salesforce, Reducing Workload by 5x 29/10/2008
OrangeHRM Releases plug-in to extract data from time sheets and report modules 15/10/2008
Habanero v2: Create One Application for the Desktop & Web 07/10/2008
Affordable website design, web application development, web hosting & seo. 01/10/2008
Eploy’s recruitment software is the leading online recruitment 30/09/2008
Extend and Customize OpenCMS through its Java API 29/09/2008
Easy Ways to Redesign Your Website 29/09/2008
OrangeHRM Releases Version 2.4 of the Open-Source HR Management System 26/09/2008
Alternative Technology Signs Distribution Agreement with Red Hat 18/09/2008
Software Development Company India 17/09/2008
How Blogs Help To Promote An Existing Business 16/09/2008
IT Solutions and Web Development in the Internet Industry 15/09/2008
E-XD++ .NET / VC++ Visio - Like Flow/Diagramming Suite V12.50 15/09/2008
Know more with the Ubiquitous EBook 08/09/2008
How Can You Use eBooks To Make Money? 08/09/2008
takeMS and ICEcat co-operate in providing free product info to the channel 02/09/2008
Search Engine Optimization to promote your business 26/08/2008
Anyone can Optimise their website 20/08/2008
Affiliate Marketing – tips and tricks 19/08/2008
New Private Label Rights Articles Site Launches With A Bang! 19/08/2008
Blogging how to get successful thpough your blogs 14/08/2008
Location For Computer Enclosures 12/08/2008
Apatar Data Integration and Vertica Analytic Database Accelerate Corporate Data Analysis 11/08/2008
Make Your PC Healthy With Computer Tricks And Tips 30/07/2008
Work Part-Time From Home And Succeed 29/07/2008
Keyword Research Tools 28/07/2008
SEO - Benefits of Internet Marketing 28/07/2008
Google Adsense Boost Your Payout 150% 28/07/2008
Online Home Businesses: Google Adsense 28/07/2008
"Forced" Entrepreneur Shines Hope Into The Hype Of Make Money Online… 28/07/2008
Death Of Adsense For Web Publishers 28/07/2008
Find Free Your Own Valuable Adsense Keywords 25/07/2008
Maximize Your Search Engine Ranking 25/07/2008
Need Satellite TV on Your PC 25/07/2008
Need Satellite TV on Your PC 25/07/2008
One other lucrative point that Ad Word Analyzer revealed to me 24/07/2008
Easiest Techniques Google Adsense Click Fraud 23/07/2008
Get Listed in search engines within 48 Hrs - SEO 22/07/2008
Computer Performance is Important to Every PC Owner 21/07/2008
Style Aspects Brighter by Purchasing an Attractive Computer Desk 21/07/2008
Impression and recompense of Digital signage - Technology 18/07/2008
My PC Vulnerable on the Internet 18/07/2008
Analysts Technology of Computer Systems 18/07/2008
How Can You Find Good Computer Support Online 18/07/2008
Prevent Your Computer From Being Slow 18/07/2008
Renovate Your Used Computer with Accessories 18/07/2008
Software Help In Enhancing The Security Of Your Computer 18/07/2008
Essential Ways to Improve Computer Performance 17/07/2008
Cynapse unveils cyn.in v2, the Open source Collaborative Knowledge Management software 17/07/2008
Better expression through Animation and Flash-based Sites 17/07/2008
Joint Venture Marketing Tactics! 16/07/2008
Optimizing Web Site With Color 15/07/2008
CDC-CRM unveils first of its kind, customized solution for campaign management in India 15/07/2008
Apatar Launches Hosted Salesforce.com and QuickBooks Integration 14/07/2008
AOL and WebTV Designing 14/07/2008
Where you want to move city 14/07/2008
Google Adsense - Blogs Marriage 11/07/2008
Adsense business to advertise at a low cost and generate more traffic 11/07/2008
Free Wordpress Google Adsense Theme 04/07/2008
Contact Information
leiyu
Knowlesys Software Inc.
86-0755-86032826
Website
http://www.knowlesys.com



Email
knowlesys@126.com


Press release files
There are no files associated with this release.
Please do not contact SANEPR with questions, queries or comments regarding any information contained within press releases, distributed by or appearing on SANEPR.com Please contact the company listed in the press release as SANEPR will be not be able to assist with any such inquiry as stated in the terms of service.