Week 1 (June 4 – June 8) Report

Task 1: Introduction

Hello. My name is Suppawong Tuarob, a second year PhD student in Computer Science and Engineering at Penn State. My advisors are Dr. Prasenjit Mitra and Dr. C. Lee Giles. My research covers the area of information retrieval, information extraction, machine learning, and social networks. Some of the notable projects that my group at Penn State is working on include CiteseerX (a digital library for scholarly documents), ChemXSeer (a search engine for chemical information), and RefSeer (a citation recommendation system).

What I have been working on back at Penn State involves 3 separate projects briefly discussed below:

1. Algorithm Search

This project involves identify and extract algorithm representations and their metadata from documents, make them searchable, and discovering semantic relatedness among algorithms. I am also studying how algorithms influence each other overtime.

2. Document Structure Analysis

I am also interested in document structure analysis. I implemented a document segmentation tool for scholarly documents which breaks down a document into section. My next step on this would be to (re)build a hierarchy of sections and recognize the semantics (or rather intention) of each section in a document.

3. Profile Similarity in Social Networks

I employ the hierarchy of topics extracted from Wikipedia to compute the similarity between two entities in a social network.

Feel free to browse my website: http://www.personal.psu.edu/szt5115/

Task 2: Visiting ORNL

This week I visited ORNL during June 7-8 to meet with my mentors. the first day involves a series of presentations aiming for me to familiarize myself with DataONE, ONEMercury, and other related systems such as ORNL DAAC, Dryad, and Hive.

Task 3: Obtain Accesses to Necessary Resources

I have obtained an ORNL account. The access allow me to log into ORNL system. However, I still cannot access the repository where the index files of ORNL DAAC reside. I also still need to obtain Dryad and KNB data sets.

Task 4: Rough Outline of the Problem

As a result of the discussion during my visit, I would like to outline the problem that I will be working on over the rest of the internship. ONEMercury is a search engine for metadata of environment-related data. Currently the system is harvesting metadata from 3 sources namely ORNL DAAC (http://daac.ornl.gov/), Dryad (http://datadryad.org/), and KNB. It is always a problem when harvesting metadata from different sources that metadata come with a diverse qualities, leading to a mixture of metadata records that are both very rich (well annotated), and very poor (poorly annotated). Since ONEMercury is a text-based search engine, these poorly annotated metadata records are likely not to be captured in the search results.

Hence the problem that I will be focusing on is finding ways to enrich the poorly annotated metadata records to improve the coverage of the search. To attack this problem, I would like to frame this problem to a tag recommendation problem. This way, we can try multiple techniques and the evaluation can be done efficiently in a given tight time frame.

Next Week:

I will be looking into the literature that addresses the similar problems, along with exploring different approaches that have been tried and discussing the possibility of application to our problem. I will also define the problem in a more concrete way along with coming up with a rough evaluation plan.

Leave a Reply

Your email address will not be published. Required fields are marked *