Difference between revisions of "StructuredDataFromWikiPages"

Revision as of 01:32, 25 March 2008

Rating: 0 - 0 votes

This page is about a company.

Extract data that users have entered onto Wiki pages and turn them into structured data for easier manipulation.

We need to extract

We are moving towards using more highly-structured data, but need to leverage the large quantity of data users have entered onto our site.

Easily identify which data has been added or changed to a wiki page by human edits. (standard diff may work?)
Apply heuristics (section, regular expression, machine learning, something else) to determine if a piece of data should belong in a structured field.
If all human-added data can be extracted, indicate that the entire wiki page should be deleted.
If there remain human-added data that can't be identified and extracted, return wikitext containing only the non-identified human data, with all bot-created data removed.

Build many test cases--pick many random human-edited pages.
- Pull out revision histories (or at least diffs to compare to original bot scrape)
- Identify and extract human-edited data yourself. Great fun!
Make test cases pass. (In order below?)
- First identify all human-edited data
- Then classify and extract said data
- Then determine if a page should be deleted, and, if not, which data should be left behind.
Throw a wild and crazy party

@@ Line 15: / Line 15: @@
 == [[DoneDone]] ==
-* Most pages that have structured data that can
+* Easily identify which data has been added or changed to a wiki page by human edits. (standard diff may work?)
+* Apply heuristics (section, regular expression, machine learning, something else) to determine if a piece of data should belong in a structured field.
+* If all human-added data can be extracted, indicate that the entire wiki page should be deleted.
+* If there remain human-added data that can't be identified and extracted, return wikitext containing only the non-identified human data, with all bot-created data removed.
 == Steps to get to [[DoneDone]] ==
+* Build many test cases--pick many random human-edited pages.
+** Pull out revision histories (or at least diffs to compare to original bot scrape)
+** Identify and extract human-edited data yourself.  Great fun!
+* Make test cases pass. (In order below?)
+** First identify all human-edited data
+** Then classify and extract said data
+** Then determine if a page should be deleted, and, if not, which data should be left behind.
+* Throw a wild and crazy party
 == Discussion ==