Difference between revisions of "StructuredDataFromWikiPages"
(New page: <noinclude><big>OurWork < DevelopmentTeam < Priorities < </noinclude>BigBeautifulInfluenceUs {{JustTinyEditIcon|BigBeautifulInfluenceUs}}<noinclud...) |
|||
Line 15: | Line 15: | ||
== [[DoneDone]] == | == [[DoneDone]] == | ||
− | * | + | * Easily identify which data has been added or changed to a wiki page by human edits. (standard diff may work?) |
+ | * Apply heuristics (section, regular expression, machine learning, something else) to determine if a piece of data should belong in a structured field. | ||
+ | * If all human-added data can be extracted, indicate that the entire wiki page should be deleted. | ||
+ | * If there remain human-added data that can't be identified and extracted, return wikitext containing only the non-identified human data, with all bot-created data removed. | ||
+ | |||
== Steps to get to [[DoneDone]] == | == Steps to get to [[DoneDone]] == | ||
− | + | * Build many test cases--pick many random human-edited pages. | |
+ | ** Pull out revision histories (or at least diffs to compare to original bot scrape) | ||
+ | ** Identify and extract human-edited data yourself. Great fun! | ||
+ | * Make test cases pass. (In order below?) | ||
+ | ** First identify all human-edited data | ||
+ | ** Then classify and extract said data | ||
+ | ** Then determine if a page should be deleted, and, if not, which data should be left behind. | ||
+ | * Throw a wild and crazy party | ||
== Discussion == | == Discussion == |
Revision as of 01:32, 25 March 2008
What (summary)
Extract data that users have entered onto Wiki pages and turn them into structured data for easier manipulation.
We need to extract
- Contact info
- Address
- Phone #
Why this is important
We are moving towards using more highly-structured data, but need to leverage the large quantity of data users have entered onto our site.
DoneDone
- Easily identify which data has been added or changed to a wiki page by human edits. (standard diff may work?)
- Apply heuristics (section, regular expression, machine learning, something else) to determine if a piece of data should belong in a structured field.
- If all human-added data can be extracted, indicate that the entire wiki page should be deleted.
- If there remain human-added data that can't be identified and extracted, return wikitext containing only the non-identified human data, with all bot-created data removed.
Steps to get to DoneDone
- Build many test cases--pick many random human-edited pages.
- Pull out revision histories (or at least diffs to compare to original bot scrape)
- Identify and extract human-edited data yourself. Great fun!
- Make test cases pass. (In order below?)
- First identify all human-edited data
- Then classify and extract said data
- Then determine if a page should be deleted, and, if not, which data should be left behind.
- Throw a wild and crazy party