Difference between revisions of "WhoisParsing"

(Status: Nov 23 2007: Thoughts on the numbers you are reporting ... also how long does it take to run?)
(Status: Nov 23 2007)
Line 50: Line 50:
 
**Tucows
 
**Tucows
 
**Wild West
 
**Wild West
 +
 +
== Nov 26 2007 ==
 +
=== Objectives ===
 +
* Write 5 tests for the Default Parser.
 +
* Run the parser on 8779 records and find out the time it take to run on them.
 +
* Refactor the Contact Parser and move the common ones in the parent class.
 +
* Find the reasons of the slowness of SRS Plus and Domain People and correct them.
 +
  
 
== Status: Nov 23 2007 ==
 
== Status: Nov 23 2007 ==

Revision as of 05:48, 26 November 2007

OurWork Edit-chalk-10bo12.png

What (summary)

Pull out all pieces of information from the whois record across all domain registrars. WhoisParsing is one Task in the larger WhoisRefresh Project.

Why this is important

  • This project will help us get access to whois record fields (like administrative contacts, technical contacts, domain info, etc. ) that could be used to update the information of all the web pages currently hosted in AboutUs.
  • Add more fields to the domain like expire date etc.
  • Gives us mastery over our technology. We can change it easily, or adapt it for a different problem.
  • Allows us to turn off Apache and the large Java tomcat process on our database master.

DoneDone

  • A stable of interesting test cases for each of the largest 50 registrars have been created and hand audited.
  • All of the test cases are passing.

Steps to DoneDone

  • Write the whois record parsers for the top 20 registrar.
  • Pass one or two test cases for all the above 20 registrars.
  • Write address parsers for the top 20 countries and pass test cases for different formats of addresses.
    • Australia
    • Brazil
    • China
    • Canada
    • Germany
    • Japan
    • Netherlands
    • Spain
    • United States
    • United Kingdom
  • Write a wrapper that takes a domain name, fetches the whois record and then call the parser on this record.
  • Write more test cases for the top 20 registrars
    • Belgium Domains
    • Capitol Domain
    • DirectNic
    • Domain Discover
    • Domain Doorman
    • Dot Register
    • Dotster
    • Enom
    • Fabulous
    • Godaddy
    • Key Systems
    • Melbourne It
    • Moniker
    • NameKing
    • Network Solutions
    • Register.com
    • Schlund Partner
    • Tucows
    • Wild West

Nov 26 2007

Objectives

  • Write 5 tests for the Default Parser.
  • Run the parser on 8779 records and find out the time it take to run on them.
  • Refactor the Contact Parser and move the common ones in the parent class.
  • Find the reasons of the slowness of SRS Plus and Domain People and correct them.


Status: Nov 23 2007

  • Updated a few parsers to fix issues in parsing.
  • Ran the WhoisParser(including the Default Parser) on 8779 records and fixed the Parsers accordingly with the following results

Passing Records : 7374 Failing Records : 1405

  • Continued work on the default parser. Ran the Default parser on the same record set to extract the raw address with the following results

Passing Records : 6501 Failing Records : 2278

Looks great! How did you arrive at these numbers? Is it possible to create ground truth test cases for a representative sample so that we have some hard numbers on what percentage of the parsed records are correct so that we can have a better sense of how reliable that number is? Also how long to run all of the record? -- Brandon 21:09, 25 November 2007 (PST)

Plan for Tomorrow

  • Write the test cases for the Default parser.
  • Manually verify the results.
  • Update SRSPLUS Parser to optimize its slow nature.

Status: Nov 22 2007

  • We added another 2 registrars.
  • Started work on the default parser. We are following the multi line hash strategy with some tweaks. It finds something on around 2500 records out of around 8500 records till now. But it still need improvements.
  • Corrected small problems from the registrar parser that were identified by running the parsers on large data set.

Plan for Tomorrow

  • We plan to finish the top 50 Registrars tomorrow which cover 90% of the entire domains. Only 2 3 left for which we dont have any test case as yet.
  • Continue on the default parser so that it can be finished by tomorrow.
  • Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
Sounds good. For today's report please start indicating the size of the dataset you're testing on and the results you are getting. Then, each day that we continue to work on this task you can report the results that you are getting at the end of the day and we can easily quantify our progress. -- Brandon 22:21, 22 November 2007 (PST)

Status: Nov 21 2007

  • We added around 8-9 registrars.
  • Corrected small problems from the registrar parser that were identified by running the parsers on large data set.
  • Added few more test cases for the parsers.

Plan for Tomorrow

  • We plan to add few more registrars tomorrow.
  • Work on a default parser that can extract address from any registrar whois.
  • Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
Excellent! The plan sounds great ... especially the default parser :-) Keep rockin! --Brandon 21:30, 21 November 2007 (PST)

Testing

For each "Test Domain" included in these tallies, we have an exhaustive test that is passing.

S# Registrar # "Interesting" Test Domains Comment
1 GoDaddy 0 ToSkip
2 Melbourne IT 0 ToSkip
3 Enom 0 ToSkip
4 Network Solutions 0 ToSkip Most records have Domain Status set to nil.
5 Belguim Domains 0 ToSkip
6 Tucows 0 ToSkip Many UK tests failing, some havepostal codes same as state
7 Beijing Innovative 0 ToSkip


Formats of WhoIs For Registrars

Network Solutions

AEroiNstruments.com: Without Country in Address

c/o Network Solutions
P.O. Box 447  
Herndon, VA.  20172-0447

AEroiNstruments.com: Administrative Contact and Technical Contact are on the same line Administrative Contact and Technical Contact in Separate lines

Administrative Contact:
Technical Contact:

ArRail-Dental.com: China Address

     Y.P.Sun       
     Y.P.Sun
     Beijing Shengbin Company Limited
     304 Citic Building 2 No 19
     Jianguomenwai
     Beijing 100004
     100004
     CHN 
     999 999 9999 fax: 999 999 9999

Broadland-Gas.com: UK address

     BROADLAND GAS     
     35 The Street
     CARLTON  COLVILLE
     LOWESTOFT, SUFFOLK NR33 8JP 
     UK  

CommerceCenter.com: UK Address

  Lorien
  Felstead, Essex CM6 3LR 
  Felstead, ESSEX CM6 3LR 
  UK  

DuoCuisines.com: France Address

  REVANCHE
  20 Rue Bernard Lazare
  LE CAILAR 30740
  FR  

Hricn.com: China Address

  jin yuan, xiong
  Guizhou huangping jiuzhouzhongxue
  Guizhou, Guizhou 550000
  CN  

JennaJoy.com: Name Missing

     Technical Contact:
           
     ATTN: JENNAJOY.COM
     c/o Network Solutions
     P.O. Box 447
     Herndon, VA 20172-0447
     570-708-8780

OnegaTelecoms.com: City and State are same

  Unit 2 Ground Floor Caxton Street
  Studios Caxton Street North
  London, LONDON E16 1JL 
  GB  

TradeFootball.com: IE address

  10b Beckett Way 
  Parkwest Business Park, Clondalkin D22 00000
  IE  

XmlConference.com: Fax is NULL

     Daste, Kevin      
     323 Pine St
     New Orleans, LA 70118
     US
     504-208-1566 fax: null


Retrieved from "http://aboutus.com/index.php?title=WhoisParsing&oldid=12610470"