WhoisParsing

OurWork WhoisParsing (Arif Iqbal) (7-10)

What (summary)

Pull out all pieces of information from the whois record across all domain registrars. WhoisParsing is one Task in the larger WhoisRefresh Project.

Why this is important

  • This project will help us get access to whois record fields (like administrative contacts, technical contacts, domain info, etc. ) that could be used to update the information of all the web pages currently hosted in AboutUs.
  • Add more fields to the domain like expire date etc.
  • Gives us mastery over our technology. We can change it easily, or adapt it for a different problem.
  • Allows us to turn off Apache and the large Java tomcat process on our database master.

DoneDone

  • A stable of interesting test cases for each of the largest 50 registrars have been created and hand audited.
  • All of the test cases are passing.

Dec 3, 2007

Objectives

  • Update the default parser and contact parser to cater the failing cases specially hashed one.

Nov 30, 2007

Objectives

  • Make the ContactParser fast. The current problem is that RE has to many optional things.
  • Run the latest parser on the 9000 Whois Records and do an analysis on the time taken, and the failed results.

Achieved

  • After tweaking the ContactParsers a bit, it seems to be working much faster now. The stats after running the Parser(this includes applying the ContactParsers)on 8777 records are
Contact Information Extracted  : 7475
Contact Information Not Extraceted : 1302
Time taken in seconds is: 688.200859

Status of Failing Cases

Out of the total 1302 failing records, around 70% dont have any data associated with them i.e.

  • Empty Whois Record  : 849 (65.2%)
  • No Contact Information  : 45 (3.4%)

Remaining are the files that have the contact information that could not be parsed by the DefaultParser

  • Hashed kind of information that can be catered  : 232 (17.8%)
  • Only Contact Info available under heading of "Contact Info(rmation):": 3 (0.2%)
  • Easy, can be handled  : 156 (11.9%)
  • Improper Format  : 17 (1.3%)

Nov 29, 2007

Objectives

Nov 28, 2007

Objectives

Today would be a great day if we can accomplish the following

  • Test the Contact Parser function that applies all the contact parsers serially.(Generic ContactParser) Arif
  • Get the 20 tests passing for the Default Parser. Laiq
  • Remove the parse contact function from the registrar parser and call the ContactParser class for parsing the contact information. Hassan|Arif

Achieved

  • The tests for Default Parser are passing.
  • Contact Parser function has been tested, It is passing 27 out of 27 tests written for it.
  • Merged parse contact function of around 30 parsers.

Nov 27 2007

Objectives

Today would be a great day if we can accomplish the following

  • Write 20 tests for the Default Parser. Laiq
  • Write the all possible contact parsers in a single ContactParser class and write its test cases. Write RE's in a way that it caters more cases and is not restricted. Arif | Hassan

Achieved

  • Tests for the default parser written but they still need to be passed.
  • ContactParser done with some generic parse contact function and a function that would apply these parsers serially, We have updated few parser but still need to updated remaining parsers to start using the ContactParser.

Nov 26 2007

Objectives

  • Write 5 tests for the Default Parser.
  • Run the parser on 8779 records and find out the time it take to run on them.
  • Refactor the Contact Parser and move the common ones in the parent class.
  • Find the reasons of the slowness of SRS Plus and Domain People and correct them.

Outcome

  • Results Statics:
Total Records: 8777
Passing records are: 7374
Failing records are: 1403
Time taken: 358.803416 secs
  • Merged the contact parsers of three registrars. Few others in progress but few cases are failing.
  • Two test cases written for default parser. We are first planning to get the Contact parsers done otherwise its tests wont pass and then fully concentrate on the test cases.

Status: Nov 23 2007

  • Updated a few parsers to fix issues in parsing.
  • Ran the WhoisParser(including the Default Parser) on 8779 records and fixed the Parsers accordingly with the following results

Passing Records : 7374 Failing Records : 1405

  • Continued work on the default parser. Ran the Default parser on the same record set to extract the raw address with the following results

Passing Records : 6501 Failing Records : 2278

Looks great! How did you arrive at these numbers? Is it possible to create ground truth test cases for a representative sample so that we have some hard numbers on what percentage of the parsed records are correct so that we can have a better sense of how reliable that number is? Also how long to run all of the record? -- Brandon 21:09, 25 November 2007 (PST)
Till now, its been on the basis of the output of raw_address and just seeing the files in which we are generating the raw_address. The output looks reasonably good if you look it manually.

Plan for Tomorrow

  • Write the test cases for the Default parser.
  • Manually verify the results.
  • Update SRSPLUS Parser to optimize its slow nature.

Status: Nov 22 2007

  • We added another 2 registrars.
  • Started work on the default parser. We are following the multi line hash strategy with some tweaks. It finds something on around 2500 records out of around 8500 records till now. But it still need improvements.
  • Corrected small problems from the registrar parser that were identified by running the parsers on large data set.

Plan for Tomorrow

  • We plan to finish the top 50 Registrars tomorrow which cover 90% of the entire domains. Only 2 3 left for which we dont have any test case as yet.
  • Continue on the default parser so that it can be finished by tomorrow.
  • Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
Sounds good. For today's report please start indicating the size of the dataset you're testing on and the results you are getting. Then, each day that we continue to work on this task you can report the results that you are getting at the end of the day and we can easily quantify our progress. -- Brandon 22:21, 22 November 2007 (PST)

Status: Nov 21 2007

  • We added around 8-9 registrars.
  • Corrected small problems from the registrar parser that were identified by running the parsers on large data set.
  • Added few more test cases for the parsers.

Plan for Tomorrow

  • We plan to add few more registrars tomorrow.
  • Work on a default parser that can extract address from any registrar whois.
  • Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
Excellent! The plan sounds great ... especially the default parser :-) Keep rockin! --Brandon 21:30, 21 November 2007 (PST)

Steps to DoneDone

  • Write the whois record parsers for the top 20 registrar.
  • Pass one or two test cases for all the above 20 registrars.
  • Write address parsers for the top 20 countries and pass test cases for different formats of addresses.
    • Australia
    • Brazil
    • China
    • Canada
    • Germany
    • Japan
    • Netherlands
    • Spain
    • United States
    • United Kingdom
  • Write a wrapper that takes a domain name, fetches the whois record and then call the parser on this record.
  • Write more test cases for the top 20 registrars
    • Belgium Domains
    • Capitol Domain
    • DirectNic
    • Domain Discover
    • Domain Doorman
    • Dot Register
    • Dotster
    • Enom
    • Fabulous
    • Godaddy
    • Key Systems
    • Melbourne It
    • Moniker
    • NameKing
    • Network Solutions
    • Register.com
    • Schlund Partner
    • Tucows
    • Wild West

Testing

For each "Test Domain" included in these tallies, we have an exhaustive test that is passing.

S# Registrar # "Interesting" Test Domains Comment
1 GoDaddy 0 ToSkip
2 Melbourne IT 0 ToSkip
3 Enom 0 ToSkip
4 Network Solutions 0 ToSkip Most records have Domain Status set to nil.
5 Belguim Domains 0 ToSkip
6 Tucows 0 ToSkip Many UK tests failing, some havepostal codes same as state
7 Beijing Innovative 0 ToSkip


Formats of WhoIs For Registrars

Network Solutions

AEroiNstruments.com: Without Country in Address

c/o Network Solutions
P.O. Box 447  
Herndon, VA.  20172-0447

AEroiNstruments.com: Administrative Contact and Technical Contact are on the same line Administrative Contact and Technical Contact in Separate lines

Administrative Contact:
Technical Contact:

ArRail-Dental.com: China Address

     Y.P.Sun       
     Y.P.Sun
     Beijing Shengbin Company Limited
     304 Citic Building 2 No 19
     Jianguomenwai
     Beijing 100004
     100004
     CHN 
     999 999 9999 fax: 999 999 9999

Broadland-Gas.com: UK address

     BROADLAND GAS     
     35 The Street
     CARLTON  COLVILLE
     LOWESTOFT, SUFFOLK NR33 8JP 
     UK  

CommerceCenter.com: UK Address

  Lorien
  Felstead, Essex CM6 3LR 
  Felstead, ESSEX CM6 3LR 
  UK  

DuoCuisines.com: France Address

  REVANCHE
  20 Rue Bernard Lazare
  LE CAILAR 30740
  FR  

Hricn.com: China Address

  jin yuan, xiong
  Guizhou huangping jiuzhouzhongxue
  Guizhou, Guizhou 550000
  CN  

JennaJoy.com: Name Missing

     Technical Contact:
           
     ATTN: JENNAJOY.COM
     c/o Network Solutions
     P.O. Box 447
     Herndon, VA 20172-0447
     570-708-8780

OnegaTelecoms.com: City and State are same

  Unit 2 Ground Floor Caxton Street
  Studios Caxton Street North
  London, LONDON E16 1JL 
  GB  

TradeFootball.com: IE address

  10b Beckett Way 
  Parkwest Business Park, Clondalkin D22 00000
  IE  

XmlConference.com: Fax is NULL

     Daste, Kevin      
     323 Pine St
     New Orleans, LA 70118
     US
     504-208-1566 fax: null


Retrieved from "http://aboutus.com/index.php?title=WhoisParsing&oldid=14740082"