WhoisParsing
What (summary)
Pull out all pieces of information from the whois record across all domain registrars. WhoisParsing is one Task in the larger WhoisRefresh Project.
Why this is important
- This project will help us get access to whois record fields (like administrative contacts, technical contacts, domain info, etc. ) that could be used to update the information of all the web pages currently hosted in AboutUs.
- Add more fields to the domain like expire date etc.
- Gives us mastery over our technology. We can change it easily, or adapt it for a different problem.
- Allows us to turn off Apache and the large Java tomcat process on our database master.
DoneDone
- A stable of interesting test cases for each of the largest 50 registrars have been created and hand audited.
- All of the test cases are passing.
Dec 3, 2007
Objectives
- Update the default parser and contact parser to cater the failing cases specially hashed one.
Nov 30, 2007
Objectives
- Make the ContactParser fast. The current problem is that RE has to many optional things.
- Run the latest parser on the 9000 Whois Records and do an analysis on the time taken, and the failed results.
Achieved
- After tweaking the ContactParsers a bit, it seems to be working much faster now. The stats after running the Parser(this includes applying the ContactParsers)on 8777 records are
- Contact Information Extracted : 7475
- Contact Information Not Extraceted : 1302
- Time taken in seconds is: 688.200859
Status of Failing Cases
Out of the total 1302 failing records, around 70% dont have any data associated with them i.e.
- Empty Whois Record : 849 (65.2%)
- No Contact Information : 45 (3.4%)
Remaining are the files that have the contact information that could not be parsed by the DefaultParser
- Hashed kind of information that can be catered : 232 (17.8%)
- Only Contact Info available under heading of "Contact Info(rmation):": 3 (0.2%)
- Easy, can be handled : 156 (11.9%)
- Improper Format : 17 (1.3%)
Nov 29, 2007
Objectives
- Merge the ContactParser with the remaining 10 parsers. Arif|Hassan
- Write tests to verify that the WhoisParser never crashes. Laiq
- Co-ordinate with Ali Aslam|Jason Parmer to fetch around 2000 and run the whoisrefresh on them
Nov 28, 2007
Objectives
Today would be a great day if we can accomplish the following
- Test the Contact Parser function that applies all the contact parsers serially.(Generic ContactParser) Arif
- Get the 20 tests passing for the Default Parser. Laiq
- Remove the parse contact function from the registrar parser and call the ContactParser class for parsing the contact information. Hassan|Arif
Achieved
- The tests for Default Parser are passing.
- Contact Parser function has been tested, It is passing 27 out of 27 tests written for it.
- Merged parse contact function of around 30 parsers.
Nov 27 2007
Objectives
Today would be a great day if we can accomplish the following
- Write 20 tests for the Default Parser. Laiq
- Write the all possible contact parsers in a single ContactParser class and write its test cases. Write RE's in a way that it caters more cases and is not restricted. Arif | Hassan
Achieved
- Tests for the default parser written but they still need to be passed.
- ContactParser done with some generic parse contact function and a function that would apply these parsers serially, We have updated few parser but still need to updated remaining parsers to start using the ContactParser.
Nov 26 2007
Objectives
- Write 5 tests for the Default Parser.
- Run the parser on 8779 records and find out the time it take to run on them.
- Refactor the Contact Parser and move the common ones in the parent class.
- Find the reasons of the slowness of SRS Plus and Domain People and correct them.
Outcome
- Results Statics:
- Total Records: 8777
- Passing records are: 7374
- Failing records are: 1403
- Time taken: 358.803416 secs
- Merged the contact parsers of three registrars. Few others in progress but few cases are failing.
- Two test cases written for default parser. We are first planning to get the Contact parsers done otherwise its tests wont pass and then fully concentrate on the test cases.
Status: Nov 23 2007
- Updated a few parsers to fix issues in parsing.
- Ran the WhoisParser(including the Default Parser) on 8779 records and fixed the Parsers accordingly with the following results
Passing Records : 7374 Failing Records : 1405
- Continued work on the default parser. Ran the Default parser on the same record set to extract the raw address with the following results
Passing Records : 6501 Failing Records : 2278
- Looks great! How did you arrive at these numbers? Is it possible to create ground truth test cases for a representative sample so that we have some hard numbers on what percentage of the parsed records are correct so that we can have a better sense of how reliable that number is? Also how long to run all of the record? -- Brandon 21:09, 25 November 2007 (PST)
- Till now, its been on the basis of the output of raw_address and just seeing the files in which we are generating the raw_address. The output looks reasonably good if you look it manually.
Plan for Tomorrow
- Write the test cases for the Default parser.
- Manually verify the results.
- Update SRSPLUS Parser to optimize its slow nature.
Status: Nov 22 2007
- We added another 2 registrars.
- Started work on the default parser. We are following the multi line hash strategy with some tweaks. It finds something on around 2500 records out of around 8500 records till now. But it still need improvements.
- Corrected small problems from the registrar parser that were identified by running the parsers on large data set.
Plan for Tomorrow
- We plan to finish the top 50 Registrars tomorrow which cover 90% of the entire domains. Only 2 3 left for which we dont have any test case as yet.
- Continue on the default parser so that it can be finished by tomorrow.
- Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
- Sounds good. For today's report please start indicating the size of the dataset you're testing on and the results you are getting. Then, each day that we continue to work on this task you can report the results that you are getting at the end of the day and we can easily quantify our progress. -- Brandon 22:21, 22 November 2007 (PST)
Status: Nov 21 2007
- We added around 8-9 registrars.
- Corrected small problems from the registrar parser that were identified by running the parsers on large data set.
- Added few more test cases for the parsers.
Plan for Tomorrow
- We plan to add few more registrars tomorrow.
- Work on a default parser that can extract address from any registrar whois.
- Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
- Excellent! The plan sounds great ... especially the default parser :-) Keep rockin! --Brandon 21:30, 21 November 2007 (PST)
Steps to DoneDone
-
Write the whois record parsers for the top 20 registrar. -
Pass one or two test cases for all the above 20 registrars. -
Write address parsers for the top 20 countries and pass test cases for different formats of addresses.-
Australia -
Brazil -
China -
Canada -
Germany -
Japan -
Netherlands -
Spain -
United States -
United Kingdom
-
- Write a wrapper that takes a domain name, fetches the whois record and then call the parser on this record.
- Write more test cases for the top 20 registrars
- Belgium Domains
- Capitol Domain
- DirectNic
- Domain Discover
- Domain Doorman
- Dot Register
- Dotster
- Enom
- Fabulous
- Godaddy
- Key Systems
- Melbourne It
- Moniker
- NameKing
- Network Solutions
- Register.com
- Schlund Partner
- Tucows
- Wild West
Testing
For each "Test Domain" included in these tallies, we have an exhaustive test that is passing.
S# | Registrar | # "Interesting" Test Domains | Comment |
---|---|---|---|
1 | GoDaddy | 0 | ToSkip |
2 | Melbourne IT | 0 | ToSkip |
3 | Enom | 0 | ToSkip |
4 | Network Solutions | 0 | ToSkip Most records have Domain Status set to nil. |
5 | Belguim Domains | 0 | ToSkip |
6 | Tucows | 0 | ToSkip Many UK tests failing, some havepostal codes same as state |
7 | Beijing Innovative | 0 | ToSkip |
Formats of WhoIs For Registrars
Network Solutions
AEroiNstruments.com: Without Country in Address
c/o Network Solutions P.O. Box 447 Herndon, VA. 20172-0447
AEroiNstruments.com: Administrative Contact and Technical Contact are on the same line Administrative Contact and Technical Contact in Separate lines
Administrative Contact: Technical Contact:
ArRail-Dental.com: China Address
Y.P.Sun Y.P.Sun Beijing Shengbin Company Limited 304 Citic Building 2 No 19 Jianguomenwai Beijing 100004 100004 CHN 999 999 9999 fax: 999 999 9999
Broadland-Gas.com: UK address
BROADLAND GAS 35 The Street CARLTON COLVILLE LOWESTOFT, SUFFOLK NR33 8JP UK
CommerceCenter.com: UK Address
Lorien Felstead, Essex CM6 3LR Felstead, ESSEX CM6 3LR UK
DuoCuisines.com: France Address
REVANCHE 20 Rue Bernard Lazare LE CAILAR 30740 FR
Hricn.com: China Address
jin yuan, xiong Guizhou huangping jiuzhouzhongxue Guizhou, Guizhou 550000 CN
JennaJoy.com: Name Missing
Technical Contact: ATTN: JENNAJOY.COM c/o Network Solutions P.O. Box 447 Herndon, VA 20172-0447 570-708-8780
OnegaTelecoms.com: City and State are same
Unit 2 Ground Floor Caxton Street Studios Caxton Street North London, LONDON E16 1JL GB
TradeFootball.com: IE address
10b Beckett Way Parkwest Business Park, Clondalkin D22 00000 IE
XmlConference.com: Fax is NULL
Daste, Kevin 323 Pine St New Orleans, LA 70118 US 504-208-1566 fax: null