Difference between revisions of "WhoisParsing"
Arif Iqbal (talk | contribs) |
Arif Iqbal (talk | contribs) (→Nov 26 2007) |
||
Line 62: | Line 62: | ||
:Passing records are: 7374 | :Passing records are: 7374 | ||
:Failing records are: 1403 | :Failing records are: 1403 | ||
− | :Time taken | + | :Time taken: 358.803416 secs |
+ | * Merged the contact parsers of three registrars. Few others in progress but few cases are failing. | ||
+ | * Two test cases written for default parser. We are first planning to get the Contact parsers done otherwise its tests wont pass and then fully concentrate on the test cases. | ||
== Status: Nov 23 2007 == | == Status: Nov 23 2007 == |
Revision as of 13:06, 26 November 2007
What (summary)
Pull out all pieces of information from the whois record across all domain registrars. WhoisParsing is one Task in the larger WhoisRefresh Project.
Why this is important
- This project will help us get access to whois record fields (like administrative contacts, technical contacts, domain info, etc. ) that could be used to update the information of all the web pages currently hosted in AboutUs.
- Add more fields to the domain like expire date etc.
- Gives us mastery over our technology. We can change it easily, or adapt it for a different problem.
- Allows us to turn off Apache and the large Java tomcat process on our database master.
DoneDone
- A stable of interesting test cases for each of the largest 50 registrars have been created and hand audited.
- All of the test cases are passing.
Steps to DoneDone
-
Write the whois record parsers for the top 20 registrar. -
Pass one or two test cases for all the above 20 registrars. -
Write address parsers for the top 20 countries and pass test cases for different formats of addresses.-
Australia -
Brazil -
China -
Canada -
Germany -
Japan -
Netherlands -
Spain -
United States -
United Kingdom
-
- Write a wrapper that takes a domain name, fetches the whois record and then call the parser on this record.
- Write more test cases for the top 20 registrars
- Belgium Domains
- Capitol Domain
- DirectNic
- Domain Discover
- Domain Doorman
- Dot Register
- Dotster
- Enom
- Fabulous
- Godaddy
- Key Systems
- Melbourne It
- Moniker
- NameKing
- Network Solutions
- Register.com
- Schlund Partner
- Tucows
- Wild West
Nov 26 2007
Objectives
- Write 5 tests for the Default Parser.
- Run the parser on 8779 records and find out the time it take to run on them.
- Refactor the Contact Parser and move the common ones in the parent class.
- Find the reasons of the slowness of SRS Plus and Domain People and correct them.
Outcome
- Results Statics:
- Total Records: 8777
- Passing records are: 7374
- Failing records are: 1403
- Time taken: 358.803416 secs
- Merged the contact parsers of three registrars. Few others in progress but few cases are failing.
- Two test cases written for default parser. We are first planning to get the Contact parsers done otherwise its tests wont pass and then fully concentrate on the test cases.
Status: Nov 23 2007
- Updated a few parsers to fix issues in parsing.
- Ran the WhoisParser(including the Default Parser) on 8779 records and fixed the Parsers accordingly with the following results
Passing Records : 7374 Failing Records : 1405
- Continued work on the default parser. Ran the Default parser on the same record set to extract the raw address with the following results
Passing Records : 6501 Failing Records : 2278
- Looks great! How did you arrive at these numbers? Is it possible to create ground truth test cases for a representative sample so that we have some hard numbers on what percentage of the parsed records are correct so that we can have a better sense of how reliable that number is? Also how long to run all of the record? -- Brandon 21:09, 25 November 2007 (PST)
Plan for Tomorrow
- Write the test cases for the Default parser.
- Manually verify the results.
- Update SRSPLUS Parser to optimize its slow nature.
Status: Nov 22 2007
- We added another 2 registrars.
- Started work on the default parser. We are following the multi line hash strategy with some tweaks. It finds something on around 2500 records out of around 8500 records till now. But it still need improvements.
- Corrected small problems from the registrar parser that were identified by running the parsers on large data set.
Plan for Tomorrow
- We plan to finish the top 50 Registrars tomorrow which cover 90% of the entire domains. Only 2 3 left for which we dont have any test case as yet.
- Continue on the default parser so that it can be finished by tomorrow.
- Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
- Sounds good. For today's report please start indicating the size of the dataset you're testing on and the results you are getting. Then, each day that we continue to work on this task you can report the results that you are getting at the end of the day and we can easily quantify our progress. -- Brandon 22:21, 22 November 2007 (PST)
Status: Nov 21 2007
- We added around 8-9 registrars.
- Corrected small problems from the registrar parser that were identified by running the parsers on large data set.
- Added few more test cases for the parsers.
Plan for Tomorrow
- We plan to add few more registrars tomorrow.
- Work on a default parser that can extract address from any registrar whois.
- Continue the exercise of running the parsers on large data sets and identifying problems and correcting them.
- Excellent! The plan sounds great ... especially the default parser :-) Keep rockin! --Brandon 21:30, 21 November 2007 (PST)
Testing
For each "Test Domain" included in these tallies, we have an exhaustive test that is passing.
S# | Registrar | # "Interesting" Test Domains | Comment |
---|---|---|---|
1 | GoDaddy | 0 | ToSkip |
2 | Melbourne IT | 0 | ToSkip |
3 | Enom | 0 | ToSkip |
4 | Network Solutions | 0 | ToSkip Most records have Domain Status set to nil. |
5 | Belguim Domains | 0 | ToSkip |
6 | Tucows | 0 | ToSkip Many UK tests failing, some havepostal codes same as state |
7 | Beijing Innovative | 0 | ToSkip |
Formats of WhoIs For Registrars
Network Solutions
AEroiNstruments.com: Without Country in Address
c/o Network Solutions P.O. Box 447 Herndon, VA. 20172-0447
AEroiNstruments.com: Administrative Contact and Technical Contact are on the same line Administrative Contact and Technical Contact in Separate lines
Administrative Contact: Technical Contact:
ArRail-Dental.com: China Address
Y.P.Sun Y.P.Sun Beijing Shengbin Company Limited 304 Citic Building 2 No 19 Jianguomenwai Beijing 100004 100004 CHN 999 999 9999 fax: 999 999 9999
Broadland-Gas.com: UK address
BROADLAND GAS 35 The Street CARLTON COLVILLE LOWESTOFT, SUFFOLK NR33 8JP UK
CommerceCenter.com: UK Address
Lorien Felstead, Essex CM6 3LR Felstead, ESSEX CM6 3LR UK
DuoCuisines.com: France Address
REVANCHE 20 Rue Bernard Lazare LE CAILAR 30740 FR
Hricn.com: China Address
jin yuan, xiong Guizhou huangping jiuzhouzhongxue Guizhou, Guizhou 550000 CN
JennaJoy.com: Name Missing
Technical Contact: ATTN: JENNAJOY.COM c/o Network Solutions P.O. Box 447 Herndon, VA 20172-0447 570-708-8780
OnegaTelecoms.com: City and State are same
Unit 2 Ground Floor Caxton Street Studios Caxton Street North London, LONDON E16 1JL GB
TradeFootball.com: IE address
10b Beckett Way Parkwest Business Park, Clondalkin D22 00000 IE
XmlConference.com: Fax is NULL
Daste, Kevin 323 Pine St New Orleans, LA 70118 US 504-208-1566 fax: null