Building a Site Submission Program
How to build your own site submission tool with CGI
5/25/98
by Edward Piou
In connection with our site marketing guide, I wrote a program that automatically submits your site to several major search engines for indexing. This guide will show you how to build your own such program for either personal or commercial use. You may want to view the text of this program in a separate browser window so you can see it in its entirety while reading through this guide. The text of this program is available at: http://www.ahref.com/guides/technology/199805/0525pioucode.txt
The Input Form
The form which takes in input for this program needs input boxes for two variables: the URL to be indexed (named input_url) and the email address of the person submitting the URL (named input_email). For example:
Note: This is not a functioning version of the submitter form.
The HTML code for this form is below. Make sure the path to the CGI bin and the filename for the program match what you've used.
<FORM METHOD="POST" ACTION="/cgi-bin/multisubmit.pl">
<B>URL:</B>
<INPUT TYPE="TEXT" NAME="input_url" SIZE="32" MAXLENGTH="200" VALUE="http://">
<BR>
<B>Your email:</B>
<INPUT TYPE="TEXT" NAME="input_email" SIZE="18" MAXLENGTH="200">
<INPUT TYPE="SUBMIT" NAME="action" VALUE="Submit">
</FORM>
Lines 1-6 of the program indicate the location of the perl binary on our system and explain the copyright status of the program.
1 #!/usr/local/bin/perl -w
2 # Copyright Edward Piou, piou@ahref.com. Originally written for use
3 # on Anchor [ahref.com], 5/20/1998. Permission granted to use
4 # this code, in current or altered form, for private or commercial use
5 # provided these comments are preserved, and any changes to the code are
6 # noted in comments.
Lines 8-11 import the necessary Perl modules - CGI.pm, for handling the CGI input; and several LWP and HTTP modules, for dealing with HTTP requests. All of these modules are available at CPAN (http://www.perl.com/CPAN-local/README.html). Line 13 disables output buffering.
8 use CGI;
9 use LWP::UserAgent;
10 use HTTP::Request;
11 use HTTP::Status;
12
13 $| = 1;
Line 15 creates a new CGI object. Line 16 lets the user's browser know that the response page is an HTML document. Lines 17 and 18 use the CGI object to import the information from the form into variables in the program.
15 $cgi = new CGI;
16 print $cgi->header;
17 $input_url = $cgi->param('input_url');
18 $input_email = $cgi->param('input_email');
Lines 20-22 define several variables containing information unique to our site - it is info you should change if you copy the program. $our_ip is the IP address of the machine running the site submitter; we define it because one of the search engines (Hotbot) requires an IP address for submissions. We use our own IP address, rather than the IP address of the user, so that if there is a problem with our program, the Hotbot folks will know it is our machine that is causing problems. Lines 21 and 22 define our header and footer files - these files include the HTML code which goes before and after the dynamic output of the program.
20 my $our_ip = "205.177.109.84";
21 my $HEADER = "/documentpath/header.inc";
22 my $FOOTER = "/documentpath/footer.inc";
In lines 24-35 we do some basic checking on the URL and email address which were passed in. If either does not pass our test, we assign an error message to the variable $output_string, show that output, and exit the program. For the URL, we check to make sure that it starts with "http://". A stricter program might actually try to access the URL, and generate an error if the page is inaccessible. For the email address, we just check for an @ sign in the address.
24 if ($input_url !~ /^http:\/\//) {
25 $output_string = "<BLOCKQUOTE><H3>Invalid URL</H3>\n";
26 $output_string .= "<P>You input an invalid URL - try again!</BLOCKQUOTE>";
27 &show_output ($output_string);
28 exit;
29 }
30 if ($input_email !~ /@/) {
31 $output_string = "<BLOCKQUOTE><H3>Invalid Email Address</H3>\n";
32 $output_string .= "<P>You input an invalid email address - try again!</BLOCKQUOTE>";
33 &show_output ($output_string);
34 exit;
35 }
Line 37 defines the list of search engines which we will be submitting to. If you copy this program and add more search engines, you will want to add their names to this list. Before adding other search engines to this list, be sure to send them email and ask if they have a policy against remote programs (rather than people) submitting to their site. We don't include Infoseek in our list because they have told us they have a policy against automated submissions. Yahoo doesn't have a policy against such programs, but their site submission process requires stepping through several pages, a process beyond the scope of this simple script.
37 @site_list = qw (altavista excite hotbot lycos webcrawler);
Lines 39-65 define 1 hash for each search engine. The words we use to name each hash are the same as the names we used to describe each site in the @site_list on line 37.
Each hash includes a key/value pair describing the URL of the CGI that accepts submissions for the search engine. The key for the pair is the same in each hash: submission_page.
Each hash also includes a key/value pair describing text which appears on the search engine's response page when a successful submission has been made. The key here is success.
The other elements of each hash vary from search engine to search engine. To figure out what information each search engine requires, and the name of the variable that the information should appear under, view the source of the engine's submission form. The name of each variable will become a key in the hash; the value assigned the key will come from hidden variables on the form page, or from the URL and email address submitted.
For example: viewing source on Altavista's submission form reveals that the field in which you input the URL is named q and there is a hidden input value on the form, named ad, with a value of 1. So we assign the value of $input_url (which we got from our own form) to the variable $altavista{"q"}, and assign 1 to $altavista{"ad"}.
39 $altavista{"submission_page"} = "http://add-url.altavista.digital.com/cgi-bin/newurl";
40 $altavista{"success"} = "has been recorded by our robot";
41 $altavista{"q"} = "$input_url";
42 $altavista{"ad"} = "1";
43
44 $excite{"submission_page"} = "http://www.excite.com/cgi/add_url.cgi";
45 $excite{"success"} = "Thank you!";
46 $excite{"url"} = "$input_url";
47 $excite{"email"} = "$input_email";
48 $excite{"look"} = "excite";
49
50 $hotbot{"submission_page"} = "http://www.hotbot.com/addurl.html";
51 $hotbot{"success"} = "Got it!";
52 $hotbot{"newurl"} = "$input_url";
53 $hotbot{"email"} = "$input_email";
54 $hotbot{"ip"} = "$our_ip";
55 $hotbot{"redirect"} = "http://www.hotbot.com/addurl2.html";
56
57 $lycos{"submission_page"} = "http://www.lycos.com/cgi-bin/spider_now.pl";
58 $lycos{"success"} = "We successfully spidered your page.";
59 $lycos{"query"} = "$input_url";
60 $lycos{"email"} = "$input_email";
61
62 $webcrawler{"submission_page"} = "http://webcrawler.com/cgi-bin/addURL.cgi";
63 $webcrawler{"success"} = "has been scheduled for indexing.";
64 $webcrawler{"url"} = "$input_url";
65 $webcrawler{"action"} = "add";
For Hotbot, the name of the field where you would normally type in the URL is newurl; the field for email is email. Hotbot also has a hidden field: the name is redirect and the value is http://www.hotbot.com/addurl2.html. Another hidden field, which is dynamically generated whenever you access Hotbot's submission form, is ip; this is where we input the variable $our_ip from above.
Excite, Lycos, and Webcrawler were filled out in a similar manner. If you want to add other search engines to this program, you'll need to create a new hash for each new search engine using this procedure.
Lines 67-70 create and define a new UserAgent object using the LWP::UserAgent module. Line 68 gives our agent a name; this name will show up in the search engines' access logs and let them know what program was used to access their sites. Line 69 attaches an email address to the agent, so that the remote sites will know who to email if there is a problem. Line 70 sets the timeout value of the agent to 90 seconds. By default, an agent waits three minutes after accessing a page for the remote server's response. We have reset the default to 90 seconds, because we're a little impatient.
67 my $ua = new LWP::UserAgent;
68 $ua->agent ("ahref.com multisubmit 1.0");
69 $ua->from ("piou\@ahref.com");
70 $ua->timeout (90);
Lines 72-78 customize the program's output for our site a little bit more. We don't include this text in our header file because the header file might be used for other programs for which this text doesn't apply.
72 $output_string = "<BLOCKQUOTE><FONT SIZE=\"2\"><A HREF=\"/index.html\">";
73 $output_string .= "<B>Anchor</B></A> > <A HREF=\"/guides/index.html\">Guides</A> > ";
74 $output_string .= "<A HREF=\"/guides/industry/index.html\">Industry</A> > ";
75 $output_string .= "Planting Seeds in All the Right Places<P></FONT>\n";
76 $output_string .= "<IMG SRC=\"/images/indyguideheader.gif\" ALT=\"Industry Guide\" ";
77 $output_string .= "ALIGN=BOTTOM WIDTH=\"455\" HEIGHT=\"61\" BORDER=\"0\">\n";
78 $output_string .= "<BR>\n<H3>Site Submitter Results</H3>\n";
Lines 80-99 do most of the work of the program.
80 foreach $sitename (@site_list) {
81 $query_string = $$sitename{submission_page} . "?";
82 foreach $key (keys %$sitename) {
83 if (($key ne "submission_page") && ($key ne "success")){
84 $query_string .= "$key=$$sitename{$key}&";
85 }
86 }
87 my $request = new HTTP::Request 'POST', $query_string;
88 my $response = $ua->request ($request);
89 $response_body = $response->content();
90 if (($response->code() == RC_OK) && ($response_body =~ /$$sitename{"success"}/)) {
91 $output_string .= "<BR>Submission to <B>$sitename</B> was successful.\n<HR>\n";
92 }
93 else {
94 $coder = $response->code();
95 $output_string .= "<BR>Submission to <B>$sitename</B> failed for some reason. ";
96 $output_string .= "(Response code $coder.) ";
97 $output_string .= "You will need to submit to $sitename by hand.\n<HR>\n";
98 }
99 }
At line 80, we start cycling through our list of search engines. Line 81 begins the construction of our query string: the full URL that we will be accessing in making our submission. We start the URL with the base URL of the search engine's submission page, and add on a question mark, to show that what follows is a series of variable names and values. At line 82, we start cycling through the key/value pairs in the current search engine's hash. Line 83 checks if the key/value pair is one that does not contain information on a variable and value for constructing the query string. If it does not contain needed information, we go back to line 82, and access the next key and value. Otherwise, on line 84, we add the variable and value to the query string, along with an ampersand in case there are more variables to add.
Line 87 creates an HTTP::Request object, and line 88 performs the request and assigns the response to $response. Line 89 assigns the HTML from the response - the HTML you would normally see after submitting the form - to $response_body. Line 90 confirms that nothing went wrong. If there were no errors indicated by the HTTP header (we check $response->code() for this information), and the HTML which was returned includes the "success" text which we assigned to the hash, line 91 is executed: text indicating that the submission to this search engine was successful is added to the program's output. If there was an error, or the HTML did not include the "success" text, lines 94-97 are executed. We determine the HTTP response code on line 94, and add text indicating that there was a problem with the submission to the program's output.
Line 101 adds some final formatting to the program output, and line 102 calls the subroutine which actually sends out the output. Line 103 finishes the program.
101 $output_string .= "</BLOCKQUOTE>";
102 &show_output ($output_string);
103 exit;
The output subroutine, on lines 105-118, is fairly standard. It opens the header file, prints its contents to the screen; prints the output string, which has been accumulating text throughout the program; and opens the footer file, and prints out its contents.
105 sub show_output {
106 my $output_string = shift (@_);
107 open HEADER or die "Can't find file $HEADER: $!\n";
108 while (<HEADER>) {
109 print $_;
110 }
111 close HEADER;
112 print $output_string;
113 open FOOTER or die "Can't find file $FOOTER: $!\n";
114 while (<FOOTER>) {
115 print $_;
116 }
117 close FOOTER;
118 }
This program immediately tries to submit a URL to several Search Engines and determine if the submission was successful. Because of this, you have to wait for several things to happen before the program generates its output. The site running the program (www.ahref.com in our case) has to access each of the search engines' submission pages, and some of the search engines try to access the submitted URL before determining if the submission was successful. How quickly the program runs depends on how quickly these various connections are made. If you don't want to keep your users waiting, you can split the program into two parts. The first part can take in the user's URL and email address, quickly show them a response page saying the site will be submitted, and place the URL in a queue of URLs to be submitted. The second part of the program can look through the queue of URLs and submit it on a daily (or even hourly) basis, and notify an administrator if there are any errors encountered in submitting the sites. If there are errors, the administrator will need to submit the site by hand to those search engines that did not accept it, or notify the user that a particular submission didn't work.
If you copy this program and add more search engines, be sure to find out those search engines' policies on automated submissions. Irrational as it may seem, you may find that some sites will refuse to take such automated submissions.
Edward Piou is an Anchor producer and runs ep Productions (http://www.eppi.com), a Washington, D.C.-based development company.
© 1998-1999 Anchor Productions, Inc. All rights reserved.