[wellylug] spider needed

jumbophut jumbophut at gmail.com
Tue Aug 31 13:11:45 NZST 2004


On Tue, 31 Aug 2004 12:30:54 +1200, Brenda O'Hagan wrote:
> 
> the pages are the result of ASP scripts running (on windows something)
> and rsyncing the source scripts that produce this isn't what i want
> i need to spider the resulting wapsite..
> 
> any other spidering apps?? or is there a way to make wget just spider wml?
> 

You could try the --force-html option to wget.
--html-extension might also be relevant.

Otherwise, you probably need to do something along the lines of Nick's
suggestion, like write a perl script to:

grab the WML home page using wget
manually grep for <a>, <anchor>, <img> tags and store targets somewhere
for each item in target list: 
       if an image and not already downloaded
            download and put in appropriate directory structure; 
            add image to already-downloaded list
       else must be an anchor tag
             if target within the web site
                  if target is not a WML file
                       download into appropriate directory structure; 
                       add target to already-downloaded list
                  else
                      grab the WML page using wget
                      add WML page to already-downloaded list
                      grep <a>,<anchor>
                      store targets in list unless they dup ones
already gathered
                      for each item....
                           ..... (recursion)
                      next
                 end if
            //ignore stuff outside site
            end if
     end if
next

You probably don't want to go there.

-- 
Tony (echo 'spend!,pocket awide' | sed 'y/acdeikospntw!, /l at omcgtjuba.phi/')




More information about the wellylug mailing list