[wellylug] spider needed
jumbophut
jumbophut at gmail.com
Tue Aug 31 13:11:45 NZST 2004
On Tue, 31 Aug 2004 12:30:54 +1200, Brenda O'Hagan wrote:
>
> the pages are the result of ASP scripts running (on windows something)
> and rsyncing the source scripts that produce this isn't what i want
> i need to spider the resulting wapsite..
>
> any other spidering apps?? or is there a way to make wget just spider wml?
>
You could try the --force-html option to wget.
--html-extension might also be relevant.
Otherwise, you probably need to do something along the lines of Nick's
suggestion, like write a perl script to:
grab the WML home page using wget
manually grep for <a>, <anchor>, <img> tags and store targets somewhere
for each item in target list:
if an image and not already downloaded
download and put in appropriate directory structure;
add image to already-downloaded list
else must be an anchor tag
if target within the web site
if target is not a WML file
download into appropriate directory structure;
add target to already-downloaded list
else
grab the WML page using wget
add WML page to already-downloaded list
grep <a>,<anchor>
store targets in list unless they dup ones
already gathered
for each item....
..... (recursion)
next
end if
//ignore stuff outside site
end if
end if
next
You probably don't want to go there.
--
Tony (echo 'spend!,pocket awide' | sed 'y/acdeikospntw!, /l at omcgtjuba.phi/')
More information about the wellylug
mailing list