Crawling: i2p2.i2p recursive source loops #30
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Opened 5 years ago
Last modified 5 years ago
#1781assigneddefect
Crawling: i2p2.i2p recursive source loops
Reported by:k1773rOwned by:str4d
Priority:
minor
Milestone:
undecided
Component:
www/i2p
Version:
0.9.24
Keywords:
Cc:
Parent Tickets:
Sensitive:
no
Description
While crawling www.i2p2.i2p i get recursive links which lead to a "page not found" site, but the HTTP status is 200. On those pages i get further nested links and it starts all over. Eventually it will hit a 404 (as shown below).
crawler logs:
first link is the site crawled, second link is where it came from.
The Crawler would detect the loop after some nested loops, but for now i just created a exclude regex.
Subtickets
comment:2 Changed 5 years ago by zzz
Owner:
set to _str4d_Status:new →
assigned
comment:1 Changed 5 years ago by k1773r
Version:
→ 0.9.24
The status code also varies depending on which host is being used:
geti2p.net is 404
i2p-projekt.i2p is 302
i2p2.i2p is 200
for example on i2p2.i2p:
that each site links to the other site with the invalid/nonexisting page makes it even worse.