Crawling: i2p2.i2p recursive source loops #30

Open
opened 2025-04-21 14:47:45 -04:00 by idk · 2 comments
Owner

Opened 5 years ago

Last modified 5 years ago

#1781assigneddefect

Crawling: i2p2.i2p recursive source loops

Reported by:k1773rOwned by:str4d
Priority:
minor
Milestone:
undecided
Component:
www/i2p
Version:
0.9.24
Keywords:

Cc:

Parent Tickets:

Sensitive:
no

Description

While crawling www.i2p2.i2p i get recursive links which lead to a "page not found" site, but the HTTP status is 200. On those pages i get further nested links and it starts all over. Eventually it will hit a 404 (as shown below).

crawler logs:

first link is the site crawled, second link is where it came from.

    2016-04-06T**:19:40.798Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html text/html #044 20160406**1940424+346 sha1:66374BVL4IQZ3HBJXFVOAYAZBWU6VGEQ - -
    2016-04-06T**:19:39.700Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html text/html #018 20160406**1939082+603 sha1:JWWJX7KEBMZCBJSEZW6C3TQPEEA6VG32 - -
    2016-04-06T**:19:38.583Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html text/html #047 20160406**1938203+365 sha1:TNDZLJEXSFWTE3UZ3FX4BHELNBQSAW3F - -
    2016-04-06T**:19:37.853Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html text/html #029 20160406**1937490+336 sha1:UIIBTTZBEW2LHC5TIWALY33YBZPQ4Y5C - -
    2016-04-06T**:19:37.081Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html text/html #018 20160406**1936671+397 sha1:P6IKCGRG77YEY3U3QGET6JQICO2M274M - -
    2016-04-06T**:19:36.201Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html text/html #047 20160406**1935726+448 sha1:GWBZFXTRMUQZQIPJ4EKA3FW4ERRRLYHS - -
    2016-04-06T**:19:35.361Z   404      22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html text/html #040 20160406**1934995+353 sha1:M56A3Y62E7AJYUEURZ224EEEYXS3GYCP - -
    2016-04-06T**:19:34.526Z   404      22318 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html text/html #048 20160406**1934130+372 sha1:MAW4ZNR2RB4RFR6XG2UECOZCKQFT4TFW - -

The Crawler would detect the loop after some nested loops, but for now i just created a exclude regex.

Subtickets

Opened [5 years ago](/timeline?from=2016-04-06T15%3A38%3A09Z&precision=second "See timeline at Apr 6, 2016 3:38:09 PM") Last modified [5 years ago](/timeline?from=2016-05-04T16%3A25%3A58Z&precision=second "See timeline at May 4, 2016 4:25:58 PM") ## [\#1781](/ticket/1781)[assigned](/query?status=assigned)[defect](/query?status=!closed&type=defect) # Crawling: i2p2.i2p recursive source loops Reported by:[k1773r](/query?status=!closed&reporter=k1773r)Owned by:[str4d](/query?status=!closed&owner=str4d) Priority: [minor](/query?status=!closed&priority=minor) Milestone: [undecided](/milestone/undecided "No date set") Component: [www/i2p](/query?status=!closed&component=www%2Fi2p) Version: [0.9.24](/query?status=!closed&version=0.9.24) Keywords: Cc: Parent Tickets: Sensitive: [no](/query?status=!closed&sensitive=0) ### Description While crawling www.i2p2.i2p i get recursive links which lead to a "page not found" site, but the HTTP status is 200. On those pages i get further nested links and it starts all over. Eventually it will hit a 404 (as shown below). crawler logs: first link is the site crawled, second link is where it came from. ``` 2016-04-06T**:19:40.798Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_ru.html text/html #044 20160406**1940424+346 sha1:66374BVL4IQZ3HBJXFVOAYAZBWU6VGEQ - - 2016-04-06T**:19:39.700Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_nl.html text/html #018 20160406**1939082+603 sha1:JWWJX7KEBMZCBJSEZW6C3TQPEEA6VG32 - - 2016-04-06T**:19:38.583Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_it.html text/html #047 20160406**1938203+365 sha1:TNDZLJEXSFWTE3UZ3FX4BHELNBQSAW3F - - 2016-04-06T**:19:37.853Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_fr.html text/html #029 20160406**1937490+336 sha1:UIIBTTZBEW2LHC5TIWALY33YBZPQ4Y5C - - 2016-04-06T**:19:37.081Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_zh.html text/html #018 20160406**1936671+397 sha1:P6IKCGRG77YEY3U3QGET6JQICO2M274M - - 2016-04-06T**:19:36.201Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_es.html text/html #047 20160406**1935726+448 sha1:GWBZFXTRMUQZQIPJ4EKA3FW4ERRRLYHS - - 2016-04-06T**:19:35.361Z 404 22321 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index_de.html text/html #040 20160406**1934995+353 sha1:M56A3Y62E7AJYUEURZ224EEEYXS3GYCP - - 2016-04-06T**:19:34.526Z 404 22318 https://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html LEEEEEEEELERR http://geti2p.net/feeds/p/i2p/downloads/_static/styles/_static/styles/_static/_static/styles/_static/styles/_static/_static/index.html text/html #048 20160406**1934130+372 sha1:MAW4ZNR2RB4RFR6XG2UECOZCKQFT4TFW - - ``` The Crawler would detect the loop after some nested loops, but for now i just created a exclude regex. ### Subtickets
idk added this to the undecided milestone 2025-04-21 14:47:45 -04:00
idk added the
#1781
i2p
undecided
www
labels 2025-04-21 14:47:45 -04:00
Author
Owner

comment:2 Changed 5 years ago by zzz

Owner:
set to _str4d_Status:new →
assigned

[comment:2](https://trac.i2p2.de/\#comment:2) Changed [5 years ago](https://trac.i2p2.de//timeline?from=2016-05-04T16%3A25%3A58Z&precision=second "See timeline at May 4, 2016 4:25:58 PM") by zzz Owner: set to _str4d_Status:new → assigned
Author
Owner

comment:1 Changed 5 years ago by k1773r

Version:
→ 0.9.24

The status code also varies depending on which host is being used:

geti2p.net is 404

i2p-projekt.i2p is 302

i2p2.i2p is 200

for example on i2p2.i2p:

2016-04-06T**:39:16.911Z 200 6436 http://www.i2p2.i2p/feeds/p/i2p/downloads/_static/_static/styles/_static/styles/_static/_static/styles/_static/_static/donate.html LEEEEEEEEL http://www.i2p2.i2p/feeds/p/i2p/downloads/_static/_static/styles/_static/styles/_static/_static/styles/_static/_static/favicon.ico text/html #003 20160406**3916031+862 sha1:6QTBW2PILK47WKXLN6SKFFE5KWWH2AZR - -

that each site links to the other site with the invalid/nonexisting page makes it even worse.

[comment:1](https://trac.i2p2.de/\#comment:1) Changed [5 years ago](https://trac.i2p2.de//timeline?from=2016-04-06T16%3A40%3A47Z&precision=second "See timeline at Apr 6, 2016 4:40:47 PM") by k1773r Version: → 0.9.24 The status code also varies depending on which host is being used: geti2p.net is 404 i2p-projekt.i2p is 302 i2p2.i2p is 200 for example on i2p2.i2p: ``` 2016-04-06T**:39:16.911Z 200 6436 http://www.i2p2.i2p/feeds/p/i2p/downloads/_static/_static/styles/_static/styles/_static/_static/styles/_static/_static/donate.html LEEEEEEEEL http://www.i2p2.i2p/feeds/p/i2p/downloads/_static/_static/styles/_static/styles/_static/_static/styles/_static/_static/favicon.ico text/html #003 20160406**3916031+862 sha1:6QTBW2PILK47WKXLN6SKFFE5KWWH2AZR - - ``` that each site links to the other site with the invalid/nonexisting page makes it even worse.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: I2P_Developers/i2p.www#30
No description provided.