Techiio-author
Started by Lidig TinilaloOct 6, 2021

Open
nutch does not crawl sites that allows all crawler by robots.txt

0 VIEWES 0 LIKES 0 DISLIKES SHARE
0 LIKES 0 DISLIKES 0 VIEWES SHARE

I collect more than 300 urls on seed.txt that should be crawled by nutch, but approximately 80 nutch did not crawl. I controlled these sites and find out that most of these sites has allowed to crawl in robots.txt by:

User-agent: *

Why nutch doesn't crawl these sites? Is there a possibility to fix this behavior?

0 Replies

You must be Logged in to reply
Trending Technologies
15
Software91
DevOps48
Frontend Development24
Backend Development20
Server Administration17
Linux Administration28
Data Center24
Sentry24
Terraform23
Ansible83
Docker70
Penetration Testing16
Kubernetes21
NGINX20
JenkinsX17
Techiio-logo

Techiio is on the journey to build an ocean of technical knowledge, scouring the emerging stars in process and proffering them to the corporate world.

Follow us on:

Subscribe to get latest updates

You can unsubscribe anytime from getting updates from us
Developed and maintained by Wikiance
Developed and maintained by Wikiance