Started by Lidig TinilaloOct 6, 2021

Open
nutch does not crawl sites that allows all crawler by robots.txt

0 VIEWES 0 LIKES 0 DISLIKES SHARE
0 LIKES 0 DISLIKES 0 VIEWES SHARE

I collect more than 300 urls on seed.txt that should be crawled by nutch, but approximately 80 nutch did not crawl. I controlled these sites and find out that most of these sites has allowed to crawl in robots.txt by:

User-agent: *

Why nutch doesn't crawl these sites? Is there a possibility to fix this behavior?

0 Replies

You must be Logged in to reply
Trending Categories
95
Software18
DevOps35
Frontend Development19
Backend Development17
Server Administration13
Linux Administration16
Data Center20
Sentry22
Terraform19
Ansible19
Docker19
Penetration Testing12
Kubernetes13
NGINX14
JenkinsX16
Jenkins20
SSL10
Ethical-Hacking10
Python12
NodeJs15
RedHat14
Github18
AngularJs15
Google Cloud Platform (GCP)6
SonarQube13
Amazon Web Service (AWS)13
VMware17
Blockchain12
Snipe-IT7