Information Retrieval — Problem Set II - Part 2
Programming Assignment 1
Due Date: Mon. Sep 11, 2023
In this work, you are going to scrape webpages within the KSU domain and analyze the fetched
pages. The objectives are:
1. Fetch and store the webpages.
2. Analyze how many webpages contain at least one email address in a raw textual format
(e.g., [email protected], not like netid at kennesaw dot edu).
3. Build a vocabulary of the collected webpages, and plot word frequency charts to see if
the distribution obeys the Zipf's law.
Crawling the Web using Scrapy
We are going to use a Python package, Scrapy, for web crawling. Scrapy is a fast high-level web
crawling and web scraping framework, used to crawl websites and extract structured data from
their pages.
Follow the official installation guide to install Scrapy. We recommend to install Scrapy in a
Python virtualenv.
Python and Virtualenv
If you are not familiar with Python virtual environments, read this.
You will need to take the following steps to install your virtual environment.
Step 1: Install the latest version of Python on you PC or your Mac
Step 2: Install the Virtual Environment with the following command:
On the PC from the Dos Shell after Python is installed:
python -m venv virtualworkspace
virtualworkspace\Scripts\activate.bat
python -m pip install --upgrade pip
Then once inside the Python Shell:
pip install Scrapy
On the Mac after Python is installed:
sudo pip3 install virtualenv virtualenvwrapper
Edit your ~/.zshrc to enable virtualenv plugin and set Python3 as the default for
virtualenvwrapper like so:
...
plugins=(...virtualenv)
...
# Virtualenvwrapper things
export VIRTUALENVWRAPPER_PYTHON='/usr/bin/python3'
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Workspace
source /usr/local/bin/virtualenvwrapper.sh
Create a virtual environment for this project and activate it. For example,
mkvirtualenv ir --python=python3
workon ir
Now you can install a Python package in the ir virtualenv.
pip install Scrapy
Scrapy Tutorial
We are going to extract textual data (e.g., title, body texts) from webpages and store them.
Follow this tutorial to learn the essentials.
Crawler for the KSU Web
Implement a spider subclass as described in the tutorial.
Specify the rules that your spider will follow. The rules provided to a spider govern how to
extract the links from a page and which callbacks should be called for those links. See
scrapy.spider.Rule. The LinkExtractor of your spider must obey the following rules:
• Only the links with the domain name 'kennesaw.edu' should be extracted for the next
request.
• Duplicate URLs should not be revisited
Your spider must obey the followings:
• Let the webservers know who your spider is associated with. Include a phrase in
USER_AGENT value to let them know this is a part of the course experiments. Use a name
such as "KSU CS7263-IRbot/0.1".
• Be polite:
o Wait at least 2.0 sec before downloading consecutive pages from the same
domain
o Respect robots.txt policies
• Run in breadth-first search order (i.e., FIFO)
• Once you reach the desire number of pages fetched, terminate crawling. (set
`CLOSESPIDER_PAGECOUNT' to a reasonable number, 1,000 is a good number)
Your spider should start crawling with the following three URLs:
1. KSU home (www.kennesaw.edu)
2. CCSE home (ccse.kennesaw.edu)
3. And any KSU webpage of your choice
Implement a parse function which extracts information from the url response. Yield a dictionary
which contains data of interest:
def parse_items(self, response):
entry = dict.fromkeys(['pageid', 'url', 'title', 'body', 'emails'])
# TODO. Extract corresponding information and fill the entry
yield entry
• pageid: str, A unique identifier for the page. You may use a hash function (e.g., md5) to
create a unique ID for a URL.
• url: str, URL from which the page is fetched
• title: str, Title of the page (if exists)
• body: str, Body text of the page. get_text from a Python package BeautifulSoup might
be useful to extract all the texts in a document
• emails: list, A list of email addresses found in the document.
Use the function above as a callback function of your LinkExtractor.
Run your crawler and save the scraped items
Start crawling using a spider using a command at the project root directory. Syntax:
scrapy crawl <spider_name>
If you use -O option, it will dump the scraped items to a file. Syntax:
scrapy crawl <spider_name> -O ksu1000.json
Text Statistics
Now, you should have downloaded textual data from KSU webpages stored in a file (e.g., a
JSON file). We want to compute the following statistics:
• Average length of webpages in tokens (use simple split() for tokenization)
• Top ten most frequent email addresses
• Percentage of webpages that contain at least one email address
Your output should be similar to the following:
% python text_stats.py ksu1000.json
Average tokens per page: 523.352
Most Frequent Emails:
('[email protected]', 8)
('[email protected]', 4)
('[email protected]', 4)
('[email protected]', 4)
('[email protected]', 3)
('[email protected]', 3)
('[email protected]', 3)
('[email protected]', 3)
('[email protected]', 2)
('[email protected]', 2)
Percentage with at least one email: 0.061%
Word Frequencies:
We also want to analyze the word frequencies from the scraped KSU webpages. Build a
vocabulary and count the word frequencies. List top 30 most common words before removing
stopwords and after removing stopwords. We expect to see a similar list to the followings.
rank term freq. perc. rank term freq. perc.
------ -------- ------- ------- ------ ----------- ------- -------
1 and 15539 0.031 16 on 2991 0.006
2 the 12164 0.025 17 university 2928 0.006
3 of 9315 0.019 18 contact 2603 0.005
4 to 7990 0.016 19 about 2558 0.005
5 & 6512 0.013 20 search 2430 0.005
6 / 5743 0.012 21 information 2351 0.005
7 for 5333 0.011 22 faculty 2316 0.005
8 in 5178 0.01 23 student 2217 0.004
9 campus 4566 0.009 24 you 2203 0.004
10 ksu 4496 0.009 25 is 2201 0.004
11 a 4314 0.009 26 with 2161 0.004
12 kennesaw 4156 0.008 27 community 2014 0.004
13 students 3361 0.007 28 programs 2013 0.004
14 research 3146 0.006 29 global 1978 0.004
15 state 3065 0.006 30 marietta 1885 0.004
Now, remove any stopwords and punctuations, then print another rankings. For stopwords, you
may use nltk.corpus.stopwords. For removing punctuations, you may use
string.punctuations or you may use the regular expression: "[^\w\s]"
rank term freq. perc. rank term freq. perc.
------ ----------- ------- ------- ------ --------- ------- -------
1 campus 4566 0.009 16 marietta 1885 0.004
2 ksu 4496 0.009 17 resources 1873 0.004
3 kennesaw 4156 0.008 18 home 1855 0.004
4 students 3361 0.007 19 staff 1773 0.004
5 research 3146 0.006 20 program 1677 0.003
6 state 3065 0.006 21 diversity 1665 0.003
7 university 2928 0.006 22 ga 1564 0.003
8 contact 2603 0.005 23 © 1445 0.003
9 search 2430 0.005 24 2021 1441 0.003
10 information 2351 0.005 25 college 1354 0.003
11 faculty 2316 0.005 26 online 1346 0.003
12 student 2217 0.004 27 alumni 1308 0.003
13 community 2014 0.004 28 us 1303 0.003
14 programs 2013 0.004 29 safety 1247 0.003
15 global 1978 0.004 30 financial 1169 0.002
Next plot the word distribution for the data before removing stopwords and punctuations, and for
the data after removing stopwords and punctuations.
Use a log-log plot to see it's approximately linear pattern on the log-log plot. For example:
What to submit
1. A code for spider class which should be located at [scrapy_project]/spiders/
2. A code for generating the text statistics and plots
3. Outputs similar to the ones shown above. Please write all the outputs on one single file
(e.g., docx, pdf, md). This file should include the followings:
1. email statistics
2. two word frequency rankings (before and after removing stopwords and
punctuations)
3. two frequency plots (one w/ Stopwords and punctuation and one w/o Stopwords
and punctuation)
Please, submit all artifacts via the D2L assignments
Requirements are in Description. Only expert needed.