1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
|
========================
Project 2: Search Engine
========================
**CS2621– Web Science**
*100 points*
You are to create a web search engine that works at the command line.
To do this, you will write two Python scripts, indexer.py and
search.py.
Indexer
=======
Indexer.py should do the following:
1. After performing a crawl (using your other Python script), read all
the HTML files that were stored in the “pages” directory. For each
document, extract the title and the text from the body of the page
(read the Beautiful Soup documentation to find out how). Beautiful
Soup will include the text of the page in the content of the page,
and that is OK. Beautiful Soup may also break on some pages and
include HTML as text, but we will not worry about these
exceptions or bugs.
2. All text should be converted to lowercase and non-alphanumeric
characters should be ignored. So “123-456” would become “123” and
“456”, and “joe@yahoo.com” would become “joe”, “yahoo”, “com”.
Ignore the following stop words: a, an, and, are, as, at, be, by,
for, from, has, he, in, is, it, its, of, on, that, the, to, was,
were, will, with. Do not perform stemming.
3. A single inverted index should be created for the document corpus
which maintains the document ID (numbered 1…n in order of the pages
found in the “pages” directory), a 1 or 0 if the text is found in
the title, and the term frequency from the body (normalized by the
total number of tokens in the document after removing stop words).
4. After indexer.py has finished indexing all the web pages, it should
output the index to index.dat which looks likethis:
::
arkansas
6 0 0.022
model
1 0 0.309
3 0 0.015
5 1 0.001
tuesday
2 0 0.082
white
2 1 0.018
etc…
.. note ::
The indexed words are alphabetized, and there are 3 spaces before
sets of three numbers (each separated by a single space) which are:
doc ID, title (0 or 1), and normalized body TF (rounded to 3 decimal
places). For example, the term white was found only in document 2;
it was somewhere in the title and made up 1.8% of all the words in
the document.
5. It may take some time for your program to run, so you should output
information about the program’s status as it indexes the crawled
pages. Outputting what file is being worked on would be helpful to
the user who is waiting for the program to finish its work.
Search
======
After the index is written to index.dat, the search.py script will
allow the user to search the corpus for specific words. Here is how
it should operate:
1. First, read the search phrase at the command line. Examples:
.. code :: bash
$ search.py bisons
$ search.py "landmark college"
If no command line argument is supplied, the program should tell the
user a search term is required and terminate. Ignore any command-line
arguments after the first.
2. Next, the program should read the index from index.dat into memory.
Note that you may want to use similar data structures used in
indexer.py, so you should write your programs in a way where you
share code without having redundant code in each script. (It’s OK
to introduce new .py files to your project.)
3. For simplicity, all queries will be assumed to use boolean ANDs,
and we will not implement phrase search. For example, the query
landmark college should generate a boolean search for landmark AND
college, so only documents containing both terms should be
considered amatch.
4. Remove any stop words from the query as was done when indexing the
documents.
5. After determining which documents match the search terms, calculate
the relevancy score for each document: relevancy score = 0.9 * body
TF + 0.1 * title score Do this for each term, and compute the
average relevancy score for all terms. So if the search was for
landmark college, you would compute the score for landmark and the
score for college and compute the average to determine the overall
relevancy score.
6. The total number of results should first be displayed. Then display
every document ID and score (out to 3 decimal places) ordered by
score, and number the results. Example:
.. code:: bash
Results: 4
1. docID, 3, score, 0.830
2. docID, 1, score, 0.814
3. docID, 5, score, 0.350
4. docID, 8, score, 0.108
**Bonus:** You can receive 5 bonus points by implementing phrase search.
So when the user searches for “landmark college”, assume they want
only documents with that exact phrase. To accomplish this, you will
need to store the positions of the terms that are stored in the
inverted index. Then use those positions to ensure the phrase matches
successive positions.
Zip your entire project directory and submit it
to Canvas before it is due. Make sure your output matches the
specifications precisely to avoid losing any points. If you use any
code you find in the Web, you must document the source in your
program.
Test Data
=========
*a.html*
.. code:: html
<title>cool!!! test!!!</title>
<body>
this 123-456.
</body>
*b.html*
.. code:: html
<html>
<head>
<title>Go Patriots!</title>
</head>
<body>
And another test and test!
</body>
</html>
*c.html*
.. code:: html
<body>
This is a test.
</body>
*Inverted index:*
.. code::
123
a 0 0.200
456
a 0 0.200
another
b 0 0.200
cool
a 1 0.200
patriots
b 1 0.200
go
b 1 0.200
test
a 1 0.200
c 0 0.500
b 0 0.400
this
a 0 0.200
c 0 0.500
Search for "test this" results in the following:
::
Results: 2
1. docID c, score 0.450
2. docID a, score 0.230
Search for "test patriots go" results in the following:
::
Results: 1
1. docID b, score 0.310
Search for "cool patriots" results in the following:
::
Results: 0
|