aboutsummaryrefslogtreecommitdiff
path: root/search/assign.rst
blob: 70d0b08b516f6e12c550a778bbff414c8de0ae2d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
========================
Project 2: Search Engine
========================

**CS2621– Web Science**

*100 points*

You are to create a web search engine that works at the command line.
To do this, you will write two Python scripts, indexer.py and
search.py. 

Indexer
=======

Indexer.py should do the following:

1. After performing a crawl (using your other Python script), read all
   the HTML files that were stored in the “pages” directory. For each
   document, extract the title and the text from the body of the page
   (read the Beautiful Soup documentation to find out how). Beautiful
   Soup will include the text of the page in the content of the page,
   and that is OK. Beautiful Soup may also break on some pages and
   include HTML as text, but we will not worry about these 
   exceptions or bugs.

2. All text should be converted to lowercase and non-alphanumeric
   characters should be ignored. So “123-456” would become “123” and
   “456”, and “joe@yahoo.com” would become “joe”, “yahoo”, “com”.
   Ignore the following stop words: a, an, and, are, as, at, be, by,
   for, from, has, he, in, is, it, its, of, on, that, the, to, was,
   were, will, with. Do not perform stemming.

3. A single inverted index should be created for the document corpus
   which maintains the document ID (numbered 1…n in order of the pages
   found in the “pages” directory), a 1 or 0 if the text is found in
   the title, and the term frequency from the body (normalized by the
   total number of tokens in the document after removing stop words).

4. After indexer.py has finished indexing all the web pages, it should
   output the index to index.dat which looks likethis:

::

  arkansas
    6 0 0.022
  model
    1 0 0.309
    3 0 0.015
    5 1 0.001
  tuesday
    2 0 0.082
  white
    2 1 0.018
  etc…

.. note ::
  The indexed words are alphabetized, and there are 3 spaces before
  sets of three numbers (each separated by a single space) which are:
  doc ID, title (0 or 1), and normalized body TF (rounded to 3 decimal
  places). For example, the term white was found only in document 2;
  it was somewhere in the title and made up 1.8% of all the words in
  the document.

5. It may take some time for your program to run, so you should output
   information about the program’s status as it indexes the crawled
   pages. Outputting what file is being worked on would be helpful to
   the user who is waiting for the program to finish its work.

Search
======

After the index is written to index.dat, the search.py script will
allow the user to search the corpus for specific words.  Here is how
it should operate:

1. First, read the search phrase at the command line. Examples:

  .. code :: bash

    $ search.py bisons
    $ search.py "landmark college"

If no command line argument is supplied, the program should tell the
user a search term is required and terminate.  Ignore any command-line
arguments after the first.

2. Next, the program should read the index from index.dat into memory.
   Note that you may want to use similar data structures used in
   indexer.py, so you should write your programs in a way where you
   share code without having redundant code in each script. (It’s OK
   to introduce new .py files to your project.)

3. For simplicity, all queries will be assumed to use boolean ANDs,
   and we will not implement phrase search. For example, the query
   landmark college should generate a boolean search for landmark AND
   college, so only documents containing both terms should be
   considered amatch.

4. Remove any stop words from the query as was done when indexing the
   documents.

5. After determining which documents match the search terms, calculate
   the relevancy score for each document: relevancy score = 0.9 * body
   TF + 0.1 * title score Do this for each term, and compute the
   average relevancy score for all terms. So if the search was for
   landmark college, you would compute the score for landmark and the
   score for college and compute the average to determine the overall
   relevancy score.

6. The total number of results should first be displayed. Then display
   every document ID and score (out to 3 decimal places) ordered by
   score, and number the results. Example:

.. code:: bash

  Results:
  1. docID
  2. docID
  3. docID
  4. docID

  4
  3,
  1,
  5,
  8,

  score
  score
  score
  score

  0.830
  0.814
  0.350
  0.108

**Bonus:** You can receive 5 bonus points by implementing phrase search.
So when the user searches for “landmark college”, assume they want
only documents with that exact phrase. To accomplish this, you will
need to store the positions of the terms that are stored in the
inverted index. Then use those positions to ensure the phrase matches
successive positions.  Zip your entire project directory and submit it
to Canvas before it is due. Make sure your output matches the
specifications precisely to avoid losing any points. If you use any
code you find in the Web, you must document the source in your
program.

Test Data
=========

*a.html*

.. code:: html

  <title>cool!!! test!!!</title>
  	<body>
  		this 123-456.
  	</body>

*b.html*

.. code:: html

  <html>
  	<head>
  		<title>Go Patriots!</title>
  	</head>
  	<body>
  		And another test and test!
  	</body>
  </html>

*c.html*

.. code:: html

  <body>
  	This is a test.
  </body>

*Inverted index:*

.. code::

  123
  1 0 0.200
  456
  1 0 0.200
  another
  3 0 0.200
  cool
  1 1 0.200
  cowboys
  3 1 0.200
  go
  3 1 0.200
  test
  1 1 0.200
  2 0 0.500
  3 0 0.400
  this
  1 0 0.200
  2 0 0.500

Search for "test this" results in the following:

::

  Results: 2
  1. docID 2, score 0.450
  2. docID 1, score 0.230

Search for "test patriots go" results in the following:

::

  Results: 1
  1. docID 3, score 0.310

Search for "cool patriots" results in the following:

::

  Results: 0