search/assign.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177

========================
Project 2: Search Engine
========================

**CS2621– Web Science**

**100 points**

You are to create a web search engine that works at the command line. To do this, you will write two Python scripts,
indexer.py and search.py. Indexer.py should do the following:

1. After performing a crawl (using your other Python script), read all the HTML files that were stored in the “pages” directory. For each document, extract the title and the text from the body of the page (read the Beautiful Soup documentation to find out how). Beautiful Soup will include the text of the page in the content of the page, and that is OK. Beautiful Soup may also break on some pages and include HTML as text, but we will not worry about these exceptions or bugs.

2. All text should be converted to lowercase and non-alphanumeric characters should be ignored. So “123-456” would become “123” and “456”, and “joe@yahoo.com” would become “joe”, “yahoo”, “com”. Ignore the following stop words: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with. Do not perform stemming.

3. A single inverted index should be created for the document corpus which maintains the document ID (numbered 1…n in order of the pages found in the “pages” directory), a 1 or 0 if the text is found in the title, and the term frequency from the body (normalized by the total number of tokens in the document after removing stop words).

4. After indexer.py has finished indexing all the web pages, it should output the index to index.dat which looks likethis:

::

  arkansas
    6 0 0.022
  model
    1 0 0.309
    3 0 0.015
    5 1 0.001
  tuesday
    2 0 0.082
  white
    2 1 0.018
  etc…

.. note ::
  The indexed words are alphabetized, and there are 3 spaces before sets of three numbers (each separated
  by a single space) which are: doc ID, title (0 or 1), and normalized body TF (rounded to 3 decimal places). For
  example, the term white was found only in document 2; it was somewhere in the title and made up 1.8% of all the
  words in the document.

5. It may take some time for your program to run, so you should output information about the program’s status as it
indexes the crawled pages. Outputting what file is being worked on would be helpful to the user who is waiting for
the program to finish its work.

After the index is written to index.dat, the search.py script will allow the user to search the corpus for specific words.
Here is how it should operate:

1. First, read the search phrase at the command line. Examples:

  .. code :: bash

    $ search.py bisons
    $ search.py "landmark college"

If no command line argument is supplied, the program should tell the user a search term is required and terminate.
Ignore any command-line arguments after the first.

2. Next, the program should read the index from index.dat into memory. Note that you may want to use similar data
structures used in indexer.py, so you should write your programs in a way where you share code without having
redundant code in each script. (It’s OK to introduce new .py files to your project.)

3. For simplicity, all queries will be assumed to use boolean ANDs, and we will not implement phrase search. For
example, the query landmark college should generate a boolean search for landmark AND college, so only
documents containing both terms should be considered amatch.

4. Remove any stop words from the query as was done when indexing the documents.

5. After determining which documents match the search terms, calculate the relevancy score for each document:
relevancy score = 0.9 * body TF + 0.1 * title score
Do this for each term, and compute the average relevancy score for all terms. So if the search was for landmark
college, you would compute the score for landmark and the score for college and compute the average to
determine the overall relevancy score.

6. The total number of results should first be displayed. Then display every document ID and score (out to 3 decimal
places) ordered by score, and number the results. Example:
Results:

:: 

  1. docID
  2. docID
  3. docID
  4. docID

  4
  3,
  1,
  5,
  8,

  score
  score
  score
  score

  0.830
  0.814
  0.350
  0.108

Bonus: You can receive 5 bonus points by implementing phrase search. So when the user searches for “landmark
college”, assume they want only documents with that exact phrase. To accomplish this, you will need to store the
positions of the terms that are stored in the inverted index. Then use those positions to ensure the phrase matches
successive positions.
Zip your entire project directory and submit it to Canvas before it is due. Make sure your output matches the specifications
precisely to avoid losing any points. If you use any code you find in the Web, you must document the source in your
program.

Test Data
=========

a.html

.. code:: html

  <title>cool!!! test!!!</title>
  	<body>
  		this 123-456.
  	</body>

b.html

.. code:: html

  <html>
  	<head>
  		<title>Go Patriots!</title>
  	</head>
  	<body>
  		And another test and test!
  	</body>
  </html>

c.html

.. code:: html

  <body>
  	This is a test.
  </body>

Inverted index:

.. code::

  123
  1 0 0.200
  456
  1 0 0.200
  another
  3 0 0.200
  cool
  1 1 0.200
  cowboys
  3 1 0.200
  go
  3 1 0.200
  test
  1 1 0.200
  2 0 0.500
  3 0 0.400
  this
  1 0 0.200
  2 0 0.500

Search for "test this" results in the following:
Results: 2
1. docID 2, score 0.450
2. docID 1, score 0.230

Search for "test patriots go" results in the following:
Results: 1
1. docID 3, score 0.310

Search for "cool patriots" results in the following:
Results: 0