1 files changed, 213 insertions, 0 deletions
diff --git a/CSC2636/search/assign.rst b/CSC2636/search/assign.rst
new file mode 100644
index 0000000..66e537e
--- /dev/null
+++ b/CSC2636/search/assign.rst
@@ -0,0 +1,213 @@
+========================
+Project 2: Search Engine
+========================
+
+**CS2621– Web Science**
+
+*100 points*
+
+You are to create a web search engine that works at the command line.
+To do this, you will write two Python scripts, indexer.py and
+search.py. 
+
+Indexer
+=======
+
+Indexer.py should do the following:
+
+1. After performing a crawl (using your other Python script), read all
+   the HTML files that were stored in the “pages” directory. For each
+   document, extract the title and the text from the body of the page
+   (read the Beautiful Soup documentation to find out how). Beautiful
+   Soup will include the text of the page in the content of the page,
+   and that is OK. Beautiful Soup may also break on some pages and
+   include HTML as text, but we will not worry about these 
+   exceptions or bugs.
+
+2. All text should be converted to lowercase and non-alphanumeric
+   characters should be ignored. So “123-456” would become “123” and
+   “456”, and “joe@yahoo.com” would become “joe”, “yahoo”, “com”.
+   Ignore the following stop words: a, an, and, are, as, at, be, by,
+   for, from, has, he, in, is, it, its, of, on, that, the, to, was,
+   were, will, with. Do not perform stemming.
+
+3. A single inverted index should be created for the document corpus
+   which maintains the document ID (numbered 1…n in order of the pages
+   found in the “pages” directory), a 1 or 0 if the text is found in
+   the title, and the term frequency from the body (normalized by the
+   total number of tokens in the document after removing stop words).
+
+4. After indexer.py has finished indexing all the web pages, it should
+   output the index to index.dat which looks likethis:
+
+::
+
+  arkansas
+    6 0 0.022
+  model
+    1 0 0.309
+    3 0 0.015
+    5 1 0.001
+  tuesday
+    2 0 0.082
+  white
+    2 1 0.018
+  etc…
+
+.. note ::
+  The indexed words are alphabetized, and there are 3 spaces before
+  sets of three numbers (each separated by a single space) which are:
+  doc ID, title (0 or 1), and normalized body TF (rounded to 3 decimal
+  places). For example, the term white was found only in document 2;
+  it was somewhere in the title and made up 1.8% of all the words in
+  the document.
+
+5. It may take some time for your program to run, so you should output
+   information about the program’s status as it indexes the crawled
+   pages. Outputting what file is being worked on would be helpful to
+   the user who is waiting for the program to finish its work.
+
+Search
+======
+
+After the index is written to index.dat, the search.py script will
+allow the user to search the corpus for specific words.  Here is how
+it should operate:
+
+1. First, read the search phrase at the command line. Examples:
+
+  .. code :: bash
+
+    $ search.py bisons
+    $ search.py "landmark college"
+
+If no command line argument is supplied, the program should tell the
+user a search term is required and terminate.  Ignore any command-line
+arguments after the first.
+
+2. Next, the program should read the index from index.dat into memory.
+   Note that you may want to use similar data structures used in
+   indexer.py, so you should write your programs in a way where you
+   share code without having redundant code in each script. (It’s OK
+   to introduce new .py files to your project.)
+
+3. For simplicity, all queries will be assumed to use boolean ANDs,
+   and we will not implement phrase search. For example, the query
+   landmark college should generate a boolean search for landmark AND
+   college, so only documents containing both terms should be
+   considered amatch.
+
+4. Remove any stop words from the query as was done when indexing the
+   documents.
+
+5. After determining which documents match the search terms, calculate
+   the relevancy score for each document: relevancy score = 0.9 * body
+   TF + 0.1 * title score Do this for each term, and compute the
+   average relevancy score for all terms. So if the search was for
+   landmark college, you would compute the score for landmark and the
+   score for college and compute the average to determine the overall
+   relevancy score.
+
+6. The total number of results should first be displayed. Then display
+   every document ID and score (out to 3 decimal places) ordered by
+   score, and number the results. Example:
+
+.. code:: bash
+
+  Results: 4
+  1. docID, 3, score, 0.830
+  2. docID, 1, score, 0.814
+  3. docID, 5, score, 0.350
+  4. docID, 8, score, 0.108
+
+**Bonus:** You can receive 5 bonus points by implementing phrase search.
+So when the user searches for “landmark college”, assume they want
+only documents with that exact phrase. To accomplish this, you will
+need to store the positions of the terms that are stored in the
+inverted index. Then use those positions to ensure the phrase matches
+successive positions.  
+
+
+Zip your entire project directory and submit it
+to Canvas before it is due. Make sure your output matches the
+specifications precisely to avoid losing any points. If you use any
+code you find in the Web, you must document the source in your
+program.
+
+Test Data
+=========
+
+*a.html*
+
+.. code:: html
+
+  <title>cool!!! test!!!</title>
+  	<body>
+  		this 123-456.
+  	</body>
+
+*b.html*
+
+.. code:: html
+
+  <html>
+  	<head>
+  		<title>Go Patriots!</title>
+  	</head>
+  	<body>
+  		And another test and test!
+  	</body>
+  </html>
+
+*c.html*
+
+.. code:: html
+
+  <body>
+  	This is a test.
+  </body>
+
+*Inverted index:*
+
+.. code::
+
+  123
+  	a 0 0.200
+  456
+  	a 0 0.200
+  another
+  	b 0 0.200
+  cool
+  	a 1 0.200
+  patriots
+  	b 1 0.200
+  go
+  	b 1 0.200
+  test
+  	a 1 0.200
+  	c 0 0.500
+  	b 0 0.400
+  this
+  	a 0 0.200
+  	c 0 0.500
+
+Search for "test this" results in the following:
+
+::
+
+  Results: 2
+  1. docID c, score 0.450
+  2. docID a, score 0.230
+
+Search for "test patriots go" results in the following:
+
+::
+
+  Results: 1
+  1. docID b, score 0.310
+
+Search for "cool patriots" results in the following:
+
+::
+
+  Results: 0