aboutsummaryrefslogtreecommitdiff
path: root/search/assign.rst
diff options
context:
space:
mode:
Diffstat (limited to 'search/assign.rst')
-rw-r--r--search/assign.rst213
1 files changed, 0 insertions, 213 deletions
diff --git a/search/assign.rst b/search/assign.rst
deleted file mode 100644
index 66e537e..0000000
--- a/search/assign.rst
+++ /dev/null
@@ -1,213 +0,0 @@
-========================
-Project 2: Search Engine
-========================
-
-**CS2621– Web Science**
-
-*100 points*
-
-You are to create a web search engine that works at the command line.
-To do this, you will write two Python scripts, indexer.py and
-search.py.
-
-Indexer
-=======
-
-Indexer.py should do the following:
-
-1. After performing a crawl (using your other Python script), read all
- the HTML files that were stored in the “pages” directory. For each
- document, extract the title and the text from the body of the page
- (read the Beautiful Soup documentation to find out how). Beautiful
- Soup will include the text of the page in the content of the page,
- and that is OK. Beautiful Soup may also break on some pages and
- include HTML as text, but we will not worry about these
- exceptions or bugs.
-
-2. All text should be converted to lowercase and non-alphanumeric
- characters should be ignored. So “123-456” would become “123” and
- “456”, and “joe@yahoo.com” would become “joe”, “yahoo”, “com”.
- Ignore the following stop words: a, an, and, are, as, at, be, by,
- for, from, has, he, in, is, it, its, of, on, that, the, to, was,
- were, will, with. Do not perform stemming.
-
-3. A single inverted index should be created for the document corpus
- which maintains the document ID (numbered 1…n in order of the pages
- found in the “pages” directory), a 1 or 0 if the text is found in
- the title, and the term frequency from the body (normalized by the
- total number of tokens in the document after removing stop words).
-
-4. After indexer.py has finished indexing all the web pages, it should
- output the index to index.dat which looks likethis:
-
-::
-
- arkansas
- 6 0 0.022
- model
- 1 0 0.309
- 3 0 0.015
- 5 1 0.001
- tuesday
- 2 0 0.082
- white
- 2 1 0.018
- etc…
-
-.. note ::
- The indexed words are alphabetized, and there are 3 spaces before
- sets of three numbers (each separated by a single space) which are:
- doc ID, title (0 or 1), and normalized body TF (rounded to 3 decimal
- places). For example, the term white was found only in document 2;
- it was somewhere in the title and made up 1.8% of all the words in
- the document.
-
-5. It may take some time for your program to run, so you should output
- information about the program’s status as it indexes the crawled
- pages. Outputting what file is being worked on would be helpful to
- the user who is waiting for the program to finish its work.
-
-Search
-======
-
-After the index is written to index.dat, the search.py script will
-allow the user to search the corpus for specific words. Here is how
-it should operate:
-
-1. First, read the search phrase at the command line. Examples:
-
- .. code :: bash
-
- $ search.py bisons
- $ search.py "landmark college"
-
-If no command line argument is supplied, the program should tell the
-user a search term is required and terminate. Ignore any command-line
-arguments after the first.
-
-2. Next, the program should read the index from index.dat into memory.
- Note that you may want to use similar data structures used in
- indexer.py, so you should write your programs in a way where you
- share code without having redundant code in each script. (It’s OK
- to introduce new .py files to your project.)
-
-3. For simplicity, all queries will be assumed to use boolean ANDs,
- and we will not implement phrase search. For example, the query
- landmark college should generate a boolean search for landmark AND
- college, so only documents containing both terms should be
- considered amatch.
-
-4. Remove any stop words from the query as was done when indexing the
- documents.
-
-5. After determining which documents match the search terms, calculate
- the relevancy score for each document: relevancy score = 0.9 * body
- TF + 0.1 * title score Do this for each term, and compute the
- average relevancy score for all terms. So if the search was for
- landmark college, you would compute the score for landmark and the
- score for college and compute the average to determine the overall
- relevancy score.
-
-6. The total number of results should first be displayed. Then display
- every document ID and score (out to 3 decimal places) ordered by
- score, and number the results. Example:
-
-.. code:: bash
-
- Results: 4
- 1. docID, 3, score, 0.830
- 2. docID, 1, score, 0.814
- 3. docID, 5, score, 0.350
- 4. docID, 8, score, 0.108
-
-**Bonus:** You can receive 5 bonus points by implementing phrase search.
-So when the user searches for “landmark college”, assume they want
-only documents with that exact phrase. To accomplish this, you will
-need to store the positions of the terms that are stored in the
-inverted index. Then use those positions to ensure the phrase matches
-successive positions.
-
-
-Zip your entire project directory and submit it
-to Canvas before it is due. Make sure your output matches the
-specifications precisely to avoid losing any points. If you use any
-code you find in the Web, you must document the source in your
-program.
-
-Test Data
-=========
-
-*a.html*
-
-.. code:: html
-
- <title>cool!!! test!!!</title>
- <body>
- this 123-456.
- </body>
-
-*b.html*
-
-.. code:: html
-
- <html>
- <head>
- <title>Go Patriots!</title>
- </head>
- <body>
- And another test and test!
- </body>
- </html>
-
-*c.html*
-
-.. code:: html
-
- <body>
- This is a test.
- </body>
-
-*Inverted index:*
-
-.. code::
-
- 123
- a 0 0.200
- 456
- a 0 0.200
- another
- b 0 0.200
- cool
- a 1 0.200
- patriots
- b 1 0.200
- go
- b 1 0.200
- test
- a 1 0.200
- c 0 0.500
- b 0 0.400
- this
- a 0 0.200
- c 0 0.500
-
-Search for "test this" results in the following:
-
-::
-
- Results: 2
- 1. docID c, score 0.450
- 2. docID a, score 0.230
-
-Search for "test patriots go" results in the following:
-
-::
-
- Results: 1
- 1. docID b, score 0.310
-
-Search for "cool patriots" results in the following:
-
-::
-
- Results: 0