diff options
author | Tucker Evans <tuckerevans24@gmail.com> | 2019-02-18 07:35:54 -0500 |
---|---|---|
committer | Tucker Evans <tuckerevans24@gmail.com> | 2019-02-18 07:35:54 -0500 |
commit | e8b1808eaf87a49e4c34ebbfb66854baa627418c (patch) | |
tree | 8a4bb15321992702b6b26e34bd2ed3a55bb7b0d9 /search | |
parent | 6cc5652a8af3361288393718ec2adb2889c9af1e (diff) |
Moves assignments to given course folder.
Diffstat (limited to 'search')
-rw-r--r-- | search/.gitignore | 6 | ||||
-rw-r--r-- | search/README.rst | 19 | ||||
-rw-r--r-- | search/assign.rst | 213 | ||||
-rw-r--r-- | search/index/index.go | 165 | ||||
-rw-r--r-- | search/indexer.go | 402 | ||||
-rw-r--r-- | search/search.go | 144 |
6 files changed, 0 insertions, 949 deletions
diff --git a/search/.gitignore b/search/.gitignore deleted file mode 100644 index 7523492..0000000 --- a/search/.gitignore +++ /dev/null @@ -1,6 +0,0 @@ -*test* -pages -index.dat -indexer -search -*.swp diff --git a/search/README.rst b/search/README.rst deleted file mode 100644 index e1d14fb..0000000 --- a/search/README.rst +++ /dev/null @@ -1,19 +0,0 @@ -============= -Search Engine -============= - -Setup -===== -In order for search.go to use the index package the directory "index" -must by copied (or linked) into a directory "search" that is in your -GOPATH. - -About -===== -Search Engine for web science class. - -See assign.rst for assignment details. - -Authors -======= -- Tucker Evans diff --git a/search/assign.rst b/search/assign.rst deleted file mode 100644 index 66e537e..0000000 --- a/search/assign.rst +++ /dev/null @@ -1,213 +0,0 @@ -======================== -Project 2: Search Engine -======================== - -**CS2621– Web Science** - -*100 points* - -You are to create a web search engine that works at the command line. -To do this, you will write two Python scripts, indexer.py and -search.py. - -Indexer -======= - -Indexer.py should do the following: - -1. After performing a crawl (using your other Python script), read all - the HTML files that were stored in the “pages” directory. For each - document, extract the title and the text from the body of the page - (read the Beautiful Soup documentation to find out how). Beautiful - Soup will include the text of the page in the content of the page, - and that is OK. Beautiful Soup may also break on some pages and - include HTML as text, but we will not worry about these - exceptions or bugs. - -2. All text should be converted to lowercase and non-alphanumeric - characters should be ignored. So “123-456” would become “123” and - “456”, and “joe@yahoo.com” would become “joe”, “yahoo”, “com”. - Ignore the following stop words: a, an, and, are, as, at, be, by, - for, from, has, he, in, is, it, its, of, on, that, the, to, was, - were, will, with. Do not perform stemming. - -3. A single inverted index should be created for the document corpus - which maintains the document ID (numbered 1…n in order of the pages - found in the “pages” directory), a 1 or 0 if the text is found in - the title, and the term frequency from the body (normalized by the - total number of tokens in the document after removing stop words). - -4. After indexer.py has finished indexing all the web pages, it should - output the index to index.dat which looks likethis: - -:: - - arkansas - 6 0 0.022 - model - 1 0 0.309 - 3 0 0.015 - 5 1 0.001 - tuesday - 2 0 0.082 - white - 2 1 0.018 - etc… - -.. note :: - The indexed words are alphabetized, and there are 3 spaces before - sets of three numbers (each separated by a single space) which are: - doc ID, title (0 or 1), and normalized body TF (rounded to 3 decimal - places). For example, the term white was found only in document 2; - it was somewhere in the title and made up 1.8% of all the words in - the document. - -5. It may take some time for your program to run, so you should output - information about the program’s status as it indexes the crawled - pages. Outputting what file is being worked on would be helpful to - the user who is waiting for the program to finish its work. - -Search -====== - -After the index is written to index.dat, the search.py script will -allow the user to search the corpus for specific words. Here is how -it should operate: - -1. First, read the search phrase at the command line. Examples: - - .. code :: bash - - $ search.py bisons - $ search.py "landmark college" - -If no command line argument is supplied, the program should tell the -user a search term is required and terminate. Ignore any command-line -arguments after the first. - -2. Next, the program should read the index from index.dat into memory. - Note that you may want to use similar data structures used in - indexer.py, so you should write your programs in a way where you - share code without having redundant code in each script. (It’s OK - to introduce new .py files to your project.) - -3. For simplicity, all queries will be assumed to use boolean ANDs, - and we will not implement phrase search. For example, the query - landmark college should generate a boolean search for landmark AND - college, so only documents containing both terms should be - considered amatch. - -4. Remove any stop words from the query as was done when indexing the - documents. - -5. After determining which documents match the search terms, calculate - the relevancy score for each document: relevancy score = 0.9 * body - TF + 0.1 * title score Do this for each term, and compute the - average relevancy score for all terms. So if the search was for - landmark college, you would compute the score for landmark and the - score for college and compute the average to determine the overall - relevancy score. - -6. The total number of results should first be displayed. Then display - every document ID and score (out to 3 decimal places) ordered by - score, and number the results. Example: - -.. code:: bash - - Results: 4 - 1. docID, 3, score, 0.830 - 2. docID, 1, score, 0.814 - 3. docID, 5, score, 0.350 - 4. docID, 8, score, 0.108 - -**Bonus:** You can receive 5 bonus points by implementing phrase search. -So when the user searches for “landmark college”, assume they want -only documents with that exact phrase. To accomplish this, you will -need to store the positions of the terms that are stored in the -inverted index. Then use those positions to ensure the phrase matches -successive positions. - - -Zip your entire project directory and submit it -to Canvas before it is due. Make sure your output matches the -specifications precisely to avoid losing any points. If you use any -code you find in the Web, you must document the source in your -program. - -Test Data -========= - -*a.html* - -.. code:: html - - <title>cool!!! test!!!</title> - <body> - this 123-456. - </body> - -*b.html* - -.. code:: html - - <html> - <head> - <title>Go Patriots!</title> - </head> - <body> - And another test and test! - </body> - </html> - -*c.html* - -.. code:: html - - <body> - This is a test. - </body> - -*Inverted index:* - -.. code:: - - 123 - a 0 0.200 - 456 - a 0 0.200 - another - b 0 0.200 - cool - a 1 0.200 - patriots - b 1 0.200 - go - b 1 0.200 - test - a 1 0.200 - c 0 0.500 - b 0 0.400 - this - a 0 0.200 - c 0 0.500 - -Search for "test this" results in the following: - -:: - - Results: 2 - 1. docID c, score 0.450 - 2. docID a, score 0.230 - -Search for "test patriots go" results in the following: - -:: - - Results: 1 - 1. docID b, score 0.310 - -Search for "cool patriots" results in the following: - -:: - - Results: 0 diff --git a/search/index/index.go b/search/index/index.go deleted file mode 100644 index 5d8ab65..0000000 --- a/search/index/index.go +++ /dev/null @@ -1,165 +0,0 @@ -package index - -import "fmt" -import "os" -import "io" -import "bufio" -import "sort" -import "errors" -import "strings" -import "strconv" - -/* TODO - - - Implement Forward Creation - - Implement Inverted from Forward - - Switch Indexer.go over to this package - -/********* - * Types * - *********/ - -type F_info struct { - Word string; - In_title bool; - Freq float64; -}; - -type I_info struct { - Doc string; - In_title bool; - Freq float64; -}; - -type F_entry struct{ - This *F_info; - Next *F_entry; -}; - -type I_entry struct{ - This *I_info; - Next *I_entry; -}; - -type F_index map[string]*F_entry; -type I_index map[string]*I_entry; - -type sortInverted struct{ - w string; - root *I_entry; -}; - - -/*************************** - * Forward Index Funcitons * - ***************************/ - -func NewForwardEntryStrings(text, title []string) (*F_entry, error) { - return nil, errors.New("not implemented"); -} - -/**************************** - * Inverted Index Functions * - ****************************/ - -func new_I_info() *I_info{ - return &I_info{"", false, 0.0}; -} - -func NewInvertedIndexFromFile(fname string) (I_index, error) { - var fd *os.File; - var br *bufio.Reader; - var err error; - var buf []byte; - var tmp *I_info; - var cur *I_entry; - var index I_index; - var word string - var info []string; - - fd, err = os.Open(fname); - if err != nil { - return nil, err; - } - - br = bufio.NewReader(fd); - if br == nil { - return nil, errors.New("Could not initialize reader"); - } - - index = make(I_index); - - for buf, err = br.ReadBytes('\n'); err != io.EOF; buf, err = br.ReadBytes('\n'){ - tmp = new_I_info(); - if err != nil { - return nil, err; - } - if buf[0] != '\t' { - word = strings.TrimSpace(string(buf)); - } else { - info = strings.Fields(string(buf)); - tmp.Doc = info[0]; - tmp.In_title = (info[1] == "1"); - tmp.Freq, _ = strconv.ParseFloat(info[2], 32); - if (index[word] == nil) { - index[word] = &I_entry{This: tmp, Next: nil}; - } else { - cur = index[word]; - for cur.Next != nil { - cur = cur.Next; - } - cur.Next = &I_entry{This: tmp, Next: nil}; - } - } - } - - return index, nil; -} - -func NewInvertedFromForward(f F_index) (I_index, error) { - return nil, errors.New("not implemented"); - -} - -func (x I_index) PrintToFile(fd *os.File) error{ - var i int; - var cur *I_entry; - var index []sortInverted; - - index = x.sortIndex(); - - for i = 0; i < len(index); i++ { - fmt.Fprintf(fd, "%s\n", index[i].w); - for cur = index[i].root; cur != nil; cur = cur.Next { - fmt.Fprintf(fd, "\t%s %d %.3f\n", cur.This.Doc, toInt(cur.This.In_title), cur.This.Freq); - } - } - return nil; -} - -func toInt(t bool) int{ - if (t){ - return 1; - } - return 0; -} - -func (unsort I_index) sortIndex() []sortInverted { - var i int; - var sorted []sortInverted; - - sorted = make([]sortInverted, len(unsort)); - - i = 0; - for k, v := range unsort { - sorted[i].w = k; - sorted[i].root = v; - i++; - } - - sort.Slice(sorted, func(i, j int) bool { - return sorted[i].w < sorted[j].w; - }); - - return sorted -} diff --git a/search/indexer.go b/search/indexer.go deleted file mode 100644 index d95f126..0000000 --- a/search/indexer.go +++ /dev/null @@ -1,402 +0,0 @@ -package main - -import "os" -import "sort" -import "golang.org/x/net/html" -import "log" -import "fmt" -import "github.com/PuerkitoBio/goquery" -import "github.com/kennygrant/sanitize" -import "strings" -import "flag" -import "errors" -import "regexp" - -type document struct { - fname string; - title []string; - text []string; - length int; -} - -type index struct { - doc *document; - title bool; - freq int; -} - -type wordSort struct { - w string; - root *wordList; -} - -type wordList struct { - this *index - next *wordList -} - -var r, nonAN *regexp.Regexp; -var stopWords []*regexp.Regexp; - - -func newDocument() *document { - return &document{"" , nil, nil, 0}; -} - -func RemoveNode(r, rn *html.Node) { - var found bool; - var n, item *html.Node; - var nodes map[int]*html.Node; - var i, j int; - - found = false; - nodes = make(map[int]*html.Node); - - for n = r.FirstChild; n != nil; n = n.NextSibling { - if n == rn { - found = true; - n.Parent.RemoveChild(n); - } - - nodes[i] = n; - i++; - } - - if !found { - for j = 0; j < i; j++ { - item = nodes[j]; - RemoveNode(item, rn); - } - } -} -func RemoveTag(doc *goquery.Selection, tag string) { - doc.Find(tag).Each(func(i int, s *goquery.Selection) { - RemoveNode(doc.Get(0), s.Get(0)); - }); -} - -func logReg(h []byte) []byte { - log.Printf("RegExp: %s", h); - return h; -} - -func parseDoc(fd *os.File, f_info os.FileInfo) (*document, error) { - var err error; - var text, t_text string; - var doc *goquery.Document; - var body, title *goquery.Selection; - var r_doc *document; - var i int; - - doc, err = goquery.NewDocumentFromReader(fd); - if err != nil { - log.Printf("goquery error: %s\n", err); - return nil, errors.New("Can't create goquery documnt"); - } - - body = doc.Find("body"); - RemoveTag(body, "script"); - RemoveTag(body, "noscript"); - - title = doc.Find("title"); - - //TODO add error detection - text, err = body.Html(); - t_text, err = title.Html(); - - - text = r.ReplaceAllString(text, "> <"); - t_text = r.ReplaceAllString(t_text, "> <"); - - text = sanitize.HTML(text); - t_text = sanitize.HTML(t_text); - - text = strings.ToLower(text); - t_text = strings.ToLower(t_text); - - text = nonAN.ReplaceAllString(text, " "); - t_text = nonAN.ReplaceAllString(t_text, " "); - - - for i = 0; i < len(stopWords); i++ { - text = stopWords[i].ReplaceAllString(text, " "); - t_text = stopWords[i].ReplaceAllString(t_text, " "); - } - r_doc = newDocument(); - - r_doc.fname = f_info.Name(); - r_doc.text = strings.Fields(text); - r_doc.title = strings.Fields(t_text); - r_doc.length = len(r_doc.text) + len(r_doc.title); - - return r_doc, nil; -} -func boolToInt(t bool) int { - if t { - return 1; - } - return 0; -} - -func printIndex(words []wordSort, fd *os.File) { - var i int; - var cur *wordList; - var fname string; - var t int; - var freq float64; - - for i = 0; i < len(words); i++ { - fmt.Fprintf(fd, "%s\n", words[i].w); - for cur = words[i].root; cur != nil; cur = cur.next { - fname = cur.this.doc.fname; - t = boolToInt(cur.this.title); - freq = float64(cur.this.freq) / float64(cur.this.doc.length); - - fmt.Fprintf(fd,"\t%s %d %.3f\n", fname, t, freq); - } - } -} - -func init() { - var err error; - log.SetOutput(os.Stderr); - r, err = regexp.Compile("><"); - if err != nil { - panic(err); - } - nonAN, err = regexp.Compile("[^a-zA-Z0-9]+"); - if err != nil { - panic(err); - } - //TODO add func to read in stop words from a file; - stopWords = make([]*regexp.Regexp, 26) - if err != nil { - panic(err); - } - stopWords[0], err = regexp.Compile("\\W+and\\W+"); - if err != nil { - panic(err); - } - stopWords[1], err = regexp.Compile("\\W+a\\W+"); - if err != nil { - panic(err); - } - stopWords[2], err = regexp.Compile("\\W+an\\W+"); - if err != nil { - panic(err); - } - stopWords[3], err = regexp.Compile("\\W+and\\W+"); - if err != nil { - panic(err); - } - stopWords[4], err = regexp.Compile("\\W+are\\W+"); - if err != nil { - panic(err); - } - stopWords[5], err = regexp.Compile("\\W+as\\W+"); - if err != nil { - panic(err); - } - stopWords[6], err = regexp.Compile("\\W+at\\W+"); - if err != nil { - panic(err); - } - stopWords[7], err = regexp.Compile("\\W+be\\W+"); - if err != nil { - panic(err); - } - stopWords[8], err = regexp.Compile("\\W+by\\W+"); - if err != nil { - panic(err); - } - stopWords[9], err = regexp.Compile("\\W+for\\W+"); - if err != nil { - panic(err); - } - stopWords[10], err = regexp.Compile("\\W+from\\W+"); - if err != nil { - panic(err); - } - stopWords[11], err = regexp.Compile("\\W+has\\W+"); - if err != nil { - panic(err); - } - stopWords[12], err = regexp.Compile("\\W+he\\W+"); - if err != nil { - panic(err); - } - stopWords[13], err = regexp.Compile("\\W+in\\W+"); - if err != nil { - panic(err); - } - stopWords[14], err = regexp.Compile("\\W+is\\W+"); - if err != nil { - panic(err); - } - stopWords[15], err = regexp.Compile("\\W+it\\W+"); - if err != nil { - panic(err); - } - stopWords[16], err = regexp.Compile("\\W+its\\W+"); - if err != nil { - panic(err); - } - stopWords[17], err = regexp.Compile("\\W+of\\W+"); - if err != nil { - panic(err); - } - stopWords[18], err = regexp.Compile("\\W+on\\W+"); - if err != nil { - panic(err); - } - stopWords[19], err = regexp.Compile("\\W+that\\W+"); - if err != nil { - panic(err); - } - stopWords[20], err = regexp.Compile("\\W+the\\W+"); - if err != nil { - panic(err); - } - stopWords[21], err = regexp.Compile("\\W+to\\W+"); - if err != nil { - panic(err); - } - stopWords[22], err = regexp.Compile("\\W+was\\W+"); - if err != nil { - panic(err); - } - stopWords[23], err = regexp.Compile("\\W+were\\W+"); - if err != nil { - panic(err); - } - stopWords[24], err = regexp.Compile("\\W+will\\W+"); - if err != nil { - panic(err); - } - stopWords[25], err = regexp.Compile("\\W+with\\W+"); - if err != nil { - panic(err); - } -} - -func main() { - // var words map[string]index - var p_dir, w, fname string; - var err error; - var i, j int; - var words map[string]*wordList; - var cur *wordList; - var tmp *index; - var sorted []wordSort; - - var files []os.FileInfo; - var dir, fd *os.File; - var dir_info, fd_info os.FileInfo; - var dir_mode os.FileMode; - - var doc *document; - - flag.StringVar(&p_dir, "d", "./pages", "pages directory"); - - flag.Parse(); - - words = make(map[string]*wordList); - - dir, err = os.Open(p_dir); - if err != nil { - log.Printf("Error accessing \"%s\":\t%s\n", p_dir, err); - os.Exit(1); - } - - dir_info, err = dir.Stat(); - dir_mode = dir_info.Mode(); - - if !dir_mode.IsDir() { - log.Printf("\"%s\" is not a directory\n", p_dir); - os.Exit(1); - } - - files, err = dir.Readdir(0); - if err != nil { - log.Printf("Error reading %s\n", p_dir); - os.Exit(1); - } - - for i = 0; i < len(files); i++ { - fd, err = os.Open(fmt.Sprintf("%s/%s", dir_info.Name(), files[i].Name())); - fd_info, err = fd.Stat(); - if err != nil { - log.Printf("Error getting info\n"); - os.Exit(1); - } - fname = fd_info.Name(); - - if err != nil { - log.Printf("Error reading %s/%s\n", dir_info.Name(), files[i].Name()); - } else { - fmt.Printf("Indexing %s...\n", fname); - doc, err = parseDoc(fd, fd_info); - if err != nil { - log.Printf("Error parsing %s/%s\n", dir_info.Name(), files[i].Name()); - } else { - /* Text */ - for j = 0; j < len(doc.text); j++ { - w = strings.ToLower(doc.text[j]); - - if words[w] == nil{ - tmp = &index{doc: doc, title: false, freq: 0}; - words[w] = &wordList{this: tmp, next: nil}; - } - - for cur = words[w];cur.next != nil && cur.this.doc.fname != fname; cur = cur.next{} - - if cur.this.doc.fname == fname { - cur.this.freq++ - } else if cur.next == nil { - tmp = &index{doc: doc, title: false, freq: 1}; - cur.next = &wordList{this: tmp, next: nil}; - } else { - panic(fmt.Sprintf("%v", cur)); - } - } - /* Title */ - for j = 0; j < len(doc.title); j++ { - w = strings.ToLower(doc.title[j]); - - if words[w] == nil{ - tmp = &index{doc: doc, title: true, freq: 0}; - words[w] = &wordList{this: tmp, next: nil}; - } - - for cur = words[w];cur.next != nil && cur.this.doc.fname != fname; cur = cur.next{} - - if cur.this.doc.fname == fname { - cur.this.title = true; - cur.this.freq++; - } else if cur.next == nil { - tmp = &index{doc: doc, title: true, freq: 1}; - cur.next = &wordList{this: tmp, next: nil}; - } else { - panic(fmt.Sprintf("%v", cur)); - } - } - } - } - fd.Close(); - } - sorted = make([]wordSort, len(words)); - i = 0; - for k,v := range words { - sorted[i].w = k; - sorted[i].root = v; - i++; - } - - sort.Slice(sorted, func(i, j int) bool { - return sorted[i].w < sorted[j].w; - }); - - fd,_ = os.Create("index.dat"); - printIndex(sorted, fd); - fd.Close(); -} diff --git a/search/search.go b/search/search.go deleted file mode 100644 index c144055..0000000 --- a/search/search.go +++ /dev/null @@ -1,144 +0,0 @@ -/************************************************ - * README * - * In order for search/index to be accessible * - * you must link this folder (search) into your * - * GOPATH * - ************************************************/ - - -package main - -import "search/index" -import "os" -import "fmt" -import "sort" -import "flag" -import "strings" - -type res struct { - doc string; - score float64; -}; - -func main() { - var init_index, sIndex index.I_index; - var tmp, results, root *index.I_entry; - var tmp_score float64; - var scores map[string]map[string]float64; // scores[doc][word] == score - var i,j int; - var searchBool, perWord, docAdded map[string]bool; //map[doc]bool - var resultSort []res; - var err error; - var fname, s string; - var search []string; - - flag.StringVar(&fname, "f", "./index.dat", "Index file"); - flag.StringVar(&s, "s", "" , "Search phrase"); - - flag.Parse(); - if len(s) == 0 { - fmt.Printf("Usage: search -s \"search phrase\" [-f index_file]"); - os.Exit(1); - } else { - search = strings.Fields(s); - } - - scores = make(map[string]map[string]float64); - searchBool = make(map[string]bool); - perWord = make(map[string]bool); - docAdded = make(map[string]bool); - - - sIndex = make(index.I_index); - - - - init_index, err = index.NewInvertedIndexFromFile(fname); - if err != nil { - panic(err) - } - for i = 0; i < len(search); i++ { - sIndex[search[i]] = init_index[search[i]] - } - - for _, v := range sIndex { - for tmp = v; tmp != nil; tmp = tmp.Next { - searchBool[tmp.This.Doc] = true; - scores[tmp.This.Doc] = make(map[string]float64); - } - } - - for _, v := range sIndex { - for tmp = v; tmp != nil; tmp = tmp.Next { - perWord[tmp.This.Doc] = true; - } - for d := range searchBool { - if !perWord[d] { - searchBool[d] = false; - } - } - perWord = make(map[string]bool); - } - - for k, v := range sIndex { - for tmp = v; tmp != nil; tmp = tmp.Next { - if searchBool[tmp.This.Doc] { - if tmp.This.In_title { - tmp_score = 1.0; - } else { - tmp_score = 0.0; - } - - scores[tmp.This.Doc][k] = (0.9 * tmp.This.Freq) + (0.1 * tmp_score); - } - } - - } - - i = 0; - results = &index.I_entry{nil, nil} - root = &index.I_entry{nil, nil}; - results.Next = root; - - j = 0; - - for _ ,v := range sIndex { - for tmp = v; tmp != nil; tmp = tmp.Next { - if (searchBool[tmp.This.Doc]) { - root.This = tmp.This; - docAdded[root.This.Doc] = false; - root.Next = &index.I_entry{nil, nil}; - root = root.Next; - j++ - } - } - } - - resultSort = make([]res, j); - - i = 0; - for root = results.Next; root.Next != nil; root = root.Next { - if (!docAdded[root.This.Doc]) { - j = 0; - tmp_score = 0; - for _ ,v := range scores[root.This.Doc] { - tmp_score += v; - j++; - } - tmp_score /= float64(j); - resultSort[i] = res{root.This.Doc, tmp_score}; - docAdded[root.This.Doc] = true; - i++; - } - } - resultSort = resultSort[:i]; - - sort.Slice(resultSort, func(i, j int) bool { - return resultSort[i].score > resultSort[j].score; - }); - - fmt.Printf("Results: %d\n", len(resultSort)); - for i = 0; i < len(resultSort); i++ { - fmt.Printf("\t%d. Doc: %s, Score: %.3f\n", i, resultSort[i].doc, resultSort[i].score); - } -} |