Search

  • FIXME

Terms defined: term frequency - inverse document frequency

FIXME: build TF-IDF with stemming

Issue 278

  • Want to search the abstracts of over 2000 papers
  • Use term frequency - inverse document frequency (TF-IDF)
    • Term frequency: frequency of each word in each document
    • Document frequency: proportion of documents in which a word appears
    • Inverse document frequency: one over that (i.e., how specific the word is)

Fetching Data

  • OpenAlex indexes over 250 million scholarly works
  • PyAlex provides a Python interface
  • Copy, paste, and tweak example
def main():
    """Main driver."""
    args = parse_args()
    if args.email:
        pyalex.config.email = args.email
    pager = (
        Works()
        .filter(concepts={"wikidata": args.concept})
        .paginate(method="page", per_page=200)
    )
    counter = 0
    for page in pager:
        for work in page:
            counter += 1
            if args.verbose:
                print(counter)
            ident = work["id"].split("/")[-1]
            data = {
                "doi": work["doi"],
                "year": work["publication_year"],
                "abstract": work["abstract"],
            }
            if all(data.values()):
                Path(args.outdir, f"{ident}.json").write_text(
                    json.dumps(data, ensure_ascii=False)
                )
  • Additional definitions
WIKIDATA_LAND_SNAIL = "https://www.wikidata.org/wiki/Q6484264"
def parse_args():
    """Parse command-line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--concept", type=str, default=WIKIDATA_LAND_SNAIL, help="Wikidata concept URL"
    )
    parser.add_argument("--email", type=str, default=None, help="user email address")
    parser.add_argument("--outdir", type=str, required=True, help="output directory")
    parser.add_argument(
        "--verbose", action="store_true", default=False, help="report progress"
    )
    return parser.parse_args()
  • Produces 2192 JSON files
{
    "doi": "https://doi.org/10.1007/978-94-009-0343-2_40",
    "year": 1996,
    "abstract": "Helicid snails are suitable organisms…"
}

Building Index

  • Usual main driver
def main():
    """Main driver."""
    args = parse_args()
    abstracts = read_abstracts(args.bibdir)
    words_in_file = {
        filename: get_words(abstract) for filename, abstract in abstracts.items()
    }
    term_freq = calculate_tf(words_in_file)
    inverse_doc_freq = calculate_idf(words_in_file)
    tf_idf = calculate_tf_idf(term_freq, inverse_doc_freq)
    save(args.outfile, tf_idf)
  • Reading abstracts from JSON is the simple part
def read_abstracts(bibdir):
    """Extract abstracts from bibliography entries."""
    result = {}
    for filename in Path(bibdir).iterdir():
        data = json.loads(Path(filename).read_text())
        result[filename.name] = data["abstract"]
    return result
  • Getting words is a bit of a hack
    • For now, remove punctuation and hope for the best
def get_words(text):
    """Get words from text, stripping basic punctuation."""
    words = text.split()
    for char in ",.'\"()%‰!?$‘’&~–—±·":
        words = [w.strip(char) for w in words]
    return [w for w in words if w]
  • Calculate term frequency
def calculate_tf(words_in_file):
    """Calculate term frequency of each word per document."""
    result = {}
    for filename, wordlist in words_in_file.items():
        total_words = len(wordlist)
        counts = Counter(wordlist)
        for w in wordlist:
            result[(filename, w)] = counts[w] / total_words
    return result
  • Calculate inverse document frequency
def calculate_idf(words_in_file):
    """Calculate inverse document frequency of each word."""
    num_docs = len(words_in_file)
    word_sets = [set(words) for words in words_in_file.values()]
    result = {}
    for word in set().union(*word_sets):
        result[word] = log(num_docs / sum(word in per_doc for per_doc in word_sets))
    return result
  • Combine values
def calculate_tf_idf(term_freq, inverse_doc_freq):
    """Calculate overall score for each term in each document."""
    result = defaultdict(dict)
    for (filename, word), tf in term_freq.items():
        result[word][filename] = tf * inverse_doc_freq[word]
    return result
  • And save as CSV
def save(outfile, tf_idf):
    """Save results as CSV."""
    outfile = sys.stdout if outfile is None else open(outfile, "w")
    writer = csv.writer(outfile)
    writer.writerow(("word", "doc", "score"))
    for word in sorted(tf_idf):
        for filename, score in sorted(tf_idf[word].items()):
            writer.writerow((word, filename, score))
    outfile.close()
  • 258,000 distinct terms (!)
    • Of which several thousand contain non-Latin characters
  • 17 documents contain the word “search”
  • Of these, W2026888704.json has the highest score
word,doc,score
…,…,…
search,W1583262424.json,0.017354843942898893
search,W1790707322.json,0.010208731731116994
search,W1978369717.json,0.022087983200053132
search,W1981216857.json,0.04189100262079043
search,W2011577929.json,0.023030124663562513
search,W2026888704.json,0.05716889769425517
search,W2032021174.json,0.022813879361557227
search,W2082863826.json,0.017734877021940473
search,W2084509015.json,0.012992931294148902
search,W2086938190.json,0.020417463462233987
search,W2101925012.json,0.02466678326909487
search,W2316979134.json,0.045842984000110276
search,W2575616999.json,0.047796947252574
search,W2892782288.json,0.020678111931964636
search,W4206540709.json,0.028252071534951684
search,W4304606284.json,0.028584448847127585
search,W4386532853.json,0.033283262356244445
…,…,…
  • Upon inspection, that abstract includes phrases like “Search Dropdown Menu toolbar search search input”, which are probably a result of inaccurate web scraping
  • The good news is, TF-IDF is exactly the sort of thing we know how to write unit tests for