Search

FIXME

Terms defined: term frequency - inverse document frequency

FIXME: build TF-IDF with stemming

Want to search the abstracts of over 2000 papers
Use term frequency - inverse document frequency (TF-IDF)
- Term frequency: frequency of each word in each document
- Document frequency: proportion of documents in which a word appears
- Inverse document frequency: one over that (i.e., how specific the word is)

Fetching Data

OpenAlex indexes over 250 million scholarly works
PyAlex provides a Python interface
Copy, paste, and tweak example

def main():
    """Main driver."""
    args = parse_args()
    if args.email:
        pyalex.config.email = args.email
    pager = (
        Works()
        .filter(concepts={"wikidata": args.concept})
        .paginate(method="page", per_page=200)
    )
    counter = 0
    for page in pager:
        for work in page:
            counter += 1
            if args.verbose:
                print(counter)
            ident = work["id"].split("/")[-1]
            data = {
                "doi": work["doi"],
                "year": work["publication_year"],
                "abstract": work["abstract"],
            }
            if all(data.values()):
                Path(args.outdir, f"{ident}.json").write_text(
                    json.dumps(data, ensure_ascii=False)
                )

Additional definitions

WIKIDATA_LAND_SNAIL = "https://www.wikidata.org/wiki/Q6484264"

def parse_args():
    """Parse command-line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--concept", type=str, default=WIKIDATA_LAND_SNAIL, help="Wikidata concept URL"
    )
    parser.add_argument("--email", type=str, default=None, help="user email address")
    parser.add_argument("--outdir", type=str, required=True, help="output directory")
    parser.add_argument(
        "--verbose", action="store_true", default=False, help="report progress"
    )
    return parser.parse_args()

Produces 2192 JSON files

{
    "doi": "https://doi.org/10.1007/978-94-009-0343-2_40",
    "year": 1996,
    "abstract": "Helicid snails are suitable organisms…"
}

Building Index

Usual main driver

def main():
    """Main driver."""
    args = parse_args()
    abstracts = read_abstracts(args.bibdir)
    words_in_file = {
        filename: get_words(abstract) for filename, abstract in abstracts.items()
    }
    term_freq = calculate_tf(words_in_file)
    inverse_doc_freq = calculate_idf(words_in_file)
    tf_idf = calculate_tf_idf(term_freq, inverse_doc_freq)
    save(args.outfile, tf_idf)

Reading abstracts from JSON is the simple part

def read_abstracts(bibdir):
    """Extract abstracts from bibliography entries."""
    result = {}
    for filename in Path(bibdir).iterdir():
        data = json.loads(Path(filename).read_text())
        result[filename.name] = data["abstract"]
    return result

Getting words is a bit of a hack
- For now, remove punctuation and hope for the best

def get_words(text):
    """Get words from text, stripping basic punctuation."""
    words = text.split()
    for char in ",.'\"()%‰!?$‘’&~–—±·":
        words = [w.strip(char) for w in words]
    return [w for w in words if w]

Calculate term frequency

def calculate_tf(words_in_file):
    """Calculate term frequency of each word per document."""
    result = {}
    for filename, wordlist in words_in_file.items():
        total_words = len(wordlist)
        counts = Counter(wordlist)
        for w in wordlist:
            result[(filename, w)] = counts[w] / total_words
    return result

Calculate inverse document frequency

def calculate_idf(words_in_file):
    """Calculate inverse document frequency of each word."""
    num_docs = len(words_in_file)
    word_sets = [set(words) for words in words_in_file.values()]
    result = {}
    for word in set().union(*word_sets):
        result[word] = log(num_docs / sum(word in per_doc for per_doc in word_sets))
    return result

Combine values

def calculate_tf_idf(term_freq, inverse_doc_freq):
    """Calculate overall score for each term in each document."""
    result = defaultdict(dict)
    for (filename, word), tf in term_freq.items():
        result[word][filename] = tf * inverse_doc_freq[word]
    return result

And save as CSV

def save(outfile, tf_idf):
    """Save results as CSV."""
    outfile = sys.stdout if outfile is None else open(outfile, "w")
    writer = csv.writer(outfile)
    writer.writerow(("word", "doc", "score"))
    for word in sorted(tf_idf):
        for filename, score in sorted(tf_idf[word].items()):
            writer.writerow((word, filename, score))
    outfile.close()

258,000 distinct terms (!)
- Of which several thousand contain non-Latin characters
17 documents contain the word “search”
Of these, W2026888704.json has the highest score

word,doc,score
…,…,…
search,W1583262424.json,0.017354843942898893
search,W1790707322.json,0.010208731731116994
search,W1978369717.json,0.022087983200053132
search,W1981216857.json,0.04189100262079043
search,W2011577929.json,0.023030124663562513
search,W2026888704.json,0.05716889769425517
search,W2032021174.json,0.022813879361557227
search,W2082863826.json,0.017734877021940473
search,W2084509015.json,0.012992931294148902
search,W2086938190.json,0.020417463462233987
search,W2101925012.json,0.02466678326909487
search,W2316979134.json,0.045842984000110276
search,W2575616999.json,0.047796947252574
search,W2892782288.json,0.020678111931964636
search,W4206540709.json,0.028252071534951684
search,W4304606284.json,0.028584448847127585
search,W4386532853.json,0.033283262356244445
…,…,…

Upon inspection, that abstract includes phrases like “Search Dropdown Menu toolbar search search input”, which are probably a result of inaccurate web scraping
The good news is, TF-IDF is exactly the sort of thing we know how to write unit tests for