Skip to Main Content

Text and Data Mining Resources

Disclaimer

  • When in doubt about how your intended use of library electronic resources will comply with library license agreements, please contact your subject librarian. The library can reach out to vendors to help you explore paid options that you can then write into your grant proposals.
  • Systematic downloading of materials from Atkins Library electronic resources is not supported by our license agreements and will result in a shutdown of access for all of our users. We can assist in the location of datasets to text mine and contact information or instructions/documentation from resources accessible to our library.
  • We cannot assist in programming or APIs to text mine these resources.

TDM Sources

Vendor Details

JSTOR - Data for Research (DfR) - 

(free)

A self-service system for text mining. Provides a self-service system for text mining. By creating a free DfR account you can download the metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. To get larger datasets or a type of data not available through the main site, you have to contact JSTOR directly: support@ithaka.org  (Guide for using DfR)
Science Direct (Elsevier) - free You can text mine all subscribed content as long as it is for non-commercial purposes and using their Science Direct APIs. You must register first to use these APIs. 
SpringerLink - free (through library subscription) You can download subscribed and open access content for TDM purposes directly from the SpringerLink platform. 
Full-text content can be accessed via friendly URLs: PDF: http://link.springer.com/[DOI].pdf  OR HTML(when available): http://link.springer.com/[DOI].html
Content can be downloaded via a web browser or with an HTTP GET request using a scripting tool. Researchers are requested to be considerate and limit their downloading speed to a reasonable rate.
PLEASE read their instructions for more detailed information HERE
More information about Springer's Nature API Portal HERE or BioMed Central HERE
National Center for Biotechnology Information Multiple collections of articles/abstracts from the National Library of Medicine. You can click "Tools" or "Web APIs" on the top menu to find out more information about accessing the data here
PubMed Central Article Databases PMC and the NCBI Bookshelf offer several large datasets of journal articles and other scientific publications made available for retrieval under license terms that generally allow for more liberal redistribution and reuse than a traditional copyrighted work (e.g., Creative Commons licenses).
PLOS Search API (Public Library of Science)  Gives developers access to rich data that can be flexibly integrated into applications for the web, desktop or mobile devices.
HathiTrust

A partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the Access HathiTrust Datasets and its Research Center for information about text and data mining.
Lesson on Text Mining in Python through the HTRC Feature Reader.

Also, a new HTRC Derived dataset (2.0) is available with documentation here. HTRC Extracted Features 2.0 is the most current version of a derived dataset consisting of metadata and data elements extracted from volumes in the HathiTrust Digital Library. The dataset is composed of 17+ million JSON files representing a snapshot of the HathiTrust corpus from February 2020.

Digital Public Library of America (DPLA) Access to DPLA API Codex, Bulk Download, Technical Documentation, and Sample Code and Libraries.
Internet Archive Instructions for developers and those interested in bulk download and API access.
Chronicling America (API)  Provides access to information about historic newspapers and select digitized newspaper pages. Search America's historic newspaper pages from 1789-1963 or use the U.S. Newspaper Directory to find information about American newspapers published between 1690-present.
Gale Eighteenth Century Collection Online (ECCO) Available at the Text Creation Partnership (ECCO-TCP). Online searching at the partnership is free, but text-mining the collection requires a fee to access "raw" data. 
Early English Books Online (EEBO) - free The Text Creation Partnership creates standardized, accurate XML/GSML encoded electronic text editions of early print books. Phase I is freely available (25,000 titles).
Oxford English Dictionary and Oxford University Press (Oxford Scholarship Online) - free  Oxford accommodates TDM for non-commercial use. Researchers are not required to request permission for non-commercial text-mining, If you have any questions please e-mail Data.Mining@oup.com
LexisNexis Academic (now Nexis UNI) - free/fee-based Not specifically available for text mining, but since text files can be downloaded many articles at a time, mining is possible. You can contact LexisNexis to inquire about using/purchasing their "Data as a Service" for larger datasets. Here is also a link for LexisNexis Bulk Content API mining personal consultation service.
CAP API (from Harvard Law Library) - free The Caselaw Access Project API, also known as CAPAPI, serves all official US court cases published in books from 1658 to 2018. The collection includes over six million cases scanned from the Harvard Law Library shelves.
Adam Matthew Databases - free/fee-based Source of unique primary digital collections that are readily available for mining. Researchers may contact him directly at info@amdigital.co.uk to discuss data mining requests/projects. More information here.
Govinfo Bulk Data Repository 

Bills, regulations, rules, and papers of the presidents of the U.S.

Royal Society of Chemistry (RSC) Researcher should send following information to  jnl_licences@rsc.org one month before activity: 
Date to start, Completion date, Institution, Crawler IP address, Crawler user agent,Types of content (HTML / PDF), Institution contact email, and Researcher contact email.
Wiley databases Instructions can be found at the linked website.
Alpha Vantage Market API - free Includes Historical price and volume data for public companies in the Americas, Europe, and Asia Pacific. Also includes company financials from SEC filings, real-time and historical foreign exchange rates, and other technical indicators. Each free API key allows up to 500 API calls per day by default. Please reach out to support@alphavantage.co if a higher rate limit is needed. You may also refer to this guide that describes some of the important concepts in the financial market data domain.

 

More APIs

Definition and Documents

Computer programs can crawl millions of articles or other content in digital form to derive or organize information from text or data. TDM can help researchers sort through large amounts of information, identify patterns and trends, and understand individual text as well as the connections between texts. (ARL)


The Association of Research Libraries Issue Brief ("Text and Data Mining and Fair Use in the U.S.")

The International Federation of Library Associations and Institutions (IFLA) Statement on Text and Data Mining

EXAMPLES and TOOLS 

Using HathiTrust Data

Using Google Books Data

Voyant Tools - web-based text reading and analysis environment. It is a scholarly project that is designed to facilitate reading and interpretive practices for digital humanities students and scholars as well as for the general public.