Some datasets are available from Linguistic Data Consortium in 2020 through UNC-Charlotte's School of Data Science. Please email Dr. Radford for more information.
JSTOR - Data for Research (DfR) -
|A self-service system for text mining. Provides a self-service system for text mining. By creating a free DfR account you can download the metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. To get larger datasets or a type of data not available through the main site, you have to contact JSTOR directly: firstname.lastname@example.org (Guide for using DfR)|
|Science Direct (Elsevier) - free||You can text mine all subscribed content as long as it is for non-commercial purposes and using their Science Direct APIs. You must register first to use these APIs.|
|SpringerLink - free (through library subscription)||You can download subscribed and open access content for TDM purposes directly from the SpringerLink platform.
Full-text content can be accessed via friendly URLs: PDF: http://link.springer.com/[DOI].pdf OR HTML(when available): http://link.springer.com/[DOI].html
Content can be downloaded via a web browser or with an HTTP GET request using a scripting tool. Researchers are requested to be considerate and limit their downloading speed to a reasonable rate.
PLEASE read their instructions for more detailed information HERE
More information about Springer's Nature API Portal HERE or BioMed Central HERE
|National Center for Biotechnology Information||Multiple collections of articles/abstracts from the National Library of Medicine. You can click "Tools" or "Web APIs" on the top menu to find out more information about accessing the data here|
|PLOS Search API (Public Library of Science)||Gives developers access to rich data that can be flexibly integrated into applications for the web, desktop or mobile devices.|
A partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the Access HathiTrust Datasets and its Research Center for information about text and data mining.
Also, a new HTRC Derived dataset (2.0) is available with documentation here. HTRC Extracted Features 2.0 is the most current version of a derived dataset consisting of metadata and data elements extracted from volumes in the HathiTrust Digital Library. The dataset is composed of 17+ million JSON files representing a snapshot of the HathiTrust corpus from February 2020.
|Digital Public Library of America (DPLA)||Access to DPLA API Codex, Bulk Download, Technical Documentation, and Sample Code and Libraries.|
|Internet Archive||Instructions for developers and those interested in bulk download and API access.|
|Chronicling America (API)||Provides access to information about historic newspapers and select digitized newspaper pages. Search America's historic newspaper pages from 1789-1963 or use the U.S. Newspaper Directory to find information about American newspapers published between 1690-present.|
|Gale Eighteenth Century Collection Online (ECCO)||Available at the Text Creation Partnership (ECCO-TCP). Online searching at the partnership is free, but text-mining the collection requires a fee to access "raw" data.|
|Early English Books Online (EEBO) - free||The Text Creation Partnership creates standardized, accurate XML/GSML encoded electronic text editions of early print books. Phase I is freely available (25,000 titles).|
|Oxford English Dictionary and Oxford University Press (Oxford Scholarship Online) - free||Oxford accommodates TDM for non-commercial use. Researchers are not required to request permission for non-commercial text-mining, If you have any questions please e-mail Data.Mining@oup.com|
|LexisNexis Academic (now Nexis UNI) - free/fee-based||Not specifically available for text mining, but since text files can be downloaded many articles at a time, mining is possible. You can contact LexisNexis to inquire about using/purchasing their "Data as a Service" for larger datasets. Here is also a link for LexisNexis Bulk Content API mining personal consultation service.|
|CAP API (from Harvard Law Library) - free||The Caselaw Access Project API, also known as CAPAPI, serves all official US court cases published in books from 1658 to 2018. The collection includes over six million cases scanned from the Harvard Law Library shelves.|
|Adam Matthew Databases - free/fee-based||Source of unique primary digital collections that are readily available for mining. Researchers may contact him directly at email@example.com to discuss data mining requests/projects. More information here.|
|Govinfo Bulk Data Repository||
Bills, regulations, rules, and papers of the presidents of the U.S.
|Royal Society of Chemistry (RSC)||Researcher should send following information to firstname.lastname@example.org one month before activity:
Date to start, Completion date, Institution, Crawler IP address, Crawler user agent,Types of content (HTML / PDF), Institution contact email, and Researcher contact email.
|Wiley databases||Instructions can be found at the linked website.|
|Alpha Vantage Market API - free||Includes Historical price and volume data for public companies in the Americas, Europe, and Asia Pacific. Also includes company financials from SEC filings, real-time and historical foreign exchange rates, and other technical indicators. Each free API key allows up to 500 API calls per day by default. Please reach out to email@example.com if a higher rate limit is needed. You may also refer to this guide that describes some of the important concepts in the financial market data domain.|
Computer programs can crawl millions of articles or other content in digital form to derive or organize information from text or data. TDM can help researchers sort through large amounts of information, identify patterns and trends, and understand individual text as well as the connections between texts. (ARL)
The Association of Research Libraries Issue Brief ("Text and Data Mining and Fair Use in the U.S.")
The International Federation of Library Associations and Institutions (IFLA) Statement on Text and Data Mining
Voyant Tools - web-based text reading and analysis environment. It is a scholarly project that is designed to facilitate reading and interpretive practices for digital humanities students and scholars as well as for the general public.