Site Text Dataset

Extracted text and site links. Available only in JSON format and as a daily feed.

Flat-File Layout

Name	Format	Description
url	String	Page from which the content was extracted.
domain	String	Normalized domain name (excluding any subdomain).
extraction_date	Date (YYYY-MM-DD)	Date in which the content was extracted.
page	String	Classification of the URL from which the content was extracted. Possible values are ‘home’, ‘about’, ‘careers’, ‘documentation’, ‘location’, ‘policy’, ‘products’, ‘services’.
extracted_text	String	Text as extracted from the URL, cleaned up from all HTML tags and other elements that cannot be used for embeddings.
extracted_links	Array[LinkObject]	Compilation of URL links and their corresponding or alternative text from <a> tags.