Site Text Dataset
Extracted text and site links. Available only in JSON format and as a daily feed.
Flat-File Layout
Section titled “Flat-File Layout”| Name | Format | Description |
|---|---|---|
| url | String | Page from which the content was extracted. |
| domain | String | Normalized domain name (excluding any subdomain). |
| extraction_date | Date (YYYY-MM-DD) | Date in which the content was extracted. |
| page | String | Classification of the URL from which the content was extracted. Possible values are ‘home’, ‘about’, ‘careers’, ‘documentation’, ‘location’, ‘policy’, ‘products’, ‘services’. |
| extracted_text | String | Text as extracted from the URL, cleaned up from all HTML tags and other elements that cannot be used for embeddings. |
| extracted_links | Array[LinkObject] | Compilation of URL links and their corresponding or alternative text from <a> tags. |