DocumentCloud Search
Contents
Introduction
DocumentCloud's search is powered by Solr, an open source search engine by the Apache Software Foundation. Most of the search syntax is passed through directly to Solr — you can read Solr's documentation directly for information on how its syntax works. This document will reiterate the parts of that syntax that are applicable to DocumentCloud, as well as parts of the search that are specific to DocumentCloud.
Syntax
Specifying Terms
You may specify either single words to search for, such as document
or report
, or a phrase of multiple words to be matched as a whole, by surrounding it in double quotes, such as "the mueller report"
.
Wildcard Searches
Terms can use ?
to match any single character. For example ?oat
will match both goat and boat. You may use *
to match zero or more characters, so J*
will match J, John, Jane or any other word beginning with a J. You may use these in any position of a term — beginning, middle or end.
Note: This feature is only available to authenticated users. You may register for a free account at https://accounts.muckrock.com/ to use this feature.
Fuzzy Searches
By appending ~
to a term you can perform a fuzzy search which will match close variants of the term based on edit distance. Edit distance is the number of letter insertions, deletions, substitutions, or transpositions needed to get from one word to another. This can be useful for finding documents with misspelled words or with poor OCR. By default ~
will allow an edit distance of 2, but you can specify an edit distance of 1 by using ~1
. For example, book~
will match book, books, and looks.
Note: This feature is only available to authenticated users. You may register for a free account at https://accounts.muckrock.com/ to use this feature.
Proximity Searches
Proximity searches allow you to search for multiple words within a certain distance of each other. It is specified by using a ~
with a number after a phrase. For example, "mueller report"~10
will search for documents which contain the words mueller and report within 10 words of each other.
Ranges
Range searches allow you to search for fields that fall within a certain range. For example, pages:[2 TO 20]
will search for all documents with 2 to 20 pages, inclusive. You can use {
and }
for exclusive ranges, as well as mix and match them. Although this is most useful on numeric and date fields, it will also work on text fields: [a TO c]
will match all text alphabetically between a and c.
You can also use *
for either end of the range to make it open ended. For example, pages:[100 TO *]
will find all documents with at least 100 pages, while pages:[* to 20]
will find all documents with at most 20 pages.
Boosting
Boosting allows you to alter how the documents are scored. You can make one of your search terms more important in terms of ranking. Use the ^
operator with a number. By default, terms have a boost of 1. For example, mueller^4 report
will search for documents containing mueller or report but give more weight to the term mueller.
Fields
By default, text is searched through title and source boosted to 10, description boosted to 5, and text boosted to 1. You can search any field specifically by using field:term
syntax. For example, to just search for documents with report in the title, you can use title:report
. The fielded search only affects a single term — so title:mueller report
will search for mueller in the title, and report in the default fields. You can use title:"mueller report"
to search for the exact phrase "mueller report" in the title, or use grouping, title:(mueller report)
to search for mueller or report in the title.
Boolean Operators
You can require or omit certain terms, or apply more complex boolean logic to queries. You can require a term by prepending it with +
and can omit a term by prepending it with -
. You can also omit a term by preceding it with NOT
. You can require multiple terms by combining them with AND
, and require either (or both) terms by combining them with OR
. For example, mueller AND report
requires both mueller and report be present. +mueller -report
would require mueller be present and require report to not be present. By default, multiple terms are combined with OR
— but see filter fields for how they are handled specially. These boolean operators must be uppercase, or else they will be treated as search terms.
Grouping Terms
You can use parentheses to group terms, allowing for complex queries, such as (mueller OR watergate) AND report
to require either mueller or watergate, and report to appear.
Specifying Dates and Times
Date times must be fully specified in the form "YYYY-MM-DDThh:mm:ssZ"
where YYYY is the year, MM is the month, DD is the day, hh is the hour, mm is the minutes, and ss is the seconds. T is the literal T character and Z is the literal Z character. These are always expressed in UTC time. You may optionally include fractional seconds ("YYYY-MM-DDThh:mm:ss.fZ"
). You must quote these for them to work in search queries.
You may also use NOW
to stand in for the current time. This is most useful when combined with date time math, which allows you to add or subtract time in the following units:
YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND
. For example NOW+1DAY
would be one day from now. NOW-2MONTHS
would be 2 months in the past.
You may also use /
to round to the closest time unit. For example, NOW/HOUR
is the beginning of the current hour. These can be combined: NOW-1YEAR+2MONTHS/MONTH
would be the beginning of the month, 2 months past one year ago. These are useful with ranged searches: [NOW-1MONTH TO *]
would be all dates in the past month.
Sorting
You may sort using the syntax sort:<sort type>
. Possible sortings include:
score
(highest score first; default)created_at
(newest first)page_count
(largest first)title
(alphabetical)source
(alphabetical)data_keyname
(alphabetical, ex: sort:data_request to filter by key name request)
These may be reversed by prepending a -
(sort:-page_count
). You may use order
as an alias to sort
.
Escaping Special Characters
Special characters may be escaped by preceding them with a \
— for example, \(1\+1\)
will search for a literal "(1+1)" in the text instead of using the characters’ special meanings. If your query contains a syntax error, the parser will automatically escape your query to make a best effort at returning relevant results. The API response will contain a field escaped
informing you if this auto-escape mechanism was triggered.
Filter Fields
The following fields may be searched on, which will filter the resulting documents based on their properties. By default, all fields included in the query are treated as required (e.g. user:1 report
will show only documents from user 1 scored by the text query “report”). If you include multiple of the same field, the query is equivalent to applying OR
between each of the same field (e.g. user:1 user:2 report
will show documents by user 1 or 2). If you include distinct fields, the query is equivalent to applying AND
between each set of distinct fields (e.g. user:1 user:2 tag:email
will find documents by user 1 or 2 and which are tagged as email). If you use any explicit boolean operators (AND
or OR
), that will take precedence (e.g. (user:1 AND tag:email) OR (user:2 AND tag:contract)
would return documents by user 1 tagged as email as well as documents by user 2 tagged as contract. This allows you to make complex boolean queries using any available field.
Available fields:
- user
Specify using the user ID. Also accepts the slug preceding the ID for readability (e.g.user:mitchell-kotler-1
).account
is an alias for user. - organization
Specify using the organization ID. Also accepts the slug preceding the ID for readability (e.g.organization:muckrock-1
).group
is an alias for organization. - access
Specify the access level. Valid choices arepublic
,organization
, andprivate
. - status
Specify the status of the document. Valid choices aresuccess
,readable
,pending
,error
, andnofile
. - project
Specify using the project ID. Also accepts the slug preceding the ID for readability (e.g.project:panama-papers-1
).projects
is an alias for project. - document
Specify using the document ID. Also accepts the slug preceding the ID for readability (e.g.document:mueller-report-1
).id
is an alias for document. - language
Specify the language the document is in. Valid choices include:- ara - Arabic
- zho - Chinese (Simplified)
- tra - Chinese (Traditional)
- hrv - Croatian
- dan - Danish
- nld - Dutch
- eng - English
- fra - French
- deu - German
- heb - Hebrew
- hun - Hungarian
- ind - Indonesian
- ita - Italian
- jpn - Japanese
- kor - Korean
- nor - Norwegian
- por - Portuguese
- ron - Romanian
- rus - Russian
- spa - Spanish
- swe - Swedish
- ukr - Ukrainian
- slug
Specify the slug of the document. - created_at
Specify the date time the document was created. - updated_at
Specify the date time the document was last updated. - page_count
Specify the number of pages the document has.pages
is an alias for page_count. - data_*
Specify arbitrary key-value data pairs on the document (e.g. the search querydata_color: blue
returns documents with datacolor
:blue
). Note that color is the key and blue is the value. Key/value pairs are case and spelling sensitive. If you want to find any document with a color key you can usedata_color:*
. You can use-data_color:*
if you want to find any documents that do not have a key/value pair for color. - tag
This is an alias todata__tag
which is used by the site as a simple tagging system. Searching for tags is case and spelling sensitive. To find any documents that are tagged, you can usetag:*
. You can use - to indicate you want to exclude results with that tag result. For example,-tag:significant
would remove all documents from the search that are tagged as significant.
Text Fields
Text fields can be used to search for text in a particular field of the document. They are used to score the searches and are always treated as optional unless you use +
or AND
to require them.
- title
The title of the document. - source
The source of the document. - description
The description of the document. - text
The full text of the document, as obtained by text embedded in the PDF or by OCR.doctext
is an alias for text. - **page_no_* **
You may search the text on the given page of a document. To find all documents which contain the word report on page 2, you could usepage_no_2:report
.
Example Queries
Date ranges:
Find all documents uploaded by user 102112 in the last month
+user:102112 created_at:[NOW-1MONTH TO *]
Find all documents uploaded by user 102112 in the last 11 months.
+user:102112 created_at:[NOW-11MONTH TO *]
Find all documents uploaded by user 102112 between 11 months ago and 3 months ago.
+user:102112 created_at:[NOW-11MONTH TO NOW-3MONTH]
Find all documents uploaded by user 102112 in the last month with a page count of 41 pages.
+user:102112 created_at:[NOW-1MONTH TO *] AND page_count:41
Find all documents uploaded by user 102112 uploaded between 2024-01-01 and 2024-01-31.
+user:102112 created_at:["2024-01-01T00:00:00Z" TO "2024-01-31T00:00:00Z"]
Key/value pair existence
Find all documents uploaded by user 102112 that have a _mr_status key (that it exists)
+user:102112 AND data__mr_status:*
Find all documents uploaded by user 102112 in the last month that do not have a _mr_status_key (the key does not exist)
+user:102112 AND -data__mr_status:*
Key/value pair searches
Find all the documents that have an entry for the key "Folder" on DocumentCloud
data_Folder:*
Find all documents that have a value of "From ARMY site - Environmental documents" for the key Folder
+data_Folder:"From ARMY site - Environmental documents"
Find all documents that have a value of 38 for the key Subfolder and "From ARMY site - Environmental documents" for the Folder.
+data_Folder:"From ARMY site - Environmental documents" AND +data_Subfolder:38
See sorting for an example on sorting by key/value pair.
Searching Tags
Find all documents that have been labelled with the tag "significant" on DocumentCloud
tag:significant
Project filter
Find all documents uploaded by user 102112 that are also in the project 214246
+user:102112 AND project:214246
Access level filter
Find all documents uploaded by user 102112 that are also private.
+user:102112 AND access:private
Sorting
Find all documents uploaded by user 102112 in the last month that are in project 214246, sorted by page_count so that the documents with the most pages appear first.
+user:102112 created_at:[NOW-1MONTH TO *] AND project:214246 sort:page_count
Find all documents uploaded to DocumentCloud that are also uploaded to IPFS, sorted by their IPFS cid (stored as a key/value pair):
sort:data_cid
In reverse alphabetical order:
sort:-data_cid
Wildcard Searches
Find all documents uploaded by user 102112 in the last month that starts with fy2017
+user:102112 created_at:[NOW-1MONTH TO *] AND +title:fy2017*
Text Field Searches
Find all documents uploaded to DocumentCloud that have Mueller somewhere in the title
title:Mueller*
Find all documents uploaded to DocumentCloud that have Edwin Mueller somewhere in the title.
title:"Edwin Mueller*"
Find all documents uploaded to DocumentCloud that have Mueller somewhere in the description.
description:Mueller*
Find all documents uploaded to DocumentCloud that have Mueller somewhere in the description and Barr somewhere in the title.
description:Mueller* AND title:Barr*
Find all documents uploaded to DocumentCloud that contain the word "Russian" in the document text and contain "Mueller" in the description and contain "Barr" in the title.
+description:Mueller* AND +title:Barr* AND text:Russian
Find all documents uploaded to DocumentCloud that contain "Mueller" in the description, "Barr" in the title, and "Russian" on page 4 of the document.
+description:Mueller* AND +title:Barr* AND page_no_4:Russian
API
You may search via the API:
GET /api/documents/search/
You may pass the query as described above in the q
parameter (e.g. /api/documents/search/?q=some+text+user:1
to search for some text in documents by user 1). For all fielded searches, you may pass them in as standalone query parameters instead of in q
if you prefer (e.g. /api/documents/search/?q=some+text&user=1
is the same query as the previous example). You may also negate fields by preceding them with a -
in this way (e.g. /api/documents/search/?q=some+text&-user=1
to search for some text in documents not by user 1). You may specify the sort order using either sort
or order
as a parameter (e.g. /api/documents/search/?q=some+text+order:title
and /api/documents/search/?q=some+text&order=title
both search for some text in documents sorted by their title).
You can also specify per_page
, page
, and expand
as you would for /api/documents/
. expand
may be user
or organization
(or both user,organization
). The response will be in a JSON object like a list response:
{
"count": <number of results on the current page>,
"next": <next page url if applicable>,
"previous": <previous page url if applicable>,
"results": <list of results>,
"escaped": <bool>
}
with the addition of the escaped
property to specify if the query had a syntax error and needed to be autoescaped.
You may also enable highlighting by setting the hl
parameter to true
. Each document will then contain a highlights
property, which will contain relevant snippets from the document containing the given search term.
https://api.www.documentcloud.org/api/documents/search?q=report&hl=true
{
"count": 413,
"next": "https://api.www.documentcloud.org/api/documents/search/?q=report&page=2&hl=true",
"previous": null,
"results": [
{
"id": "20059100",
"user": 100000,
"organization": 10001,
"access": "public",
"status": "success",
"title": "the-mueller-report",
"slug": "the-mueller-report",
"source": "gema_georgia_gov",
"language": "eng",
"created_at": "2020-04-05T13:36:08.507Z",
"updated_at": "2020-04-24T18:47:52.985Z",
"page_count": 448,
"highlights": {
"title": [
"the-mueller-<em>report</em>"
],
"page_no_9": [
"-CrinP6te\nINTRODUCTION TO VOLUME T |\n\nThis <em>report</em> is submitted to the Attorey General pursuant to 28 C-F.R"
]
},
"data": {},
"asset_url": "https://assets.documentcloud.org/"
},
]
}
You may search within a document using the following endpoint:
GET /api/documents/<doc_id>/search/
This will return up to 25 highlights per page for your query. You may use the same search syntax as above, although most of the fielded queries will not be meaningful when searching within a single document.
Example response:
{
"title": [
"the-mueller-<em>report</em>"
],
"page_no_9": [
"-CrinP6te\nINTRODUCTION TO VOLUME T |\n\nThis <em>report</em> is submitted to the Attorey General pursuant to 28 C-F.R",
" the Attorney\nGeneral a confidential <em>report</em> explaining the prosecution or declination decisions [the",
" in detail in this <em>report</em>, the Special Counsel's investigation established that\nRussia interfered in"
],
"page_no_10": [
"\n‘overview of the two volumes of our <em>report</em>.\n\nThe <em>report</em> describes actions and events that the Special",
", the <em>report</em> points out\nthe absence of evidence or conflicts in the evidence about a particular fact or",
" with\nconfidence, the <em>report</em> states that the investigation established that certain actions or events",
"\n‘coordination in that sense when stating in the <em>report</em> thatthe investigation did not establish that the\n‘Trump",
" Campaign coordinated with the Russian government in its election interference activities.\n\nThe <em>report</em> on"
]
}