Although large citation databases such as Web of Science and Scopus are widely used in bibliometric research, they have several disadvantages, including limited availability, poor coverage of books and conference proceedings, and inadequate mechanisms for distinguishing among authors. We discuss these issues, then examine the comparative advantages and disadvantages of other bibliographic databases, with emphasis on (a) discipline-centered article databases such as EconLit, MEDLINE, PsycINFO, and SocINDEX, and (b) book databases such as
While nearly all research universities provide access to Web of Science or Scopus, these databases are available at only a small minority of undergraduate colleges. Systematic restrictions on access may result in systematic biases in the literature of scholarly communication and assessment.
The limitations of the largest citation databases influence the kinds of research that can be most readily pursued. In particular, research problems that use exclusively bibliometric data may be preferred over those that draw on a wider range of information sources.
Because books, conference papers, and other research outputs remain important in many fields of study, journal databases cover just one component of scholarly accomplishment. Likewise, data on publications and citation impact cannot fully account for the influence of scholarly work on teaching, practice, and public knowledge.
The automation of data compilation processes removes opportunities for investigators to gain first-hand, in-depth understanding of the patterns and relationships among variables. In contrast, manual processes may stimulate the kind of associative thinking that can lead to new insights and perspectives.
Although large citation databases are used extensively in research on scholarly communication and assessment, they have limitations that make them less than ideal for certain kinds of projects.
Our primary goal is to demonstrate that widely available databases such as SocINDEX and
Our secondary goal is to present a data set that illustrates these principles and to describe the methods used in its construction. Our data file (
For many scholars, the biggest disadvantage of Web of Science and Scopus is simply that neither resource is available to them. Although faculty at the major research universities often have access to at least one of these databases, the situation is very different elsewhere. Apart from those institutions in the Carnegie R1 and R2 categories, just 25% of American four-year colleges and universities provide access to either Scopus or Web of Science.
Our experience at several universities suggests that for institutions with 2,000 to 10,000 students, Scopus costs three to four times as much as SocINDEX or Sociological Abstracts. Moreover, Web of Science generally costs more than Scopus. Cost is not the only factor that influences library holdings, of course. The disciplinary databases each have a clear constituency—an academic department or school with a strong interest in maintaining access to specific databases and journals. Although each multidisciplinary database may be of some interest to a large number of faculty, no single group is likely to feel a compelling need to choose Scopus (for instance) over Biological Abstracts or MathSciNet. When a small group with well-defined interests and a larger group with more diffuse interests compete, the small group is likely to prevail (
Finally, there is a perception among some librarians and faculty that Scopus and Web of Science are not especially attractive to undergraduates—that the advantages of these databases (such as size, multidisciplinary scope, and citation-tracing capabilities) are offset by disadvantages such as complicated interfaces and marketing strategies that target expert users rather than undergraduates. Faculty may appreciate the breadth of Web of Science and Scopus, but students often adopt a more constrained approach to database selection. As a student at Manhattan College stated during web site usability testing, ‘If my paper is for a psychology class, I look for a database with “psych” in the name.’
Because Scopus and Web of Science are held by relatively few U.S. colleges and universities, many researchers who use bibliographic or bibliometric data must look elsewhere.
A second limitation of the large citation databases is their relatively poor coverage of books and conference proceedings. Scopus, Web of Science, and Google Scholar are all devoted primarily to journal articles. For instance, Scopus covers more than 39,000 journals but just 1,628 book series, 514 conference proceedings, and no books issued independently (i.e., not as part of a series). Likewise, the three most readily available Web of Science databases—SCI, SSCI, and Arts & Humanities Citation Index—include about 1/20 as many books and chapters as journal articles. The Web of Science book and proceedings databases are much smaller, offered as separate products, and not widely held, even by research universities. Because Google Scholar makes no distinction between articles, conference papers, books, and other online resources that ‘look scholarly’ to its web crawling mechanisms, the methods used in its construction do improve its coverage of conference papers and other non-journal documents. Google Scholar’s coverage of books and chapters is still limited, however, perhaps because books are far less likely than journals to be indexed and made available online (
The importance of conference proceedings is well established, especially in rapidly changing fields such as computer science (
Although faculty at the major research universities have far higher average article counts than those at other institutions, the highest average
A third limitation of the large citation databases is the difficulty of distinguishing among authors with similar names. To some extent, the problem can be traced to bibliographic errors in the databases themselves. For instance, Web of Science is not always consistent or reliable in its reporting of author names and institutional affiliations (
Although the difficulty of resolving authors’ names can be addressed through author identifier systems such as ORCID and ResearcherID, many authors are not included in either system. In compiling author information for our data set, we looked for databases that (a) provide complete and accurate author information, (b) maintain their own author identifier systems (although there are no such databases for sociology), (c) cover a limited range of subject areas and therefore minimize the number of instances in which ‘unwanted’ authors appear in the search results, and (d) provide, for each record, the full text of the title page and any other pages on which bibliographic or author information is likely to appear. We also conducted each search manually rather than relying on automated procedures. This last point is discussed more fully in section 4.3.
Accurate name disambiguation is especially important for
When evaluating the publishing productivity of American sociologists, we found no database that provided adequate coverage of both journal articles and books. (Other scholarly works, such as conference papers, were not included in our analysis.) We chose SocINDEX as our primary source of journal article data after evaluating Google Scholar, Scopus, Web of Science, SocINDEX, and Sociological Abstracts on the basis of five criteria:
Covers a large number of sociology journals, including the more prominent ones
Includes the journals of related fields in which sociologists routinely publish (e.g., criminology, demography, social statistics, and the social aspects of public health)
Provides reliable, easily compiled bibliographic information
Includes the author’s full first name—not just the initial—as part of the searchable name field(s)
Excludes subjects areas unrelated to sociology, to minimize the need for investigation and clarification of matching and near-matching names.
Despite their very broad coverage, none of the three large citation databases (Google Scholar, Scopus, or Web of Science) are as comprehensive as SocINDEX and Sociological Abstracts with regard to sociology (criteria 1 and 2). For instance, the Scopus
Google Scholar, with its idiosyncratic presentation of results, did not satisfy criterion 3. None of the three large citation databases satisfied criterion 5.
In many academic disciplines, a single database such as EconLit or PsycINFO is widely accepted as the foremost source of bibliographic information. In contrast, sociology has two contenders: SocINDEX and Sociological Abstracts. Although the two are comparable in many ways, SocINDEX has a broader subject scope and provides more thorough coverage of peer-reviewed journals (
Each discipline has its own subject databases with unique characteristics, of course. We recommend that investigators seeking the most appropriate data sources consider the literature of the relevant disciplines, the scope/coverage information available at publishers’ web sites, and the database reviews and comparative reports that have appeared in the library and information science (LIS) literature. LISTA, perhaps the most prominent LIS journal database, is freely available online (
We evaluated nine source of bibliographic data for books. Six were withdrawn from consideration early in the process due to obvious gaps in coverage. Specifically,
1. We considered counting the books cited in key disciplinary journals, but those books are not necessarily representative of the literature as a whole.
2. Likewise, the books reviewed (or received for review) by key disciplinary journals are not representative. That list is also likely to be biased by publishers’ actions and by the policies and preferences of the journals’ editorial boards.
3. Book Citation Index includes just 60,000 books. It provides good coverage of highly cited books but poor coverage otherwise (
4. Books in Print is limited to titles currently in print, thereby excluding a substantial number of recently published books.
5. CHOICE Reviews covers only those books that are appropriate for liberal arts colleges, and, with few exceptions, only those that receive favorable reviews (
6. Google Books includes only those books that are available in digital format or cited online—a serious limitation, since fewer than half of all current print books are available in any digital format (
Three data sources are more comprehensive, however.
7.
8. OCLC WorldCat includes the books held by nearly 17,000 libraries worldwide as well as those available for purchase through major library vendors such as GOBI Library Solutions (
9. The GOBI database includes the books available through the largest U.S. academic library book vendor or through a selection of prominent used book dealers.
Of these nine information sources, we chose
Although
As noted in section 1, our data file includes five-year publication counts (2013-2017) for 2,132 professors and associate professors in 426 U.S. departments of sociology. Publication counts and related data are presented separately for individuals and for academic departments. The data file, in .xlsx format, is freely available through both Zenodo and openICPSR (
The data compilation procedures, described fully in the Appendix, result in a data set with several unique advantages:
Data for six distinct institution types allow for the investigation of publishing productivity across the full range of U.S. colleges and universities: top research universities, other R1 universities, other doctoral universities, master’s institutions, top liberal arts colleges, and other bachelor’s institutions.
Four productivity measures—articles, articles in high-impact journals, books, and books from high-impact publishers—allow researchers to identify book- and article-centered departments and individuals, and to explore the relationships between book and article counts.
The inclusion of key institutional and individual variables (e.g., department size, academic rank, gender, Ph.D. year, and Ph.D. institution) facilitates investigation of the correlates/determinants of scholarly productivity. Likewise, the identification of individuals and institutions allows for the linking of these data to the variables found in other data sets.
Multiple data sources and careful data cleaning/standardization procedures provide for a data set that is reliable and consistent in format. The data were compiled manually from authoritative sources, without relying on surveys or other instruments that might be subject to response bias.
Our data do have five significant limitations, however. First, assistant professors are not included. Section A.5 of the Appendix presents the rationale for this decision. Second, the data for institutions in the R1 and
Overall, six types of variables are included in the data file: general data on academic institutions (sociology departments), general data on individuals (sociology faculty), data on each journal article, data on each book, productivity data for individuals, and productivity data for academic institutions. For a list of the variables, see the Appendix, section A.9.
Although at least 25 studies have rated or ranked the scholarly output of sociologists and sociology departments since 1970, just three post-2000 analyses include rankings of at least 40 U.S. sociology departments based on articles published in a wide range of journals (
The productivity differential between faculty at the major research universities and those at other institutions appears to have declined over time. More generally, the link between institution type and publishing productivity is weaker now than in the past.
Although men are more productive than women at the R1 universities, women are more productive than men at the top liberal arts colleges, other bachelor’s institutions, and universities in the
Although the major research universities have the highest average article counts, the highest average
In general, especially high publication counts can be found among associate professors (rather than full professors), faculty with fewer than 17 years’ experience, and authors with doctorates from the most prestigious universities.
There is high variation in productivity among institutions and individuals within each of the six institution types.
The second study based on these data focused on the most productive faculty at various types of colleges and universities, yielding further evidence in support of the second finding, above (
We encourage others to use our data set, either to explore new areas of research or to investigate these same topics in greater detail or from different perspectives. As noted in section 4.1, these data may be of limited value for multidisciplinary research. At the same time, however, our data—and our methods, more generally—are well suited to research on academic or professional groups that can be clearly delineated on the basis of individuals’ characteristics or the characteristics of their publication outlets. Although bibliometric research often deals with interdisciplinary or multidisciplinary groups, an approach centered on particular disciplines or occupations may be more appropriate for research in fields such as labor economics and the sociology of professions.
The methods used to compile and clean the data are described in the Appendix. To summarize, we began by identifying the sociology departments in the population of interest—four-year public and nonprofit colleges and universities in the United States—and compiling rosters of all the faculty with professor or associate professor rank. We then searched SocINDEX, Amazon, and other publicly accessible sources (e.g., course catalogs, Google Scholar, the IPEDS Data System, OCLC WorldCat, personal web sites, ProQuest Dissertation Express, publishers’ web sites, and Scopus Sources) to compile information on institutional characteristics, individuals’ characteristics, and publishing productivity.
These methods are time-intensive, of course. Although we did not systematically record the time spent on each particular task (i.e., the average time per individual for the faculty rosters or the average time per article for the journal article data), we can provide a general sense of the time commitment required. Working 15-20 person-hours per week on data compilation, we spent 11 weeks compiling departmental rosters for 426 departments (2,132 individuals), 14 weeks compiling data on 4,928 journal articles, and 10 weeks compiling data on 598 books. That’s roughly 5 minutes per person record, 3 minutes per article record, and 18 minutes per book record.
Many information science researchers use automated methods to compile bibliographic and bibliometric data. For instance, application programming interfaces (APIs) can be used to harvest data from Google Scholar, Scopus, and WorldCat (
A second problem is the existence of multiple records for a single research contribution. In some cases, the papers are variants such as a published article and the corresponding manuscript or preprint, but the database provides no mechanism by which the researcher can choose to count these variants as the same work or as multiple works. In other cases, the exact same paper (e.g., the same PDF file) is represented by multiple records. As discussed in section 3.2, this is a particular problem with WorldCat, which has 109 records for Web of Science and its key components. Disentangling the relationships among these records requires considerable effort, and it cannot be done through an automated search mechanism. The same problem is readily apparent with Google Scholar.
A third difficulty is that many automated methods fail to retrieve all relevant records, often due to a lack of standardization in the underlying data. For an investigation of the citation impact of papers in predatory accounting journals (in progress), we evaluated the performance of the Publish or Perish search tool (
Finally, automated searches may lead researchers to bypass the mechanisms that would otherwise make them more familiar with the patterns and idiosyncrasies that exist within the data. We believe manual searching can give the investigator a deeper understanding of the relationships among variables, including facets of those relationships that might not be detected by the more common statistical methods. (For example, is there a threshold level at which institutional prestige begins to influence book productivity? Is the relationship between gender and article productivity conditional on a third characteristic? Do the a priori delineations of the variables capture the most important distinctions between institutions and groups, or would alternative specifications be more appropriate?) Likewise, the experience of compiling the data may suggest hypotheses, explanations for findings that emerge later in the research process, or caveats related to data interpretation that might not have come to light through automated searching. For example, when compiling the data we noticed that many mid-ranked bachelor’s and master’s universities have just one ‘superstar’ faculty member with far higher productivity than the others—and that a disproportionate number of those superstars are women. We interpreted our initial statistical results with this idea in mind, then later developed more careful, formal methods of evaluating the situation (
Our institutional population is based on the set of all four-year public and nonprofit colleges and universities in the United States (
The individuals in the population of interest include full-time faculty with the rank of (full) professor or associate professor. Faculty with endowed chairs or distinguished professor rank were included, as were those on sabbatical or temporary leave. Adjunct (part-time) faculty were excluded, as were instructors, lecturers, and assistant professors. (Section A.5 explains the exclusion of assistant professors.) Likewise, we excluded emeritus faculty as well as faculty with current, non-interim administrative appointments at the dean level or higher (e.g., provosts, vice provosts, and deans). Associate deans and department chairs were included, however. For departments with faculty from two or more academic disciplines, we included only the sociologists—those who hold doctorates in sociology
For interpretive and sampling purposes, we identified six types of colleges and universities:
TopR—Top research universities: The top 26 doctoral programs in sociology, based on the subjective ratings assigned by department chairs and graduate program directors at doctorate-granting institutions (
R1—Other R1 universities: Other institutions with a Carnegie classification of
OD—Other doctoral universities: Institutions with a Carnegie classification of
M—Master’s institutions: Institutions in the three Carnegie
TopLA—Top liberal arts colleges: The top 50 national liberal arts colleges that award the bachelor’s degree in sociology. ‘Top 50’ status is based on alumni giving rate, class size, faculty salaries, financial resources, graduation rate, percentage of faculty with terminal degrees, retention rate, student selectivity, and undergraduate academic reputation (
B—Other bachelor’s institutions: All other institutions with a Carnegie classification of
Despite the labels used by the Carnegie Foundation, none of the six types are defined on the basis of publishing productivity.
The data for four of the six institution types (TopR, OD, TopLA, and B) include the entire populations of interest. For those four institution types, Table
Population and sample sizes for six institution types.
TOPR | R1 | OD | M | TOPLA | B | |
---|---|---|---|---|---|---|
Population or sample | P | S | P | S | P | P |
Base population of institutionsa | 26 | 89 | 201 | 695 | 50 | 469 |
Number of departments checkedb | 26 | 42 | 201 | 185 | 50 | 469 |
Population of departmentsc | 26 | 64 | 21 | 406 | 41 | 200 |
Sample of departmentsd | — | 30 | — | 108 | — | — |
Population of facultye | 546 | 867 | 205 | 1,518 | 165 | 403 |
Sample of faculty | — | 409 | — | 404 | — | — |
Our data for the R1 and M institutions are sample data that include roughly 47% and 27% of the corresponding populations. For the R1 group, we began by identifying the 89 universities in the relevant Carnegie classification, excluding those already placed in the TopR group. To obtain the R1 sample, we listed those institutions in random order, then went down the list and compiled data for the departments that met our criteria (offers doctorate in sociology and has one or more full or associate professors) until the sample included at least 400 faculty. We had to check 42 departments before reaching the desired sample size, and about 71% of those departments—30 departments with 409 faculty—met our criteria. Based on that proportion, we estimated a population size of 64 departments with 867 professors and associate professors. (See Table
Because the data file includes sample rather than population data for two of the six institution types, it is necessary to apply case weights when estimating population values that account for all six institution types combined. Weights of 2.1198 for the R1 group, 3.7574 for the M group, and 1.0 for the other four groups will result in unbiased estimates for the population. To arrive at a sample that is representative of the entire population without inflating the sample size—when undertaking significance tests, for instance—use case weights of 1.2201 for R1, 2.1628 for M, and 0.5756 for all other cases.
Basic institutional data—institution name, location, and control (public, private nonreligious, Roman Catholic, Protestant, or other religious)—were obtained from the IPEDS Data System (
Department rosters were compiled in the first three months of 2018 from university web sites, course catalogs, OCLC WorldCat, personal web sites, ProQuest Dissertation Express, and other publicly available sources. For each institution, we noted the highest sociology degree offered. For each individual, we recorded name, academic rank (professor or associate professor), gender (female or male), Ph.D. year, and Ph.D. institution. There are no missing values, although Ph.D. year was estimated for 8 of the 2,132 individuals.
Because the names of institutions and individuals were standardized, the personal names listed in our data file are not necessarily those used professionally by each individual. We may have used a full middle name, for instance, in order to provide for more reliable identification or to differentiate between individuals with similar names. Gender was determined through names, pronouns, and photographs, as presented on personal web sites, university web sites, and in sources such as RateMyProfessors. We found no cases in which our information sources suggested a gender category other than female or male.
Four measures were used to represent publishing productivity over the 2013-2017 period:
Articles: Number of articles in journals indexed by SocINDEX.
HI articles: Number of articles in high-impact journals indexed by SocINDEX.
Books: Number of books listed in Amazon.
HI books: Number of books from high-impact publishers listed in Amazon.
The consideration of both article and book counts is essential, since both forms of publication remain important within sociology. See sections A.6 and A.7 for notes on the delineation of high-impact journals and publishers.
Our four measures of publishing productivity all represent five-year productivity (January 2013 through December 2017) rather than lifetime productivity. While this constraint limits investigators’ ability to directly examine long-term trends, it also helps avoid two significant problems. First, by focusing on five-year productivity and excluding assistant professors, we ensure a five-year period of potential productivity for everyone in the population of interest. That is, we avoid the need to pro-rate scholarly productivity based on the number of research-active years. (Active engagement in research may or may not pre-date the Ph.D. year, so the inclusion of assistant professors would have required us to determine a ‘first research year’ for each faculty member with less than five years’ experience.) Second, the use of a five-year period minimizes the potential impact of name changes and avoids the difficulty of using older bibliographic records that sometimes list just initials rather than first names. Our use of multiple information sources gave us confidence in matching authors to scholarly works over a five-year period. That task would have been more difficult and less reliable if we had tried to match authors and works over a period of several decades.
SocINDEX searches for individual articles were conducted over a four-month period beginning in March 2018. Each SocINDEX search was limited to peer-reviewed journals and to contributions with a document type of
To ensure comprehensiveness, we searched for multiple variants of each author’s name: the full name and the short form (e.g., ‘Christopher’ and ‘Chris’), with the middle initial and without, with the full middle name (if known) and without. We also searched without the first name if there was any reason to believe the author did not use it consistently. Hyphenated last names were searched as written, with a space instead of a hyphen, with the first component alone, and with the second component alone.
As described in section A.5, the data file includes separate counts for
Our Amazon searches were conducted in June, July, and August 2018. We used the
Our book counts include only new books with initial publication dates from January 2013 through December 2017. We excluded chapters in edited volumes, editorships of edited volumes, new editions, translations, and re-publications such as paperback editions of titles originally issued in hardcover. We also excluded self-published books and books of fewer than 60 pages.
As noted in section A.5, the data file includes separate counts for
For all four productivity measures, we used harmonic weighting to assign credit for works with two or more authors. This method accounts for the number of authors as well as each individual’s place in the author list. Specifically, the credit assigned to each author of a paper is 1/
Apart from certain identifier variables and note fields, six types of variables are included in the data set:
General data on academic institutions (sociology departments): institution name, institution type, location, public or private status, highest sociology degree offered, number of faculty in the department, percentage who are full rather than associate professors, percentage who are female, average years since Ph.D. (in 2018), percentage who earned the Ph.D. within each of three date ranges, and percentage with doctorates from top-25 universities.
For overviews of the use of Scopus and Web of Science in bibliometric research, see Baas et al. (
As used here,
Institutions in the Carnegie R1 and R2 categories (
Admittedly, the situation may be different in Europe and Asia, where undergraduate colleges are less common and universities are more likely to maintain doctoral programs.
Microsoft Academic is likely to exhibit the same kinds of errors as Google Scholar, since it is compiled using similar methods. Two other potential data sources, Crossref and Dimensions, rely on publisher-supplied bibliographic information but are also subject to omissions and inaccuracies (
The attempt was unsuccessful, since WorldCat lists not just those libraries with access to the current data, but those with any of the previous editions—the SSCI volumes once issued in print or on microfiche, for instance.
The Zenodo site includes the associated user notes while the openICPSR site does not. Moreover, Zenodo can be accessed anonymously while openICPSR requires registration.
Our methods account for scholarly productivity and for the relative standing of particular journals and book publishers, but not for the citation impact of each individual article or book. Notably, neither the large citation databases nor our data sources capture other important dimensions of scholarly impact, such as influence on teaching, practice, and public knowledge.
Likewise, our procedures cannot fully capture the most recent changes in authors’ affiliations or characteristics.
This emphasis follows the tradition of bibliometric research grounded in the social sciences, an approach established by authors such as Cole & Cole (
Most of the time spent on book records went into verifying authorship, determining whether particular books should be counted (e.g., distinguishing between new books and revised editions), and verifying bibliographic information (e.g., resolving discrepancies between the publication dates provided by two different sellers). A particular difficulty was that many authors had identical or near-identical names.
For instance, the journal name that appears on the web site may be different from that on the article PDFs, even for well-established journals. Inconsistencies in the use of ampersands, commas, and British or American spelling are especially common.
The authors have no competing interests to declare.