BibTex is a LaTeX-related tool to handle bibliography developed by Oren Patashnik around 1988. Bibliographic entries are stored in a separate file (with extension .bib) than the LaTeX source, which only reference them through keys with the \cite macro. The command line program bibtex use this external file and the information output by previous latex invocations to produce LaTeX code than the next latex invocations will include in the document.

If it were defined nowadays, the file format of the .bib files will be certainly be XML based and would be easily parsed. However, it is unfortunately not the case. And since BibTex is still widely used, we have to deal with that format. This document is a reference for the file format, that I compiled when I wrote parser for the .bib files.

This document would not have been possible without the excellent documents written by Nicolas Markey and available at http://www.lsv.ens-cachan.fr/~markey/bibla.php. You can also find some information from the LaTeX Book at http://bibliographic.openoffice.org/bibtex-defs.html.

Standard entry type and fields vs. valid BibTex files

An important thing to understand is that BibTex has two distinct part. The first one is a statically defined parser that can read entries in a text file (your .bib file), and a programmable part that tells how to produce LaTeX code (the .bbl files) from those entries. The latter is defined by the infamous .bst style files such as alpha.bst that you rarely manipulate and that XdkBibtex (and in particular its Python binding) is meant to replace. The first part is just syntactical, and the second is semantical.

Therefore, you can have a valid BibTex file that is correctly parsed but produce no output with the standard style files (e.g. alpha.bst) because this styles expects standards field names such as author, title for example. In that document, we only refer to the syntactical part of BibTeX.

The file format

A bibtex entry looks like this

@Article{py03,
     author = {Xavier D\'ecoret},
     title  = "PyBiTex",
     year   = 2003
}
@Article(py03,
     author = {Xavier D\'ecoret},
     title  = "PyBiTex",
     year   = 2003
)

where the delimiters can be either braces or parenthesis, as shown in the two above examples (which are identical for BibTeX). An entry is made of :

Here are some example of valid and invalid field values :

Not parsed by BibTex Parsed by BibTex
@Article{key03,
  title = "The lonely { brace",
}
@Article{key03,
  title = "A {bunch {of} braces {in}} title"
}
@Article{key01,
  author = "Simon \"the saint\" Templar", 
}
@Article{key01,
  author = "Simon {"}the {saint"} Templar", 
}
@Article{key01,
  title = { The history of @ sign } 
}
@Article{key01,
  title = "The history of @ sign"
}

Finally there are three last things to know about the syntax of the .bib file format :

Comments

Comments in BibTeX are not working the usual way, that is with a pair of comment delimiters such as /* and */ or // and \n (end of line) in C++. Instead, BibTeX splits the file in two areas: inside an entry and outside an entry, the delimitation being indicated by the presence of a @ sign. When this character is met, BibTex expects to find an entry as described above. Before that sign, and after an entry, everything is considered a comment! So the following file is correctly parsed :

Some {{comments} with unbalanced braces
....and a "commented" entry...

Book{landru21,
  author =	 {Landru, Henri D\'esir\'e},
  title =	 {A hundred recipes for you wife},
  publisher =	 {Culinary Expert Series},
  year =	 1921
}

..some other comments..before a valid entry...

@Book{steward03,
  author =	 { Martha Steward },
  title =	 {Cooking behind bars},
  publisher =	 {Culinary Expert Series},
  year =	 2003
}

The advantage of this definition of comment is that you can quickly "comment" an entry by simply removing the @ at the beginning. BibTeX actually offers another way to comment a part of the file. If the entry type is @Comment, is is not considered to be the start of an entry. (Actually, the rule is that everything from the @Comment and to the end of line is ignored. The remainder lines of the commented entry is ignored by the first comment mechanism we described; in particular a @Comment does not need to be a valid entry, i.e. it can for example skip comas between two fields).

...and finally an entry commented by the use of the special @Comment entry type.

@Comment{steward03,
  author =	 {Martha Steward},
  title =	 {Cooking behind bars},
  publisher =	 {Culinary Expert Series},
  year =	 2003
}

A side effect of the very strong meaning of the @ sign in the file format is that when BibTex encounters an error in an entry (as a missing coma between two fields, or an unbalanced braced expression), it is able to recover rather well by skipping everything until the next @ sign.

The counter part is that you cannot have a @ sign in your comment. You might wonder why does BibTex provides the special @Comment mechanism since it would be easy to comment an entry by just removing the @. The only answer I can come up with is that keeping a arrobas allows grepping for entries'keys in the file. You can thus easily count/search/whatever the entries commented using the @comment approach in a file, which would be more complicated using the "no arrobas" approach. Note that Nicolas Markey offers another explanation which is that it allows to quickly comment a set of entries by surrounding them with @Comment{...} but I do not subscribe to that point of view since in the following .bib file, the steward03 is still taken into account by bibtex :

@Comment{
  @Book{steward03,
    author =	 {Martha Steward},
    title =	 {Cooking behind bars},
    publisher =	 {Culinary Expert Series},
    year =	 2003
  }
}

String variables

In order to get coherent notation among your entries, BibTeX provides a useful mechanism to define strings. The following lines will equivalently define a string constant (note again that, like for entries, you can use braces or parenthesis as delimiters) :

@String(mar = "march")
@String{mar = "march"}

The placeholder (variable/string name) must start with a letter and can contain any character in [a-z,a-Z,_,0-9]. The placeholder is case insensitive. If a placeholder is defined several times, the last one is kept. Once a placeholder is defined, you can use it for a field (e.g. the month one) value, as in the following example :

What your file contains What BibTex sees
@String(mar = "march")
      
@Book{sweig42,
  Author =	 { Stefan Sweig },
  title =	 { The impossible book },
  publisher =	 { Dead Poet Society},
  year =	 1942,
  month =        mar
}
      
@Book{sweig42,
  Author =	 { Stefan Sweig },
  title =	 { The impossible book },
  publisher =	 { Dead Poet Society},
  year =	 1942,
  month =        "march"
}

Be careful that if you place quotes or braces around the placeholder, the substitution is not made. But you can concatenate an explicit string of character (with quotes or braces) with a string variable using the pound (#) sign as in the following example :

What your file contains What BibTex sees
@String(mar = "march")
      
@Book{sweig42,
  ...
  month =        "1~mar"
}
      
@Book{sweig42,
  ...
  month =        "1~mar"
}
@String(mar = "march")
      
@Book{sweig42,
  ...
  month =        "1~" # mar
}
  
    
@Book{sweig42,
  ...
  month =        "1~march"
}

This mechanism can be used itself in string definitions so you can do :

@String {firstname = "Xavier"}
@String {lastname  = "Decoret"}
@String {email      = firstname # "." # lastname # "@imag.fr"}

Finally, note that, although it would have been a great feature that would have been coherent with the entry syntax, you cannot define several strings at once in a single @String command. You must issue two such commands as shown below :

Not parsed by BibTex Parsed by BibTex
@String(mar = "march",
        apr = "april")
@String(mar = "march")
@String(apr = "april")

The @Preamble declaration

You can define some LaTeX commands that will be included in the .bbl file generated by BibTex using a declaration like this (note again that, like for entries, you can use braces or parenthesis as delimiters) :

@preamble {"This bibliography was generated on \today"}
@preamble ("This bibliography was generated on \today")

Such declarations can be placed anywhere in the document outside entries. If several of them appears in different places, they are concatenated in the order of apparition.

The string definition can be used within the preamble so you can have files such as :

@String {maintainer = "Xavier D\'ecoret"}

@preamble { "Maintained by " # maintainer }

Names specifications

Well, I said that I would describe only the syntactical part of BibTex, but it is not exactly true. BibTex defines a de facto standard for describing a person name in a field value (typically for the author field). For it, a name is composed of four parts :

When BibTex is given a string representing a name, it analyzes it to retrieve the four parts. The three possible recognized "structure" for the string are :

First von Last
von Last, First
von Last, Jr ,First

The structure involves word delimitation, coma and uppercase/lowercase of the first letters of words as explained now. In the sequel, a word is a sequence of pseudo-characters where a pseudo-character is either a non whitespace character, or a well balanced braced expression. If the first letter of a word is neither a well balanced expression nor a digit,a letter or a backslash, it is considered a whitespace and therefore dropped (see Splitting examples later)!

First von Last First is the longest sequence of white-space separated words starting with an uppercase (see case determination) and that is not the whole string. von is the longest sequence of whitespace separated words whose last word starts with lower case (note that because of First maximality, the first word also starts with a lower case) and that is not the whole string. Then Last is everything else.
Theses rulesimplies that the Last part cannot be empty.
von Last, First von is the longest sequence of whitespace separated words whose last word does not start with an upper case (see case determination). Then Last is everything else before the coma and First is everything after the coma. Here again, the Last part cannot be empty.
von Last, Jr, First Same thing than above for von and Last. Jr is everything in between the two first comas, and First is everything after the second coma, no matter of the case of their first letters.

Finally, to handle multiple authors, BibTeX splits the initial string based on the word and and applies the scheme we have just seen to each part to get each author. The and must not be in braces

Case determination

The algorithm used by BibTex to determine if a word starts with a lowercase or not is pretty tricky. Thanks again to Nicolas Markey for writing this down properly in his great document "Tames the Beast". I am summarizing my understanding of it :

Name splitting test suite

Here the test suite I used to check the name decomposition in BibTeX. They should help you to be familiar with the tricks in name splitting. All those examples have been checked with bibtex using a special BST style file (names.bst) that display the parts of authors names. You can use it with the following Python script (test_names.py) if you want (download it, run chmod a+x test_names.py to make it executable and then ./test_names.py afile.bib to display the names of authors found in the file afile.bib).

Test suite for the first name specification form First von Last
tested author's value First part von part Last part jr part comment
AA BB AA     BB     Testing simple case with no von.
AA         AA     Testing that Last cannot be empty.
AA bb AA     bb     Idem.
aa         aa     Idem.
AA bb CC AA bb CC     Testing simple von.
AA bb CC dd EE AA bb~CC~dd EE     Testing simple von (with inner uppercase words)
AA 1B cc dd AA~1B cc dd     Testing that digits are caseless (B fixes the case of 1B to uppercase).
AA 1b cc dd AA 1b~cc dd     Testing that digits are caseless (b fixes the case of 1b to lowercase)
AA {b}B cc dd AA~{b}B cc dd     Testing that pseudo letters are caseless.
AA {b}b cc dd AA {b}b~cc dd     Idem.
AA {B}b cc dd AA {B}b~cc dd     Idem.
AA {B}B cc dd AA~{B}B cc dd     Idem.
AA \BB{b} cc dd AA~\BB{b} cc dd     Testing that non letters are case less (in particular show how latex command are considered).
AA \bb{b} cc dd AA \bb{b}~cc dd     Idem.
AA {bb} cc DD AA~{bb} cc DD     Testing that caseless words are grouped with First primilarily and then with Last.
AA bb {cc} DD AA bb {cc}~DD     Idem.
AA {bb} CC AA~{bb}     CC     Idem.
Test suite for the second,third specification form von Last First
tested author's value First part von part Last part jr part comment
bb CC, AA AA bb CC     Simple case. Case do not matter for First.
bb CC, aa aa bb CC     Idem.
bb CC dd EE, AA AA bb~CC~dd EE     Testing simple von (with inner uppercase).
bb, AA AA     bb     Testing that the Last part cannot be empty.
BB,         BB     Testing that first can be empty after coma
bb CC,XX, AA AA bb CC XX Simple Jr. Case do not matter for it.
bb CC,xx, AA AA bb CC xx Idem.
BB,, AA AA     BB     Testing that jr can be empty in between comas.

I used dummy names (AA,BB, etc.. for clarity). Note also that besides the name splitting, BibTeX also perform some modifications, like adding non breakable spaces (~ between the words in the von part and some other ones that I don't list here as they do not concern the syntax anymore (see the format.name$ function of the BST language in Markeys's "Tames the Beast").

Further remarks

I suggest you always use the second (or third if there is a Jr part) form as it will save you some easily arrived mistakes. Indeed, suppose your author is the famous french explorer Paul Émile Victor (note the accent) but you forgot the upper case in the accented E and use the first form, it will be incorrectly split. But you won't have the problem with the second form. The table below summarize what you get.

Using first form
Paul \'Emile Victor Paul \'Emile     Victor        
Paul {\'E}mile Victor Paul {\'E}mile     Victor        
Paul \'emile Victor Paul \'emile Victor        
Paul {\'e}mile Victor Paul {\'e}mile Victor        
Using second/third form
Victor, Paul \'Emile Paul \'Emile     Victor        
Victor, Paul {\'E}mile Paul {\'E}mile     Victor        
Victor, Paul \'emile Paul \'emile     Victor        
Victor, Paul {\'e}mile Paul {\'e}mile     Victor        

Actually, the second form can allow name description impossible with the first form. Take the french politician Dominique Galouzeau de Villepin. If you use the first form, the "de" will be interpreted as a von part and hence the First part will be "Dominique Galouzeau" which is incorrect.

Using first form
Dominique Galouzeau de Villepin Dominique Galouzeau de Villepin        
Dominique {G}alouzeau de Villepin Dominique {G}alouzeau de Villepin        
Using second/third form
Galouzeau de Villepin, Dominique Dominique     Galouzeau de Villepin